NNQuant4
NNQuant4
Abstract—This paper provides a comprehensive overview of Healthcare: In healthcare, machine learning applications
the principles, challenges, and methodologies associated with include disease diagnosis, drug discovery, and medical imag-
quantizing large-scale neural network models. As neural net- ing analysis. For instance, models trained to recognize patho-
works have evolved towards larger and more complex archi-
tectures to address increasingly sophisticated tasks, the com- logical images can help doctors diagnose diseases like cancer
putational and energy costs have escalated significantly. We more quickly.
explore the necessity and impact of model size growth, high- Financial Services:Machine learning in finance is used for
lighting the performance benefits as well as the computational credit scoring, fraud detection, and algorithmic trading. It an-
challenges and environmental considerations. The core focus is
on model quantization as a fundamental approach to mitigate
alyzes customers’ transaction behaviors to identify fraudulent
these challenges by reducing model size and improving efficiency activities.
without substantially compromising accuracy. We delve into Autonomous Driving and Robotics
various quantization techniques, including both post-training Autonomous vehicles use machine learning to process com-
quantization (PTQ) and quantization-aware training (QAT), and plex data from sensors to recognize the environment, make
analyze several state-of-the-art algorithms such as LLM-QAT,
PEQA(L4Q), ZeroQuant, SmoothQuant, and others. Through decisions, and increase driving safety and efficiency.
comparative analysis, we examine how these methods address E-commerce and Recommendation Systems
issues like outliers, importance weighting, and activation quanti- E-commerce platforms utilize machine learning to analyze
zation, ultimately contributing to more sustainable and accessible user behavior, optimize search results, and recommend prod-
deployment of large-scale models.
ucts, significantly enhancing user experience and sales effi-
ciency.
Image Recognition
I. I NTRODUCTION
Machine learning now identifies and classifies objects in
A. Background and Motivation images, widely used in social media, security surveillance, and
industrial visual inspection systems.
1) Machine Learning: Machine Learning (ML) is a subfield
of artificial intelligence that enables computers to learn from Natural Language Processing
and make decisions based on data patterns without being From speech recognition to text analysis, machine learning’s
explicitly programmed. The core of machine learning is de- natural language processing technologies allow machines to
veloping algorithms that allow computers to receive input better understand and generate human language, used in chat-
data and use statistical analysis to predict or classify outputs, bots, translation services, and sentiment analysis.
thereby optimizing the performance of a specific task with 2) Neural Networks the Evolution Towards Larger Models:
minimal human intervention. Machine learning can be broadly Neural networks, a pivotal concept in machine learning, are
categorized into three types: inspired by the biological neural networks that constitute
animal brains. They are comprised of interconnected nodes
• Supervised Learning: The model is trained on a pre-
or neurons, which collectively learn to perform complex
defined set of data examples. The goal is to learn a
tasks. Typically, these tasks include but are not limited to
general rule that maps inputs to outputs.
classification, regression, and pattern recognition, making neu-
• Unsupervised Learning: The model looks for patterns
ral networks versatile tools in both theoretical and applied
and structures in data that is not labeled.
machine learning.
• Reinforcement Learning: The model learns through trial
Structure of Neural Networks
and error to perform a task to maximize the reward.
The basic structure of a neural network involves three types
Machine Learning has many Applications: of layers:
[email protected] • Input Layer: This layer receives the raw input data
[email protected] analogous to sensory input in biological systems.
• Hidden Layers: One or more hidden layers compute only a few layers and a limited number of neurons, which
functions applied to values from the previous layer. These constrained their learning capacity and applicability to com-
layers form the core computational engine of the neural plex tasks. However, as computational resources expanded,
network. so too did the size and depth of these models. For instance,
• Output Layer: The final layer produces output for models like AlexNet and VGG in the early 2010s marked the
the network, which corresponds to the predictions for beginning of what would be a rapid expansion in network
supervised learning tasks. depth and complexity, featuring layers deep into the double
Each neuron in these layers applies a non-linear transforma- digits.
tion to its input data and passes this output to the next layer. Advancements in Hardware and Algorithms: The advent
The strength and nature of the connections between neurons of GPUs and improvements in distributed computing have
are adjusted through a process known as learning, typically significantly reduced the time required to train large neural
implemented via backpropagation and gradient descent tech- networks. Simultaneously, advancements in optimization al-
niques. gorithms, such as Adam and RMSprop, have improved the
Learning Process in Neural Networks efficiency of training deep networks. These technical advance-
Learning in neural networks involves adjusting the weights of ments have facilitated the development of models such as the
connections based on the error between the predicted output Transformer, which underpins modern NLP systems like GPT
and the actual output. The most common learning algorithm and BERT. These models not only have hundreds of layers but
used is backpropagation combined with an optimization tech- also millions to billions of parameters.
nique such as gradient descent. This process involves: Implications of Larger Models: The shift towards larger
1) Propagating inputs forward through the network to gen- models has resulted in substantial improvements in tasks
erate the output. such as image recognition, natural language processing, and
2) Calculating the error between predicted and actual out- complex decision-making processes. For example, larger mod-
puts. els have led to breakthroughs in machine translation and
3) Propagating the error backward through the network to autonomous vehicle technology. However, the trend towards
update the weights, aiming to minimize the error by larger networks also presents new challenges, including in-
adjusting the weights. creased computational cost, energy consumption, and the need
for more sophisticated techniques to combat overfitting.
Types of Neural Networks
Future Prospects: As the trend towards larger models
There are several types of neural networks, each designed for
continues, the field of machine learning is likely to witness
specific types of problems and datasets. These include:
even more sophisticated architectures. This progression sug-
• Convolutional Neural Networks (CNNs): Highly effec-
gests a future where neural networks could approach and
tive for processing data that has a grid-like topology, such
even surpass human-like capabilities in certain tasks. However,
as images.
this potential also necessitates innovations in model efficiency,
• Recurrent Neural Networks (RNNs): Designed to han-
training techniques, and hardware design to make the training
dle sequential data, such as time series or language.
and deployment of such models sustainable.
• Deep Belief Networks (DBNs): A type of deep network
3) The Necessity and Impact of Model Size Growth: The
that uses a stack of restricted Boltzmann machines lay-
Necessity and Impact of Model Size Growth: The necessity
ered on top of each other.
for growth in neural network model size stems primarily from
Applications of Neural Networks the increasing complexity of tasks that modern AI systems are
Neural networks have been successfully applied in numerous expected to perform. As the ambition to develop systems that
domains including: can mimic human-level understanding and decision-making
• Vision Systems: From facial recognition to autonomous grows, so too does the need for models with greater capacity.
driving. Larger models, with their enhanced ability to model complex
• Speech Recognition: Enabling voice-activated assistants patterns and relationships, are pivotal in achieving higher
and real-time translation services. levels of accuracy in tasks ranging from natural language
• Natural Language Processing: Driving the development understanding to complex image recognition.
of conversational AI and other language understanding Impacts on Performance and Efficiency: Larger neural
applications. network models have consistently set new benchmarks in
Over the past few decades, there has been a significant shift AI performance. For instance, in natural language processing
in the architecture of neural networks, from relatively simple (NLP), models like OpenAI’s GPT-3 have demonstrated re-
designs to highly complex and large models. This trend is markable linguistic understanding and generation capabilities,
driven by the continuous growth in computational power and directly correlating their performance improvements to their
the availability of vast amounts of data, which have enabled vast number of parameters. Similarly, in image processing,
the training of larger models capable of performing a multitude larger Convolutional Neural Networks (CNNs) have achieved
of tasks with unprecedented accuracy. unprecedented accuracies in image classification challenges.
Evolution of Model Complexity: Initially, neural networks Computational Challenges and Solutions: However, the
were limited in size and complexity due to the computational growth in model size is not without its challenges. The primary
constraints of the time. Early networks often consisted of concern is the exponential increase in computational resources
2
and energy required for training such large models. This
n
has prompted significant research into more efficient training 1 XX T
LCE = − p (Xi ) log(pSc (Xi ))
algorithms, pruning techniques, and specialized hardware ac- n c i=1 c
celerations like GPUs and TPUs, which are designed to handle
extensive computational loads more efficiently. Here, i represents the i-th sample in the batch, c denotes
Economic and Environmental Considerations: Moreover, the number of classes (vocabulary size), and T and S are the
the economic and environmental impact of training and de- teacher and student networks, respectively.
ploying large-scale models cannot be overlooked. The financial Third, next token data generation from the pre-trained model
cost associated with accessing the necessary computational is proposed for synthesizing the distribution of pre-training
power can be prohibitive, limiting the accessibility of cutting- data. Data is generated by initiating with a random start
edge AI technology to well-funded organizations. Environ- token and generating subsequent tokens iteratively until the
mentally, the carbon footprint associated with training and end of the sentence or maximum length is reached. And
maintaining large models is substantial, prompting a push to- to ensure the generated data is diverse and accurate, LLM-
wards developing more energy-efficient computing techniques. QAT introduces a a hybrid approach where the first few
Balancing Scale with Sustainability: Moving forward, tokens are deterministically selected with top-1 strategy and
the challenge will be to balance the undeniable benefits of the remaining tokens are stochastically sampled from the pre-
larger neural network models with the practical limitations trained model’s SoftMax output distribution.
they impose. Innovations in model design, such as the develop- Lastly, The generated data is then used as input for fine-
ment of sparse networks and federated learning, offer promis- tuning the quantized model, where the teacher model’s pre-
ing avenues for maintaining model efficacy while mitigating dictions serve as labels to guide the training.
computational and environmental costs. The future of neural Experimental results show that LLM-QAT significantly out-
networks, therefore, lies not just in scaling up, but in scaling performs traditional PTQ methods at lower bit precisions.
smartly—enhancing model efficiency without compromising For example, in the 8-8-4 setting, the 30B LLM-QAT model
on their transformative potential. achieves an average zero-shot accuracy of 69.7, compared
to 50.7 with SmoothQuant, demonstrating its robustness in
maintaining accuracy. In the 4-8-4 setting, where both weights
II. O BJECTIVES , I MPORTANCE , AND F UNDAMENTAL
and the KV cache are quantized to 4 bits, LLM-QAT achieves
M ETHODS OF M ODEL Q UANTIZATION
69.9, only 1.5 points behind the full-precision model, while
A. Fundamental Approaches all PTQ methods perform poorly, highlighting LLM-QAT’s
1) LLM-QAT: LLM-QAT is an advanced method for superior quantization capabilities.
Quantization-Aware Training (QAT) specifically designed for Additionally, in the 4-8-8 setting, LLM-QAT outperforms
LLMs. Traditional post-training quantization methods have the best PTQ method (RTN) by 1.4 points. These results are
shown effectiveness up to 8-bits but struggle at lower bit consistent across different model sizes, with an 8-8-8 30B
precision levels. LLM-QAT leverages a data-free distillation quantized model surpassing a 13B full-precision model in
technique that generates training data using the pre-trained performance, and a 4-8-4 LLM-QAT 30B model outperform-
model itself. This method helps in preserving the original ing an 8-bit LLaMA-13B. These findings underscore LLM-
output distribution and allows the quantization of weights, QAT’s ability to maintain high performance with reduced
activations, and the key-value (KV) cache. The process aims computational costs and memory usage, offering a better
to improve the efficiency and performance of LLMs even at efficiency-accuracy tradeoff.
quantization levels as low as 4-bits. 2) PEQA: To address the increasing memory and compu-
In detail, based on the observation, Symmetric MinMax tational costs in large-scale NLP models, researchers Hyesung
quantization is first used to retain LLMs’ outliers and maintain Jeon, Yulhwa Kim, and Jae-Joon Kim from Seoul National
the performance of the model, which is defined as: University and Sungkyunkwan University proposed the Low-
rank adaptive Learning quantization algorithm geared towards
XRi
max(|XR |) LLMs (L4Q) [9] . L4Q combines quantization and parameter-
i
XQ =α , α= , efficient fine-tuning (PEFT) to overcome the limitations of
α 2N −1 − 1
traditional methods. While Post-Training Quantization (PTQ)
where XQ represents the quantized values, XR represents is efficient but error-prone, and Quantization-Aware Training
the real (full-precision) values, and α is the scaling factor. For (QAT) is accurate but resource-intensive, L4Q integrates QAT
weights, per-channel quantization is used, and for activations with PEFT to achieve both high precision and low memory
and the KV cache, per-token quantization is applied. usage, making it ideal for resource-constrained environments.
Second, LLM-QAT uses a student-teacher model framework L4Q achieves the integration of quantization and fine-tuning
to ensure that the quantized model retains the performance through the following steps:
of the full-precision model. Specifically, the student network, 1) Merging Weights and LoRA Parameters: The pre-trained
which is the lower-precision version of the model, is guided weights WF P and LoRA parameters A and B are merged
by the full-precision teacher network, the original pre-trained into a new weight matrix:
model. This guidance is provided through cross-entropy-based
logits distillation: W ′ = WF P + αBA
3
LUT-GEMM RTN SmoothQuant
1 1 -1 1 1 0 -3 12 1 0 -3 6
-1 -1 1 -1 -5 -98 2 -1 -2 24 1 -1
A LUT
1 -2 31 6 0 -2 8 3
Xdiag(s)-1
1 -1 1 1
1 -1 1 -1 9 -2 2 -11 4 -2 2 -5
LLM-QAT OliVe
8 8 7 8 1 0 -3 12
0.5 0.1 -2.6 11.5 1000 1111
7 0 8 7 2 -1
(0,-96)
8 7 10 8
0100 1000
8 7 8 7 1 -2
(32,0)
-5.2 -98.0 1.6 -0.8 9 -2 2 -11
SpQR
2 1 0 7 QLoRA
ZeroQuant
EasyQuant
Weight Matrix
0.043 0.008 0.226 1.000
0 0 -2 8
0.053 1.000 -0.016 0.008
Normal Value -4 -98.0 1 -1
s=1.409
0.016 0.068 1.000 0.192 0 -1 30.7 4
-0.868 0.179 -0.160 1.000 Outlier Value 7 -1 1 -7
GPTQ
14 7 9 0 14 7 9 0 14 7 9 0 -10.6
Fig. 1: Comparison of Different Algorithms for Quantizing Weight Matrices. Some algorithms, such as RTN and LLM-QAT
[1], directly quantize the weight matrix. Others, like SmoothQuant [2], SpQR [3], OliVe [4], and EasyQuant [5], process
outliers separately from normal values. Algorithms like GPTQ [6] and QLoRa [7] use matrix operation properties to preserve
outliers during computations. Additionally, ZeroQuant [8], SpQR [3], and GPTQ [6] address fine-grained quantization issues.
4
Model Core Algorithm
LLM-QAT [1] Data-free quantization-aware training
PEQA(L4Q) [9] Y = (W0 + αBA)X
QLORA [7] NF4 Quantization + Double Quantization (DQ) + LoRA
LUT-GEMM [10] Pre-calculation+Look up table
ZeroQuant [8] Fine-grained+LKD+Mixed quantizaiton/dequantization
SmoothQuant [2] Outliers Transfer:Y = (Xdiag(s)−1 )(diag(s)W )
SpQR [3] Sparse Quantization + Two-layer quantization
OliVe [4] Sacrificing normal values for outliers
GPTQ [6] MinMax quantization of approximate second-order information + adaptive batch update + Cholesky decomposition
AWQ [11] Look for key weights by observing activations
ACIQ [12] Clipping;Per-channel Bit Allocation
LowbitQ [13] mutiple tensor
DFQ [14] equalizing weight ranges
PWLQ [15] Partition quantization
SPARQ [16] Select the most important bit as the quantized value
Easyquant [5] Isolate outliers for weight
BRECQ [17] Block reconstruction
PTQD [18] quantization noise segmentation
Zeroq [19] distill data + Pareto frontier
TABLE II: Comparison of Matrix Quantization Across Different Large Model Quantization Algorithms. It displays the features
and considerations of various algorithms in matrix quantization. The ”✓” indicates that the respective algorithm considers
outliers or the importance of weights during the quantization process.
2) Quantizing the New Weight Matrix: The merged weight Experimental results show that the L4Q method significantly
matrix W ′ is quantized to generate quantized weights Wq : improves performance across various tasks. In the Common-
sense QA benchmark, L4Q achieves higher model accuracy
W′ − b
Wq = R(clamp( , Qn , Qp )) × s + b in most configurations, especially by 2% in 3-bit quantization
s scenarios. In the MMLU benchmark tests, L4Q outperforms
R represents the rounding function, and clamp function QLoRA and QA-LoRA in zero-shot (0-shot) and few-shot
limits the values within the quantization range. (5-shot) tasks, with accuracy improvements of approximately
3) Optimizing Quantization Parameters: L4Q uses the LSQ 1.5% and 2% respectively. These advantages make L4Q highly
method to update the quantization parameters s and b: valuable in the field of large model quantization.
∂L 3) QLORA:
= −w + w̃, if Qn ≤ w ≤ Qp
∂s 4) LUT-GEMM: To address the increasing memory and
∂L computational costs in large-scale NLP models, researchers
= 1, if w < Qn or w > Qp Gunho Park and Baeseong Park from Pohang University
∂b
of Science and Technology and NAVER Cloud proposed
4) Gradient Calculation for LoRA Parameters: The quan- the Lookup Table-based GEMM (LUT-GEMM) algorithm
∂L
tized gradients ∂W q
are used to update the LoRA param- [10]. LUT-GEMM enhances inference efficiency by quantiz-
eters A and B: ing weights while maintaining full precision for activations,
∂L ∂L eliminating the dequantization step.
= αB T
∂A ∂Wq LUT-GEMM achieves quantized matrix multiplication
through the following steps:
∂L ∂L T
=α A 1) Constructing Lookup Tables: For a binary matrix B ∈
∂B ∂Wq
5
Activation
Model
Feature Consider Outliers Consider Importance
LLM-QAT [1] Activation quantization of each token ✓ ✓
QLORA [7] Brain Floating Point 16 (BFloat16) ✓ ✓
LUT-GEMM [10] Full precision
ZeroQuant [8] Fine-grained+Token-wise
SmoothQuant [2] Xdiag(s)−1
AWQ [11] determine critical weights ✓ ✓
ACIQ [12] clip the range ✓
LowbitQ [13] quantizing the residual
DFQ [14] absorbs high biases
PWLQ [15] same as weight
Easyquant [5] 0: weight only
BRECQ [17] same as weight
TABLE III: Comparison of Activation Quantization Across Different Large Model Quantization Algorithms. It displays the
features and considerations of various algorithms in Activation quantization. The ”✓” indicates that the respective algorithm
considers outliers or the importance of activation during the quantization process.
Model Memory Aligned Trained Knowledge Distillation Feature Bias Correction Calibration Set Mixed-precision PTQ QAT
LLM-QAT [1] ✓ Logits Distillation ✓ ✓
PEQA(L4Q) [9] ✓ ✓ ✓ ✓
QLORA [7] ✓ ✓ ✓
LUT-GEMM [10] ✓ ✓ ✓
ZeroQuant [8] ✓ LKD ✓
SmoothQuant [2] ✓
SpQR [3] ✓ ✓ ✓
OliVe [4] ✓ ✓ ✓
GPTQ [6] ✓ ✓ ✓
AWQ [11] ✓ ✓ ✓
ACIQ [12] ✓ ✓ ✓
LowbitQ [13] ✓ ✓ ✓
DFQ [14] ✓ ✓
PWLQ [15] ✓ ✓
SPARQ [16] ✓
Easyquant [5] ✓
BRECQ [17] ✓ ✓
PTQD [18] ✓ ✓ ✓ ✓
Zeroq [19] ✓ ✓
TABLE IV: Comparison of Different Algorithms for Quantizing Large-Scale Models. The ”✓” symbol indicates that the
specified feature or attribute is implemented or considered by the algorithm. This symbol helps to quickly identify which
algorithms include certain functionalities, such as training, use of calibration sets, and implementation of quantization-aware
training (QAT), among others.
{−1, +1}m×n and an activation vector x ∈ Rn , all plied by an input vector x, the computational complexity
possible combinations of activation values and binary is:
patterns are precomputed and stored in a lookup table n
C =O m· ·q
(LUT). This can be expressed as: µ
6
PTQ-based QAT-based
LoRA Parameter Training LoRA & Quantization Parameter Training
QA-LoRA X L4Q X
QA-LoRA, L4Q
X
A
PTQ A* GA* Quantize
WFP Wq WFP + Wq
GB s,b Wq
B B
GA*
Gs,b
GB O
O Low-bit Precision Inference
O
G0
G0
QLoRA X QAT-LoRA X
QLoRA, QAT-LoRA
X
Gs,b
O
O O
Mixed-precision Inference
G0 G0
7
model, SmoothQuant achieved a 1.51x speed improvement
and
1.96x memory savings almost without loss of accuracy.
LLKD,k = MSE Lk · Lk−1 · · · · · L1 (X) − L̂k · Lk−1 · · · · · L1 (X)
This method not only maintains the model’s precision but also
where Lk represents the original model weights of the k- significantly enhances hardware efficiency, especially valuable
th layer, L̂k represents the quantized weights, X is the input in resource-constrained environments.
data, and MSE denotes the mean square error. 7) SpQR:
ZeroQuant demonstrates remarkable efficiency in model 8) OliVe: To address the increasing memory and compu-
quantization, reducing the precision of weights and activations tational costs in large-scale NLP models, researchers Cong
to INT8 for both BERT and GPT-3 models without retraining, Guo, Jiaming Tang, Weiming Hu, and others from Shanghai
achieving up to 5.19x and 4.16x speedup compared to FP16 Jiao Tong University and Microsoft Research proposed the
inference with minimal accuracy impact. When combined with Outlier-Victim Pair Quantization algorithm (OliVe) [4] for
Layer-wise Knowledge Distillation (LKD), ZeroQuant can fur- large language models. OliVe employs a hardware-friendly
ther quantize fully-connected module weights to INT4, while method to handle outliers, significantly enhancing performance
keeping attention module weights and activations at INT8, and energy efficiency while maintaining model accuracy in
leading to a 3x reduction in memory footprint compared to resource-constrained environments.
FP16 models. Additionally, ZeroQuant has been successfully The OliVe algorithm achieves effective handling of outliers
applied to large open-source language models like GPT-J6B through the following steps:
and GPT-NeoX20B, where the INT8 versions maintain similar 1) Pair-wise Analysis: OliVe first analyzes the tensor values
accuracy to FP16 but with up to 5.2x greater efficiency. in the model, classifying them into three types of pairs:
6) SmoothQuant: In response to the growing memory and normal-normal, outlier-normal, and outlier-outlier. The
computational costs of large-scale NLP models, researchers core of the algorithm is that for outlier-normal pairs, the
Guangxuan Xiao, Ji Lin from the Massachusetts Institute normal values are set to zero (referred to as ”victims”),
of Technology (MIT), and Mickael Seznec, Hao Wu, Julien freeing up space to handle the outliers.
Demouth from NVIDIA proposed a post-training quantization 2) Outlier Quantization: OliVe uses an adaptive biased float
(PTQ) method for large language models (LLMs) called (abfloat) data type to quantize outliers. This method adds
SmoothQuant [2]. This method specifically addresses the a suitable bias to adjust the range of floating-point values,
issues of maintaining accuracy and hardware efficiency during ensuring they do not overlap with the range of normal
quantization by implementing an INT8 quantization for both values, thus maximizing the utilization of the numerical
activations and weights through a training-free transformation, representation space. Specifically, the abfloat values are
optimizing model execution efficiency and memory usage on quantized using the formula:
hardware.
SmoothQuant centers around several key components: quant(e) = sign×(1 ≪ mantissa+mantissa) ≪ (exponent+bias)
1) Quantization Transformation Formula: SmoothQuant re- where mantissa denotes the width of the mantissa bits,
duces quantization difficulty by smoothing input activa- exponent denotes the width of the exponent bits, and bias
tions X and adjusting weights W . The core transforma- is the adaptive bias.
tion formula is: 3) Hardware-Friendly Memory Alignment: A key feature of
OliVe is memory alignment. By positioning the ”vic-
Y = (Xdiag(s)−1 ) · (diag(s)W ) = X̂ Ŵ
tims” adjacent to the outliers, OliVe achieves efficient
Here, s represents the smoothing factor for each channel, memory access with low hardware overhead. This design
making the adjusted activations X̂ and weights Ŵ easier avoids the complexity of sparse indexing hardware and
to quantize. is compatible with the memory subsystems of existing
2) Selection of Smoothing Factors: The choice of smoothing accelerators such as GPUs and TPUs.
factor sj is aimed at maximizing quantization effective- In terms of experimental results, OliVe demonstrated sig-
ness and accuracy, calculated by: nificant improvements across various tasks and datasets. For
max(|Xj |)α instance, in the GLUE benchmark with the BERT model,
sj = the 4-bit Post-Training Quantization (PTQ) method of OliVe
max(|Wj |)1−α
resulted in less than a 1% drop in accuracy compared to the
α is a hyperparameter between 0 and 1 that balances the full-precision model, outperforming other 4-bit, 6-bit, and 8-
quantization difficulty between activations and weights. bit PTQ and Quantization-Aware Training (QAT) methods.
3) Application to Transformer Blocks: Within Trans- Additionally, when applied to large-scale language models
former models, SmoothQuant specifically applies scaling like GPT2-XL, BLOOM-7B1, and OPT-6.7B, OliVe’s 8-bit
smoothing to the input activations of self-attention and PTQ method nearly preserved the original model performance,
feed-forward layers, and uses INT8 quantization for all highlighting its superior inference performance and energy
linear layers involving both weights and activations. efficiency.
In practical applications, SmoothQuant demonstrated signif- 9) GPTQ:
icant performance enhancements across various configurations 10) AWQ: In the field of quantizing large language models
and large models. For instance, in tests using the OPT-175B (LLMs), traditional methods face significant challenges such
8
as high training costs and notable accuracy loss at low bit B. The Significance of Quantization in Modern Deep Learning
rates. A research team from MIT, including Ji Lin, Jiaming III. T RAINING WITH Q UANTIZATION : A LGORITHMS AND
Tang, Haotian Tang, Shang Yang, Wei-Ming Chen, Wei-Chen A PPROACHES
Wang, Guangxuan Xiao, Xingyu Dang, Chuang Gan, and
Song Han, proposed an algorithm called Activation-aware A. Quantized Neural Networks (QNNs)
Weight Quantization (AWQ) [11] to address these challenges. Quantized Neural Networks(QNNs) are variants of neural
AWQ is a hardware-friendly low-bit weight-only quantization network that use quantized weights and/or activations in-
method based on the observation that weights are not equally stead of traditional full-precision (32-bit floating point) num-
important, and protecting only 1% of salient weights can bers. The motivation behind quantization els with more and
significantly reduce quantization error. AWQ protects salient more weights, which are not suitable for deployment on the
weights by observing activations rather than weights them- resource-constrained devices such as mobile phones, embed-
selves, requiring no backpropagation or reconstruction, thereby ded system and IoT devices. And the goal of quantization is
maintaining the generalization ability of LLMs across different to reduce the computational cost and memory requirements of
domains and modalities without overfitting to the calibration neural networks without performance degradation.
set.
The core idea of the AWQ algorithm is based on the B. Training neural networks With Quantization
observation that weights are not equally important, and pro- When quantizing neural networks, there are three compo-
tecting only 1% of the significant weights can greatly reduce nents could be quantized: weights, activations and gradients.
quantization error. The algorithm protects significant weights The quantization of weights and activations could effectively
by observing activations rather than the weights themselves. reduce the size of models, and the quantization of gradients
The specific steps are as follows: could reduce the resource and time cost when training QNNs.
1) Selecting Significant Weights: Significant weight chan- Before the advent of Large Language Models(LLMs), the
nels are selected based on the activation distribution QNNs had already attracted a lot of attention. Many re-
rather than the weight distribution. This is because weight searchers dived into the quantization of weights or activations,
channels associated with larger activation magnitudes and proposed a lot of techniques to reduce the cost of
process more important features. deployments and accelerate the inference. However, due to the
2) Weight Channel Scaling: A per-channel scaling method is need of the fidelity of gradient computations during backward
designed to automatically search for the optimal scaling propagation, the training of QNNs is unlike the inference,
factor to minimize quantization error under full-weight which could work well on low-precision. There is a high
quantization. The quantization function is defined as: probability that the models will not converge if training QNNs
with low-precision gradients. Therefore, the most quantization
w max(|w|) techniques try to quantize weights and/or activations of well-
Q(w) = ∆ · Round ,∆ = trained neural networks rather than train a QNNs from scratch.
∆ 2N −1
BinaryConnect(BC) [20] proposes a method in training
where N is the number of quantization bits, and ∆ is the DNNs with binary weights during propagations. This work
quantization scale determined by the maximum value of aims to use 1-bit to present weights w, and gives two quan-
the weights. tization method Qw to convert w to 1-bit. One of that is
To protect significant weights, quantize before scaling the deterministic method which is based on the sign function:
(
weights: +1, if w ≥ 0,
wq = Qw (w) = (1)
x w · s x −1, otherwise
Q(w · s) = ∆′ · Round ·
s ∆′ s This method make the weights greater than 0 to become 1 and
the other become -1. And the other is stochastic method:
By analyzing the error, determine the scaling factor s to (
+1, with probability p = δ(w),
reduce the quantization error of significant weights, with wq = Qw (w) = (2)
the optimization objective: −1, with probability 1 − p
where δ is the hardsigmoid function:
s∗ = arg min ∥Q(W · diag(s))(diag(s)−1 · X) − W X∥ x+1
s δ(x) = clip( , 0, 1) (3)
2
Experimental results show that AWQ outperforms existing which convert the weights greater than 1 to 1, the weights less
methods on various language modeling and domain-specific than -1 to -1 and the weights between -1 and 1 to either -1 or
benchmarks, especially in instruction-tuned and multi-modal 1 with a probability.
models. The team also implemented TinyChat, an efficient During backward propagation, the BC also adopt the gra-
and flexible inference framework for deploying 4-bit quantized dient decent algorithm like full-precision neural networks.
LLMs on various edge platforms, achieving more than 3x However the common quantization function is not differen-
speedup compared to the Huggingface FP16 implementation. tiable, or the derivative value is 0 almost everywhere(e.g.,
9
sign function mentioned above) which will make the gradient and the use linear mapping function as basic quantization
vanish. Therefor, They use the straight-through estimator(STE) function to convert the floating-point number x to k-bitwith
to obtain gradients. This technique has been proposed by singed integer. The basic function is defined as the follow:
Hinton et al. at 2012. The STE function is defined as follow: x
Q(x, k) = clip{δ(k)·round[ ], −1+δ(k), 1−δ(k)} (10)
STE(x) = clip(x, −1, 1) = min(1, max(x, −1)) (4) δ(k)
where round function approximate a value to the nearest
During the forward propagation, the BC us wb = Qw (wr ) = discrete value.
sign(wr ) to binarize the weights and forward with binary Before the linear mapping function applied, they shift values
weights. During backward propagation, the gradient of cost distribution to an appropriate order of magnitude. The shift
C respect wr as the follow equation: function is defined as the follow:
∂C ∂C ∂wb x
= (5) shift(x) = 2round(log2 ) (11)
∂wr ∂wb ∂wr
And through STE to obtain an estimate for the gradient of wb Weights are quantized by the linear mapping function above
respect wr : and activations are quantized by the same function after a shift-
∂wb based batch normalization.
= I|wr | (6) During backpropagation, WAGE first scales gradients g to
∂wr
the minimum step number.(for k-bitsinteger the minimum step
where I|wr | is the derivative of STE:
is ±1, for floating-point the minimum step is ±δ(k))and keeps
its direction. Then bring in shift-based learning rate η.
(
1, if |wr | ≤ 0,
I|wr | = (7)
0, otherwise
gs = η · g/shift(max(|g|)) (12)
The estimation could add noisy to weights as a form of
where η is an power of 2 integer. It is used to control update
regulation. Above all, the BC can correctly use gradient decent
step sizes of weights. To substitute accumulation of small
algorithm to update the weights, and obtain 98.8% accuracy
gradients, WAGE separates gs into integer part and decimal
on MNIST dataset.
part, and use bernoulli function to sample the decimal part to
BNN(Binary Neural Networks) [21] , which based on the
either 0 or 1. Above all, weights can be updated properly:
BinaryConnect and quantizes the activations further. Besides
that, considering that batch normalization can help the net- δW = QG (g) = δ(k)·sgn(gs )·{⌊|gs |⌋+bernoulli(|gs |−⌊|gs |⌋)}
works avoid the problem of exploding and vanishing gradients, (13)
accelerate the training, and ensure that BNN can converge. Besides STE, there is another way to solve the difficulty in
But the batch normalization operation cannot work efficiently propagating gradients which raises from non-differential quan-
without floating point unit. In the light of that, BNN simplifies tization function. Zhuang et al.proposed a solution by train-
batch normalization to a shift-based variation. ing the low-precision network with a full-precision auxiliary
XNOR-Net [22] implements the convolution operation us- module, creating additional full-precision routes to optimize
ing XNOR operations and bit-count operations as the follow the low-precision model. [25]
equation: Besides training the QNNs via gradient decent algorithm,
Peng et al. use the Evolutionary Algorithms(EAs) to search for
x · y = bitcount(and(x, y)), xi , yi ∈ {0, 1} (8)
the optimal low-bits weights of QNNs. [26] They formulate
DoReFa-Net [23], which has first quantized gradients to the quantization of neural networks as a large-scale discrete
low-bitwidth(less than 8) floating-point numbers with dis- optimization problem.
crete states in backward propagation. But when updating the
weights, DoReFa-Net also use the full-precision weights and C. Quantization after the advent of LLMs
gradients.
With the advent of LLMs, the excellent performance and
WAGE [24] realize the most work still keep high precision
the huge resource cost and memory requirement create a huge
and computational complexity during training.(e.g. BC and
demand for low cost deployment and fast inference. Much
BNN maintain the full-precision gradients and weights during
and much attention has been attracted in the quantiaziton of
backpropagation.) They develop a new method termed as
LLMs. But to the best of our knowledge, there doesn’t exist a
WAGE to quantize both training and inference. In WAGE
work that pre-train a quantized LLM so far. Considering that
the weights, activations, gradients, and errors among layers
training a LLM is very expensive, the pre-train of LLMs has
are shifted and linearly constrained to low-bitwidth integers.
been monopolized by large corporations. There is a very urgent
It is the first work that uses the discrete gradients to update
that propose techniques to effectively reduce the resource and
the quantized weights. WAGE uses shift-based linear mapping
time cost in LLMs pre-train. In our opinion, the difficulty of
and stochastic rounding technique for quantization. In WAGE
Pre-training Quantized LLMs lies in the following points:
quantization, the continuous and unbounded values are dis-
• LLMs have too many parameters. The more parameters
cretized by uniform distance δ:
there are, the harder it becomes to predict the overall
δ(k) = 21−k , k ∈ N+ (9) impact on training after quantizing.
10
• LLMs are deeper than traditional DNNs. As the depth in- • Technical Details: QKD uses a trainable uniform quan-
creases, the problem of exploding and vanishing gradients tization scheme for weights and activations, along
worsens. with gradient approximation techniques such as the
straight-through estimator (STE) to handle the non-
IV. I NFERENCE WITH Q UANTIZATION : A LGORITHMS AND differentiable nature of quantizers.
A PPROACHES 2) Self-Supervised Quantization-Aware Knowledge Distilla-
A. Knowledge Distillation (KD) tion (SQAKD):
Knowledge Distillation (KD) is a critical technique for • Methodology:SQAKD integrates QAT and KD by
model compression, enabling the transfer of knowledge from framing the problem as a constrained optimization task.
a large, well-trained teacher model to a smaller, more effi- It utilizes self-supervised learning techniques to mini-
cient student model. This process is particularly valuable in mize both discretization error and prediction discrep-
Quantization-Aware Training (QAT), where models are trained ancy, ensuring precise quantization without sacrificing
to maintain high performance despite being quantized to lower accuracy.
precision to reduce memory and computational costs. The • Performance: SQAKD has demonstrated significant
primary motivation behind using KD in QAT is to enhance performance improvements across various architectures
the performance of quantized models. KD helps mitigate the and datasets, showing its robustness and effectiveness
accuracy loss typically associated with model quantization in enhancing the accuracy of low-bit quantized models.
by transferring the knowledge encapsulated in the teacher’s The effectiveness of KD methods in QAT has been evaluated
full-precision model to the student’s lower-precision model. extensively on various benchmark datasets, including CIFAR-
This process is particularly beneficial for resource-constrained 10, CIFAR-100, and ImageNet. These evaluations show that
environments where computational efficiency is paramount. KD significantly enhances the performance of student models,
A. Key Techniques in KD for QAT even with aggressive quantization (e.g., sub-2-bit precision).
1) Layer-wise Distillation: Metrics such as cover length ratio and ranking loss have been
• Method: Applying KD at each layer of the model,
introduced to quantitatively assess the effectiveness of KD
aligning the student’s intermediate representations with losses, further validating the improvements achieved through
those of the teacher. This approach helps in capturing these techniques.
the hierarchical features learned by the teacher model.
• Benefits: This technique has been shown to signifi-
B. Key-Value cache(KV cache) compression
cantly improve the performance of quantized models
by maintaining the structural and functional integrity In large-scale neural network computation, the concept
of the model across all layers. of Key-Value Cache (KV Cache) is mainly related to the
2) Attention Mechanism Distillation: self-attention mechanism commonly used in the Transformer
• Method: Transferring attention maps from the teacher
architecture.The purpose of KV Cache is to reduce the compu-
to the student model. This ensures that the student tational complexity, improve the efficiency, and conserve the
model focuses on similar regions as the teacher, pre- computational resources, and it is widely used in large-scale
serving the interpretability and effectiveness of the models.
attention mechanism. In Transformer, the operation of the self-attention mech-
• Advantages: Enhances the student model’s perfor- anism involves the computation of query (Q), key (K) and
mance by ensuring that the critical attention mech- value (V) matrices [27]. These matrices are used to compute
anisms learned by the teacher are retained in the the attention distribution to balance the input information at
quantized student model. different locations. During inference, the Q matrix is usually
derived from the model input, which makes the Q matrix
3) Logit-based Distillation:
different for each input instance. In contrast, the K matrix and
• Method: Aligning the logits (pre-softmax outputs) of
the V matrix are computed from the output of the encoder and
the student model with those of the teacher. This are relatively stable.
ensures that the probability distributions produced by
The key idea behind the KV cache is that due to the
the student are similar to those of the teacher, aiding
relative stability of the K and V matrices, they can be cached
in better generalization.
at different time steps. This means that for the same input,
• Benefits: This method is straightforward and effective,
there is no need to recalculate the K and V matrices and
providing a strong baseline for performance improve-
therefore they can be reused.The KV cache can store the result
ment in quantized models.
of multiplying each token with the WK and WV parameter
B. Advanced KD Techniques for QAT matrices. In the converter architecture, each token is generated
1) Quantization-aware Knowledge Distillation (QKD): based on the result of the previous token. the KV cache caches
• Phases:QKD typically involves three phases - self- the K and V matrices, thereby reducing computation time.
studying, co-studying, and tutoring. These phases help Without the KV cache, recomputing the product of WK and
in progressively transferring knowledge and adapting WV for all tokens each time a new token is generated would
the student model to quantized constraints. be computationally intensive. Therefore, caching these results
11
Fig. 3: Summary of KV Cache Compression
is known as KV caching, which can improve the speed of KV Cache by token dropping methods. Scissorhands [31]
reasoning. and H2O (Heavy-Hitter Oracle) [32] are two state-of-the-art
To avoid recalculating attention keys and values, LLMs methods that significantly reduce the memory usage of the KV
store previously computed keys and values, which leads to cache. The Scissorhands system is based on the ‘importance
significant memory consumption as batch sizes and sequence persistence assumption’ and achieves up to 5x memory usage
lengths increase. For instance, In the experiment of KVQuant reduction without the need to fine-tune the model, by tracking
[28], they find that with the LLaMA-7B model, the KV and retaining the key tokens that have a significant impact on
cache consumes only 2% of memory during inference with future generation. The Scissorhands system is based on the
a sequence length of 512. However, this usage skyrockets to ‘Importance Persistence Assumption’ and achieves up to a 5x
84% when the sequence length increases to 128k. reduction in memory usage without the need to fine-tune the
In the field of KV cache compression methods, two hard- model, and can be combined with 4-bit quantization to further
ware optimisation methods: ‘DeepSpeed Inference’ [29]and compress the memory without compromising the quality of
‘FlexGen’ [30].DeepSpeed Inference, together with ZeRO In- the model while reducing memory usage.The H2O method,
ference, introduces a multi-GPU inference solution that lever- on the other hand, discovers that a small fraction of important
ages heterogeneous memory (including GPU memory, DRAM, words (Heavy Hitters) contribute most of the value by looking
and NVMe) to meet the demands of large-scale models and at the attention scores. The method solves the KV cache
improve inference performance. Instead, ‘FlexGen’ focuses elimination problem by using a dynamic retention strategy
on enabling high-throughput generative reasoning for large of recent and Heavy Hitters tokens, significantly improves
language models using a single GPU. The system employs a inference efficiency and throughput, and validates its accuracy
flexible memory and compute resource management strategy and efficiency on tasks such as OPT, LLAMA, and GPT-NeoX.
that integrates GPU, CPU and disk memory hierarchies to H2 O significantly reduces memory footprint and improves
optimise performance. Notably, FlexGen employs an adaptive performance in a variety of cases, while combining with
policy selection mechanism to efficiently adapt to various quantization methods works particularly well . Both methods
hardware configurations and model requirements. When run- effectively reduce memory bottlenecks in LLM deployments
ning on a single GPU, FlexGen achieves a more significant while maintaining model accuracy.
improvement in token generation rate compared to traditional KV cache compression method can be divided into KV
systems. cache quantization and KV cache compression.
In previous work, a number of scholars have optimised In terms of KV cache quantization, recent studies include
12
KVQuant [28], KIVI [33], QAQ [34], SKVQ [35], WKVQuant of pre-training large-scale language models while maintaining
[36], and L2 Norm-Based Strategy [37] high model accuracy. This helps improve the performance of
Experiments by KVQuant [28] and KIVI [33] have proved language models deployed in resource-constrained environ-
that key matrices often have distinct outlier channels. These ments such as mobile devices. WKVQuant employs a Past-
outlier channels significantly impact other channels when Only quantization (POQ) approach to improve the accuracy of
quantizing the key matrices. To mitigate these impacts, attention computation by quantising only the past KV caches
KVQuant and KIVI attempt to quantize the matrices along at the decoding stage; it then employs a 2D quantization
the channel direction, which is called per-cahnnel. Besides strategy that combines static channel smoothing with dynamic
the outlier problem, the rotary position embedding(RoPE) marker level pinpointing, which further reduces the error;
will cause mixing pairs of channels by different amounts finally, Cross-block Reconstruction Regularization (CRR) is
for different position in the sequence. To handle the RoPE introduced to optimise the parameters by calculating the
challenge, KVQuant develop a fused kernel to dequantize difference between subsequent layers to avoid the bias problem
the pre-RoPE quantization and efficiently aplly the posisional caused by local reconstruction loss, and using Mean Absolute
embedding after that. For value matrices, there exists both Error (MAE). Absolute Error (MAE) instead of MSE to reduce
outlier channels and outlier tokens, but they are much less than the effect of outliers.
the outlier key channels. KIVI also finds that when the value Alessio Devoto et al. proposed a KV cache compression
matrix is per-channel quantized, the accuracy significantly de- method based on the L2 paradigm number of key embeddings.
creases regardless of the quantization precision, and the most By observing the attention distributions and L2 norms of key
accurate quantization approach is to quantize key matrices per- embeddings in different heads and layers, the research team
channel and value matrices per-token when quantizing by a found that key embeddings with low L2 norms are usually
low numerical precision such as INT2. Besides that, KIVI associated with higher attention scores. Therefore, the authors
maintains a full precision KV cache sliding window for the concluded that the size of the KV cache can be reduced by
near tokens. Only the tokens are outside the sliding window compressing key embeddings with higher L2 paradigms and
are quantized to maintain the accuracy. reducing the impact on the output.
QAQ [34] finds through partial derivative derivation and In terms of KV cache compression, recent studies include
experimentation that the key cache is more sensitive to MLKV [38], CLA [39], KEYFORMER [40], GEAR [41], and
quantization, leading to more severe performance degradation MiniCache [42].
when quantized. Consequently, QAQ proposed using different MLKV can share KV heads between heads in the same
quantization methods and dynamic precisions for key cache layer and other layers. m is the number of layers that have
and value cache. Additionally, QAQ finds exceptions to the their respective KV heads, and the size of the KV cache is
’Importance Persistence Assumption.’ When performing quan- 2bsmgdk when MLKV is used. In addition, this paper proposes
tization based on this hypothesis without additional treatment a sliding-window quantization (SWQ) strategy, which applies
for exceptional cases, It could significantly impact the model’s a pre-population stage to the computed KV cache with pre-
performance. QAQ proposes the attention window as the population stage with full precision and then retaining a certain
additional treatment for the exceptional cases, which maintains number of token pairs at the end for high-precision processing,
a slide window of size n for attention, and only quantizes it which can obtain significant performance improvement in long
to lower bits if the attention scores within these n windows sequence tasks while adding little extra overhead, and mainly
remain consistently low. solves the quantization problem of KV cache with low bit-
SKVQ proposes a solution to the low-bitwidth KV cache width.
quantization problem, which mainly solves the problem of Reducing Transformer Key-Value Cache Size with Cross-
low-bitwidth KV cache quantization, i.e., how to improve Layer Attention (CLA) mainly solves the memory consump-
the quantization performance and reduce the quantization tion problem of large-scale pre-trained models in natural lan-
error under the premise of guaranteeing the computational guage processing tasks. CLA makes it possible to perform only
efficiency. Firstly, a channel reordering strategy is introduced one key/value projection between the same set of neighbouring
to group channels with similar data distribution and then layers by sharing the key/value header, which greatly reduces
quantise them together, thus reducing the quantization error. the memory occupation and computation. In addition, CLA
Then, a clipping dynamic quantization strategy is proposed, can be used in conjunction with other attention mechanisms
which introduces a clipping factor in the quantization process, to improve model effectiveness.
which can effectively reduce the influence of outliers in the M Adnan et al. proposed the KEYFORMER approach,
channel, and realise the efficient improvement of the quantiza- which aims to address the problem of increasing KV cache size
tion performance through offline calibration. Next, this paper in large language models.KEYFORMER utilises a sparsity ap-
proposes a sliding-window quantization strategy, which utilises proach to reduce the KV cache size. By exploiting the inherent
the KV cache obtained from full-precision computation in the sparsity in the attention mechanism, shorter subsequences can
pre-population stage, and retains a certain number of token be selected to reduce the size of the KV cache and improve
pairs at the end to be processed with high precision, which can reasoning efficiency. This approach is important for tasks that
achieve significant performance improvement in long sequence require processing long texts, saving computational resources
tasks while adding little extra overhead. and accelerating the text generation process.
WKVQuant can effectively reduce the memory footprint GEAR saves memory resources by compressing the KV
13
cache matrix. Traditional approaches can only apply one self-attention term, the information from all tokens can still
compression technique alone, such as quantization, singular be taken into account by means of bottom-up computation
value decomposition or sparsification, but these methods can- through the information transfer of residual connections. By
not consider different information types simultaneously. In pairing queries from all layers with key-value pairs using
contrast, GEAR uses an integrated compression strategy that only the top layer, the computation of key-value pairs from
combines three different compression techniques to handle all layers except the top layer is avoided, thus reducing
different types of information more efficiently and thus achieve memory consumption and computation. To solve the circular
better compression of the KV cache matrix to save memory dependency problem in self-attention, the method uses a mask
resources.GEAR first uses a filter to extract the maximum to handle the main diagonal so that the first token does not
and minimum values in the input tensor X and stores them depend on itself but still maintains the effective delivery of
in the sparse matrix S. The matrix is then sparsified by the information. Meanwhile, to address the fact that applying key-
filter. Then, the extracted matrix is uniformly quantised to value pairs at the same level may undermine the previously
obtain the quantised backbone D. Finally, the singular value observed tendency of focusing on syntactic information at
decomposition algorithm is used to compute the remaining lower levels and semantic information at higher levels, the
residual matrix R and its first r singular vectors are used to paper proposes a strategy of retaining some of the ‘warmer
construct the low-rank matrix L. This integrated compression layers’ in order to maintain the information transfer between
strategy allows GEAR to capture and recover information different levels.
more efficiently, resulting in higher compression ratios and W Lee et al. proposed the InfiniGen method, which mainly
smaller approximation errors. addresses the data transfer overhead of loading and computing
Liu et al. proposes a cross-layer KV cache compression key value caches in large-scale pre-trained models.InfiniGen
method,which is called MiniCache. This idea is inspired by the introduces a predictive cache module which is capable of
prior studies, which revealed the ineffectiveness in the middle- modifying weight matrices online in order to generate skewed
to-deep layer in LLMs. It is a layer-wise merging method of query and key matrices, which improves prediction accuracy
KV cache. They had two observation: One is that through and efficiency. In addition, InfiniGen employs a CPU memory
KV cache shares a high similarity between adjacent layers pool management strategy to automatically select and replace
but they may have different magnitudes. Scaling is needed the least important KV entries when a user-defined memory
before merging the similar KV cache. and the other is that limit is reached.
not all tokens are equally important to merge, a few distinct
part require retention. Minicache merges the adjacent layers in
C. Quantization-Aware Training (QAT) and Post-Training
the middle-to-deep layer after scaling them and retain a few
Quantization (PTQ)
tokens which are sensitive distinct. When inference, minicache
first rescales the merged KV cache with the corresponding Quantization is a vital technique in the field of deep
magnitude along the token dimension and places the sensitive learning aimed at reducing the model size and improving
tokens according to their token indices. inference speed without significantly compromising accuracy.
In the area of KV Cache management and optimisation tech- Two primary methods for quantization are Quantization-Aware
niques, recent research contains Cached Transformers [43], Training (QAT) and Post-Training Quantization (PTQ).
Layer-Condensed KV Cache [44], and InfiniGen [?] Quantization-Aware Training (QAT) involves incorporating
Cached Transformers solves the problems of high com- quantization into the training process itself. During QAT, the
putational complexity and difficulty in capturing long-term model is trained while simulating the effects of quantization,
dependencies faced by traditional self-attentive mechanisms allowing the model to learn and adapt to the quantized
when dealing with long sequence data.The Cached Transform- weights and activations. This method typically yields higher
ers approach combines current input samples with historical accuracy compared to PTQ, as the model adjusts its parameters
samples and achieves efficient long sequence modelling by to minimize the loss function considering the quantization
dynamically recording historical samples and using them as effects. QAT is particularly beneficial when the target hardware
a cache. In addition, the method employs a gated recurrent has stringent performance constraints, and maintaining high
unit (GRU) update rule to dynamically capture dependencies accuracy is crucial.
on different time scales. The approach also provides a more Post-Training Quantization (PTQ), on the other hand, is
flexible paradigm that can be used to develop applications such applied after the model has been fully trained. It involves
as cross-task memory modules. converting the weights and activations of the pre-trained model
The Layer-Condensed KV Cache method solves two main from floating-point precision (usually 32-bit) to lower preci-
problems: reducing memory consumption and computation sion formats like 8-bit integers. PTQ is easier to implement
and maintaining information transfer between different layers. since it doesn’t require retraining the model and can be applied
The method discards its own attention term in the self- to a wide range of pre-trained models. However, the accuracy
attention of each token, which is equivalent to masking the might be slightly lower than QAT, especially for models that
main diagonal of the attention matrix. As a result, the first are sensitive to quantization errors.
token does not have any self-attention term to refer to and Both QAT and PTQ offer significant benefits in terms
simply uses the zero vector as a virtual key-value pair in its of reduced model size and increased inference speed, mak-
attention computation. Despite the fact that each token has no ing them essential techniques for deploying deep learning
14
models on resource-constrained devices like mobile phones PWLQ [49] introduces a piece-wise linear quantization
and embedded systems. The choice between QAT and PTQ scheme that enhances precision through the strategic division
depends on the specific requirements of the application, such of the quantization range into two non-overlapping regions
as the importance of model accuracy, available computational with equal quantization levels.To find the optimal breakpoints,
resources, and deployment constraints. the authors minimize the quantization error. Additionally, bias
1) PTQ: Post-Training Quantization (PTQ) is a technique correction is applied to further refine accuracy.
for compressing deep neural networks by reducing the nu- SPARQ [16] dynamically selects the most significant bits
merical precision of weights and activations after the model from 8-bit activations, bypassing leading zeros, and quantizes
has been trained. Here is a survey of related works on pairs of activations to 4 bits, using the full 8-bit budget if one
PTQ: PTQ aims to quantize a pre-trained full-precision model activation is zero. Implemented on a systolic array and Tensor
without retraining or access to the original training data. Early Core DP unit, SPARQ demonstrates low area overhead while
works like [45] proposed multipoint quantization, which maximizing computational efficiency.
approximates full-precision weights as a linear combination of EasyQuant [5] introduces a data-free weight-only quan-
multiple low-bit vectors, allowing flexible trade-offs between tization algorithm, which optimizes the quantization range
accuracy and model size on a per-channel basis. More recent using a gradient-based method while strategically leaving less
works have explored PTQ for vision transformers like [45], than 1% of outliers unchanged to minimize quantization error
[46]. PTQ-ViT [45] proposes similarity-aware quantization and reduce reconstruction error. The algorithm’s ability to be
for linear layers and ranking-aware quantization for self- implemented in parallel makes it highly efficient for LLMs
attention layers in vision transformers. It also uses mixed- over 100B. Remarkably, EasyQuant outperforms traditional
precision based on the nuclear norm of attention maps and data-dependent methods, delivering over 10 times faster per-
output features. PTQ4ViT [46] introduces techniques like twin formance.
uniform quantizers and Hessian-guided metrics for quantizing BRECQ [17] advances the frontiers of neural network
vision transformer activations. Several works have focused on quantization by enabling INT2 precision for the first time.
improving PTQ accuracy, especially for low bit-widths (¡ 8 By reconstructing the fundamental building blocks of neural
bits) [47]. proposes a three-stage pipeline: 1) minimizing per- networks one at a time and optimizing cross-layer dependen-
layer quantization error on a small calibration set, 2) allocating cies through second-order error analysis, BRECQ achieves a
bit-widths optimally via integer programming, and 3) tuning balance between precision and generalization. The integration
global model statistics. This achieves state-of-the-art results of mixed precision techniques and genetic algorithm further
like 4-bit quantized ResNet50 with < 1% accuracy drop. enhances the efficacy of quantization. BRECQ can produce
Other notable PTQ works include which surveys techniques 4-bit ResNet and MobileNetV2 models that rival the perfor-
for transformer compression including quantization methods mance of QAT, with a 240x faster production of quantized
tailored for vision transformers. Outlier suppression is pro- models.
posed to handle outliers when quantizing transformers. In PTQD [18] introduces a novel quantization approach for
summary, PTQ has seen significant research interest, with diffusion models by disentangling quantization noise into
works proposing novel quantization schemes for vision trans- correlated and residual uncorrelated components. The linearly
formers, improving accuracy for low bit-widths, and optimally correlated part is mitigated through the estimation of the
allocating bit-widths across layers. correlation coefficient, while the latter is treated as a Gaussian
[48] propose three novel methods for improving quan- distribution, addressed by subtracting the bias and modify-
tization. First, ACIQ clips the activation range to an op- ing the variance schedule. Additionally, a step-aware mixed-
timal value determined by minimizing the mean squared precision scheme is employed to optimize bitwidth allocation,
error(MSE), thereby reducing quantization error in regions effectively maintaining a high SNR while significantly reduc-
with the highest information density. Second, per-channel bit ing computational complexity. This approach not only reduces
allocation determines the optimal bit-width for each channel the FID gap to just 0.06 in 250 generation steps under a W4A8
to minimize overall MSE. Third, Bias Correction compensates bitwidth setting but also compresses the model size by 6.83×
for the inherent bias and variance changes in weight after and cuts bit operations by 19.96×, demonstrating its superiority
quantization. over existing models.
DFQ [14] propose a data-free method for 4-bit quantization. ZeroQ [19] introduces a novel quantization approach that
This approach equalizes the weight ranges across the network generates a distilled dataset aligned with network batch nor-
by leveraging the scale-equivariance property of activation malization statistics, enabling effective quantization without
functions, synchronizing the weights of multiple layers, and requiring training data. This method supports both uniform
integrating high biases into the subsequent layer. Additionally, and mixed-precision quantization, automatically determining
the method includes bias correction during quantization. the optimal bit settings using a Pareto frontier approach.
LBQ [13] minimize the MSE by optimizing each network 2) QAT: quantization Aware Training (QAT) is a model
layer individually and utilizing multiple quantization tensors to compression and acceleration technique that takes quantiza-
address key layers with high MSE. Additionally, they propose tion into account during the training process. Different from
a method for post-quantization adjustment of scaling factors, the traditional post-training Quantization, QAT simulates the
which optimizes a small number of parameters for better quantization operation in the training phase, so that the model
approximation of the original model. can adapt to the information loss caused by quantization in
15
2019 2020 2021 2022 2023 2024
DFQ [14] PWLQ [49] AdaQuant [47] ZeroQuant [8] PTQD [18] EasyQuant [5]
ACIQ [48] AdaRound [50] BRECQ [17] LUT-GEMM [10] LLM-QAT [51] L4Q [9]
LBQ [13] ZeroQ [19] SparQ [16] SmoothQuant [2] QLORA [7] EdgeQAT [53]
PTQ-ViT [45] GPTQ [6] OliVe [4] QuaRot [54]
SpQR [3] SpinQuant [1]
AWQ [11] LR-QAT [55]
PB-LLM [52] EfficientQAT [56]
PTQ4ViT [46]
Fig. 4: Timeline of QAT and PTQ. The red highlighted methods represent they belonging to QAT-related methods, and others
are PTQ-based methods.
the optimisation process. Specifically, QAT inserts ‘pseudo- tokens are sampled using top prediction, and the subsequent
quantization nodes’ into the model, i.e., simulated quantization tokens are sampled using random sampling based on output
and anti-quantization of weights and activation values, so that probability, and the first 3-5 tokens are sampled using top pre-
the model is aware of the quantization error during training. diction, and the subsequent tokens are sampled using random
This allows the model to be aware of the quantization error sampling based on output probability, and the next tokens are
during training, so that the model can maintain high perfor- sampled using random sampling based on output probability.
mance even when low-precision integer arithmetic is used The first 3-5 tokens are sampled using top prediction, and the
in the actual deployment. Compared with post-quantization subsequent tokens are sampled using random sampling based
alone, QAT can better balance model size, inference speed on output probability, and synthetic data are generated using
and accuracy, which is especially important for large-scale GAN for QAT training. In the quantization method, the paper
language models deployed on resource-constrained devices. chooses the symmetric MinMax linear quantization instead of
quantization Aware Training was first introduced by Google cropping quantization, because there are a large number of
researcher Benoit Jacob et al. at ICLR 2018 in a paper ‘Quan- outliers in the LLM, and cropping these values will seriously
tization and Training of Neural Networks for Efficient Integer- damage the model performance. In addition, the quantization
Arithmetic-Only Inference’ [57], which first introduced the technique is applied to KV cache, a key component of LLM,
concept of quantization Aware Training. They introduced a for the first time, which further compresses the model size.
method to simulate quantization operations during the training
In the recent study, we have categorised and summarised
process in neural networks, so that the model can adapt to
the thesis research directions.
the precision loss caused by quantization, and thus use low-
precision integer operations without significant performance Data-free QAT.This category of methods aims to perform
degradation when deployed in practice. QAT training by generating data or other techniques without
Later, the QAT method was gradually widely used in various relying on the original training data. LLM-QAT [51] also
fields such as quantization optimisation for neural network belongs to this class of QAT quantization methods. Mean-
training and inference [58], and quantization of graph neural while, EdgeQAT [53] is also a Data-free QAT quantization
networks [59]. method. It is an innovative approach to optimise the quantiza-
LLM-QAT. LLM-QAT [51]is the first application of QAT tion process by introducing an adaptive quantization strategy.
techniques to large models. The authors propose a novel Specifically, EdgeQAT will adopt different quantization pa-
data-free distillation method to apply quantitative awareness rameters according to the characteristics of different layers.
training (QAT) to large language models (LLMs). The authors This adaptive quantization approach can better balance the
analyse the difficulties of quantitative training for LLM: firstly, model performance and model size, thus enabling the deploy-
it is very important to choose the appropriate fine-tuning data, ment of low-bit quantization models on edge devices while
which may damage the model performance if the fine-tuning maintaining high inference performance. Unlike the traditional
data does not match the original pre-training data distribution one-size-fits-all quantization approach, EdgeQAT’s adaptive
or is too narrow; secondly, due to the complexity and huge quantization strategy can make full use of the characteristics
scale of LLM pre-training, it is difficult to replicate the of different layers. Some critical layers can retain higher bits,
original training setup exactly; furthermore, the weight and while non-critical layers can be quantised to lower bits. This
activation distributions of LLM are significantly different from differentiated quantization minimises the performance loss
those of small-scale models, and thus are applicable to the caused by quantization, so that the low-bit quantization model
LLM. To address these difficulties, the paper proposes a data- can still maintain high accuracy and inference speed on edge
free distillation method, which uses the original pre-training devices. This Data-free type of QAT method does not need
model to generate the next token data as QAT training data, to rely on the original training data, but achieves the efficient
and adopts a hybrid sampling strategy, where the first 3-5 compression of the model through adaptive quantization and
16
other techniques, which provides new possibilities for the problem by designing an adaptive quantization parameter for
deployment of LLM models on resource-constrained edge each layer, thus better balancing the performance of each layer.
devices. Compared with the traditional QAT methods, data In addition, EfficientQAT also makes use of the knowledge
free methods such as EdgeQAT are more flexible and general, distillation technique to further improve the efficiency and
and can be widely applied to different types of LLM models. practicability of QAT. L4Q proposes a parameter-efficient
Matrix transform-based QAT.This type of method opti- quantization-aware fine-tuning method, which can signifi-
mises the quantization process and improves the performance cantly reduce the model size while maintaining the model
by applying specific matrix transformations (e.g. rotations) to performance, effectively balancing the model compression and
the model weights or activations. This approach should belong performance preservation. ”How to Parameterize Asymmetric
to the new class of quantization techniques that make other Quantization Ranges for Quantization-Aware Training”, on the
modifications to QAT. Representative works include QuaRot other hand, investigates how to parameterize asymmetric quan-
[54] and SpinQuant [1], which are two matrix-transformation- tization ranges to improve the effectiveness of quantization-
based LLM quantization methods that optimize the quanti- aware training, so as to better balance the performance of
zation process by applying specific transformations to the different layers and enhance the overall model performance.
model weights and activations to mitigate the performance It improves the quantization effect by introducing learnable
degradation caused by quantization. SpinQuant proposes a asymmetric quantization parameters to better adapt to the dif-
quantization framework based on the spin system, which ferences in the characteristics of different layers. Specifically,
encodes the model parameters into the states of the spin the method introduces two learnable scaling factors α and
system to achieve more efficient quantization, and this method β in the quantization process, which are used to scale the
can significantly improve the performance of the quantised quantization range of positive and negative weights respec-
model.QuaRot optimises the quantization process by apply- tively. This can adaptively adjust the quantization ranges of
ing orthogonal transformations to the model weights and different layers to better capture the asymmetry of the weight
activations. The orthogonal transformation can maintain the distribution. Meanwhile, the authors also propose a gradient-
expressive power of the model and avoid the performance based optimization method, which can optimize the network
degradation caused by quantization. Specifically, QuaRot per- parameters and quantization parameters simultaneously dur-
forms orthogonal transformations on weights and activations ing the training process to further improve the quantization
before quantization. The orthogonal transformation ensures performance. This parametric asymmetric quantization method
that the quantised model still maintains good expressive belongs to the lightweight and efficient QAT category, which
ability, thus improving the performance after quantization. can significantly reduce the model size and computational
Compared with the traditional quantization methods, these overhead while maintaining the model performance. Com-
two matrix transformation-based methods can better adapt to pared with traditional symmetric quantization, this method can
the quantization process by applying specific transformations better adapt to the differences in the characteristics of different
to the weights and activations, thus mitigating the perfor- layers, thus improving the quantization effect.
mance degradation caused by quantization. This is important Extreme low-bit quantization. This class of methods
for deploying high-performance LLM models on resource- explores quantising models to extremely low bit counts (e.g.,
constrained edge devices. 2-4bit) while maintaining performance as much as possible.
Lightweight and Efficient QAT. Such methods aim to Representative works include PB-LLM [52].PB-LLM is an
improve the efficiency and practicality of QAT and reduce extreme low-bit quantization method for LLM models, which
the consumption of training resources. Representative works achieves the goal of maintaining good performance at very
include LR-QAT [55], EfficientQAT [56], and L4Q [9]. LR- low bits through two key innovations.PB-LLM has two inno-
QAT introduces a low-rank adaptation technique, which dra- vations. First, the first innovation is the partial preservation of
matically reduces the memory and computational overheads significant weights. Traditional quantization methods quantise
of QAT by performing low-rank decomposition of model all weight parameters to the same low bit-width, which results
weights. The traditional QAT method needs to save the quan- in a significant degradation of model performance. In contrast,
tization parameters for each weight parameter individually, PB-LLM identifies the critical weight parameters in the model
which brings huge memory overhead. In contrast, LR-QAT and retains them in the original high bit-width, quantising only
compresses the weight parameters into low-rank represen- the non-critical weights. This strategy of partially retaining the
tations through low-rank decomposition, which significantly critical weights allows the model to maintain a good inference
reduces the memory and computational resources required performance at very low bits. The second innovation of the
for QAT. At the same time, it also adopts the block train- article is the optimisation of the quantization parameters.PB-
ing strategy, which further improves the efficiency of QAT. LLM designs adaptive quantization parameters for each layer
These innovations make LR-QAT a lightweight and efficient with respect to the differences in the characteristics of differ-
QAT method. EfficientQAT, on the other hand, introduces an ent layers. This strategy can better balance the performance
adaptive quantization strategy to adopt different quantization of each layer and avoid the problems that may be caused
parameters according to the characteristics of different layers. by using uniform quantization parameters. In addition, PB-
As the characteristics of different layers are quite different, it LLM further optimises the quantization parameters by using
is difficult to balance the performance of each layer by using knowledge distillation and other techniques to improve the
a uniform quantization parameter. EfficientQAT addresses this quantization effect. Compared with the traditional 8-bit or 4-bit
17
QAT Category Details
Data free QAT method
• LLM-QAT
• EdgeQAT
quantization method, PB-LLM can further compress the model that uses Hessian matrix information to guide mixed-precision
size to a lower bit width while maintaining a higher inference quantization. By computing the Hessian matrix, this method
performance. This extreme low-bit quantization method en- identifies layers that are sensitive to quantization and assigns
ables excellent models to be deployed in common hardware higher precision to them, thus maintaining overall model
conditions and facilitates model generalisation. accuracy while reducing computational costs. Cai et al. (2020),
These approaches demonstrate the properties of QAT tech- in ”Towards Accurate Post-Training Network Quantization
nology: the ability to be deployed in low-end hardware and via Bit-Split and Stitching,” proposed another method to
to be quantised during training, reducing the loss of model enhance quantization accuracy. They introduced a technique
accuracy. that splits weights into smaller segments, quantizes them
3) PTQ: The core idea of PTQ is to quantize model weights separately, and then stitches them back together, effectively
and activations after training, thus reducing the computational reducing quantization error. This approach performs well in
and memory footprint without further modifying the model. applications requiring high-precision inference. Additionally,
The seminal work by Jacob et al. (2018), titled ”Quantiza- Choi et al. (2018) proposed PACT: Parameterized Clipping
tion and Training of Neural Networks for Efficient Integer- Activation for Quantized Neural Networks, a technique that
Arithmetic-Only Inference,” lays the foundation for modern optimizes the quantization process by adjusting the range
PTQ techniques. The authors introduced linear quantization of activation values. The PACT method uses parameterized
methods that map floating-point values to integers, enabling clipping to better align activation values with quantization,
convolutional neural networks to run efficiently on hardware thereby reducing accuracy loss during inference.
that only supports integer operations. This work has been Q-DiT.Building upon the foundational work in PTQ, recent
widely adopted and has significantly influenced subsequent advancements have focused on addressing the unique chal-
developments in PTQ. lenges posed by complex model architectures like Diffusion
Practical Applications and Challenges of PTQ. Migacz Transformers. A notable contribution in this area is the Q-DiT
(2017) in ”Integer Quantization for Deep Learning Inference: framework, which introduces several innovative techniques to
Principles and Empirical Evaluation” provided a practical enhance the accuracy and efficiency of post-training quan-
evaluation of PTQ in real-world scenarios. The paper analyzed tization for these models. Q-DiT leverages a combination
the impact of quantization precision on model inference per- of fine-grained group quantization and dynamic activation
formance and offered insights into how to choose appropriate quantization to tackle the high variance and structured outliers
quantization strategies for different applications. This research in activations that are characteristic of Diffusion Transformers.
serves as a valuable guide for engineers deploying deep These techniques ensure that the quantization process remains
learning models on mobile devices and in edge computing effective across the model’s various layers, even under ag-
environments. However, a major challenge of PTQ is the loss gressive quantization settings like W4A8. Additionally, Q-DiT
of accuracy introduced by quantization. While PTQ does not employs an evolutionary search algorithm to optimize quan-
require retraining, direct quantization can lead to significant tization granularity, balancing computational efficiency with
performance degradation, particularly in tasks that demand performance. Experimental results on the ImageNet dataset
high precision. This accuracy loss is often pronounced in validate the effectiveness of Q-DiT, showing that it can achieve
certain layers or parts of the model, necessitating more refined near-lossless compression, setting a new standard for PTQ
quantization strategies. in high-performance generative models. These advancements
Advanced PTQ Optimization Strategies.To address the underscore the importance of tailored PTQ strategies for
accuracy loss associated with traditional PTQ methods, re- different model architectures and pave the way for further
searchers have proposed various optimization strategies. Dong innovations in the field.
et al. (2019) introduced HAWQ: Hessian AWare Quantization AdaLog.Further extending the boundaries of PTQ, the
of Neural Networks with Mixed-Precision, a novel approach AdaLog framework addresses the specific challenges of Vi-
18
sion Transformer (ViT) architectures, which are increasingly performance and efficiency, making the deployment of large-
popular but computationally expensive. Traditional PTQ meth- scale models more feasible in resource-constrained environ-
ods often struggle with the power-law distributions of post- ments. The integration of quantization-aware training and
Softmax and post-GELU activations in ViTs, leading to signif- sophisticated post-training quantization methods demonstrates
icant accuracy degradation, especially at low-bit quantization the potential for neural networks to continue scaling in com-
levels. AdaLog introduces an Adaptive Logarithm Quantizer, plexity without proportional increases in computational costs.
which optimizes the logarithmic base used in the quantization Future work in this domain is essential to further optimize
process rather than relying on a fixed base. This adaptability quantization strategies, ensuring that the benefits of large-
allows AdaLog to better handle the unique activation dis- scale models can be widely accessible and environmentally
tributions of ViTs, maintaining higher accuracy even under sustainable.
3-bit and 4-bit quantization settings. The framework also
incorporates bias reparameterization, enabling the effective R EFERENCES
quantization of both post-Softmax and post-GELU activations, [1] Z. Liu, C. Zhao, I. Fedorov, B. Soran, D. Choudhary, R. Krishnamoorthi,
and making the process more hardware-friendly. To enhance V. Chandra, Y. Tian, and T. Blankevoort, “Spinquant: Llm quantization
with learned rotations,” 2024. [Online]. Available: https://arxiv.org/abs/
the efficiency of the quantization process, AdaLog employs 2405.16406
a Fast Progressive Combining Search (FPCS) strategy. This [2] G. Xiao, J. Lin, M. Seznec, H. Wu, J. Demouth, and S. Han,
strategy efficiently determines the optimal hyperparameters “Smoothquant: Accurate and efficient post-training quantization for large
language models,” in International Conference on Machine Learning.
for quantization, balancing precision with computational cost. PMLR, 2023, pp. 38 087–38 099.
Experimental results demonstrate that AdaLog significantly [3] T. Dettmers, R. Svirschevski, V. Egiazarian, D. Kuznedelev, E. Frantar,
outperforms existing PTQ methods in various ViT architec- S. Ashkboos, A. Borzunov, T. Hoefler, and D. Alistarh, “Spqr: A sparse-
quantized representation for near-lossless llm weight compression,”
tures, providing a robust solution for deploying these complex arXiv preprint arXiv:2306.03078, 2023.
models in resource-constrained environments. [4] C. Guo, J. Tang, W. Hu, J. Leng, C. Zhang, F. Yang, Y. Liu, M. Guo,
Future Research Directions.Despite the success of PTQ in and Y. Zhu, “Olive: Accelerating large language models via hardware-
friendly outlier-victim pair quantization,” in Proceedings of the 50th
real-world applications, further reducing accuracy loss remains Annual International Symposium on Computer Architecture, 2023, pp.
a key area of research. Future work may focus on: 1–15.
[5] H. Tang, Y. Sun, D. Wu, K. Liu, J. Zhu, and Z. Kang, “Easyquant:
• Adaptive Quantization Strategies: Developing methods An efficient data-free quantization algorithm for llms,” ArXiv, vol.
that dynamically adjust quantization precision based on abs/2403.02775, 2024.
different parts of the model to minimize quantization [6] E. Frantar, S. Ashkboos, T. Hoefler, and D. Alistarh, “Gptq: Accurate
post-training quantization for generative pre-trained transformers,” arXiv
error. preprint arXiv:2210.17323, 2022.
• Mixed-Precision Quantization: Combining low-precision [7] T. Dettmers, A. Pagnoni, A. Holtzman, and L. Zettlemoyer, “Qlora:
and high-precision computation to achieve the optimal Efficient finetuning of quantized llms,” Advances in Neural Information
Processing Systems, vol. 36, 2024.
balance between performance and efficiency. [8] Z. Yao, R. Yazdani Aminabadi, M. Zhang, X. Wu, C. Li, and Y. He,
• Combining PTQ with Quantization-Aware Training “Zeroquant: Efficient and affordable post-training quantization for large-
(QAT): Exploring methods to further optimize models scale transformers,” Advances in Neural Information Processing Sys-
tems, vol. 35, pp. 27 168–27 183, 2022.
with minimal retraining, improving both compression [9] H. Jeon, Y. Kim, and J.-j. Kim, “L4q: Parameter efficient quantization-
efficiency and accuracy. aware training on large language models via lora-wise lsq,” arXiv
preprint arXiv:2402.04902, 2024.
Post-Training Quantization is an effective model compres- [10] G. Park, B. Park, M. Kim, S. Lee, J. Kim, B. Kwon, S. J. Kwon, B. Kim,
sion technique, particularly suitable for deploying deep learn- Y. Lee, and D. Lee, “Lut-gemm: Quantized matrix multiplication based
ing models in resource-constrained environments. While PTQ on luts for efficient inference in large-scale generative language models,”
arXiv preprint arXiv:2206.09557, 2022.
offers significant advantages in reducing computational and [11] J. Lin, J. Tang, H. Tang, S. Yang, W.-M. Chen, W.-C. Wang, G. Xiao,
storage demands, the challenge of accuracy loss persists. The X. Dang, C. Gan, and S. Han, “Awq: Activation-aware weight quanti-
Q-DiT and AdaLog frameworks represent significant advances zation for on-device llm compression and acceleration,” Proceedings of
Machine Learning and Systems, vol. 6, pp. 87–100, 2024.
in addressing these challenges, particularly for Diffusion [12] R. Banner, Y. Nahshan, E. Hoffer, and D. Soudry, “Aciq: Analytical
Transformers and Vision Transformers. Ongoing research into clipping for integer quantization of neural networks,” 2018.
optimizing quantization strategies promises to enhance the [13] Y. Choukroun, E. Kravchik, and P. Kisilev, “Low-bit quantization of
neural networks for efficient inference,” 2019 IEEE/CVF International
efficacy of PTQ in a broader range of applications. Conference on Computer Vision Workshop (ICCVW), pp. 3009–3018,
2019.
[14] M. Nagel, M. van Baalen, T. Blankevoort, and M. Welling, “Data-free
V. C ONCLUSION quantization through weight equalization and bias correction,” 2019
he quantization of large-scale neural network models IEEE/CVF International Conference on Computer Vision (ICCV), pp.
1325–1334, 2019.
emerges as a critical strategy in addressing the computational [15] J. Fang, A. Shafiee, H. Abdel-Aziz, D. Thorsley, G. Georgiadis, and J. H.
and energy demands associated with the growth of model Hassoun, “Post-training piecewise linear quantization for deep neural
sizes. Our comprehensive overview highlights the advance- networks,” in Computer Vision–ECCV 2020: 16th European Conference,
Glasgow, UK, August 23–28, 2020, Proceedings, Part II 16. Springer,
ments in quantization techniques that enable significant re- 2020, pp. 69–86.
ductions in model size and computational overhead while [16] G. Shomron, F. Gabbay, S. Kurzum, and U. C. Weiser, “Post-training
maintaining high levels of accuracy. By analyzing various sparsity-aware quantization,” ArXiv, vol. abs/2105.11010, 2021.
[17] Y. Li, R. Gong, X. Tan, Y. Yang, P. Hu, Q. Zhang, F. Yu, W. Wang, and
algorithms and approaches, we observe that innovative meth- S. Gu, “Brecq: Pushing the limit of post-training quantization by block
ods such as LLM-QAT and SmoothQuant effectively balance reconstruction,” ArXiv, vol. abs/2102.05426, 2021.
19
[18] Y. He, L. Liu, J. Liu, W. Wu, H. Zhou, and B. Zhuang, “Ptqd: [41] H. Kang, Q. Zhang, S. Kundu, G. Jeong, Z. Liu, T. Krishna, and
Accurate post-training quantization for diffusion models,” ArXiv, vol. T. Zhao, “Gear: An efficient kv cache compression recipefor near-
abs/2305.10657, 2023. lossless generative inference of llm,” arXiv preprint arXiv:2403.05527,
[19] Y. Cai, Z. Yao, Z. Dong, A. Gholami, M. W. Mahoney, and K. Keutzer, 2024.
“Zeroq: A novel zero shot quantization framework,” 2020 IEEE/CVF [42] A. Liu, J. Liu, Z. Pan, Y. He, G. Haffari, and B. Zhuang, “Minicache:
Conference on Computer Vision and Pattern Recognition (CVPR), pp. Kv cache compression in depth dimension for large language models,”
13 166–13 175, 2020. arXiv preprint arXiv:2405.14366, 2024.
[20] M. Courbariaux, Y. Bengio, and J.-P. David, “Binaryconnect: Training [43] Z. Zhang, W. Shao, Y. Ge, X. Wang, J. Gu, and P. Luo, “Cached trans-
deep neural networks with binary weights during propagations,” Ad- formers: Improving transformers with differentiable memory cachde,” in
vances in neural information processing systems, vol. 28, 2015. Proceedings of the AAAI Conference on Artificial Intelligence, vol. 38,
[21] M. Courbariaux, I. Hubara, D. Soudry, R. El-Yaniv, and Y. Ben- no. 15, 2024, pp. 16 935–16 943.
gio, “Binarized neural networks: Training deep neural networks with [44] H. Wu and K. Tu, “Layer-condensed kv cache for efficient inference of
weights and activations constrained to+ 1 or-1,” arXiv preprint large language models,” arXiv preprint arXiv:2405.10637, 2024.
arXiv:1602.02830, 2016. [45] Z. Liu, Y. Wang, K. Han, W. Zhang, S. Ma, and W. Gao, “Post-training
[22] M. Rastegari, V. Ordonez, J. Redmon, and A. Farhadi, “Xnor-net: quantization for vision transformer,” Advances in Neural Information
Imagenet classification using binary convolutional neural networks,” in Processing Systems, vol. 34, pp. 28 092–28 103, 2021.
European conference on computer vision. Springer, 2016, pp. 525–542. [46] J. Liu, L. Niu, Z. Yuan, D. Yang, X. Wang, and W. Liu, “Pd-quant:
[23] S. Zhou, Y. Wu, Z. Ni, X. Zhou, H. Wen, and Y. Zou, “Dorefa-net: Post-training quantization based on prediction difference metric,” in
Training low bitwidth convolutional neural networks with low bitwidth Proceedings of the IEEE/CVF Conference on Computer Vision and
gradients,” arXiv preprint arXiv:1606.06160, 2016. Pattern Recognition, 2023, pp. 24 427–24 437.
[24] S. Wu, G. Li, F. Chen, and L. Shi, “Training and inference with integers [47] I. Hubara, Y. Nahshan, Y. Hanani, R. Banner, and D. Soudry, “Accurate
in deep neural networks,” arXiv preprint arXiv:1802.04680, 2018. post training quantization with small calibration sets,” in International
[25] B. Zhuang, L. Liu, M. Tan, C. Shen, and I. Reid, “Training quantized Conference on Machine Learning. PMLR, 2021, pp. 4466–4475.
neural networks with a full-precision auxiliary module,” in Proceedings [48] R. Banner, Y. Nahshan, and D. Soudry, “Post training 4-bit quantization
of the IEEE/CVF conference on computer vision and pattern recognition, of convolutional networks for rapid-deployment,” in Neural Information
2020, pp. 1488–1497. Processing Systems, 2018.
[26] F. Peng, S. Liu, N. Lu, and K. Tang, “Training quantized deep neural [49] J. Fang, A. Shafiee, H. Abdel-Aziz, D. Thorsley, G. Georgiadis, and
networks via cooperative coevolution,” in International Conference on J. Hassoun, “Post-training piecewise linear quantization for deep neural
Sensing and Imaging. Springer, 2022, pp. 81–93. networks,” in European Conference on Computer Vision, 2020.
[27] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, [50] M. Nagel, R. A. Amjad, M. van Baalen, C. Louizos, and
Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” Advances in T. Blankevoort, “Up or down? adaptive rounding for post-training
neural information processing systems, vol. 30, 2017. quantization,” ArXiv, vol. abs/2004.10568, 2020. [Online]. Available:
[28] C. Hooper, S. Kim, H. Mohammadzadeh, M. W. Mahoney, Y. S. https://api.semanticscholar.org/CorpusID:216056295
Shao, K. Keutzer, and A. Gholami, “Kvquant: Towards 10 million [51] Z. Liu, B. Oguz, C. Zhao, E. Chang, P. Stock, Y. Mehdad, Y. Shi, R. Kr-
context length llm inference with kv cache quantization,” arXiv preprint ishnamoorthi, and V. Chandra, “Llm-qat: Data-free quantization aware
arXiv:2401.18079, 2024. training for large language models,” arXiv preprint arXiv:2305.17888,
[29] R. Y. Aminabadi, S. Rajbhandari, A. A. Awan, C. Li, D. Li, E. Zheng, 2023.
O. Ruwase, S. Smith, M. Zhang, J. Rasley et al., “Deepspeed-inference: [52] Y. Shang, Z. Yuan, Q. Wu, and Z. Dong, “Pb-llm: Partially
enabling efficient inference of transformer models at unprecedented binarized large language models,” 2023. [Online]. Available: https:
scale,” in SC22: International Conference for High Performance Com- //arxiv.org/abs/2310.00034
puting, Networking, Storage and Analysis. IEEE, 2022, pp. 1–15. [53] X. Shen, Z. Kong, C. Yang, Z. Han, L. Lu, P. Dong, C. Lyu, C. hsiang
[30] Y. Sheng, L. Zheng, B. Yuan, Z. Li, M. Ryabinin, B. Chen, P. Liang, Li, X. Guo, Z. Shu, W. Niu, M. Leeser, P. Zhao, and Y. Wang,
C. Ré, I. Stoica, and C. Zhang, “Flexgen: High-throughput generative “Edgeqat: Entropy and distribution guided quantization-aware training
inference of large language models with a single gpu,” in International for the acceleration of lightweight llms on the edge,” 2024. [Online].
Conference on Machine Learning. PMLR, 2023, pp. 31 094–31 116. Available: https://arxiv.org/abs/2402.10787
[31] Z. Liu, A. Desai, F. Liao, W. Wang, V. Xie, Z. Xu, A. Kyrillidis, and [54] S. Ashkboos, A. Mohtashami, M. L. Croci, B. Li, M. Jaggi, D. Alistarh,
A. Shrivastava, “Scissorhands: Exploiting the persistence of importance T. Hoefler, and J. Hensman, “Quarot: Outlier-free 4-bit inference in
hypothesis for llm kv cache compression at test time,” Advances in rotated llms,” arXiv preprint arXiv:2404.00456, 2024.
Neural Information Processing Systems, vol. 36, 2024. [55] Y. Bondarenko, R. D. Chiaro, and M. Nagel, “Low-rank quantization-
[32] Z. Zhang, Y. Sheng, T. Zhou, T. Chen, L. Zheng, R. Cai, Z. Song, aware training for llms,” 2024. [Online]. Available: https://arxiv.org/
Y. Tian, C. Ré, C. Barrett et al., “H2o: Heavy-hitter oracle for efficient abs/2406.06385
generative inference of large language models,” Advances in Neural [56] M. Chen, W. Shao, P. Xu, J. Wang, P. Gao, K. Zhang, Y. Qiao, and
Information Processing Systems, vol. 36, 2024. P. Luo, “Efficientqat: Efficient quantization-aware training for large
[33] Z. Liu, J. Yuan, H. Jin, S. Zhong, Z. Xu, V. Braverman, B. Chen, and language models,” 2024. [Online]. Available: https://arxiv.org/abs/2407.
X. Hu, “Kivi: A tuning-free asymmetric 2bit quantization for kv cache,” 11062
arXiv preprint arXiv:2402.02750, 2024. [57] B. Jacob, S. Kligys, B. Chen, M. Zhu, M. Tang, A. Howard, H. Adam,
[34] S. Dong, W. Cheng, J. Qin, and W. Wang, “Qaq: Quality adaptive and D. Kalenichenko, “Quantization and training of neural networks for
quantization for llm kv cache,” arXiv preprint arXiv:2403.04643, 2024. efficient integer-arithmetic-only inference,” 2017. [Online]. Available:
[35] H. Duanmu, Z. Yuan, X. Li, J. Duan, X. Zhang, and D. Lin, “Skvq: https://arxiv.org/abs/1712.05877
Sliding-window key and value cache quantization for large language [58] S. K. Esser, J. L. McKinstry, D. Bablani, R. Appuswamy, and D. S.
models,” arXiv preprint arXiv:2405.06219, 2024. Modha, “Learned step size quantization,” 2020. [Online]. Available:
[36] Y. Yue, Z. Yuan, H. Duanmu, S. Zhou, J. Wu, and L. Nie, “Wkvquant: https://arxiv.org/abs/1902.08153
Quantizing weight and key/value cache for large language models gains [59] X. Gao, W. Zhang, Y. Shao, Q. V. H. Nguyen, B. Cui, and H. Yin,
more,” arXiv preprint arXiv:2402.12065, 2024. “Efficient graph neural network inference at large scale,” 2022.
[37] A. Devoto, Y. Zhao, S. Scardapane, and P. Minervini, “A simple and [Online]. Available: https://arxiv.org/abs/2211.00495
effective l 2 norm-based strategy for kv cache compression,” arXiv
preprint arXiv:2406.11430, 2024.
[38] Z. M. K. Zuhri, M. F. Adilazuarda, A. Purwarianti, and A. F. Aji, “Mlkv:
Multi-layer key-value heads for memory efficient transformer decoding,”
arXiv preprint arXiv:2406.09297, 2024.
[39] W. Brandon, M. Mishra, A. Nrusimha, R. Panda, and J. R. Kelly,
“Reducing transformer key-value cache size with cross-layer attention,”
arXiv preprint arXiv:2405.12981, 2024.
[40] M. Adnan, A. Arunkumar, G. Jain, P. Nair, I. Soloveychik, and P. Ka-
math, “Keyformer: Kv cache reduction through key tokens selection for
efficient generative inference,” Proceedings of Machine Learning and
Systems, vol. 6, pp. 114–127, 2024.
20