Quantization
Quantization
It is the process of compressing memory size. If we use 32-bit floats, the size would be
compressed to 8-bit floats, thus improving efficiency by reducing time.
LLMs have billions of parameters and these parameters grow in activations. If quantized their
information would be lost and accuracy would be compromised. LLM parameters with large
magnitude may affect the results more thus causing more error.
Two approaches can be applied for quantization. But, both have some problems associated
with them.
It works for adjusting parameters for It keeps the extreme values in a higher
activations. Applies dynamic per-token precision format (FP16) while using a smaller
activation quantization and group-wise format (INT8) for the rest of the activations.
weight quantization
Works well for smaller models like GPT-3- This mixed approach is difficult to implement
350M and GPT-J-6B. efficiently on current hardware.
Struggles for larger models OPT, which has
175 billion parameters.
SmoothQuant:
Already available models use a similar scaling factor (∆) across all the channels. However,
SmoothQuant uses a different scaling factor across each channel, depending on the weights or
activation characteristics.
SmoothQuant has been tested on several large language models, including:
OPT-175B (Zhang et al., 2022)
BLOOM-176B (Scao et al., 2022)
GLM-130B (Zeng et al., 2022)
MT-NLG 530B (Smith et al., 2022)
We implement three efficiency levels of quantization settings for SmoothQuant.
The entire tensor (matrix) is quantized using a single scale based on the maximum value of the whole
tensor.
Per-Token Quantization:
In per-token quantization, each token (which could be a word, sub-word, or character in a sequence) is
assigned its quantization scale.
Per-Tensor Quantization:
During model training in deep learning, tensors can represent input data (like images and text), labels,
model parameters, intermediate activations, and gradients.
Per-tensor quantization does not mean quantizing each input separately based on its dimensions.
Instead, it means applying one set of quantization parameters to the entire tensor, regardless of its
structure.