0% found this document useful (0 votes)
2 views

Quantization

Quantization is a method to reduce memory size by compressing 32-bit floats to 8-bit floats, enhancing efficiency but risking accuracy in large language models (LLMs) with billions of parameters. Two quantization approaches, ZeroQuant and LLM.int8(), have their own challenges, particularly with larger models. SmoothQuant introduces a unique scaling factor for each channel and has been tested on various large models, offering different levels of quantization settings.

Uploaded by

areej
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

Quantization

Quantization is a method to reduce memory size by compressing 32-bit floats to 8-bit floats, enhancing efficiency but risking accuracy in large language models (LLMs) with billions of parameters. Two quantization approaches, ZeroQuant and LLM.int8(), have their own challenges, particularly with larger models. SmoothQuant introduces a unique scaling factor for each channel and has been tested on various large models, offering different levels of quantization settings.

Uploaded by

areej
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 2

 Quantization:

It is the process of compressing memory size. If we use 32-bit floats, the size would be
compressed to 8-bit floats, thus improving efficiency by reducing time.
LLMs have billions of parameters and these parameters grow in activations. If quantized their
information would be lost and accuracy would be compromised. LLM parameters with large
magnitude may affect the results more thus causing more error.
Two approaches can be applied for quantization. But, both have some problems associated
with them.

ZeroQuant Approach LLM.int8() Solution

It works for adjusting parameters for It keeps the extreme values in a higher
activations. Applies dynamic per-token precision format (FP16) while using a smaller
activation quantization and group-wise format (INT8) for the rest of the activations.
weight quantization
Works well for smaller models like GPT-3- This mixed approach is difficult to implement
350M and GPT-J-6B. efficiently on current hardware.
Struggles for larger models OPT, which has
175 billion parameters.

 SmoothQuant:
Already available models use a similar scaling factor (∆) across all the channels. However,
SmoothQuant uses a different scaling factor across each channel, depending on the weights or
activation characteristics.
SmoothQuant has been tested on several large language models, including:
 OPT-175B (Zhang et al., 2022)
 BLOOM-176B (Scao et al., 2022)
 GLM-130B (Zeng et al., 2022)
 MT-NLG 530B (Smith et al., 2022)
We implement three efficiency levels of quantization settings for SmoothQuant.

X= maximum value for channel i

The entire tensor (matrix) is quantized using a single scale based on the maximum value of the whole
tensor.
 Per-Token Quantization:
In per-token quantization, each token (which could be a word, sub-word, or character in a sequence) is
assigned its quantization scale.

 Per-Tensor Quantization:
During model training in deep learning, tensors can represent input data (like images and text), labels,
model parameters, intermediate activations, and gradients.

Per-tensor quantization does not mean quantizing each input separately based on its dimensions.
Instead, it means applying one set of quantization parameters to the entire tensor, regardless of its
structure.

You might also like