0% found this document useful (0 votes)

3 views

Introduction to Weight Quantization.pdf (1)

The document discusses weight quantization techniques for Large Language Models (LLMs), focusing on Post-Training Quantization (PTQ) and its implementation using a GPT-2 model. It explains the differences between quantization methods like absmax and zero-point quantization, detailing their effects on model performance and memory efficiency. The document also includes practical examples and code for quantizing model weights and evaluating the impact on text generation and perplexity metrics.

Uploaded by

adityasharma030710

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

3 views

Introduction to Weight Quantization.pdf (1)

Uploaded by

adityasharma030710

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 9

Large Language Models (LLMs) are known for their extensive computational requirements.

Typically, the size of a model is

calculated by multiplying the number of parameters (size) by the precision of these values (data type). However, to save memory,
weights can be stored using lower-precision data types through a process known as quantization.

We distinguish two main families of weight quantization techniques in the literature:

Post-Training Quantization (PTQ) is a straightforward technique where the weights of an already trained model are converted
to lower precision without necessitating any retraining. Although easy to implement, PTQ is associated with potential
performance degradation.
Quantization-Aware Training (QAT) incorporates the weight conversion process during the pre-training or ﬁne-tuning stage,
resulting in enhanced model performance. However, QAT is computationally expensive and demands representative training
data.

In this article, we focus on PTQ to reduce the precision of our parameters. To get a good intuition, we will apply both naïve and more
sophisticated techniques to a toy example using a GPT-2 model.
The entire code is freely available on Colab

📚 Background on Floating Point Representation

The choice of data type dictates the quantity of computational resources required, affecting the speed and eﬃciency of the model. In
deep learning applications, balancing precision and computational performance becomes a vital exercise as higher precision often
implies greater computational demands.

Among various data types, ﬂoating point numbers are predominantly employed in deep learning due to their ability to represent a
wide range of values with high precision. Typically, a ﬂoating point number uses n bits to store a numerical value. These n bits are
further partitioned into three distinct components:

1. Sign: The sign bit indicates the positive or negative nature of the number. It uses one bit where 0 indicates a positive number
and 1 signals a negative number.
2. Exponent: The exponent is a segment of bits that represents the power to which the base (usually 2 in binary representation) is
raised. The exponent can also be positive or negative, allowing the number to represent very large or very small values.
3. Significand/Mantissa: The remaining bits are used to store the significand, also referred to as the mantissa. This represents the
significant digits of the number. The precision of the number heavily depends on the length of the significand.
This design allows floating point numbers to cover a wide range of values with varying levels of precision. The formula used for this
representation is:

To understand this better, let’s delve into some of the most commonly used data types in deep learning: float32 (FP32), float16
(FP16), and bfloat16 (BF16):

FP32 uses 32 bits to represent a number: one bit for the sign, eight for the exponent, and the remaining 23 for the significand.
While it provides a high degree of precision, the downside of FP32 is its high computational and memory footprint.
FP16 uses 16 bits to store a number: one is used for the sign, five for the exponent, and ten for the significand. Although this
makes it more memory-efficient and accelerates computations, the reduced range and precision can introduce numerical
instability, potentially impacting model accuracy.
BF16 is also a 16-bit format but with one bit for the sign, eight for the exponent, and seven for the significand. BF16 expands the
representable range compared to FP16, thus decreasing underflow and overflow risks. Despite a reduction in precision due to
fewer significand bits, BF16 typically does not significantly impact model performance and is a useful compromise for deep
learning tasks.

Image by author

In ML jargon, FP32 is often termed “full precision” (4 bytes), while BF16 and FP16 are “half-precision” (2 bytes). But could we do
even better and store weights using a single byte? The answer is the INT8 data type, which consists of an 8-bit representation
capable of storing 2⁸ = 256 different values. In the next section, we’ll see how to convert FP32 weights into an INT8 format.

🔰 Naïve 8-bit Quantization

In this section, we will implement two quantization techniques: a symmetric one with absolute maximum (absmax) quantization
and an asymmetric one with zero-point quantization. In both cases, the goal is to map an FP32 tensor X (original weights) to an
INT8 tensor X_quant (quantized weights).

With absmax quantization, the original number is divided by the absolute maximum value of the tensor and multiplied by a
scaling factor (127) to map inputs into the range [-127, 127]. To retrieve the original FP16 values, the INT8 number is divided by the
quantization factor, acknowledging some loss of precision due to rounding.

For instance, let’s say we have an absolution maximum value of 3.2. A weight of 0.1 would be quantized to round(0.1 × 127/3.2) = 4. If
we want to dequantize it, we would get 4 × 3.2/127 = 0.1008, which implies an error of 0.008. Here’s the corresponding Python
implementation:

import torch

def absmax_quantize(X):

scale = 127 / torch.max(torch.abs(X))

X_quant = (scale * X).round()

X_dequant = X_quant / scale

return X_quant.to(torch.int8), X_dequant

With zero-point quantization, we can consider asymmetric input distributions, which is useful when you consider the output of
a ReLU function (only positive values), for example. The input values are ﬁrst scaled by the total range of values (255) divided by
the difference between the maximum and minimum values. This distribution is then shifted by the zero-point to map it into the
range [-128, 127] (notice the extra value compared to absmax). First, we calculate the scale factor and the zero-point value:

Then, we can use these variables to quantize or dequantize our weights:

Let’s take an example: we have a maximum value of 3.2 and a minimum value of -3.0. We can calculate the scale is 255/(3.2 + 3.0) =
41.13 and the zero-point -round(41.13 × -3.0) - 128 = 123 -128 = -5, so our previous weight of 0.1 would be quantized to round(41.13
× 0.1 -5) = -1. This is very different from the previous value obtained using absmax (4 vs. -1).
Image by author

The Python implementation is quite straightforward:

def zeropoint_quantize(X):

x_range = torch.max(X) - torch.min(X)

x_range = 1 if x_range == 0 else x_range

scale = 255 / x_range

zeropoint = (-scale * torch.min(X) - 128).round()

X_quant = torch.clip((X * scale + zeropoint).round(), -128, 127)

X_dequant = (X_quant - zeropoint) / scale

return X_quant.to(torch.int8), X_dequant

Instead of relying on complete toy examples, we can use these two functions on a real model thanks to the transformerslibrary.

We start by loading the model and tokenizer for GPT-2. This is a very small model we probably don’t want to quantize, but it will be
good enough for this tutorial. First, we want to observe the model’s size so we can compare it later and evaluate the memory
savings due to 8-bit quantization.

!pip install -q bitsandbytes>=0.39.0

!pip install -q git+https://github.com/huggingface/accelerate.git
!pip install -q git+https://github.com/huggingface/transformers.git
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
torch.manual_seed(0)

device = 'cpu'

model_id = 'gpt2'
model =
AutoModelForCausalLM.from_pretrained(model_id).to(device)
tokenizer = AutoTokenizer.from_pretrained(model_id)

print(f"Model size: {model.get_memory_footprint():,} bytes")

Model size: 510,342,192 bytes

The size of the GPT-2 model is approximately 487MB in FP32. The next step consists of quantizing the weights using zero-point and
absmax quantization. In the following example, we apply these techniques to the ﬁrst attention layer of GPT-2 to see the results.

weights = model.transformer.h[0].attn.c_attn.weight.data
print("Original weights:")
print(weights)
weights_abs_quant, _ = absmax_quantize(weights)
print("\nAbsmax quantized weights:")
print(weights_abs_quant)

weights_zp_quant, _ = zeropoint_quantize(weights)
print("\nZero-point quantized weights:")
print(weights_zp_quant)
Original weights:
tensor([[-0.4738, -0.2614, -0.0978, ..., 0.0513, -0.0584, 0.0250],
[ 0.0874, 0.1473, 0.2387, ..., -0.0525, -0.0113, -0.0156],
[ 0.0039, 0.0695, 0.3668, ..., 0.1143, 0.0363, -0.0318],
...,
[-0.2592, -0.0164, 0.1991, ..., 0.0095, -0.0516, 0.0319],
[ 0.1517, 0.2170, 0.1043, ..., 0.0293, -0.0429, -0.0475],
[-0.4100, -0.1924, -0.2400, ..., -0.0046, 0.0070, 0.0198]])

Absmax quantized weights:

tensor([[-21, -12, -4, ..., 2, -3, 1],
[ 4, 7, 11, ..., -2, -1, -1],
[ 0, 3, 16, ..., 5, 2, -1],
...,
[-12, -1, 9, ..., 0, -2, 1],
[ 7, 10, 5, ..., 1, -2, -2],
[-18, -9, -11, ..., 0, 0, 1]], dtype=torch.int8)

Zero-point quantized weights:

tensor([[-20, -11, -3, ..., 3, -2, 2],
[ 5, 8, 12, ..., -1, 0, 0],
[ 1, 4, 18, ..., 6, 3, 0],
...,
[-11, 0, 10, ..., 1, -1, 2],
[ 8, 11, 6, ..., 2, -1, -1],
[-18, -8, -10, ..., 1, 1, 2]], dtype=torch.int8)

The difference between the original (FP32) and quantized values (INT8) is clear, but the difference between absmax and zero-point
weights is more subtle. In this case, the inputs look shifted by a value of -1. This suggests that the weight distribution in this layer is
quite symmetric.

We can compare these techniques by quantizing every layer in GPT-2 (linear layers, attention layers, etc.) and create two new
models: model_abs and model_zp. To be precise, we will actually replace the original weights with de-quantized ones. This has two
beneﬁts: it allows us to 1/ compare the distribution of our weights (same scale) and 2/ actually run the models.

Indeed, PyTorch doesn’t allow INT8 matrix multiplication by default. In a real scenario, we would dequantize them to run the model
(in FP16 for example) but store them as INT8. In the next section, we will use the bitsandbytes library to solve this issue.

import numpy as np
from copy import deepcopy

weights = [param.data.clone() for param in model.parameters()]

model_abs = deepcopy(model)

weights_abs = []
for param in model_abs.parameters():
_, dequantized = absmax_quantize(param.data)
param.data = dequantized
weights_abs.append(dequantized)

model_zp = deepcopy(model)

weights_zp = []
for param in model_zp.parameters():
_, dequantized = zeropoint_quantize(param.data)
param.data = dequantized
weights_zp.append(dequantized)
Now that our models have been quantized, we want to check the impact of this process. Intuitively, we want to make sure that the
quantized weights are close to the original ones. A visual way to check it is to plot the distribution of the dequantized and original
weights. If the quantization is lossy, it would drastically change the weight distribution.

The following ﬁgure shows this comparison, where the blue histogram represents the original (FP32) weights, and the red one
represents the dequantized (from INT8) weights. Note that we only display this plot between -2 and 2 because of outliers with very
high absolute values (more on that later).

Both plots are quite similar, with a surprising spike around 0. This spike shows that our quantization is quite lossy since reversing
the process doesn’t output the original values. This is particularly true for the absmax model, which displays both a lower valley
and a higher spike around 0.

Let’s compare the performance of the original and quantized models. For this purpose, we deﬁne a generate_text() function to
generate 50 tokens with top-k sampling .

def generate_text(model, input_text, max_length=50):

input_ids = tokenizer.encode(input_text, return_tensors='pt').to(device) output
= model.generate(inputs=input_ids,
max_length=max_length,
do_sample=True,
top_k=30,
pad_token_id=tokenizer.eos_token_id,
attention_mask=input_ids.new_ones(input_ids.shape))
return tokenizer.decode(output[0], skip_special_tokens=True)

original_text = generate_text(model, "I have a dream")

absmax_text = generate_text(model_abs, "I have a dream")
zp_text = generate_text(model_zp, "I have a dream")

print(f"Original model:\n{original_text}")
print("-" * 50)
print(f"Absmax model:\n{absmax_text}")
print("-" * 50)
print(f"Zeropoint model:\n{zp_text}")

Original model:
I have a dream, and it is a dream I believe I would get to live in my future. I love my mother, and there was that one time I had
--------------------------------------------------
Absmax model:
I have a dream to find out the origin of her hair. She loves it. But there's no way you could be honest about how her hair is made We

found a photo of the hairstyle posted on

--------------------------------------------------
Zeropoint model:
I have a dream of creating two full-time jobs in America—one for people with mental health issues, and one for people who do not s

Instead of trying to see if one output makes more sense than the others, we can quantify it by calculating the perplexity of each
output. This is a common metric used to evaluate language models, which measures the uncertainty of a model in predicting the
next token in a sequence. In this comparison, we make the common assumption that the lower the score, the better the model is. In
practice, a sentence with a high perplexity could also be correct.

We implement it using a minimal function since it doesn’t need to consider details like the length of the context window since our
sentences are short.

def calculate_perplexity(model, text):

encodings = tokenizer(text, return_tensors='pt').to(device)

input_ids = encodings.input_ids
target_ids = input_ids.clone()

with torch.no_grad():
outputs = model(input_ids, labels=target_ids)

neg_log_likelihood = outputs.loss

ppl = torch.exp(neg_log_likelihood)

return ppl

ppl = calculate_perplexity(model, original_text)

ppl_abs = calculate_perplexity(model_abs, absmax_text)
ppl_zp = calculate_perplexity(model_zp, absmax_text)

print(f"Original perplexity: {ppl.item():.2f}")

print(f"Absmax perplexity: {ppl_abs.item():.2f}")
print(f"Zeropoint perplexity: {ppl_zp.item():.2f}")

Original perplexity: 15.53

Absmax perplexity: 17.92
Zeropoint perplexity: 17.97

We see that the perplexity of the original model is slightly lower than the two others. A single experiment is not very reliable, but
we could repeat this process multiple times to see the difference between each model. In theory, zero-point quantization should
be slightly better than absmax, but is also more costly to compute.
In this example, we applied quantization techniques to entire layers (per-tensor basis). However, we could apply it at different
granularity levels: from the entire model to individual values. Quantizing the entire model in one pass would seriously degrade the
performance, while quantizing individual values would create a big overhead. In practice, we often prefer the vector-wise
quantization, which considers the variability of values in rows and columns inside of the same tensor.

However, even vector-wise quantization doesn’t solve the problem of outlier features. Outlier features are extreme values (negative
or positive) that appear in all transformer layers when the model reach a certain scale (>6.7B parameters). This is an issue since a
single outlier can reduce the precision for all other values. But discarding these outlier features is not an option since it would
greatly degrade the model’s performance.

🔢 8-bit Quantization with LLM.int8()

Introduced by Dettmers et al. (2022), LLM.int8() is a solution to the outlier problem. It relies on a vector-wise (absmax) quantization
scheme and introduces mixed-precision quantization. This means that outlier features are processed in a FP16 format to retain their
precision, while the other values are processed in an INT8 format. As outliers represent about 0.1% of values, this effectively
reduces the memory footprint of the LLM by almost 2x.

Image by author

LLM.int8() works by conducting matrix multiplication computation in three key steps:

1. Extract columns from the input hidden states X containing outlier features using a custom threshold.
2. Perform the matrix multiplication of the outliers using FP16 and the non-outliers using INT8 with vector-wise quantization (row-
wise for the hidden state X and column-wise for the weight matrix W).
3. Dequantize the non-outlier results (INT8 to FP16) and add them to the outlier results to get the full result in FP16.

Image by author

This approach is necessary because 8-bit precision is limited and can lead to substantial errors when quantizing a vector with large
values. These errors also tend to amplify as they propagate through multiple layers.

We can easily use this technique thanks to the integration of the bitsandbytes library into the Hugging Face ecosystem. We just need
to specify load_in_8bit=True when loading the model (it also requires a GPU).
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

model_int8 = AutoModelForCausalLM.from_pretrained(model_id,
device_map='auto',
load_in_8bit=True,
)
print(f"Model size: {model_int8.get_memory_footprint():,} bytes")

Model size: 176,527,896 bytes

With this extra line of code, the model is now almost three times smaller (168MB vs. 487MB). We can even compare the
distribution of the original and quantized weights as we did earlier:

In this case, we see spikes around -2, -1, 0, 1, 2, etc. These values correspond to the parameters stored in the INT8 format
(non- outliers). You can verify it by printing the model’s weights using model_int8.parameters().

We can also generate text with this quantized model and compare it to the original model.

text_int8 = generate_text(model_int8, "I have a dream")

print(f"Original model:\n{original_text}")
print("-" * 50)
print(f"LLM.int8() model:\n{text_int8}")

Original model:
I have a dream, and it is a dream I believe I would get to live in my future. I love my mother, and there was that one time I had
--------------------------------------------------
LLM.int8() model:
I have a dream. I don't know what will come of it, but I am going to have to look for something that will be right. I haven't thou

Once again, it is diﬃcult to judge what is the best output, but we can rely on the perplexity metric to give us an (approximate)
answer.

print(f"Perplexity (original): {ppl.item():.2f}")

ppl = calculate_perplexity(model_int8, text_int8)

print(f"Perplexity (LLM.int8()): {ppl.item():.2f}")
Perplexity (original): 15.53
Perplexity (LLM.int8()): 7.93

In this case, the perplexity of the quantized model is twice as low as the original one. In general, this is not the case, but it shows that
this quantization technique is very competitive. In fact, the authors of LLM.int8() show that the performance degradation is so low
it’s negligible (<1%). However, it has an additional cost in terms of computation: LLM.int8() is roughly about 20% slower for large
models.

Lab 2
100% (1)
Lab 2
4 pages
FX3u 56MR Manual
100% (3)
FX3u 56MR Manual
17 pages
How To Write Awesome C: (A Quick Intro To C For The Happyboard)
No ratings yet
How To Write Awesome C: (A Quick Intro To C For The Happyboard)
8 pages
Solved-example-of-transformers
No ratings yet
Solved-example-of-transformers
20 pages
Lab 1
100% (1)
Lab 1
10 pages
DC4_lab1_py
No ratings yet
DC4_lab1_py
5 pages
A Matrix-Multiply Unit For Posits in Reconfigurable Logic Leveraging (Open) CAPI
No ratings yet
A Matrix-Multiply Unit For Posits in Reconfigurable Logic Leveraging (Open) CAPI
9 pages
W2 4 DataTypes
No ratings yet
W2 4 DataTypes
9 pages
Coa Rest Notes
No ratings yet
Coa Rest Notes
24 pages
Module 3
No ratings yet
Module 3
21 pages
A Deeper Look at Metafunctions
No ratings yet
A Deeper Look at Metafunctions
24 pages
DL Practical 3 Loss Function
No ratings yet
DL Practical 3 Loss Function
6 pages
Calculating Indicators With Pythonbiogeme: Michel Bierlaire May 17, 2017
No ratings yet
Calculating Indicators With Pythonbiogeme: Michel Bierlaire May 17, 2017
38 pages
Basic Integer Overflows
No ratings yet
Basic Integer Overflows
13 pages
Optimizing C++/Code Optimization/faster Operations: Structure Fields Order
No ratings yet
Optimizing C++/Code Optimization/faster Operations: Structure Fields Order
5 pages
4. Copy of 03_Building_Your_First_Dataset.ipynb - Colab
No ratings yet
4. Copy of 03_Building_Your_First_Dataset.ipynb - Colab
46 pages
3 Types of Gradient Descent Algorithms For Small & Large Datasets
No ratings yet
3 Types of Gradient Descent Algorithms For Small & Large Datasets
9 pages
Ece 306L - Experiment 4: Signal Quantization
No ratings yet
Ece 306L - Experiment 4: Signal Quantization
10 pages
BSIT 22 Main PG 1 168
No ratings yet
BSIT 22 Main PG 1 168
168 pages
Lesson 26. Optimization of 64-Bit Programs
No ratings yet
Lesson 26. Optimization of 64-Bit Programs
4 pages
Transform Coding of Still Images: February 2012
No ratings yet
Transform Coding of Still Images: February 2012
6 pages
BÀI TẬP THỰC HÀNH BUỔI 1
No ratings yet
BÀI TẬP THỰC HÀNH BUỔI 1
15 pages
White Paper One VHDL Maths 2008
No ratings yet
White Paper One VHDL Maths 2008
5 pages
Analyzing GRT Data in Stata
No ratings yet
Analyzing GRT Data in Stata
17 pages
DL
No ratings yet
DL
12 pages
When To Use System Objects Instead of MATLAB Functions: Randi
No ratings yet
When To Use System Objects Instead of MATLAB Functions: Randi
14 pages
House Price Prediction Using Machine Learning in Python
No ratings yet
House Price Prediction Using Machine Learning in Python
13 pages
Experiment 2 v2
No ratings yet
Experiment 2 v2
10 pages
Explain FLYNN Classification With Suitable Examples
No ratings yet
Explain FLYNN Classification With Suitable Examples
7 pages
Sample Attack
No ratings yet
Sample Attack
35 pages
Lecture 2
No ratings yet
Lecture 2
58 pages
None
No ratings yet
None
23 pages
meoooo
No ratings yet
meoooo
5 pages
The First-Half Review: 1.1 Local Labels
No ratings yet
The First-Half Review: 1.1 Local Labels
5 pages
CVDL Cae 2
No ratings yet
CVDL Cae 2
7 pages
Computer Architecture: Author: Emil Sadigov CS2
No ratings yet
Computer Architecture: Author: Emil Sadigov CS2
10 pages
Machine Learning Assignment Solution
No ratings yet
Machine Learning Assignment Solution
30 pages
Sub Band
No ratings yet
Sub Band
7 pages
Lab 1e Fixed-Point Output Fall 2010 1e.1
No ratings yet
Lab 1e Fixed-Point Output Fall 2010 1e.1
8 pages
Programming Assignment #3: Huge Fibonacci: COP 3502, Spring 2017
No ratings yet
Programming Assignment #3: Huge Fibonacci: COP 3502, Spring 2017
10 pages
Lab 3 Variables, Constants and Operators: 3.1 Objectives
No ratings yet
Lab 3 Variables, Constants and Operators: 3.1 Objectives
5 pages
Base Interview 2
No ratings yet
Base Interview 2
10 pages
Hardware Algorithm For Variable Precision Multiplication On FPGA
No ratings yet
Hardware Algorithm For Variable Precision Multiplication On FPGA
4 pages
Visual Basic Complete Notes
No ratings yet
Visual Basic Complete Notes
38 pages
Chap 6 Embedding
No ratings yet
Chap 6 Embedding
44 pages
03_Building_Your_First_Dataset.ipynb - Colab
No ratings yet
03_Building_Your_First_Dataset.ipynb - Colab
42 pages
Data Mining Exercise 3
No ratings yet
Data Mining Exercise 3
11 pages
Assignment 5
No ratings yet
Assignment 5
4 pages
1585665292lesson 2 Variables
No ratings yet
1585665292lesson 2 Variables
16 pages
Floating Point Package User's Guide
No ratings yet
Floating Point Package User's Guide
13 pages
12 Dimensionality Reduction Techniqwues (with Python Codes)
No ratings yet
12 Dimensionality Reduction Techniqwues (with Python Codes)
20 pages
Part II
No ratings yet
Part II
281 pages
Working With Doubles in MQL4
No ratings yet
Working With Doubles in MQL4
8 pages
Implementing Artificial Neural Network in Python From Scratch
No ratings yet
Implementing Artificial Neural Network in Python From Scratch
16 pages
Arithmetic Code Discussion and Implementation
No ratings yet
Arithmetic Code Discussion and Implementation
11 pages
Data Types and Control Flow Notes
No ratings yet
Data Types and Control Flow Notes
11 pages
Pixel Arcade Tutorial 2025
No ratings yet
Pixel Arcade Tutorial 2025
20 pages
Feature Engineering
No ratings yet
Feature Engineering
20 pages
Top Numerical Methods With Matlab For Beginners!
From Everand
Top Numerical Methods With Matlab For Beginners!
Andrei Besedin
No ratings yet
Learn Programming Using C#
From Everand
Learn Programming Using C#
Taurius Litvinavicius
No ratings yet
Profound Python Data Science
From Everand
Profound Python Data Science
Onder Teker
No ratings yet
ComProg Module - M6 Final
No ratings yet
ComProg Module - M6 Final
5 pages
Operations On Floating Point Numbers
No ratings yet
Operations On Floating Point Numbers
16 pages
Chapter 2 - C++ Basics Concise
No ratings yet
Chapter 2 - C++ Basics Concise
14 pages
Question Bank Unit2 CSA
No ratings yet
Question Bank Unit2 CSA
2 pages
Incomplete Note of Lab Python Program
No ratings yet
Incomplete Note of Lab Python Program
18 pages
PDF Computer Science First Book
No ratings yet
PDF Computer Science First Book
203 pages
12 CS Preterm Answerkeys
No ratings yet
12 CS Preterm Answerkeys
8 pages
Com Roan in & Ar It Re
No ratings yet
Com Roan in & Ar It Re
35 pages
Newman Computational Physics Chap 2-5
100% (1)
Newman Computational Physics Chap 2-5
198 pages
Data Handling
No ratings yet
Data Handling
31 pages
CL 244 - Module - 1
No ratings yet
CL 244 - Module - 1
38 pages
A Beginning of Foundation of "C"
No ratings yet
A Beginning of Foundation of "C"
15 pages
Crompton INTEGRA 1530, 1560, 1580 Communications Guide
No ratings yet
Crompton INTEGRA 1530, 1560, 1580 Communications Guide
31 pages
MySQL Data Tape
No ratings yet
MySQL Data Tape
19 pages
Notes For C++
No ratings yet
Notes For C++
40 pages
Python Programming For Beginners - 3 Books in 1 - Beginner's Guide, Data Science and Machine Learning
No ratings yet
Python Programming For Beginners - 3 Books in 1 - Beginner's Guide, Data Science and Machine Learning
253 pages
PES1UG20ME172 K Sec Week3
No ratings yet
PES1UG20ME172 K Sec Week3
10 pages
ECE 171 Digital Circuits: Prof. Mark G. Faust Maseeh College of Engineering and Computer Science
No ratings yet
ECE 171 Digital Circuits: Prof. Mark G. Faust Maseeh College of Engineering and Computer Science
30 pages
Assignment No # 1: University of Gujrat Hafiz Hayat Campus Department of Mathematics
No ratings yet
Assignment No # 1: University of Gujrat Hafiz Hayat Campus Department of Mathematics
3 pages
The Logic of Computer Arithmetic
No ratings yet
The Logic of Computer Arithmetic
511 pages
Vol 1 Chapter 4
No ratings yet
Vol 1 Chapter 4
11 pages
Lecture 7-Geodatabases, Subtypes and Domains
No ratings yet
Lecture 7-Geodatabases, Subtypes and Domains
32 pages
Lesson 1 - Introduction To Python
No ratings yet
Lesson 1 - Introduction To Python
14 pages
Programming in C++ Language
No ratings yet
Programming in C++ Language
17 pages
Usharani Bhimavarapu Jude D
100% (1)
Usharani Bhimavarapu Jude D
349 pages
STATA Notes 2022
No ratings yet
STATA Notes 2022
25 pages
Python
No ratings yet
Python
17 pages
Ict2215 Computer Programming 4
No ratings yet
Ict2215 Computer Programming 4
81 pages
VDF 753
No ratings yet
VDF 753
4 pages

Introduction to Weight Quantization.pdf (1)

Uploaded by

Introduction to Weight Quantization.pdf (1)

Uploaded by

Large Language Models (LLMs) are known for their extensive computational requirements.

Typically, the size of a model is

We distinguish two main families of weight quantization techniques in the literature:

📚 Background on Floating Point Representation

🔰 Naïve 8-bit Quantization

scale = 127 / torch.max(torch.abs(X))

X_quant = (scale * X).round()

X_dequant = X_quant / scale

return X_quant.to(torch.int8), X_dequant

Then, we can use these variables to quantize or dequantize our weights:

The Python implementation is quite straightforward:

x_range = torch.max(X) - torch.min(X)

scale = 255 / x_range

zeropoint = (-scale * torch.min(X) - 128).round()

X_quant = torch.clip((X * scale + zeropoint).round(), -128, 127)

X_dequant = (X_quant - zeropoint) / scale

return X_quant.to(torch.int8), X_dequant

!pip install -q bitsandbytes>=0.39.0

print(f"Model size: {model.get_memory_footprint():,} bytes")

Model size: 510,342,192 bytes

Absmax quantized weights:

Zero-point quantized weights:

weights = [param.data.clone() for param in model.parameters()]

def generate_text(model, input_text, max_length=50):

original_text = generate_text(model, "I have a dream")

found a photo of the hairstyle posted on

def calculate_perplexity(model, text):

encodings = tokenizer(text, return_tensors='pt').to(device)

ppl = calculate_perplexity(model, original_text)

print(f"Original perplexity: {ppl.item():.2f}")

Original perplexity: 15.53

🔢 8-bit Quantization with LLM.int8()

LLM.int8() works by conducting matrix multiplication computation in three key steps:

Model size: 176,527,896 bytes

text_int8 = generate_text(model_int8, "I have a dream")

print(f"Perplexity (original): {ppl.item():.2f}")

ppl = calculate_perplexity(model_int8, text_int8)

You might also like