0% found this document useful (0 votes)
17 views

KD-Lib - A PyTorch Library For Knowledge Distillation, Pruning and Quantization

KD-Lib is an open-source PyTorch library that contains implementations of state-of-the-art algorithms for knowledge distillation, pruning, and quantization. The library aims to make complex neural network compression techniques more accessible and help transition research ideas into implementations. It supports over a dozen knowledge distillation algorithms, pruning based on lottery ticket hypothesis, and various quantization methods. Compared to other available frameworks, KD-Lib provides the most comprehensive set of algorithms across these three important compression techniques.

Uploaded by

daniel.syahputra
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views

KD-Lib - A PyTorch Library For Knowledge Distillation, Pruning and Quantization

KD-Lib is an open-source PyTorch library that contains implementations of state-of-the-art algorithms for knowledge distillation, pruning, and quantization. The library aims to make complex neural network compression techniques more accessible and help transition research ideas into implementations. It supports over a dozen knowledge distillation algorithms, pruning based on lottery ticket hypothesis, and various quantization methods. Compared to other available frameworks, KD-Lib provides the most comprehensive set of algorithms across these three important compression techniques.

Uploaded by

daniel.syahputra
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 4

KD-Lib: A PyTorch library for Knowledge Distillation, Pruning and Quantization

Het Shah,1 Avishree Khare,2 * Neelay Shah,3∗ Khizir Siddiqui 4∗


{f201700931, f201701122, f201804003, f201804394}@goa.bits-pilani.ac.in
arXiv:2011.14691v1 [cs.LG] 30 Nov 2020

Abstract of its knowledge capacity. Knowledge distillation aims


to transfers knowledge from a large model to a smaller
In recent years, the growing size of neural networks has
led to a vast amount of research concerning compression
model without loss of validity. Several advancements have
techniques to mitigate the drawbacks of such large sizes. been witnessed in the development of richer knowledge
Most of these research works can be categorized into distillation algorithms, attempting to reduce the difference
three broad families : Knowledge Distillation, Pruning, in test accuracies of the teacher and the student. These
and Quantization. While there has been steady research algorithms are model-agnostic and hence can be used for a
in this domain, adoption and commercial usage of the wide variety of network architectures.
proposed techniques has not quite progressed at the rate.
We present KD-Lib, an open-source PyTorch based library, While knowledge distillation attempts to train an equally-
which contains state-of-the-art modular implementations
of algorithms from the three families on top of multiple
competent smaller network, network pruning (LeCun et al.
abstraction layers. KD-Lib is model and algorithm-agnostic, 1990) attempts to reduce the size of the existing network
with extended support for hyperparameter tuning using by removing unimportant weights. Different pruning tech-
Optuna and Tensorboard for logging and monitoring. niques differ in the choice of weights to eliminate and the
The library can be found at - methods used to do the same. Pruning can help in reduc-
https://github.com/SforAiDl/KD Lib ing the size of the network up to 90% with minimal loss in
performance. Some approaches have also been empirically
shown to result in faster training of the pruned network along
Introduction with a higher test accuracy (Frankle and Carbin 2018).
Deep neural networks (DNNs) have gained widespread pop-
ularity in recent years, finding use in several domains in- Quantization is another way to compress neural networks
cluding computer vision, natural language processing, hu- by reducing the number of bits used to store the weights. As
man computer interaction and more. These networks have the weights of a network are usually stored as 32-bit floating
achieved remarkable results on several tasks, often even sur- values (FP32), reducing the precision to 8-bit integer values
passing human-level performance. (INT8) will reduce the size of the network by 4 times. Sev-
The number of parameters of such DNNs often increase eral approaches have been developed to quantize networks
multi-fold with an increase in their representation capacity, with minimal loss in performance.
limiting the deployment capabilities and hence, the commer-
cial feasibility of these networks. This limitation warrants
the need for efficient compression techniques that can shrink These compression techniques have become extremely
the networks in size while ensuring that the drop in perfor- popular in recent years and are actively being researched.
mance is minimal. In this paper, we restrict our focus to three New algorithms proposed in research papers can be diffi-
widely-used compression techniques: Knowledge Distilla- cult to understand and implement, especially for potential
tion, Network Pruning and Quantization. users in a non-academic setting, thereby limiting their com-
Knowledge Distillation (Hinton, Vinyals, and Dean mercial usage. To the best of our knowledge, there does not
2015) is a compression paradigm that leverages the capa- exist an umbrella framework containing implementations of
bility of large neural networks (called teacher networks) state-of-the-art algorithms in Knowledge Distillation, Prun-
to transfer knowledge to smaller networks (called student ing and Quantization. In this paper, we present KD-Lib, a
networks). While large models (such as very deep neural comprehensive PyTorch based library for model compres-
networks or ensembles of many models) have higher sion. KD-Lib aims to bridge the gap between research and
knowledge capacity than small models, this capacity might widespread use of model compression techniques. We envi-
not be fully utilized. It can be computationally just as sion that such a framework would be helpful to researchers
expensive to evaluate a model even if it utilizes little as well, providing them a tool to build upon existing algo-
rithms and helping them in going from idea to implementa-
* Equal contribution tion faster.
Library Knowledge Distillation Pruning Quantization
KD-Lib (Ours) Present Present Present
Distiller(Zmora et al. 2019) Present (only 1 algorithm) Present Present
AIMET3 - Present Present
AquVitae1 Present - -
Distiller1 Present - -

Table 1: Comparision of various libraries with KD-Lib

Related work • Knowledge Distillation : The algorithms have been di-


We compare KD-Lib with several openly available frame- vided into two major task-types: Vision and Text. The Vi-
works and libraries. In our comparison, we do not include sion module currently supports 13 algorithms while the
libraries that support less than two algorithms. Text module supports distillation from BERT to LSTM-
based networks(Tang et al. 2019).
Distiller (Zmora et al. 2019) is the most extensive frame- • Pruning : The library currently supports pruning based
work we found, but it primarily focuses on quanti- on the Lottery ticket Hypothesis (Frankle and Carbin
zation and pruning with only one knowledge distilla- 2018).
tion algorithm (Hinton, Vinyals, and Dean 2015). AquVi- • Quantization : Static Quantization, Dynamic Quantiza-
tae1 contains 4 distillation methods but no quantiza- tion and Quantization Aware Training (QAT) (Jacob et al.
tion and pruning algorithms. Similarly Distiller2 has 11 2018) are currently supported by KD-Lib.
knowledge distillation techniques but lacks pruning and
quantization methods. AIMET3 focuses mainly on quan- Code Structure
tization and some other relatively less popular model The structure of the library has been designed for efficient
compression techniques such as tensor decomposition. use with the following major principles kept in mind:
In our survey, we found no library containing algorithms
pertaining to all 3 of the popular compression paradigms • The core function of an algorithm can be executed in
- knowledge distillation, pruning and quantization. Table 1 one line of code. Hence, the classes contain a dedicated
shows concise comparison with different frameworks. method for distillation/pruning/quantization.
• Each module allows extension to newer features and easy
Features and Algorithms modifications. Hence, fluid components of algorithms
KD-Lib houses several algorithms proposed in recent years (loss functions in distillation, for example) can be easily
for model compression. The following features have driven customized.
the design choices for the library: • Necessary statistics are available wherever needed.
Hence, methods dedicated to these are also present
• The main aim of KD-Lib is to make model compression
(get pruning statics, for example).
algorithms accessible to a wide range of users, and hence
the work is fully open-source.
Distiller
• The library should act as a catalyst for further research
in these fields. It should also be extendable to newer al- train student
gorithms and other model compression fields. Hence, it is train teacher
designed to be modular, allowing flexible modifications to evaluate
essential components that can lead to novel algorithms or calculate kd loss
better extensions to existing algorithms.
Figure 1: Structure of a Distiller.
• The interface should be easy to use. Hence, the core func-
tionalities (distillation/pruning/quantization) are accessi-
Knowledge Distillation algorithms can be accessed as
ble in a few lines of code.
Distiller objects (Figure 1), with at least the mentioned
• As tuning the hyperparameters is essential for optimum methods. The train student method distills knowledge from
performance, KD-Lib provides support for hyperparame- a teacher network to a student network, where the teacher
ter tuning via Optuna. Monitoring and logging support is network could optionally be trained using the train teacher
also provided through Tensorboard. method. The evaluate method can be invoked to test the
A brief description of the implemented algorithms is as fol- performance of the student network. The calculate kd loss
lows: method can overridden to provide a custom loss function for
distillation. This can also be leveraged by researchers to test
1
https://github.com/aquvitae/aquvitae novel Knowledge Distillation loss functions.
2
https://github.com/karanchahal/distiller Pruning algorithms have been implemented as Pruner
3
https://github.com/quic/aimet objects (Figure 2). Each Pruner object can access the
Pruner Pruning Epoch % Model Pruned Accuracy
prune 1 0.0 0.9878
get pruning statistics 2 0.10 0.9891
3 0.19 0.9890
Figure 2: Structure of a Pruner.
Table 3: Pruning percentage and accuracy of ResNet18
model on MNIST using Lottery Ticket Pruning
prune method for pruning the network. Additionally, the (Frankle and Carbin 2018). Each pruning epoch con-
get pruning statistics method can be used to obtain infor- sists of 5 training epochs. ’Model pruned’ is the percentage
mation about the weights of the network after pruning (per- of model pruned and ’Accuracy’ is the corresponding
centage of network pruned, for example). accuracy at the end of the epoch.

Algorithm % Size Change BA NA


Quantizer
Static -0.75 0.72 0.70
quantize
QAT -0.75 0.72 0.71
get performance statistics
Dynamic -0.19 0.70 0.70
get model sizes
Table 4: Comparison of various quantization algorithms.
Figure 3: Structure of a Quantizer. ’BA’ (Base Accuracy) is the accuracy of the model before
quantization, and ’NA’ (New Accuracy) is the accuracy of
Quantization algorithms can be accessed via Quantizer the model after quantization. ’% Size change’ refers to the
objects (Figure 3). The quantize method can be used for change in size after quantization. In Static Quantization and
quantization (with differing implementations for different QAT, ResNet18 is tested on the CIFAR10 dataset. For Dy-
algorithms). Additionally, the get model sizes method can namic Quantization, LSTM is tested on IMDB dataset.
be used to compare sizes of the model before and after quan-
tization and the get performance statistics method can be
used to compare test-times and error metrics for the two net- Conclusion and Future Work
works. In this paper, we present KD-Lib, an easy-to-use PyTorch-
The documentation for the library4 has the description of all based library for Knowledge Distillation, Pruning and Quan-
classes and selected tutorials with example code snippets. tization. KD-Lib is designed to facilitate the adoption of cur-
rent model compression techniques and act as a catalyst for
Benchmarks further research in this direction. We plan on actively main-
taining the library and also expanding it to include more al-
We summarize benchmark results on some of the algorithms gorithms and desirable features (distributed training, for ex-
implemented in KD-Lib in Tables 2, 3 and 4. ample) in the future. We further plan on extending this li-
brary to other domains relevant to the research community
Algorithm Accuracy including but not limited to explainability and interpretabil-
None 0.57 ity in knowledge distillation.
DML (Zhang et al. 2018) 0.62
Self Training (Yun et al. 2020) 0.61 References
Messy Collab (Arani, Sarfraz, and Zonooz 2019) 0.60 Arani, E.; Sarfraz, F.; and Zonooz, B. 2019. Improving
Noisy Teacher (Sau and Balasubramanian 2016) 0.59 Generalization and Robustness with Noisy Collaboration in
TAKD (Mirzadeh et al. 2019) 0.59 Knowledge Distillation.
RCO (Jin et al. 2019) 0.58 Frankle, J.; and Carbin, M. 2018. The Lottery Ticket Hypoth-
Probability Shift (Wen, Lai, and Qian 2019) 0.58 esis: Finding Sparse, Trainable Neural Networks.
Hinton, G.; Vinyals, O.; and Dean, J. 2015. Distilling the
Table 2: The accuracies of networks trained by some of var- Knowledge in a Neural Network.
ious knowledge distillation algorithms KD-Lib packages on
the CIFAR10 dataset. All models were trained with the same Jacob, B.; Kligys, S.; Chen, B.; Zhu, M.; Tang, M.; Howard,
A.; Adam, H.; and Kalenichenko, D. 2018. Quantization and
hyperparameter set to ensure a fair comparison. We consider Training of Neural Networks for Efficient Integer-Arithmetic-
ResNet34 as the teacher network (with an accuracy of 0.63) Only Inference. In 2018 IEEE/CVF Conference on Computer
and report accuracies for the student network (ResNet18). Vision and Pattern Recognition.
None refers to a ResNet18 model trained from scratch with-
Jin, X.; Peng, B.; Wu, Y.; Liu, Y.; Liu, J.; Liang, D.; Yan,
out any model compression algorithm. The compression ra-
J.; and Hu, X. 2019. Knowledge Distillation via Route Con-
tio for all of the knowledge distillation algorithms is 50.7% strained Optimization.
LeCun, Y.; Denker, J. S.; ; and Solla, S. A. 1990. Optimal
brain damage. In Advances in Neural Information Processing
4
https://kd-lib.readthedocs.io/ Systems .
Mirzadeh, S.-I.; Farajtabar, M.; Li, A.; Levine, N.; Mat-
sukawa, A.; and Ghasemzadeh, H. 2019. Improved Knowl-
edge Distillation via Teacher Assistant.
Sau, B. B.; and Balasubramanian, V. N. 2016. Deep Model
Compression: Distilling Knowledge from Noisy Teachers.
Tang, R.; Lu, Y.; Liu, L.; Mou, L.; Vechtomova, O.; and Lin,
J. 2019. Distilling Task-Specific Knowledge from BERT into
Simple Neural Networks. CoRR abs/1903.12136.
Wen, T.; Lai, S.; and Qian, X. 2019. Preparing Lessons: Im-
prove Knowledge Distillation with Better Supervision.
Yun, S.; Park, J.; Lee, K.; and Shin, J. 2020. Regularizing
Class-Wise Predictions via Self-Knowledge Distillation. In
2020 IEEE/CVF Conference on Computer Vision and Pattern
Recognition.
Zhang, Y.; Xiang, T.; Hospedales, T. M.; and Lu, H. 2018.
Deep Mutual Learning. In 2018 IEEE/CVF Conference on
Computer Vision and Pattern Recognition.
Zmora, N.; Jacob, G.; Zlotnik, L.; Elharar, B.; and
Novik, G. 2019. Neural Network Distiller: A Python
Package For DNN Compression Research URL
https://arxiv.org/abs/1910.12232.

You might also like