KD-Lib - A PyTorch Library For Knowledge Distillation, Pruning and Quantization
KD-Lib is an open-source PyTorch library that contains implementations of state-of-the-art algorithms for knowledge distillation, pruning, and quantization. The library aims to make complex neural network compression techniques more accessible and help transition research ideas into implementations. It supports over a dozen knowledge distillation algorithms, pruning based on lottery ticket hypothesis, and various quantization methods. Compared to other available frameworks, KD-Lib provides the most comprehensive set of algorithms across these three important compression techniques.
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0 ratings0% found this document useful (0 votes)
17 views
KD-Lib - A PyTorch Library For Knowledge Distillation, Pruning and Quantization
KD-Lib is an open-source PyTorch library that contains implementations of state-of-the-art algorithms for knowledge distillation, pruning, and quantization. The library aims to make complex neural network compression techniques more accessible and help transition research ideas into implementations. It supports over a dozen knowledge distillation algorithms, pruning based on lottery ticket hypothesis, and various quantization methods. Compared to other available frameworks, KD-Lib provides the most comprehensive set of algorithms across these three important compression techniques.
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 4
KD-Lib: A PyTorch library for Knowledge Distillation, Pruning and Quantization
Het Shah,1 Avishree Khare,2 * Neelay Shah,3∗ Khizir Siddiqui 4∗
{f201700931, f201701122, f201804003, f201804394}@goa.bits-pilani.ac.in arXiv:2011.14691v1 [cs.LG] 30 Nov 2020
Abstract of its knowledge capacity. Knowledge distillation aims
to transfers knowledge from a large model to a smaller In recent years, the growing size of neural networks has led to a vast amount of research concerning compression model without loss of validity. Several advancements have techniques to mitigate the drawbacks of such large sizes. been witnessed in the development of richer knowledge Most of these research works can be categorized into distillation algorithms, attempting to reduce the difference three broad families : Knowledge Distillation, Pruning, in test accuracies of the teacher and the student. These and Quantization. While there has been steady research algorithms are model-agnostic and hence can be used for a in this domain, adoption and commercial usage of the wide variety of network architectures. proposed techniques has not quite progressed at the rate. We present KD-Lib, an open-source PyTorch based library, While knowledge distillation attempts to train an equally- which contains state-of-the-art modular implementations of algorithms from the three families on top of multiple competent smaller network, network pruning (LeCun et al. abstraction layers. KD-Lib is model and algorithm-agnostic, 1990) attempts to reduce the size of the existing network with extended support for hyperparameter tuning using by removing unimportant weights. Different pruning tech- Optuna and Tensorboard for logging and monitoring. niques differ in the choice of weights to eliminate and the The library can be found at - methods used to do the same. Pruning can help in reduc- https://github.com/SforAiDl/KD Lib ing the size of the network up to 90% with minimal loss in performance. Some approaches have also been empirically shown to result in faster training of the pruned network along Introduction with a higher test accuracy (Frankle and Carbin 2018). Deep neural networks (DNNs) have gained widespread pop- ularity in recent years, finding use in several domains in- Quantization is another way to compress neural networks cluding computer vision, natural language processing, hu- by reducing the number of bits used to store the weights. As man computer interaction and more. These networks have the weights of a network are usually stored as 32-bit floating achieved remarkable results on several tasks, often even sur- values (FP32), reducing the precision to 8-bit integer values passing human-level performance. (INT8) will reduce the size of the network by 4 times. Sev- The number of parameters of such DNNs often increase eral approaches have been developed to quantize networks multi-fold with an increase in their representation capacity, with minimal loss in performance. limiting the deployment capabilities and hence, the commer- cial feasibility of these networks. This limitation warrants the need for efficient compression techniques that can shrink These compression techniques have become extremely the networks in size while ensuring that the drop in perfor- popular in recent years and are actively being researched. mance is minimal. In this paper, we restrict our focus to three New algorithms proposed in research papers can be diffi- widely-used compression techniques: Knowledge Distilla- cult to understand and implement, especially for potential tion, Network Pruning and Quantization. users in a non-academic setting, thereby limiting their com- Knowledge Distillation (Hinton, Vinyals, and Dean mercial usage. To the best of our knowledge, there does not 2015) is a compression paradigm that leverages the capa- exist an umbrella framework containing implementations of bility of large neural networks (called teacher networks) state-of-the-art algorithms in Knowledge Distillation, Prun- to transfer knowledge to smaller networks (called student ing and Quantization. In this paper, we present KD-Lib, a networks). While large models (such as very deep neural comprehensive PyTorch based library for model compres- networks or ensembles of many models) have higher sion. KD-Lib aims to bridge the gap between research and knowledge capacity than small models, this capacity might widespread use of model compression techniques. We envi- not be fully utilized. It can be computationally just as sion that such a framework would be helpful to researchers expensive to evaluate a model even if it utilizes little as well, providing them a tool to build upon existing algo- rithms and helping them in going from idea to implementa- * Equal contribution tion faster. Library Knowledge Distillation Pruning Quantization KD-Lib (Ours) Present Present Present Distiller(Zmora et al. 2019) Present (only 1 algorithm) Present Present AIMET3 - Present Present AquVitae1 Present - - Distiller1 Present - -
Table 1: Comparision of various libraries with KD-Lib
Related work • Knowledge Distillation : The algorithms have been di-
We compare KD-Lib with several openly available frame- vided into two major task-types: Vision and Text. The Vi- works and libraries. In our comparison, we do not include sion module currently supports 13 algorithms while the libraries that support less than two algorithms. Text module supports distillation from BERT to LSTM- based networks(Tang et al. 2019). Distiller (Zmora et al. 2019) is the most extensive frame- • Pruning : The library currently supports pruning based work we found, but it primarily focuses on quanti- on the Lottery ticket Hypothesis (Frankle and Carbin zation and pruning with only one knowledge distilla- 2018). tion algorithm (Hinton, Vinyals, and Dean 2015). AquVi- • Quantization : Static Quantization, Dynamic Quantiza- tae1 contains 4 distillation methods but no quantiza- tion and Quantization Aware Training (QAT) (Jacob et al. tion and pruning algorithms. Similarly Distiller2 has 11 2018) are currently supported by KD-Lib. knowledge distillation techniques but lacks pruning and quantization methods. AIMET3 focuses mainly on quan- Code Structure tization and some other relatively less popular model The structure of the library has been designed for efficient compression techniques such as tensor decomposition. use with the following major principles kept in mind: In our survey, we found no library containing algorithms pertaining to all 3 of the popular compression paradigms • The core function of an algorithm can be executed in - knowledge distillation, pruning and quantization. Table 1 one line of code. Hence, the classes contain a dedicated shows concise comparison with different frameworks. method for distillation/pruning/quantization. • Each module allows extension to newer features and easy Features and Algorithms modifications. Hence, fluid components of algorithms KD-Lib houses several algorithms proposed in recent years (loss functions in distillation, for example) can be easily for model compression. The following features have driven customized. the design choices for the library: • Necessary statistics are available wherever needed. Hence, methods dedicated to these are also present • The main aim of KD-Lib is to make model compression (get pruning statics, for example). algorithms accessible to a wide range of users, and hence the work is fully open-source. Distiller • The library should act as a catalyst for further research in these fields. It should also be extendable to newer al- train student gorithms and other model compression fields. Hence, it is train teacher designed to be modular, allowing flexible modifications to evaluate essential components that can lead to novel algorithms or calculate kd loss better extensions to existing algorithms. Figure 1: Structure of a Distiller. • The interface should be easy to use. Hence, the core func- tionalities (distillation/pruning/quantization) are accessi- Knowledge Distillation algorithms can be accessed as ble in a few lines of code. Distiller objects (Figure 1), with at least the mentioned • As tuning the hyperparameters is essential for optimum methods. The train student method distills knowledge from performance, KD-Lib provides support for hyperparame- a teacher network to a student network, where the teacher ter tuning via Optuna. Monitoring and logging support is network could optionally be trained using the train teacher also provided through Tensorboard. method. The evaluate method can be invoked to test the A brief description of the implemented algorithms is as fol- performance of the student network. The calculate kd loss lows: method can overridden to provide a custom loss function for distillation. This can also be leveraged by researchers to test 1 https://github.com/aquvitae/aquvitae novel Knowledge Distillation loss functions. 2 https://github.com/karanchahal/distiller Pruning algorithms have been implemented as Pruner 3 https://github.com/quic/aimet objects (Figure 2). Each Pruner object can access the Pruner Pruning Epoch % Model Pruned Accuracy prune 1 0.0 0.9878 get pruning statistics 2 0.10 0.9891 3 0.19 0.9890 Figure 2: Structure of a Pruner. Table 3: Pruning percentage and accuracy of ResNet18 model on MNIST using Lottery Ticket Pruning prune method for pruning the network. Additionally, the (Frankle and Carbin 2018). Each pruning epoch con- get pruning statistics method can be used to obtain infor- sists of 5 training epochs. ’Model pruned’ is the percentage mation about the weights of the network after pruning (per- of model pruned and ’Accuracy’ is the corresponding centage of network pruned, for example). accuracy at the end of the epoch.
Algorithm % Size Change BA NA
Quantizer Static -0.75 0.72 0.70 quantize QAT -0.75 0.72 0.71 get performance statistics Dynamic -0.19 0.70 0.70 get model sizes Table 4: Comparison of various quantization algorithms. Figure 3: Structure of a Quantizer. ’BA’ (Base Accuracy) is the accuracy of the model before quantization, and ’NA’ (New Accuracy) is the accuracy of Quantization algorithms can be accessed via Quantizer the model after quantization. ’% Size change’ refers to the objects (Figure 3). The quantize method can be used for change in size after quantization. In Static Quantization and quantization (with differing implementations for different QAT, ResNet18 is tested on the CIFAR10 dataset. For Dy- algorithms). Additionally, the get model sizes method can namic Quantization, LSTM is tested on IMDB dataset. be used to compare sizes of the model before and after quan- tization and the get performance statistics method can be used to compare test-times and error metrics for the two net- Conclusion and Future Work works. In this paper, we present KD-Lib, an easy-to-use PyTorch- The documentation for the library4 has the description of all based library for Knowledge Distillation, Pruning and Quan- classes and selected tutorials with example code snippets. tization. KD-Lib is designed to facilitate the adoption of cur- rent model compression techniques and act as a catalyst for Benchmarks further research in this direction. We plan on actively main- taining the library and also expanding it to include more al- We summarize benchmark results on some of the algorithms gorithms and desirable features (distributed training, for ex- implemented in KD-Lib in Tables 2, 3 and 4. ample) in the future. We further plan on extending this li- brary to other domains relevant to the research community Algorithm Accuracy including but not limited to explainability and interpretabil- None 0.57 ity in knowledge distillation. DML (Zhang et al. 2018) 0.62 Self Training (Yun et al. 2020) 0.61 References Messy Collab (Arani, Sarfraz, and Zonooz 2019) 0.60 Arani, E.; Sarfraz, F.; and Zonooz, B. 2019. Improving Noisy Teacher (Sau and Balasubramanian 2016) 0.59 Generalization and Robustness with Noisy Collaboration in TAKD (Mirzadeh et al. 2019) 0.59 Knowledge Distillation. RCO (Jin et al. 2019) 0.58 Frankle, J.; and Carbin, M. 2018. The Lottery Ticket Hypoth- Probability Shift (Wen, Lai, and Qian 2019) 0.58 esis: Finding Sparse, Trainable Neural Networks. Hinton, G.; Vinyals, O.; and Dean, J. 2015. Distilling the Table 2: The accuracies of networks trained by some of var- Knowledge in a Neural Network. ious knowledge distillation algorithms KD-Lib packages on the CIFAR10 dataset. All models were trained with the same Jacob, B.; Kligys, S.; Chen, B.; Zhu, M.; Tang, M.; Howard, A.; Adam, H.; and Kalenichenko, D. 2018. Quantization and hyperparameter set to ensure a fair comparison. We consider Training of Neural Networks for Efficient Integer-Arithmetic- ResNet34 as the teacher network (with an accuracy of 0.63) Only Inference. In 2018 IEEE/CVF Conference on Computer and report accuracies for the student network (ResNet18). Vision and Pattern Recognition. None refers to a ResNet18 model trained from scratch with- Jin, X.; Peng, B.; Wu, Y.; Liu, Y.; Liu, J.; Liang, D.; Yan, out any model compression algorithm. The compression ra- J.; and Hu, X. 2019. Knowledge Distillation via Route Con- tio for all of the knowledge distillation algorithms is 50.7% strained Optimization. LeCun, Y.; Denker, J. S.; ; and Solla, S. A. 1990. Optimal brain damage. In Advances in Neural Information Processing 4 https://kd-lib.readthedocs.io/ Systems . Mirzadeh, S.-I.; Farajtabar, M.; Li, A.; Levine, N.; Mat- sukawa, A.; and Ghasemzadeh, H. 2019. Improved Knowl- edge Distillation via Teacher Assistant. Sau, B. B.; and Balasubramanian, V. N. 2016. Deep Model Compression: Distilling Knowledge from Noisy Teachers. Tang, R.; Lu, Y.; Liu, L.; Mou, L.; Vechtomova, O.; and Lin, J. 2019. Distilling Task-Specific Knowledge from BERT into Simple Neural Networks. CoRR abs/1903.12136. Wen, T.; Lai, S.; and Qian, X. 2019. Preparing Lessons: Im- prove Knowledge Distillation with Better Supervision. Yun, S.; Park, J.; Lee, K.; and Shin, J. 2020. Regularizing Class-Wise Predictions via Self-Knowledge Distillation. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Zhang, Y.; Xiang, T.; Hospedales, T. M.; and Lu, H. 2018. Deep Mutual Learning. In 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Zmora, N.; Jacob, G.; Zlotnik, L.; Elharar, B.; and Novik, G. 2019. Neural Network Distiller: A Python Package For DNN Compression Research URL https://arxiv.org/abs/1910.12232.