0% found this document useful (0 votes)
61 views

A Gift From Knowledge Distillation Fast Optimization, Network Minimization and Transfer Learning

This document proposes a new technique called knowledge distillation that transfers knowledge from a pretrained deep neural network (teacher DNN) to another DNN (student DNN). It defines the distilled knowledge as the "flow" between layers, calculated by computing the inner product of features from two layers. When applying this technique, the student DNN is optimized much faster than training from scratch, outperforms a similarly sized DNN trained from scratch, and can learn distilled knowledge from a teacher DNN trained on a different task to outperform one trained from scratch. The technique aims to enable fast optimization, improved performance of smaller networks, and transfer learning across tasks.

Uploaded by

TungKVT
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
61 views

A Gift From Knowledge Distillation Fast Optimization, Network Minimization and Transfer Learning

This document proposes a new technique called knowledge distillation that transfers knowledge from a pretrained deep neural network (teacher DNN) to another DNN (student DNN). It defines the distilled knowledge as the "flow" between layers, calculated by computing the inner product of features from two layers. When applying this technique, the student DNN is optimized much faster than training from scratch, outperforms a similarly sized DNN trained from scratch, and can learn distilled knowledge from a teacher DNN trained on a different task to outperform one trained from scratch. The technique aims to enable fast optimization, improved performance of smaller networks, and transfer learning across tasks.

Uploaded by

TungKVT
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

A Gift from Knowledge Distillation:

Fast Optimization, Network Minimization and Transfer Learning

Junho Yim1 Donggyu Joo1 Jihoon Bae2 Junmo Kim1


1
School of Electrical Engineering, KAIST, South Korea
2
Electronics and Telecommunications Research Institute
{junho.yim, jdg105, junmo.kim}@kaist.ac.kr
{baejh}@etri.re.kr

Abstract

We introduce a novel technique for knowledge transfer,


where knowledge from a pretrained deep neural network
(DNN) is distilled and transferred to another DNN. As the
DNN maps from the input space to the output space through
many layers sequentially, we define the distilled knowledge
to be transferred in terms of flow between layers, which is
calculated by computing the inner product between features
from two layers. When we compare the student DNN and the
original network with the same size as the student DNN but
trained without a teacher network, the proposed method of Figure 1. Concept diagram of the proposed transfer learning
transferring the distilled knowledge as the flow between two method. The FSP matrix, which represents the distilled knowl-
layers exhibits three important phenomena: (1) the student edge from the teacher DNN, is generated by the features from two
DNN that learns the distilled knowledge is optimized much layers. By computing the inner product, which represents the di-
rection, to generate the FSP matrix, the flow between two layers
faster than the original model; (2) the student DNN outper-
can be represented by the FSP matrix.
forms the original DNN; and (3) the student DNN can learn
the distilled knowledge from a teacher DNN that is trained
at a different task, and the student DNN outperforms the guided layer. Thanks to the additional hint-based train-
original DNN that is trained from scratch. ing, the trained deep student network showed better accu-
racy with fewer parameters compared to the original wide
teacher network.
The knowledge transfer performance is very sensitive to
1. Introduction
how the distilled knowledge is defined. The distilled knowl-
Over the past several years, various deep neural network edge can be extracted by various features in the pretrained
(DNN) models have provided state-of-the-art performance DNN. Considering that a real teacher teaches a student the
in many tasks, ranging from computer vision [8, 23] to nat- flow for how to solve a problem, we defined high-level dis-
ural language processing [1, 19]. Recently, several stud- tilled knowledge as the flow for solving a problem. Because
ies on the knowledge transfer technique have been con- a DNN uses many layers sequentially to map from the input
ducted [11, 20]. Hinton et al. [11] first proposed the con- space to the output space, the flow of solving a problem can
cept of knowledge distillation (KD) in the teacher–student be defined as the relationship between features from two
framework by introducing the teacher’s softened output. Al- layers.
though the KD training achieved improved accuracy over Gatys et al. [6] used the Gramian matrix to represent
several datasets, this method has limitations such as dif- the texture information of the input image. Because the
ficulty with optimizing very deep networks. To improve Gramian matrix is generated by computing the inner prod-
the performance of the KD training for deeper networks, uct of feature vectors, it can contain the directionality be-
Romero et al. [20] devised a hint-based training approach tween features, which can be thought of as texture infor-
that uses the pretrained teacher’s hint layer and student’s mation. Similar to Gatys et al. [6], we represented the

14133
flow of solving a problem by using Gramian matrix con- performance. When deep learning first began, Alexnet [16]
sisting of the inner products between features from two lay- had only five convolution layers. However, the recent well-
ers. The key difference between the Gramian matrix in [6] known network GoogleNet [23] has 22 convolution layers,
and ours is that we compute the Gramian matrix across lay- and the residual network [8] has 152 layers.
ers, whereas the Gramian matrix in [6] computes the inner A deep network with many parameters requires heavy
products between features within a layer. Figure 1 shows computation for both training and testing. These deep net-
the concept diagram of our proposed method of transferring works are difficult to use in real-life applications because a
distilled knowledge. The extracted feature maps from two normal computer cannot handle this work, let alone mobile
layers are used to generate the flow of solution procedure devices. Therefore, many researchers have been trying to
(FSP) matrix. The student DNN is trained to make its FSP make networks smaller while maintaining the performance
matrix similar to that of the teacher DNN. level. A typical way is to distill knowledge from trained
Distilling the knowledge is a useful technique for various deep networks and transfer it to a small network that can
tasks. In this study, we verified the usefulness of the pro- be used without large storage and heavy computation. Re-
posed distilled knowledge by using it to perform three tasks. cently, Hinton et al. [11] introduced the model compression
The first was fast optimization. A DNN that understands the method based on the concept of dark knowledge. It uses
flow of solving a problem can be a good initial weight for a softened version of the final output of a teacher network
solving a main task and can learn faster than a normal DNN. to teach information to a small student network. With this
Fast optimization is a very useful technique. Researchers teaching procedure, a small network can learn how a large
have focused on achieving fast optimization not only by us- network studied given tasks in a compressed form. Romero
ing advanced learning rate scheduling techniques [13, 27, 4] et al. [20] used not only the final output but also intermedi-
but also by finding good initial weights [5, 9, 18, 20]. Our ate hidden layer values of the teacher network to train the
approach is based on the initial weight method, so we only student network and showed that using these intermediate
compared it with other initial weight methods. We com- layers can improve the performance of deeper and thinner
pared the number of training iterations and performance of student networks. Net2Net [3] also uses a teacher–student
our scheme with various other techniques. network system with a function-preserving transform to ini-
The second task was to improve the performance of a tialize the parameters of the student network according to
small network, which is a shallow network with fewer pa- the parameters of the teacher network.
rameters. Because a small network learns distilled knowl- Fast Optimization A deep convolutional neural network
edge from the teacher network, it is more powerful than us- (CNN) takes a relatively long time to reach its global opti-
ing the student network alone without help from the teacher mum or a good local optimum. It is easy to train small
network. We compared the performance of the original net- datasets such as MNIST [17] or CIFAR10 [15]. In cases
work and a network using various knowledge transfer tech- of big datasets like the ILSVRC datasets [21], however, a
niques. big network can take a few weeks for training. Therefore,
The third task was transfer learning. Although a new fast optimization has become another important subject of
task may provide only a small dataset, transfer learning can research recently. There are several different approaches
take advantage of a deep and heavy DNN pretrained with to fast optimization, such as finding good initial weights or
a huge dataset [2]. Because our proposed method has the reaching the optimal point with a different technique than
advantage of being able to transfer the distilled knowledge the standard stochastic gradient descent (SGD) method.
to a small DNN, the small network can perform similarly to
In the early days, initialization by Gaussian noise with
a large DNN that uses a normal transfer learning method.
a zero mean and unit variance became very popular. Other
Our paper makes the following contributions: 1. We pro-
various initialization techniques such as Xavier initializa-
pose a novel technique to distill knowledge. 2. This ap-
tion [7] are also used widely. However, these simple initial-
proach is useful for fast optimization. 3. Using the proposed
izations are poor at training very deep networks. Therefore,
distilled knowledge to find the initial weight can improve
some new techniques [18, 22, 14] have appeared that are
the performance of a small network. 4. Even if the student
based on mathematical approaches. With good initializa-
DNN is trained at a different task from the teacher DNN, the
tion, if training starts at an appropriate location, then the
proposed distilled knowledge improves the performance of
parameters can rapidly reach the global optimum.
the student DNN.
Optimization algorithms have also evolved with the de-
2. Related Work velopment of deep learning. Conventionally, the SGD al-
gorithm is widely used as a baseline. However, using SGD
Knowledge Transfer Deep networks with many param- can make it difficult to escape from many saddle points. Be-
eters usually perform well in computer vision tasks. The cause of this problem, several other algorithms have been
depth of most architectures is being increased to improve suggested [13, 27, 4]. These algorithms help with get-

24134
ting out of saddle points and reaching the global optimum we believe that demonstrating the solution process for the
quickly. problem provides better generalization than teaching the in-
Transfer Learning Transfer learning is a simple tech- termediate result.
nique of modifying the parameters of an already trained
network to adapt to a new task. Typically, input-side layers 3.2. Mathematical Expression of the Distilled
that play the role of feature extraction are copied from a pre- Knowledge
trained network and kept frozen or fine-tuned, whereas a top The flow of the solution procedure can be defined by the
classifier for the new task is randomly initialized and then relationship between two intermediate results. In the case
trained at a slow learning rate. Fine-tuning often outper- of a DNN, the relationship can be mathematically consid-
forms training from scratch because the pretrained model ered by the direction between features of two layers. We
already has a great deal of information. For example, many designed the FSP matrix to represent the flow of the solu-
researchers [19, 28, 1, 2] have recently used a model pre- tion process. The FSP matrix G ∈ Rm×n is generated by
trained with the ILSVRC dataset to extract visual features the features from two layers. Let one of the selected lay-
from an image and fine-tuned the model to improve the ers generate the feature map F 1 ∈ Rh×w ×m , where h, w,
final accuracy for the VQA [1] and CUB200 [25] tasks. and m represent the height, width, and number of channels,
Many other tasks such as detection and segmentation also respectively. The other selected layer generates the feature
use this ImageNet pre-trained model for the initial values map F 2 ∈ Rh×w ×n . Then, the FSP matrix G ∈Rm×n is
of the model because the ILSVRC dataset can be helpful calculated by
for generalization. Our approach also uses this fine-tuning
technique with our own good initialization method.
h ∑
∑ w 1
Fs,t,i 2
(x; W ) × Fs,t,j (x; W )
Gi,j (x; W ) = , (1)
s=1 t=1
h×w
3. Method
The main concept of our proposed method is how to where x and W represent the input image and the weights
define the important information of the teacher DNN and of the DNN, respectively. We prepared residual networks
transfer the distilled knowledge to the other DNN. This sec- with 8, 26, 32 layers that were trained with the CIFAR-10
tion is divided into four parts to describe our main con- dataset. There are three points in the residual network for
cept. Sec. 3.1 presents the useful distilled knowledge that the CIFAR-10 dataset where the spatial size changes. We
we used in this study. Sec. 3.2 introduces the mathemati- selected several points to generate the FSP matrix, as shown
cal expression of our proposed distilled knowledge. Based in Figure 2.
on the carefully designed distilled knowledge, we define the 3.3. Loss for the FSP Matrix
loss term in Sec. 3.3. Finally, Sec. 3.4 presents the whole
learning procedure of the student DNN. In order to help the student network, we transfer dis-
tilled knowledge from the teacher network. As described
3.1. Proposed Distilled Knowledge before, the distilled knowledge is represented in the form
The DNN generates features layer by layer. Higher layer of an FSP matrix that contains information about the flow
features are closer to the useful features for performing a of the solution procedure. We can assume that there are
main task. If we view the input of the DNN as the question n FSP matrices GiT , i = 1 , . . . ,n, which are generated by
and the output as the answer, we can think of the generated the teacher network, and n FSP matrices GiS , i = 1 , . . . ,n,
features at the middle of the DNN as the intermediate result which are generated by the student network. In this study,
in the solution process. Following this idea, the knowledge we only considered a pair of FSP matrices between the
transfer technique proposed by Romero et al. [20] lets the teacher and student networks (GiT ,GiS ), i = 1 , . . . ,n with
student DNN simply mimic the intermediate result of the the same spatial size. We took the squared L2 norm as the
teacher DNN. However, in the case of the DNN, there are cost function for each pair. The cost function of transferring
many ways to solve the problem of generating the output the distilled knowledge task is defined as
from the input. In this sense, mimicking the generated fea-
LF SP (Wt , Ws )
tures of the teacher DNN can be a hard constraint for the
1 ∑∑
n
student DNN.
= λi × ∥(GTi (x; Wt ) − GSi (x; Ws )∥22 , (2)
In the case of people, the teacher explains the solution N x i=1
process for a problem, and the student learns the flow of the
solution procedure. The student DNN does not necessar- where λi and N represent the weight for each loss term
ily have to learn the intermediate output when the specific and the number of data points, respectively. We assumed
question is input but can learn the solution method when that whole loss terms are equally significant. Therefore, we
a specific type of question is encountered. In this manner, used the same λi for all experiments.

34135
Figure 2. Complete architecture of our proposed method. The numbers of layers of the teacher and student networks can be changed.
The FSP matrices are extracted at the three sections that maintain the same spatial size. There are two stages of our proposed method. In
stage 1, the student network is trained to minimize the distance between the FSP matrices of the student and teacher networks. Then, the
pretrained weights of the student DNN are used for the initial weight in stage 2. Stage 2 represents the normal training procedure.

3.4. Learning Procedure 4. Experiments


Our transfer method uses the distilled knowledge gener- We conducted three experiments to verify the effective-
ated by the teacher network. To clearly explain what the ness of our proposed knowledge transfer technique. For all
teacher network represents in our paper, we define two con- experiment settings, we used a deep residual network [8]
ditions. First, the teacher network should be pretrained by for the base architecture. Interestingly, the deep resid-
some dataset. This dataset can be the same or different from ual network has shortcut connections to make an ensem-
the one that the student network will learn. The teacher net- ble structure [24]. Furthermore, the shortcut connections
work uses a different dataset from that of the student net- allow training of much deeper networks. Because of these
work in the case of a transfer learning task. Second, the two reasons, many researchers use the residual network for
teacher network can be deeper or shallower than the student various tasks. Figure 2 shows the base architecture of the
network. However, we consider a teacher network that is deep residual network. There are several sections that main-
the same or deeper than the student network. tain the same spatial size of feature maps by using the zero
The learning procedure contains two stages of training. padding. For example, the deep residual network in this
First, we minimize the loss function LFSP to make the FSP figure consists of three sections. Although there are no con-
matrix of the student network similar to that of the teacher straints on how to select the two layers to make the FSP
network. The student network that went through the first matrix, we selected the first and last layers in a section. Fur-
stage is now trained by the main task loss at the second thermore, because the FSP matrix can be generated by two
stage. Because we used the classification task to verify layer features with the same spatial size, we used the max
the effectiveness of our proposed method, we can use the pooling layer to make the same spatial size if the sizes of
softmax cross entropy loss Lori as the main task loss. The two layer features are different.
learning procedure is explained below in Algorithm 1. We used three representative tasks to verify the useful-
ness of the proposed knowledge transfer technique. By
Algorithm 1 Transfer the distilled knowledge learning the flow of the solution procedure, the student net-
Stage 1: Learning the FSP matrix work can study a task faster than usual, as discussed in
Weights of the student and teacher networks: Ws , Wt Sec. 4.1. Furthermore, the FSP matrix generated by the
1: Ws = arg minWs LFSP (Wt , Ws ) teacher network allows the student network to outperform
Stage 2: Training for the original task a student network that is trained alone, as described in
1: Ws = arg minWs Lori (Ws )
Sec. 4.2. We considered the case where the teacher and stu-
dent networks are trained by the same dataset for the same
task. Sec. 4.3 expands on the ideas for application by deal-

44136
ing with the transfer learning task.
For all experiments, we compared the proposed method
with the existing knowledge transfer method, FitNet [20].
For the first stage of FitNet, hint-based training was im-
plemented by minimizing the L2 loss between outputs of
the two layers during 35 000 iterations, where the hint and
guided layer were set to the middle layer of each DNN.
The learning rate started with 1e-4 initially. Then, it was
changed to 1e-5 after 25 000 iterations. To ensure a fair
comparison of the recognition accuracy, the FitNet in the
second stage also had the same learning rate policy and
training iterations as the proposed method. At this stage, the
softening factor tau was set to 3, and the value of lambda in
the KD loss function was linearly decreased from 4 to 1.

4.1. Fast optimization


Figure 3. Analysis of the optimization speed and test accuracy.
Because recent DNNs have become deeper to increase We compared the teacher DNN and student DNN that learned the
performance, the training procedure takes many days [26, distilled knowledge (i.e., the FSP matrix).
8]. Furthermore, although a DNN takes a long time to train,
many researchers use an ensemble of DNNs to outperform of 0.0001 and momentum of 0.9 with the MSRA initializa-
the performance of single DNN [23]. In this case, if we tion [9] technique and BN [12].
use an ensemble of n DNNs, training takes n times longer. A student network with the same structure as the teacher
Because of this, interest in the fast optimization technique network was used in stage 1 to set the initial weights as de-
has been rising in recent years. scribed in algorithm 1. We used learning rates of 0.001,
We first prepared the teacher DNN with the normal train- 0.0001, and 0.00001 until 11 000, 16 000, and 21 000 it-
ing procedure. The teacher DNN was used to train the erations, respectively. We used a weight decay of 0.0001
student DNNs with the learning procedure described in and momentum of 0.9. We then trained the student DNN by
Sec. 3.4. By using one teacher network, we generated mul- using the normal procedure and the initial weights provided
tiple student networks. The goal of the proposed fast opti- at the end of stage 1. Note that we trained several student
mization technique is to reach a similar performance with networks in stage 2 by using the same initial weights pro-
the ensemble of student networks as that of the teacher net- vided by the student network trained in stage 1. As the re-
work by using less training time than the normal training sult of stage 1 is copied to many student networks as initial
procedure. weights, stage 1 is an efficient way of initializing many stu-
dent networks. One potential drawback of sharing the same
4.1.1 CIFAR-10 initial weights across all student networks is that the net-
works can be more correlated than if the student networks
The CIFAR-10 dataset [15] contains 50 000 training images are independently initialized.
with 5000 images per class and 10 000 test images with Figure 3 represents the test accuracy and change in the
1000 images per class. The CIFAR-10 dataset comprises training loss over time. The student network showed faster
32 × 32 pixel RGB images with 10 classes. However, we optimization than the teacher network. The student network
padded 4 pixels on each side to make the image size 40×40 was three times faster at reaching the saturation region than
pixels. Randomly cropped 32 × 32 pixel images were used the teacher network. Because we used the MSRA initializa-
for training, and the original 32×32 pixel images were used tion technique for the teacher network, which is not a naive
for testing. initialization method but a high performance method, we
We used a residual network with 26 layers for the teacher believe that the FSP matrix provides good distilled knowl-
DNN, which provided 92% accuracy for the CIFAR-10 edge for initializing the weights of the student network.
dataset as reported in [8]. Furthermore, we used the same We trained student networks with one-third of the orig-
structure of the teacher DNN for the student DNN. The inal number of iterations in stage 2 to demonstrate the fast
teacher network was trained according to the parameters de- optimization. In stage 2, we used learning rates of 0.1, 0.01,
scribed below. For all experiments, we used a batch size of and 0.001 until 11 000, 16 000, and 21 000 iterations, which
256. The learning rate started with 0.1, was changed to 0.01 are less than one-third the original number of iterations.
and 0.001 at 32 000 and 48 000 iterations, respectively, and From the results in Table 1, we observed that using one-
terminated at 64 000 iterations. We used a weight decay third of the iterations was enough for the student networks

54137
with the proposed method. Although the student networks Net 1 Net 2 Net 3 Avg Ensemble #Iter
used fewer iterations, the proposed method outperformed Teacher 91.61 91.56 92.09 91.75 93.48 192k
Teacher * 90.47 90.83 90.62 90.64 92.6 63k
the FitNet as well as the original teacher network. Teacher ‡ 91.84 92.26 92.01 92.04 92.71 63k
We also experimented with the FitNet method of taking 1 loss FitNet [20]* 91.69 91.85 91.64 91.72 92.98 98k
three losses applied to three intermediate layers as well as 3 loss FitNet [20]* 88.90 89.35 89.02 89.09 89.92 98k
Student * 92.28 92.08 92.07 92.14 93.26 84k
one loss applied to the middle layer only. It turned out that Student *† 92.28 91.89 92.08 92.08 93.67 126k
the one-loss FitNet outperformed the three-loss FitNet as in
Table 1. Table 1. Recognition rates (%) on CIFAR-10. The symbol * indi-
The proposed method can decompose the entire network cates that each network was trained with 21 000 iterations, which
into several modules, and each module’s behavior is cap- is less than one-third of the iterations for the original case, which
used 64 000 iterations. Student * was trained with 21 000 itera-
tured by its FSP matrix. If the student’s FSP matrix of a
tions in stage 1, whose results are copied to net 1, net 2, and net
module is similar to that of the teacher network, it implies
3, and each student network was trained with 21 000 iterations in
that the module in the student network behaves similarly to stage 2 to result in total 84 k iterations. The symbol ‡ represents
the corresponding module in teacher network. Further, each the teacher network trained with 21 000 iterations, which started
module can be trained independently in that the module can from the one of the teacher network trained with 64 000 iterations.
be trained from the correlations between inputs and outputs The symbol † indicates the student network that learned the ran-
of that module alone, even though the other modules are not domly shuffled FSP matrix in stage 1. In the case of Student * †,
fully trained. In contrast, the three-loss FitNet’s upper mod- each net was trained with 21 000 iterations in stage 1 and 21 000
ules, which are trained by matching only the outputs of the iterations in stage 2.
module without considering the relation between the input
and the output, are less efficiently trained until the modules closely correlated from sharing initial weights.
below in the student network are sufficiently trained so that Interestingly, we developed a very simple but effective
the input to the upper module begins to be meaningful. This way to train less correlated student networks using the same
explains why the one-loss FitNet outperformed the three- single teacher network. The idea is that we can generate
loss FitNet. For the three-loss FitNet, the network had four multiple FSP matrices that are essentially equivalent but ap-
modules. The second and third modules would be difficult parently different. By using apparently different FSP ma-
to train by the intermediate results. In addition, FSP is less trices for the student networks instead of sharing the same
restrictive than FitNet. If the student network and teacher FSP matrix, we can reduce the correlation among student
network have the same intermediate feature maps, they will networks. The FSP matrix was generated by features from
have the same FSP matrix. However, the converse is not two selected layers. Note that we can permute feature chan-
true, which allows diversity in feature maps given the same nels in the teacher network to obtain an equivalent teacher
FSP matrix. network that behaves essentially the same way. This means
As both teacher networks and student networks are of that the rows or columns of the FSP matrix can be shuf-
the same architecture, one can also transfer knowledge by fled without affecting the transfer of the distilled knowl-
directly copying weights. We also compared the proposed edge. The different FSP matrices obtained by row and col-
method and the knowledge transfer by copying weights. To umn shuffling can be used in stage 1 to generate multiple
this end, we simply trained three copies of one teacher net- student networks with different initial weights. Then, after
work for additional 21 000 iterations, which is equivalent to stage 2, the resulting student networks would be less corre-
copying the weights from a single teacher network and start- lated, and an improved ensemble can be obtained. As indi-
ing from there. This did not provide a better result than stu- cated in Table 1, despite the fewer iterations, the ensemble
dent*. As given in Table 1 (Teacher ‡), the individual per- of student networks using a randomly shuffled FSP matrix
formances were slightly better than the original teacher per- outperformed even the ensemble of teacher networks.
formance, but they provided a poor ensemble performance. In terms of the training times instead of the number of it-
FSP is less restrictive than copying the weights and allows erations, the original model took 16 s/100 iterations, while
for better diversity and ensemble performance. the proposed model took 35 s/100 iterations for stage 1.
In addition, the ensemble of student networks with fewer Therefore, in terms of the total learning time, it took 8.6
iterations provided a similar performance as the ensemble h to train three teacher DNNs with the original method and
of teacher networks, but FitNet did not. Although the en- take 4.84 h to train three student DNNs with the proposed
semble performance of the student networks was close to method. The latter is 1.78 times faster. However, by learn-
the ensemble performance of the teacher networks, there ing more efficiently (e.g., storing the FSP matrices and us-
was an obvious loss in gain achieved by the former (92.14 ing it directly instead of calculating the FSP matrix every
→ 93.26) compared to the gain achieved by the latter (91.75 time (took 19 s/100 iterations)) student * and student *†
→ 93.48). This is because the student networks were more could be trained 2.18 and 1.39 times faster, respectively.

64138
Net 1 Net 2 Net 3 Avg Ensemble #Iter Accuracy
Teacher 64.06 64.19 64.21 64.15 69.3 192k Teacher-original 91.91
Teacher * 61.29 61.26 61.41 61.32 67.2 63k Student-original 87.91
FitNet [20]* 62.85 62.46 62.35 62.55 67.6 98k FitNet [20] 88.57
Student * 64.66 64.64 64.65 64.65 68.8 95k Proposed Method 88.70

Table 2. Recognition rates (%) on CIFAR-100. The symbol * indi- Table 3. Recognition rates (%) on CIFAR-10. We used a resid-
cates that the network was trained with one-third of the iterations ual DNN with 8 layers for the student DNN and 26 layers for the
for the original case, which used 64 000 iterations. teacher DNN.
work. However, there are many defects. Even if we use the
4.1.2 CIFAR-100
DNN as an inference network, we have to prepare a high-
The CIFAR-100 dataset uses 50 000 training images with performance system, which is very expensive. Furthermore,
500 images per class and 10 000 test images with 100 im- many iterations are needed to train a deep and wide neural
ages per class. The CIFAR-100 dataset contains 32 × 32 network. It is also expensive. Thus, methods to improve the
pixel RGB images with 100 classes. Because of the small performance of a small DNN are very important.
number of images per class with 100 classes, we used a We conducted experiments to verify that our proposed
residual network with 32 layers and four times as many method can be used with DNNs of different sizes. The goal
channels as the one described in Sec. 4.1.1. of our proposed method is to improve the performance of a
We did not use augmentation methods unlike the CIFAR- small student network by learning the distilled knowledge
10 case to make the various experiment settings. The of a deep teacher network. Once again, we defined a small
teacher and student networks used the same parameters as network as a shallow network with few weights. As shown
those described in Sec. 4.1.1. The only difference was that in Figure 2, the teacher DNN was deeper than the student
we used learning rates of 0.001, 0.0001, and 0.00001 until DNN. The student DNN was constructed by simply reduc-
16 000, 24 000, and 32 000 iterations, respectively, in stage ing the number of residual modules in the teacher DNN.
1. Therefore, the student DNN used fewer parameters than the
Table 2 presents the recognition rates for different set- teacher DNN.
tings. Each setting was performed three times. The second The learning procedure was the same as that described
column from the right shows the performance of the ensem- in Sec. 4.1. Because the student DNN and teacher DNN
ble of three DNNs. Considering the difference in accuracy had the same number of channels, the sizes of the FSP ma-
between the prepared deep residual network with 32 layers trices were the same. By minimizing the distance between
with an average of 64.15% accuracy and the same network the FSP matrices of the student network and teacher net-
trained with one-third of the original iterations with an aver- work, we found a good initial weight for the student net-
age of 61.32% accuracy, we can conclude that the number of work. Then, the student network was trained to solve the
iterations is important for high performance. However, even main task.
though the student network used fewer iterations for train- 4.2.1 CIFAR-10
ing, the student network that used the distilled knowledge
from the teacher network generated a similar performance We used a residual network with 26 layers for the teacher
as the original teacher network. DNN and a residual network with 8 layers for the student
We compared the other distilled knowledge transfer DNN. For the parameter settings and learning procedure,
method FitNet with our proposed method. As given in Table we use the same parameters as described in Sec. 4.1.1 but
2, the student network with FitNet outperformed the teacher not the same training iterations for stage 2. The student
network with fewer iterations. However, when an ensemble DNNs used the same number of training iterations as the
of three networks was used, the teacher network with fewer teacher DNN.
iterations and student network with FitNet had similar ac- For fair comparison, we prepared a student DNN that
curacies. There was not that much improvement. In terms was trained from scratch. As indicated in Table 3, the meth-
of the performance and number of iterations, the proposed ods of transferring distilled knowledge outperformed the
method was more efficient than the existing method of Fit- student DNN that used the original learning procedure. This
Net, as presented in Table 2. means that distilled knowledge from the teacher DNN can
be useful information for even a shallow student DNN. We
4.2. Performance improvement for the small DNN can conclude that the proposed method is more useful than
Recently, many researchers have used a very deep neural the existing method.
network with a huge number of parameters for high perfor-
4.2.2 CIFAR-100
mance. For example, one residual network uses more than
1000 layers for the classification task [10]. The wide resid- We also verified the network minimization ability of pro-
ual network [26] increases the width of the residual net- posed method at the CIFAR-100 dataset. As a similar exper-

74139
Accuracy Accuracy
Teacher-original 64.06 Teacher - fine tuning 77.72
Student-original 58.65 Teacher - training from scratch 47.53
FitNet [20] 61.28 Student - training from scratch 47.73
Proposed Method 63.33 FITNET [20] 70.19
Proposed Method 74.26
Table 4. Recognition rates (%) on CIFAR-100. We used a residual
DNN with 14 layers for the student DNN and 32 layers for the Table 5. Recognition rates (%) on CUB200. We used a residual
teacher DNN. DNN with 20 layers for the student DNN and 34 layers for the
teacher DNN.
imental setting to Sec. 4.1.2, we used residual networks with
32 and 14 layers for the teacher DNN and student DNN, re- For the shallow DNN, we prepared a 20-layer residual
spectively. For all experiments in this section, we used the DNN. The 34-layer residual DNN consisted of four parts
full 64 000 iterations. that generated features of the same spatial size in the same
Table 4 presents the recognition rates for different set- part. The four parts contained three, four, six, and three
tings. Because we did not use any augmentation methods, residual modules. The prepared student DNN (20-layer
the teacher DNN showed the 64% accuracy. Furthermore, residual DNN) contained two, two, three, and two resid-
the student DNN that used the normal learning method ual modules, respectively. For all settings, we used learn-
showed a 58.65% recognition rate. Surprisingly, the pro- ing rates of 0.1, 0.01, and 0.001 up to 10 000, 20 000, and
posed method made the student network generate the sim- 30 000 iterations. The fine-tuning technique usually uses
ilar performance to the teacher DNN. The existing knowl- small learning rates. However, because we found that the
edge distillation method (i.e., FitNet) also showed improved base learning rate of 0.1 was better than the learning rate of
performance. However, when we compared the perfor- 0.001, we decided to report the results of 0.1.
mance of the student network with two distilled knowledge Because proposed methods have to transfer the FSP ma-
methods and the student network with the original method, trix of the teacher DNN to the student DNN, we used learn-
the proposed method with distilled knowledge clearly per- ing rates of 0.1, 0.01, and 0.001 up to 11 000, 16 000, and
formed better than the existing ones. 21 000 iterations for stage 1. In this stage, we extracted
the FSP matrices at each part of the DNN. As reported in
4.3. Transfer Learning Table 5, the proposed method generated a high level of per-
formance close to that of the teacher DNN with fine-tuning
In this section, we explain the applications to which the
methods. Considering that the student DNN was 1.7 times
proposed methods can be applied. The teacher DNN and
shallower than the teacher DNN, we believe that proposed
student DNN can learn not only the same task, but also
method is an effective technique for transferring the knowl-
different tasks. To deal with this problem, we focused on
edge even to a different task.
the transfer learning task. Transfer learning is widely used
when the dataset is too small to generate useful features. In
this case, most existing methods use a pretrained DNN that 5. Conclusion
is trained by a huge dataset, such as ImageNet dataset [21].
We proposed a novel approach to generate distilled
However, the most important issue is that most existing
knowledge from the DNN. By determining the distilled
methods directly use the pretrained DNN, which contains
knowledge as the flow of the solving procedure calculated
many layers and a huge number of weights. This means that
with the proposed FSP matrix, the proposed method outper-
a high-quality machine needs to be prepared to improve the
forms state-of-the-art knowledge transfer methods. We ver-
performance with a small dataset. Therefore, because the
ified the effectiveness of our proposed method in three im-
distilled knowledge can be transferred to a small DNN, the
portant aspects. The proposed method optimizes the DNN
knowledge transfer technique can be the effective solution
faster and generates a higher level of performance. Fur-
for this problem.
thermore, the proposed method can be used for the transfer
We prepared a 34-layer residual DNN [8] that was pre- learning task.
trained with the ImageNet dataset. For the different task
containing a small number of images, we used the Caltech- Acknowledgement
UCSD Birds (CUB) 200-2011 dataset [25]. The CUB 200-
2011 dataset contains 11,788 images of 200 bird subordi- This work was partly supported by the ICT R&D pro-
nates. Because of the small number of images per class, it gram of MSIP/IITP, 2016-0-00563, Research on Adaptive
is difficult to generate a high level of performance by using Machine Learning Technology Development for Intelligent
only this dataset. As given in Table 5, although we used the Autonomous Digital Companion and the National Research
deep structure of the 34-layer residual DNN, the accuracy Council of Science & Technology (NST) grant by the Korea
was very poor when we trained it from scratch. government (MSIP) (No. CRC-15-05-ETRI).

84140
References [19] H. Noh, P. H. Seo, and B. Han. Image question answering
using convolutional neural network with dynamic parameter
[1] S. Antol, A. Agrawal, J. Lu, M. Mitchell, D. Batra, prediction. arXiv preprint arXiv:1511.05756, 2015. 1, 3
C. Lawrence Zitnick, and D. Parikh. Vqa: Visual question
[20] A. Romero, N. Ballas, S. E. Kahou, A. Chassang, C. Gatta,
answering. In Proceedings of the IEEE International Con-
and Y. Bengio. Fitnets: Hints for thin deep nets. In In Pro-
ference on Computer Vision, pages 2425–2433, 2015. 1, 3
ceedings of ICLR, 2015. 1, 2, 3, 5, 6, 7, 8
[2] S. Branson, G. Van Horn, S. Belongie, and P. Perona. Bird
[21] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh,
species categorization using pose normalized deep convolu-
S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein,
tional nets. arXiv preprint arXiv:1406.2952, 2014. 2, 3
A. C. Berg, and L. Fei-Fei. ImageNet Large Scale Visual
[3] T. Chen, I. Goodfellow, and J. Shlens. Net2net: Accel-
Recognition Challenge. International Journal of Computer
erating learning via knowledge transfer. arXiv preprint
Vision (IJCV), 115(3):211–252, 2015. 2, 8
arXiv:1511.05641, 2015. 2
[22] A. M. Saxe, J. L. McClelland, and S. Ganguli. Exact so-
[4] J. Duchi, E. Hazan, and Y. Singer. Adaptive subgradi-
lutions to the nonlinear dynamics of learning in deep linear
ent methods for online learning and stochastic optimization.
neural networks. arXiv preprint arXiv:1312.6120, 2013. 2
Journal of Machine Learning Research, 12(Jul):2121–2159,
2011. 2 [23] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed,
D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich.
[5] D. Erhan, Y. Bengio, A. Courville, P.-A. Manzagol, P. Vin-
Going deeper with convolutions. In Proceedings of the IEEE
cent, and S. Bengio. Why does unsupervised pre-training
Conference on Computer Vision and Pattern Recognition,
help deep learning? Journal of Machine Learning Research,
pages 1–9, 2015. 1, 2, 5
11(Feb):625–660, 2010. 2
[6] L. A. Gatys, A. S. Ecker, and M. Bethge. A neural algorithm [24] A. Veit, M. Wilber, and S. Belongie. Residual networks are
of artistic style. arXiv preprint arXiv:1508.06576, 2015. 1, exponential ensembles of relatively shallow networks. arXiv
2 preprint arXiv:1605.06431, 2016. 4
[7] X. Glorot and Y. Bengio. Understanding the difficulty of [25] C. Wah, S. Branson, P. Welinder, P. Perona, and S. Belongie.
training deep feedforward neural networks. In Aistats, vol- The Caltech-UCSD Birds-200-2011 Dataset. Technical re-
ume 9, pages 249–256, 2010. 2 port, 2011. 3, 8
[8] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learn- [26] S. Zagoruyko and N. Komodakis. Wide residual networks.
ing for image recognition. arXiv preprint arXiv:1512.03385, arXiv preprint arXiv:1605.07146, 2016. 5, 7
2015. 1, 2, 4, 5, 8 [27] M. D. Zeiler. Adadelta: an adaptive learning rate method.
[9] K. He, X. Zhang, S. Ren, and J. Sun. Delving deep into arXiv preprint arXiv:1212.5701, 2012. 2
rectifiers: Surpassing human-level performance on imagenet [28] B. Zhou, Y. Tian, S. Sukhbaatar, A. Szlam, and R. Fer-
classification. In Proceedings of the IEEE International Con- gus. Simple baseline for visual question answering. arXiv
ference on Computer Vision, pages 1026–1034, 2015. 2, 5 preprint arXiv:1512.02167, 2015. 3
[10] K. He, X. Zhang, S. Ren, and J. Sun. Identity mappings in
deep residual networks. arXiv preprint arXiv:1603.05027,
2016. 7
[11] G. Hinton, O. Vinyals, and J. Dean. Distilling the knowledge
in a neural network. arXiv preprint arXiv:1503.02531, 2015.
1, 2
[12] S. Ioffe and C. Szegedy. Batch normalization: Accelerating
deep network training by reducing internal covariate shift.
arXiv preprint arXiv:1502.03167, 2015. 5
[13] D. Kingma and J. Ba. Adam: A method for stochastic opti-
mization. arXiv preprint arXiv:1412.6980, 2014. 2
[14] P. Krähenbühl, C. Doersch, J. Donahue, and T. Darrell. Data-
dependent initializations of convolutional neural networks.
arXiv preprint arXiv:1511.06856, 2015. 2
[15] A. Krizhevsky and G. Hinton. Learning multiple layers of
features from tiny images. 2009. 2, 5
[16] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet
classification with deep convolutional neural networks. In
Advances in neural information processing systems, pages
1097–1105, 2012. 2
[17] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-
based learning applied to document recognition. Proceed-
ings of the IEEE, 86(11):2278–2324, 1998. 2
[18] D. Mishkin and J. Matas. All you need is a good init. arXiv
preprint arXiv:1511.06422, 2015. 2

94141

You might also like