0% found this document useful (0 votes)
10 views

Knowledge Distillation For Efficient Instance Semantic Segmentation With Transformers

Uploaded by

liangpengchen8
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views

Knowledge Distillation For Efficient Instance Semantic Segmentation With Transformers

Uploaded by

liangpengchen8
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

This CVPR Workshop paper is the Open Access version, provided by the Computer Vision Foundation.

Except for this watermark, it is identical to the accepted version;


the final published version of the proceedings is available on IEEE Xplore.

Knowledge Distillation for Efficient Instance Semantic Segmentation with


Transformers

Maohui Li1 Michael Halstead1


Chris McCool1,2
1
University of Bonn, 2 Lamarr Institute for Machine Learning and Artificial Intelligence
{mlii, michael.halstead, cmccool}@uni-bonn.de

Abstract

Instance-based semantic segmentation provides detailed


per-pixel scene understanding information crucial for both
computer vision and robotics applications. However, state-
of-the-art approaches such as Mask2Former are compu-
tationally expensive and reducing this computational bur-
den while maintaining high accuracy remains challenging.
Knowledge distillation has been regarded as a potential
way to compress neural networks, but to date limited work
has explored how to apply this to distill information from Figure 1. A visualization of our bipartite query-based matching
the output queries of a model such as Mask2Former. between the teacher and the student. We compare first association
In this paper, we match the output queries of the student of the predicted queries of the model, either teacher or student, to
and teacher models to enable a query-based knowledge dis- the groundtruth to obtain teacher-gt and student-gt respectively.
We then use this to find the association of the predicted queries for
tillation scheme. We independently match the teacher and
the teacher-student.
the student to the groundtruth and use this to define the
teacher to student relationship for knowledge distillation. mantic segmentation brings the potential to greatly improve
Using this approach we show that it is possible to perform efficiency to agricultural automation. Automatic fruit pick-
knowledge distillation where the student models can have a ing, weed removal, and pesticide drip irrigation have been
lower number of queries and the backbone can be changed made possible by the introduction of dense predictions [1].
from a Transformer architecture to a convolutional neural Instance segmentation provides a wealth of information
network architecture. Experiments on two challenging agri- like per-pixel classification and instance location of objects,
cultural datasets, sweet pepper (BUP20) and sugar beet which contributes to agricultural efficiency and makes it
(SB20), and Cityscapes demonstrate the efficacy of our ap- possible to convert agricultural production from human la-
proach. Across the three datasets the student models obtain bor to automation. Despite these advances, it still remains
an average absolute performance improvement in AP of 1.8 highly challenging to deploy vision algorithms in agricul-
and 1.9 points for ResNet-50 and Swin-Tiny backbone re- ture since it is vulnerable to clutter from leaves and other
spectively. To the best of our knowledge, this is the first work crop, as well as highly variable lighting conditions. Further-
to propose knowledge distillation schemes for instance se- more, the limited computational and energy resources on
mantic segmentation with transformer-based models. edge devices constrains deployment on robots. Therefore,
Index Terms–Computer Vision for Agriculture Automa- efficient and accurate models capable of instance-based seg-
tion, Knowledge Distillation, Efficient Instance Segmenta- mentation are integral to enable real-world deployment.
tion, Transformer Transformer-based models were introduced to computer
vision to further improve accuracy after the successful ap-
plication of vanilla transformers [2] in natural language
1. Introduction processing (NLP). DETR [3] was one of the first object
detection models based on transformers, and it introduces
The large application of computer vision algorithms, such the conception of object query which only contain features
as detection, semantic segmentation, and instance-based se- from one instance. More recently, the Mask2Former [4]

5432
architecture was proposed which is capable of state-of-the- 2. Related Work
art semantic, instance-based, and panoptic segmentation by
extracting masked attention within predicted mask regions. 2.1. Instance Segmentation
Despite the improved accuracy provided by transformer- Instance-based semantic segmentation can be regarded as
based models their complicated structures inevitably in- the process of labeling pixels with categories and object ids.
creases the computational complexity and hampers their ap- This pixel-wise classification is fundamental for many ad-
plication in real-time scenarios. vanced computer vision based tasks, such as medical im-
ages analysis [5], autonomous driving [6], and scene un-
One of the most effective ways to improve the efficiency derstanding [7]. The methods can be roughly divided into
of deep neural networks is through knowledge distillation, two main categories: single-stage and two-stage methods.
by imparting knowledge from complex networks (teachers) Two-stage methods dominate instance segmentation and
to efficient networks (students). There are many kinds of can be further divided into top-down methods and bottom-
knowledge that can be used to distill information from us- up methods. Top-down methods [8–10] predict bounding-
ing features from the final or intermediate layers through to boxes first and then generate the instance masks within
using the mutual relationship between features. However, these boxes which means that the final performance is
the majority of these distillation schemes are designed for highly dependent on detection results. In contrast, bottom-
use with convolutional neural network (CNN) -based struc- up methods [11, 12] classify pixels into the corresponding
tures. categories and from this post-processing is applied to form
the instances which means that the final performance is
highly dependent on the post-processing approach. Single-
This means that distillation of joint image- and pixel- stage methods jointly perform detection and segmentation,
level classification of transformer-based networks are yet to and are further divided into anchor-based and anchor-free
be explored. Thus, how to distill instance-level knowledge methods. For anchor-based methods [13, 14], most de-
from a transformer-based structure is the key contribution tectors rely heavily on pre-defined anchors which vary in
of this paper. scale depending on the target dataset. These techniques also
rely on handcrafted approaches like non-maximal suppres-
In this paper, we compress complex transformer-based sion (NMS) to remove redundancies, which can increase the
models into more efficient networks while retaining a high computational burden. For anchor-free methods [15, 16],
degree of accuracy. First, we produce an optimal bipartite they distinguish instances on the basis of predicted loca-
matching scheme between the queries from the teacher and tions and shapes of the objects. A limitation of anchor-free
the student as shown in Figure 1 by using the Hungarian methods is that their performance generally decreases when
algorithm [3]. Second, we train our efficient networks by many instances overlap, this is because each grid can only
distilling the class probabilities and mask maps from the predict one location and mask.
teacher network. For our experiments we perform distilla-
tion using the Mask2Former [4] architecture due to its high 2.2. Transformer-Based Dense Prediction
accuracy for instance-based semantic segmentation and the A variety of transformers have been designed for vision
ability to easily switch the backbone network. Finally, We tasks [17, 18]. A standard backbone for various dense vi-
show the validity of our approach by performing knowledge sion prediction tasks is the Swin-Transformer [19] which
distillation in multiple domains: arable farmland, horticul- consists of a hierarchical architecture with multi-scale fea-
ture, and traffic scenarios. ture maps. Inspired by the transformer structures, re-
cent work has exploited self-attention to capture the long-
Our results, on challenging agricultural and traf- range relationships needed to perform instance-based se-
fic datasets, demonstrate that our knowledge distillation mantic segmentation. A query-based instance segmenta-
scheme can be employed to learn efficient and accurate tion method based on the structure of Sparse R-CNN [20]
lightweight models. Our models are 2.3 or 2.0 times faster is proposed by Fang et al. [21]. In the approach, a mask
than the original complex teacher while only degrading pooling operator is designed to extract the current stage in-
AP (average precision) performance by as little as 2.8 or stance mask features and a dynamic convolution module is
4.0 points, and in one case outperforms the teacher by 1.0 employed to set a linkage between mask features and query
point. The main contribution of our paper is that we estab- embedding. In ISTR [22], three task-specific heads and a
lish pair-wise query matching between transformer-based fixed mask decoder work together to accomplish the classi-
models with different complexities (backbones and number fication, localization, and segmentation prediction by taking
of queries) and distill the query-based knowledge from the in the image-level features and learnable position embed-
teacher to the student. ding. Dong et al. [23] propose a structure to complete clas-

5433
sification, bounding box regression, and mask segmentation tures into the teacher’s attention space during this process.
in one head linking segmentation and detection as a unified As the complex structure of the teacher consumes a large
query representation. However, complex transformer-based amount of resources during the process of supervising the
models do not currently meet the real-time requirements of student, a framework [36] that stores the teacher’s predic-
robotic platforms. tions in advance was proposed.
The distillation methods with transformed-based models
2.3. Knowledge Distillation mentioned above focus mainly on image-level or pixel-level
classification, but new distillation schemes on instance-
Knowledge distillation has proven effective at compress-
based semantic segmentation and other dense prediction
ing models while maintaining performance. This enables
paradigms are currently not explored. In this paper we in-
a lightweight network (student) to boost its performance
vestigate this dense prediction by using a bipartite matching
learning soft labels from a more complicated (teacher) net-
scheme to distill information from a teacher to a student.
work [24]. The form of knowledge is traditionally divided
into three categories when performing distillation with 3. Proposed Approach
CNN-based models: response-based knowledge, feature-
based knowledge, and relation-based knowledge respec- In this paper, we propose a knowledge distillation scheme
tively [25]. Response-based knowledge relies on the re- for instance-based semantic segmentation with transformer-
sponse of the last output layer of the used networks [26, 27]. based models. We build an ordered permutation for queries
Feature-based knowledge uses the features from selected in- from the student and the teacher in an innovative way, then
termediate layers, and an adaptation layer is often required we distill the query-based knowledge based on the estab-
to address the dimension mismatch between the teacher and lished matching. Finally, we test our scheme on two agri-
the student when doing distillation [28, 29]. Relation-based cultural datasets and a common city scene dataset. The ap-
knowledge focuses more on the relationships between fea- proach consists of three aspects.
tures from different network layers or data samples, with 1. We adopt Mask2Former as the framework for both
typical examples being [30]. teacher and student networks. For the teacher network
These distillation techniques have been shown to we use the Swin-Large backbone [19] with a high num-
perform well for CNN-based networks, however, for ber of queries. For the student networks we explore the
transformer-based networks they are not directly applica- use of both ResNet [37] and Swin for the backbones with
ble due to the inherently different network structures. Due fewer queries to explore a trade-off between efficiency
to these differences self-attention based knowledge distil- and accuracy.
lation approaches are required. Lin et al. [31] propose the 2. We build a new matching technique to facilitate the
target-aware transformer. Their approach assumes that the query-based distillation process. We match the un-
semantic information usually varies on the same spatial lo- ordered queries from a set of predictions from the
cation since the teacher’s and the student’s receptive field teacher and the student by calculating their instance sim-
are different. The model generates the similarity between ilarity using class probabilities and mask maps. We
each pixel of the teacher’s feature and all spatial locations achieve this by exploiting the available groundtruth la-
of the student’s features during the distillation process. The bels and the matching scheme is explicitly described in
performance of their technique surpasses the state-of-the-art III-B.
methods by a significant margin on common benchmarks. 3. Based on the result of the matching scheme, we per-
In [32], the authors find that the dominant factor that affects form instance-based knowledge distillation using both
the distillation performance lies in inductive teacher bias the teacher and groundtruth labels. For this, we define
rather than teacher accuracy. The student is more likely to a teacher to student loss function that incorporates both
learn diverse knowledge by transferring the inductive biases class probabilities and segmentation masks. When us-
from both the CNN and involution-based neural network ing both the teacher and groundtruth labels we combine
(INN) teacher when distilled. Touvron et al. [33] propose them using a weighted loss function. This is described
a novel attention distillation mechanism and achieve state- in III-C.
of-the-art performance on ImageNet. The authors exploit a
3.1. Teacher and Student Network Architectures
distillation token, similar to the classification token within
ViT [34], to enable the student to learn the attention maps We adopt Mask2Former as the framework, which is
from the teacher. To gain cross-architecture knowledge, a an encoder-decoder structure with self-attention layers.
novel distillation scheme is proposed by Liu et al. [35] to Mask2Former is composed of a backbone, pixel decoder,
combat the heterogeneous architectures’ gap. Two projec- and transformer decoder which results in N queries which
tors, one for partial cross-attention and one for group-wise are used to predict the class probabilities (including ob-
linear, are designed to align the student’s immediate fea- ject or not) and the associated per-pixel mask. The back-

5434
Figure 2. An illustration of our instance knowledge distillation process. Given the association of the predicted queries from the teacher to
the student model we then use an L2 loss to perform joint distillation on the class probabilities and mask predictions. In this process, only
queries with effective instances will be distilled and other redundant ones are discarded.
bone extracts information from an image and outputs fea- For this we use the following matching loss (cost),
tures with different resolutions. The pixel decoder gener-
ates high-resolution per-pixel embedding by sampling the \mathcal {L}_{match} = \delta _{cls} \mathcal {L}_{cls} + \delta _{msk} \mathcal {L}_{msk} \label {loss_matching}, (1)
features from the previous backbone. The transformer de-
coder processes the long-range relationship of the image in- where δcls = 2 and δmsk = 5. Lcls and Lmsk are the corre-
formation and generates object queries, with which the final sponding losses of class probabilities and mask predictions.
heads predict the probabilities and the masks. More details For the matching cost of classes, we only use probabili-
on this approach can be found in [4]. ties from the query set that match to an object (i.e. Ci ̸= Ø),
We take competitively performed and commonly used
ResNet and Swin-Transformer as the backbones of our \mathcal {L}_{cls} = -\mathbbm {1}_{C_{i}\neq \O }\hat {p}_{\sigma (i)}(C_{i}) \label {loss_class}, (2)
models. The model inference complexity can be varied by
selecting a set of backbones, from ResNet-18 to ResNet- where p̂σ(i) is the predicted probability of the class, and Ci
101 or from Swin-T (tiny version) to Swin-L (large ver- is the groundtruth class label.
sion). We set the teacher backbone to be the top performer As for the matching cost of masks, we calculate similar-
(Swin-L) and the student uses either ResNet-50 or Swin-T ities of pixels and overlaps between instances,
as backbones as they computationally more efficient. The
teacher backbone was selected to ensure that it can capture \mathcal {L}_{msk} = -\mathbbm {1}_{C_{i}\neq \O } [\mathcal {L}_{dic}(m_{i},\hat {m}_{\sigma (i)})+\mathcal {L}_{xe}(m_{i},\hat {m}_{\sigma (i)})] \label {loss_mask} (3)
deeper relationships within the scene and it consists of 200
queries. As the student backbone is simpler we reduce its where mi and m̂σ(i) are masks from gt and predictions in
number of queries to be 100. one image, and Ldic and Lxe are the Dice loss and Cross
Entropy loss respectively.
3.2. Student-Teacher Query Matching
3.3. Knowledge Distillation Loss function
In order to perform knowledge distillation we need to find
matching queries between the teacher and student mod- As shown in Figure 2, the student uses both the teacher in-
els. As shown in Figure 1, we establish the teacher-student formation and groundtruth labels during distillation. This
matching by exploiting the existing groundtruth labels (gt). leads to our loss function being composed of two parts, the
We do this by first resolving the teacher-gt and student-gt groundtruth loss and the teacher-student loss,
associations. From this, we can then derive the teacher-
student associations, ϕ̂. \mathcal {L}_{all} = \mathcal {L}_{gt} + \alpha \mathcal {L}_{dis} \label {loss_all}, (4)
To find the optimal bipartite matching, teacher-gt and
student-gt we exploit the commonly used Hungarian algo- where α is the balancing weight which varies according to
rithm in previous work [3]. This establishes the query per- the different datasets. Lgt and Ldis are the corresponding
mutation by optimizing the total cost involved by combin- losses from the hard labels and soft labels during distilla-
ing both class probabilities and instance mask predictions. tion.

5435
For the groundtruth loss, we compare the similarity of All of our models are implemented in Detectron2 [41]
the predicted class possibilities and mask predictions with and trained on an NVidia A6000 GPU. We fine-tune all the
the corresponding hard labels as, models with weights pre-trained on the COCO dataset [42]
and do this 3 times to get the mean and variance of the per-
\mathcal {L}_{gt}=\delta _{cls}\mathcal {L}_{xe}(GT_{cls},S_{cls}) + \delta _{msk}\mathcal {L}_{msk}\\ \mathcal {L}_{msk}=\mathcal {L}_{dic}(GT_{msk},S_{msk})+\mathcal {L}_{xe}(GT_{msk},S_{msk}) \label {loss_gt}, formance. The only exception is that on Cityscapes we do
(6) not train multiple teacher models but instead use the pre-
trained model provided by Mask2Former. For BUP20 and
where GTmsk is the groundtruth mask, Smsk is the pre-
SB20, we set AdamW optimizer [43] with a step learning
dicted mask, GTcls is the groundtruth class, and Scls is the
rate of γ = 1e−4 and γ = 1e−5 respectively for the back-
predicted class. The balancing weights are set to δcls = 2
bone ResNet and Swin-Transformer with a batch b = 1;
and δmsk = 5. The class-based loss Lxe is a standard cross-
the small batch size is due to the low number of training
entropy loss. The mask-based loss Lmsk uses a weighted
images. We search for the optimal weight α to combine
combination of a standard cross entropy loss and a Dice loss
groundtruth and teacher labels using ranges 0.2 to 5.0. For
Ldic ; similar to [4] we use equal weights of 5 for each.
Cityscapes, we set AdamW optimizer with a step learning
The distillation loss is defined as follows,
rate of γ = 1e−4 with a batch b = 16. Here we set the
search for α to in the range 0.2 to 2.0.
\mathcal {L}_{dis} = \sum _{i}^{N} ||T_{cls}^{i},S_{cls}^{\hat {\phi }(i)}||_2^2 + \beta \sum _{i}^{N} ||T_{msk}^{i},S_{msk}^{\hat {\phi }(i)}||_2^2 \label {loss_dis} (7)
We report performance primarily using average preci-
sion (AP ). Our teacher network has 200 queries with a
i ϕ̂(i) Swin-Large backbone, referred to as M 2Fsl . The student
where Tcls is the teacher class logits, Scls is the student
i ϕ̂(i) networks have only 100 queries and use either a ResNet-50
class logits, Tmsk is the teacher predicted mask logits, Smsk
(Sr50 ) or Swin-Tiny (Sst ) backbone.
is the student predicted mask logits, and β is the balancing
weight. The i-th teacher-student association is denoted as
ϕ̂(i). To compare the teacher and student logits we use an
4.1. Experiments on Agricultural Data
L2 loss. To address the imbalance between the class and
mask losses we set the balancing weight β = 0.02. 4.1.1 Results on BUP20
4. Experimental Setup and Results
Table 1 outlines the results for BUP20 where there is a clear
We evaluate our instance semantic segmentation systems improvement based on our distillation scheme. Our most
on two challenging agricultural datasets, BUP20 and SB20, efficient network, Sr50 , obtains the greatest performance
and Cityscapes dataset. BUP20 [38] is a sweet pepper boost with an absolute AP improvement of 1.3 points when
dataset (glasshouse) which consists of 280 images which compared to its direct baseline M 2Fr50 (44.8 to 46.1). The
are split into three sets with 124 images to train, 63 im- other distilled network, Sst , also improves results with an
ages to validate and 93 images to evaluate (test). SB20 [39] absolute improvement in AP of 0.8 points.
is a sugar beet dataset (arable farming) which consists of
143 images which are split into three sets with 71 images We attribute the greater increase of the Sr50 model to the
to train, 37 images to validate and 35 images to evaluate overall performance gap between the baseline (M 2Fr50 )
(test). Cityscapes [40] is a traffic scene dataset which con- and the teacher network when compared to that of the
sists of 5000 images, which are split into three sets with M 2Fst to the teacher. The difference between the two
2975 to train, 500 to valid and 1525 to evaluate (test). Sam- baseline approaches (M 2Fr50 and M 2Fst ) is believed to
ples of the three real-world datasets can be seen in Fig- be based on more informative features output by the Swin-T
ure 3. BUP20 and Cityscapes both have high levels of backbone. Interestingly, it can be seen that the main perfor-
occlusion while SB20 has large variation due to different mance gain for the distilled models occurs at higher APs.
growth stages of the plants. It can be seen that there are large performance gains for
Table 1. Instance knowledge distillation results on BUP20 in terms AP75 but much lower gains for AP50. The performance
of AP , AP 50 and AP 75. gain for AP75 is 2.1 and 1.3 points for Sr50 and Sst respec-
tively. By comparison the performance for AP50 is 1.1 and
models AP AP 50 AP 75 0.3 points for Sr50 and Sst respectively. This indicates that
teacher (M 2Fsl ) 51.3±0.1 81.1±0.1 53.3±0.3 the distillation approach is providing models which provide
baseline (M 2Fr50 ) 44.8±0.3 72.3±0.8 46.2±0.6 considerably more accurate semantic segmentation masks
dis (Sr50 ) 46.1±0.8 73.4±0.6 48.3±0.9 which is important for downstream tasks for automating the
baseline (M 2Fst ) 47.7±0.3 76.9±0.2 49.6±0.3 estimation of phenotypic attributes (e.g. size of fruit) as
dis (Sst ) 48.5±0.3 77.2±0.4 50.9±0.7 well as robotic tasks such as harvesting.

5436
Figure 3. Example images from the three datasets with groundtruth instance segmentation masks of BUP20, SB20 , and Cityscape from
left to right.
4.1.2 Results on SB20 Mask2Former [4].
Our results in Table 3 demonstrate our distillation ap-
The results for SB20, see Table 2, demonstrate that our dis-
proach is also effective on this data. The student Sr50 gains
tillation approach consistently improves the performance of
an absolute performance increase in AP of 2.1 points from
our models. For SB20, the student network Sst is able
35.8 to 37.9, and the student Sst gains 1.8 points improve-
to achieve the greatest performance improvement. For the
ment from 37.9 to 39.7. Despite these considerable perfor-
Sr50 network we achieve an absolute improvement in AP
mance gains there is still a large gap in performance be-
of 2.0 points (34.9 to 36.9) compared to 3.0 points for the
tween the students and the teacher. For the Sr50 network
more complex Sst (36.4 to 39.4). The impact of our dis-
the absolute performance in terms of AP is 5.8 points while
tillation scheme is most evident when comparing Sst to the
for Sst it is 4.0 points, when compared to the teacher.
teacher network where we improve our performance by 1.0
point, however, we note that this improvement is still within
the bounds of the variance of the models where the teacher 4.3. Best Student Model and Inference Time
network has an AP of 38.4±0.7 and the distilled model Sst
an AP of 39.4±0.5. In the previous experiments we demonstrate that our distil-
Similar to BUP20 the biggest performance improve- lation approach consistently improves performance. Over-
ments occur for higher APs which can be seen by exam- all, we demonstrate that our distillation scheme achieves
ining the performance gains of AP50 vs AP75. The perfor- considerable improvements with an average absolute per-
mance gain for AP75 is 3.5 and 5.2 points for Sr50 and Sst formance improvement in terms of AP of 1.8 points for
respectively. By comparison the performance for AP50 is Sr50 and 1.9 points for Sst across the three datasets. On
1.1 and 1.5 points for Sr50 and Sst respectively. For arable average, over the three datasets, the two student models are
farming data such as SB20, providing considerably more 2.0 and 2.3 times faster, for Sst and Sr50 respectively. Fur-
accurate segmentation masks is important for phenotyping thermore, the Sst model consistently outperforms Sr50 and
tasks such as estimating leaf-area as well as precise robotic so in most cases it would be the preferred distilled model.
weeding. However, for SB20 the relative performance degradation is
smallest, with Sr50 dropping by 1.5 points for AP which
4.2. Ablation on Cityscapes is a 3.9% relative reduction in performance. Therefore, for
this case it might be considered as preferable if the lower
To further validate the generalizability ability of our distilla- inference time is considered imperative as it is 14 millisec-
tion system, we apply this to Cityscapes. Cityscapes is very onds, or 14%, faster than Sst ; SB20 has a smaller image
different to BUP20 and SB20 as it consists of pedestrians size so that relative speed difference is lower than on other
and vehicles. For our analysis we use a pre-trained teacher high resolution datasets.
model by downloading the online pre-trained weights from

Table 2. Instance knowledge distillation results on SB20 in terms Table 3. Instance knowledge distillation results on Cityscapes in
of AP , AP 50 and AP 75. terms of AP , and AP 50.

models AP AP 50 AP 75 models AP AP 50
teacher (M 2Fsl ) 38.4±0.7 79.2±0.9 31.9±0.7 teacher (M 2Fsl ) 43.7 71.3
baseline (M 2Fr50 ) 34.9±1.3 75.3±2.0 28.1±2.1 baseline (M 2Fr50 ) 35.8±1.0 62.6±1.1
dis (Sr50 ) 36.9±0.5 76.4±0.9 31.6±0.9 dis (Sr50 ) 37.9±0.3 64.2±0.3
baseline (M 2Fst ) 36.4±0.8 78.6±0.6 29.8±1.5 baseline (M 2Fst ) 37.9±0.8 65.1±0.6
dis (Sst ) 39.4±0.5 80.1±0.5 35.0±1.0 dis (Sst ) 39.7±0.7 66.4±1.4

5437
5. Summary main differences,” IEEE Transactions on Intelligent Trans-
portation Systems, vol. 24, no. 4, pp. 4050–4059, 2023. 2
In this paper, we have proposed a bipartite matching ap- [7] I. Balazevic, D. Steiner, N. Parthasarathy, R. Arandjelović,
proach to perform knowledge distillation on output queries and O. Henaff, “Towards in-context scene understanding,”
from Mask2Former, transformer-based network. The bipar- Advances in Neural Information Processing Systems, vol. 36,
tite matching allows us to associate the queries predicted by 2024. 2
the student and the teacher. Using this association we then [8] K. He, G. Gkioxari, P. Dollár, and R. Girshick, “Mask r-
distill the corresponding query-based class probabilities and cnn,” in Proceedings of the IEEE international conference
instance masks from the teacher to the student. We apply on computer vision, 2017, pp. 2961–2969. 2
this to the student models with vastly different backbones [9] Z. Huang, L. Huang, Y. Gong, C. Huang, and X. Wang,
which consist either of a transformer backbone (Swin-Tiny) “Mask scoring r-cnn,” in Proceedings of the IEEE/CVF con-
and even a DCNN backbone (ResNet-50). To be the best ference on computer vision and pattern recognition, 2019,
pp. 6409–6418.
of our knowledge, this is the first that such an approach for
knowledge distillation has been proposed. [10] H. Zhang, Y. Tian, K. Wang, W. Zhang, and F.-Y. Wang,
“Mask ssd: An effective single-stage approach to object in-
We evaluate our knowledge distillation scheme on two stance segmentation,” IEEE Transactions on Image Process-
challenging agricultural datasets as well as Cityscapes ing, vol. 29, pp. 2078–2093, 2019. 2
which consists of pedestrian and vehicle data. In all cases, [11] D. Neven, B. D. Brabandere, M. Proesmans, and L. V. Gool,
applying our approach leads to improved performance for “Instance segmentation by jointly optimizing spatial em-
the distilled models with an average absolute performance beddings and clustering bandwidth,” in Proceedings of the
improvement in terms of AP of 1.8 points for Sr50 and 1.9 IEEE/cvf conference on computer vision and pattern recog-
points for Sst across the three datasets. In particular, our ap- nition, 2019, pp. 8837–8845. 2
proach leads to more precise detections as demonstrated by [12] N. Gao, Y. Shan, Y. Wang, X. Zhao, Y. Yu, M. Yang, and
higher values for AP75 than AP50. Overall, we show that K. Huang, “Ssap: Single-shot instance segmentation with
simple student networks trained with our instance knowl- affinity pyramid,” in Proceedings of the IEEE/CVF interna-
edge distillation scheme can retain a high accuracy with tional conference on computer vision, 2019, pp. 642–651. 2
faster inference than the teacher model. Future work should [13] D. Bolya, C. Zhou, F. Xiao, and Y. J. Lee, “Yolact: Real-time
examine the potential implication of changing not just the instance segmentation,” in Proceedings of the IEEE/CVF in-
ternational conference on computer vision, 2019, pp. 9157–
backbone but also the pixel decoder to further improve the
9166. 2
tradeoff between accuracy and computational efficient.
[14] X. Chen, R. Girshick, K. He, and P. Dollár, “Tensormask: A
foundation for dense object segmentation,” in Proceedings of
References the IEEE/CVF international conference on computer vision,
2019, pp. 2061–2069. 2
[1] M. Halstead, P. Zimmer, and C. McCool, “A cross-
[15] X. Wang, T. Kong, C. Shen, Y. Jiang, and L. Li, “Solo: Seg-
domain challenge with panoptic segmentation in agricul-
menting objects by locations,” in Computer Vision–ECCV
ture,” The International Journal of Robotics Research, p.
2020: 16th European Conference, Glasgow, UK, August 23–
02783649241227448, 2024. 1
28, 2020, Proceedings, Part XVIII 16. Springer, 2020, pp.
[2] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, 649–665. 2
A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all
[16] X. Wang, R. Zhang, T. Kong, L. Li, and C. Shen, “Solov2:
you need,” Advances in neural information processing sys-
Dynamic and fast instance segmentation,” Advances in Neu-
tems, vol. 30, 2017. 1
ral information processing systems, vol. 33, pp. 17 721–
[3] N. Carion, F. Massa, G. Synnaeve, N. Usunier, A. Kir- 17 732, 2020. 2
illov, and S. Zagoruyko, “End-to-end object detection with [17] X. Dong, J. Bao, D. Chen, W. Zhang, N. Yu, L. Yuan,
transformers,” in European conference on computer vision. D. Chen, and B. Guo, “Cswin transformer: A general vi-
Springer, 2020, pp. 213–229. 1, 2, 4 sion transformer backbone with cross-shaped windows,” in
[4] B. Cheng, I. Misra, A. G. Schwing, A. Kirillov, and R. Gird- Proceedings of the IEEE/CVF conference on computer vi-
har, “Masked-attention mask transformer for universal image sion and pattern recognition, 2022, pp. 12 124–12 134. 2
segmentation,” in Proceedings of the IEEE/CVF conference [18] Y. Lee, J. Kim, J. Willette, and S. J. Hwang, “Mpvit: Multi-
on computer vision and pattern recognition, 2022, pp. 1290– path vision transformer for dense prediction,” in Proceedings
1299. 1, 2, 4, 5, 6 of the IEEE/CVF conference on computer vision and pattern
[5] R. Wang, T. Lei, R. Cui, B. Zhang, H. Meng, and A. K. recognition, 2022, pp. 7287–7296. 2
Nandi, “Medical image segmentation using deep learning: [19] Z. Liu, Y. Lin, Y. Cao, H. Hu, Y. Wei, Z. Zhang, S. Lin, and
A survey,” IET Image Processing, vol. 16, no. 5, pp. 1243– B. Guo, “Swin transformer: Hierarchical vision transformer
1267, 2022. 2 using shifted windows,” in Proceedings of the IEEE/CVF
[6] L. Guan and X. Yuan, “Instance segmentation model evalua- international conference on computer vision, 2021, pp.
tion and rapid deployment for autonomous driving using do- 10 012–10 022. 2, 3

5438
[20] P. Sun, R. Zhang, Y. Jiang, T. Kong, C. Xu, W. Zhan, G. Heigold, S. Gelly et al., “An image is worth 16x16 words:
M. Tomizuka, L. Li, Z. Yuan, C. Wang et al., “Sparse r- Transformers for image recognition at scale,” arXiv preprint
cnn: End-to-end object detection with learnable proposals,” arXiv:2010.11929, 2020. 3
in Proceedings of the IEEE/CVF conference on computer vi- [35] Y. Liu, J. Cao, B. Li, W. Hu, J. Ding, and L. Li, “Cross-
sion and pattern recognition, 2021, pp. 14 454–14 463. 2 architecture knowledge distillation,” in Proceedings of the
[21] Y. Fang, S. Yang, X. Wang, Y. Li, C. Fang, Y. Shan, Asian Conference on Computer Vision, 2022, pp. 3396–
B. Feng, and W. Liu, “Instances as queries,” in Proceedings 3411. 3
of the IEEE/CVF international conference on computer vi- [36] K. Wu, J. Zhang, H. Peng, M. Liu, B. Xiao, J. Fu, and
sion, 2021, pp. 6910–6919. 2 L. Yuan, “Tinyvit: Fast pretraining distillation for small vi-
[22] J. Hu, L. Cao, Y. Lu, S. Zhang, Y. Wang, K. Li, F. Huang, sion transformers,” in European Conference on Computer Vi-
L. Shao, and R. Ji, “Istr: End-to-end instance segmentation sion. Springer, 2022, pp. 68–85. 3
with transformers,” Energy, vol. 50, p. 100. 2 [37] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning
[23] B. Dong, F. Zeng, T. Wang, X. Zhang, and Y. Wei, “Solq: for image recognition,” in Proceedings of the IEEE confer-
Segmenting objects by learning queries,” Advances in Neural ence on computer vision and pattern recognition, 2016, pp.
Information Processing Systems, vol. 34, pp. 21 898–21 909, 770–778. 3
2021. 2 [38] C. Smitt, M. Halstead, T. Zaenker, M. Bennewitz, and C. Mc-
[24] G. Hinton, O. Vinyals, and J. Dean, “Distilling the knowl- Cool, “Pathobot: A robot for glasshouse crop phenotyping
edge in a neural network,” arXiv preprint arXiv:1503.02531, and intervention,” in 2021 IEEE International Conference on
2015. 3 Robotics and Automation (ICRA). IEEE, 2021, pp. 2324–
[25] J. Gou, B. Yu, S. J. Maybank, and D. Tao, “Knowledge distil- 2330. 5
lation: A survey,” International Journal of Computer Vision, [39] A. Ahmadi, M. Halstead, and C. McCool, “Bonnbot-i: A
vol. 129, no. 6, pp. 1789–1819, 2021. 3 precise weed management and crop monitoring platform,”
[26] C. Shu, Y. Liu, J. Gao, Z. Yan, and C. Shen, “Channel-wise in 2022 IEEE/RSJ International Conference on Intelligent
knowledge distillation for dense prediction,” in Proceedings Robots and Systems (IROS). IEEE, 2022, pp. 9202–9209.
of the IEEE/CVF International Conference on Computer Vi- 5
sion, 2021, pp. 5311–5320. 3 [40] M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler,
[27] H. Bai, H. Mao, and D. Nair, “Dynamically pruning seg- R. Benenson, U. Franke, S. Roth, and B. Schiele, “The
former for efficient semantic segmentation,” in ICASSP cityscapes dataset for semantic urban scene understanding,”
2022-2022 IEEE International Conference on Acoustics, in Proc. of the IEEE Conference on Computer Vision and
Speech and Signal Processing (ICASSP). IEEE, 2022, pp. Pattern Recognition (CVPR), 2016. 5
3298–3302. 3
[41] Y. Wu, A. Kirillov, F. Massa, W.-Y. Lo, and R. Girshick, “De-
[28] C. Yang, H. Zhou, Z. An, X. Jiang, Y. Xu, and Q. Zhang, tectron2,” https://github.com/facebookresearch/detectron2,
“Cross-image relational knowledge distillation for seman- 2019. 5
tic segmentation,” in Proceedings of the IEEE/CVF Confer-
[42] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ra-
ence on Computer Vision and Pattern Recognition, 2022, pp.
manan, P. Dollár, and C. L. Zitnick, “Microsoft coco: Com-
12 319–12 328. 3
mon objects in context,” in Computer Vision–ECCV 2014:
[29] Z. Yang, Z. Li, M. Shao, D. Shi, Z. Yuan, and C. Yuan,
13th European Conference, Zurich, Switzerland, September
“Masked generative distillation,” in European Conference on
6-12, 2014, Proceedings, Part V 13. Springer, 2014, pp.
Computer Vision. Springer, 2022, pp. 53–69. 3
740–755. 5
[30] S. An, Q. Liao, Z. Lu, and J.-H. Xue, “Efficient semantic
[43] I. Loshchilov and F. Hutter, “Decoupled weight decay regu-
segmentation via self-attention and self-distillation,” IEEE
larization,” in International Conference on Learning Repre-
Transactions on Intelligent Transportation Systems, vol. 23,
sentations, 2018. 5
no. 9, pp. 15 256–15 266, 2022. 3
[31] S. Lin, H. Xie, B. Wang, K. Yu, X. Chang, X. Liang, and
G. Wang, “Knowledge distillation via the target-aware trans-
former,” in Proceedings of the IEEE/CVF Conference on
Computer Vision and Pattern Recognition, 2022, pp. 10 915–
10 924. 3
[32] S. Ren, Z. Gao, T. Hua, Z. Xue, Y. Tian, S. He, and H. Zhao,
“Co-advise: Cross inductive bias distillation,” in Proceed-
ings of the IEEE/CVF Conference on computer vision and
pattern recognition, 2022, pp. 16 773–16 782. 3
[33] H. Touvron, M. Cord, M. Douze, F. Massa, A. Sablayrolles,
and H. Jégou, “Training data-efficient image transformers &
distillation through attention,” in International conference on
machine learning. PMLR, 2021, pp. 10 347–10 357. 3
[34] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn,
X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer,

5439

You might also like