Knowledge Distillation For Efficient Instance Semantic Segmentation With Transformers

Uploaded by

liangpengchen8

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

20 views8 pages

Knowledge Distillation For Efficient Instance Semantic Segmentation With Transformers

Uploaded by

liangpengchen8

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 8

This CVPR Workshop paper is the Open Access version, provided by the Computer Vision Foundation.

Except for this watermark, it is identical to the accepted version;

the final published version of the proceedings is available on IEEE Xplore.

Knowledge Distillation for Efficient Instance Semantic Segmentation with

Transformers

Maohui Li1 Michael Halstead1

Chris McCool1,2
1
University of Bonn, 2 Lamarr Institute for Machine Learning and Artificial Intelligence
{mlii, michael.halstead, cmccool}@uni-bonn.de

Abstract

Instance-based semantic segmentation provides detailed

per-pixel scene understanding information crucial for both
computer vision and robotics applications. However, state-
of-the-art approaches such as Mask2Former are compu-
tationally expensive and reducing this computational bur-
den while maintaining high accuracy remains challenging.
Knowledge distillation has been regarded as a potential
way to compress neural networks, but to date limited work
has explored how to apply this to distill information from Figure 1. A visualization of our bipartite query-based matching
the output queries of a model such as Mask2Former. between the teacher and the student. We compare first association
In this paper, we match the output queries of the student of the predicted queries of the model, either teacher or student, to
and teacher models to enable a query-based knowledge dis- the groundtruth to obtain teacher-gt and student-gt respectively.
We then use this to find the association of the predicted queries for
tillation scheme. We independently match the teacher and
the teacher-student.
the student to the groundtruth and use this to define the
teacher to student relationship for knowledge distillation. mantic segmentation brings the potential to greatly improve
Using this approach we show that it is possible to perform efficiency to agricultural automation. Automatic fruit pick-
knowledge distillation where the student models can have a ing, weed removal, and pesticide drip irrigation have been
lower number of queries and the backbone can be changed made possible by the introduction of dense predictions [1].
from a Transformer architecture to a convolutional neural Instance segmentation provides a wealth of information
network architecture. Experiments on two challenging agri- like per-pixel classification and instance location of objects,
cultural datasets, sweet pepper (BUP20) and sugar beet which contributes to agricultural efficiency and makes it
(SB20), and Cityscapes demonstrate the efficacy of our ap- possible to convert agricultural production from human la-
proach. Across the three datasets the student models obtain bor to automation. Despite these advances, it still remains
an average absolute performance improvement in AP of 1.8 highly challenging to deploy vision algorithms in agricul-
and 1.9 points for ResNet-50 and Swin-Tiny backbone re- ture since it is vulnerable to clutter from leaves and other
spectively. To the best of our knowledge, this is the first work crop, as well as highly variable lighting conditions. Further-
to propose knowledge distillation schemes for instance se- more, the limited computational and energy resources on
mantic segmentation with transformer-based models. edge devices constrains deployment on robots. Therefore,
Index Terms–Computer Vision for Agriculture Automa- efficient and accurate models capable of instance-based seg-
tion, Knowledge Distillation, Efficient Instance Segmenta- mentation are integral to enable real-world deployment.
tion, Transformer Transformer-based models were introduced to computer
vision to further improve accuracy after the successful ap-
plication of vanilla transformers [2] in natural language
1. Introduction processing (NLP). DETR [3] was one of the first object
detection models based on transformers, and it introduces
The large application of computer vision algorithms, such the conception of object query which only contain features
as detection, semantic segmentation, and instance-based se- from one instance. More recently, the Mask2Former [4]

5432
architecture was proposed which is capable of state-of-the- 2. Related Work
art semantic, instance-based, and panoptic segmentation by
extracting masked attention within predicted mask regions. 2.1. Instance Segmentation
Despite the improved accuracy provided by transformer- Instance-based semantic segmentation can be regarded as
based models their complicated structures inevitably in- the process of labeling pixels with categories and object ids.
creases the computational complexity and hampers their ap- This pixel-wise classification is fundamental for many ad-
plication in real-time scenarios. vanced computer vision based tasks, such as medical im-
ages analysis [5], autonomous driving [6], and scene un-
One of the most effective ways to improve the efficiency derstanding [7]. The methods can be roughly divided into
of deep neural networks is through knowledge distillation, two main categories: single-stage and two-stage methods.
by imparting knowledge from complex networks (teachers) Two-stage methods dominate instance segmentation and
to efficient networks (students). There are many kinds of can be further divided into top-down methods and bottom-
knowledge that can be used to distill information from us- up methods. Top-down methods [8–10] predict bounding-
ing features from the final or intermediate layers through to boxes first and then generate the instance masks within
using the mutual relationship between features. However, these boxes which means that the final performance is
the majority of these distillation schemes are designed for highly dependent on detection results. In contrast, bottom-
use with convolutional neural network (CNN) -based struc- up methods [11, 12] classify pixels into the corresponding
tures. categories and from this post-processing is applied to form
the instances which means that the final performance is
highly dependent on the post-processing approach. Single-
This means that distillation of joint image- and pixel- stage methods jointly perform detection and segmentation,
level classification of transformer-based networks are yet to and are further divided into anchor-based and anchor-free
be explored. Thus, how to distill instance-level knowledge methods. For anchor-based methods [13, 14], most de-
from a transformer-based structure is the key contribution tectors rely heavily on pre-defined anchors which vary in
of this paper. scale depending on the target dataset. These techniques also
rely on handcrafted approaches like non-maximal suppres-
In this paper, we compress complex transformer-based sion (NMS) to remove redundancies, which can increase the
models into more efficient networks while retaining a high computational burden. For anchor-free methods [15, 16],
degree of accuracy. First, we produce an optimal bipartite they distinguish instances on the basis of predicted loca-
matching scheme between the queries from the teacher and tions and shapes of the objects. A limitation of anchor-free
the student as shown in Figure 1 by using the Hungarian methods is that their performance generally decreases when
algorithm [3]. Second, we train our efficient networks by many instances overlap, this is because each grid can only
distilling the class probabilities and mask maps from the predict one location and mask.
teacher network. For our experiments we perform distilla-
tion using the Mask2Former [4] architecture due to its high 2.2. Transformer-Based Dense Prediction
accuracy for instance-based semantic segmentation and the A variety of transformers have been designed for vision
ability to easily switch the backbone network. Finally, We tasks [17, 18]. A standard backbone for various dense vi-
show the validity of our approach by performing knowledge sion prediction tasks is the Swin-Transformer [19] which
distillation in multiple domains: arable farmland, horticul- consists of a hierarchical architecture with multi-scale fea-
ture, and traffic scenarios. ture maps. Inspired by the transformer structures, re-
cent work has exploited self-attention to capture the long-
Our results, on challenging agricultural and traf- range relationships needed to perform instance-based se-
fic datasets, demonstrate that our knowledge distillation mantic segmentation. A query-based instance segmenta-
scheme can be employed to learn efficient and accurate tion method based on the structure of Sparse R-CNN [20]
lightweight models. Our models are 2.3 or 2.0 times faster is proposed by Fang et al. [21]. In the approach, a mask
than the original complex teacher while only degrading pooling operator is designed to extract the current stage in-
AP (average precision) performance by as little as 2.8 or stance mask features and a dynamic convolution module is
4.0 points, and in one case outperforms the teacher by 1.0 employed to set a linkage between mask features and query
point. The main contribution of our paper is that we estab- embedding. In ISTR [22], three task-specific heads and a
lish pair-wise query matching between transformer-based fixed mask decoder work together to accomplish the classi-
models with different complexities (backbones and number fication, localization, and segmentation prediction by taking
of queries) and distill the query-based knowledge from the in the image-level features and learnable position embed-
teacher to the student. ding. Dong et al. [23] propose a structure to complete clas-

5433
sification, bounding box regression, and mask segmentation tures into the teacher’s attention space during this process.
in one head linking segmentation and detection as a unified As the complex structure of the teacher consumes a large
query representation. However, complex transformer-based amount of resources during the process of supervising the
models do not currently meet the real-time requirements of student, a framework [36] that stores the teacher’s predic-
robotic platforms. tions in advance was proposed.
The distillation methods with transformed-based models
2.3. Knowledge Distillation mentioned above focus mainly on image-level or pixel-level
classification, but new distillation schemes on instance-
Knowledge distillation has proven effective at compress-
based semantic segmentation and other dense prediction
ing models while maintaining performance. This enables
paradigms are currently not explored. In this paper we in-
a lightweight network (student) to boost its performance
vestigate this dense prediction by using a bipartite matching
learning soft labels from a more complicated (teacher) net-
scheme to distill information from a teacher to a student.
work [24]. The form of knowledge is traditionally divided
into three categories when performing distillation with 3. Proposed Approach
CNN-based models: response-based knowledge, feature-
based knowledge, and relation-based knowledge respec- In this paper, we propose a knowledge distillation scheme
tively [25]. Response-based knowledge relies on the re- for instance-based semantic segmentation with transformer-
sponse of the last output layer of the used networks [26, 27]. based models. We build an ordered permutation for queries
Feature-based knowledge uses the features from selected in- from the student and the teacher in an innovative way, then
termediate layers, and an adaptation layer is often required we distill the query-based knowledge based on the estab-
to address the dimension mismatch between the teacher and lished matching. Finally, we test our scheme on two agri-
the student when doing distillation [28, 29]. Relation-based cultural datasets and a common city scene dataset. The ap-
knowledge focuses more on the relationships between fea- proach consists of three aspects.
tures from different network layers or data samples, with 1. We adopt Mask2Former as the framework for both
typical examples being [30]. teacher and student networks. For the teacher network
These distillation techniques have been shown to we use the Swin-Large backbone [19] with a high num-
perform well for CNN-based networks, however, for ber of queries. For the student networks we explore the
transformer-based networks they are not directly applica- use of both ResNet [37] and Swin for the backbones with
ble due to the inherently different network structures. Due fewer queries to explore a trade-off between efficiency
to these differences self-attention based knowledge distil- and accuracy.
lation approaches are required. Lin et al. [31] propose the 2. We build a new matching technique to facilitate the
target-aware transformer. Their approach assumes that the query-based distillation process. We match the un-
semantic information usually varies on the same spatial lo- ordered queries from a set of predictions from the
cation since the teacher’s and the student’s receptive field teacher and the student by calculating their instance sim-
are different. The model generates the similarity between ilarity using class probabilities and mask maps. We
each pixel of the teacher’s feature and all spatial locations achieve this by exploiting the available groundtruth la-
of the student’s features during the distillation process. The bels and the matching scheme is explicitly described in
performance of their technique surpasses the state-of-the-art III-B.
methods by a significant margin on common benchmarks. 3. Based on the result of the matching scheme, we per-
In [32], the authors find that the dominant factor that affects form instance-based knowledge distillation using both
the distillation performance lies in inductive teacher bias the teacher and groundtruth labels. For this, we define
rather than teacher accuracy. The student is more likely to a teacher to student loss function that incorporates both
learn diverse knowledge by transferring the inductive biases class probabilities and segmentation masks. When us-
from both the CNN and involution-based neural network ing both the teacher and groundtruth labels we combine
(INN) teacher when distilled. Touvron et al. [33] propose them using a weighted loss function. This is described
a novel attention distillation mechanism and achieve state- in III-C.
of-the-art performance on ImageNet. The authors exploit a
3.1. Teacher and Student Network Architectures
distillation token, similar to the classification token within
ViT [34], to enable the student to learn the attention maps We adopt Mask2Former as the framework, which is
from the teacher. To gain cross-architecture knowledge, a an encoder-decoder structure with self-attention layers.
novel distillation scheme is proposed by Liu et al. [35] to Mask2Former is composed of a backbone, pixel decoder,
combat the heterogeneous architectures’ gap. Two projec- and transformer decoder which results in N queries which
tors, one for partial cross-attention and one for group-wise are used to predict the class probabilities (including ob-
linear, are designed to align the student’s immediate fea- ject or not) and the associated per-pixel mask. The back-

5434
Figure 2. An illustration of our instance knowledge distillation process. Given the association of the predicted queries from the teacher to
the student model we then use an L2 loss to perform joint distillation on the class probabilities and mask predictions. In this process, only
queries with effective instances will be distilled and other redundant ones are discarded.
bone extracts information from an image and outputs fea- For this we use the following matching loss (cost),
tures with different resolutions. The pixel decoder gener-
ates high-resolution per-pixel embedding by sampling the \mathcal {L}_{match} = \delta _{cls} \mathcal {L}_{cls} + \delta _{msk} \mathcal {L}_{msk} \label {loss_matching}, (1)
features from the previous backbone. The transformer de-
coder processes the long-range relationship of the image in- where δcls = 2 and δmsk = 5. Lcls and Lmsk are the corre-
formation and generates object queries, with which the final sponding losses of class probabilities and mask predictions.
heads predict the probabilities and the masks. More details For the matching cost of classes, we only use probabili-
on this approach can be found in [4]. ties from the query set that match to an object (i.e. Ci ̸= Ø),
We take competitively performed and commonly used
ResNet and Swin-Transformer as the backbones of our \mathcal {L}_{cls} = -\mathbbm {1}_{C_{i}\neq \O }\hat {p}_{\sigma (i)}(C_{i}) \label {loss_class}, (2)
models. The model inference complexity can be varied by
selecting a set of backbones, from ResNet-18 to ResNet- where p̂σ(i) is the predicted probability of the class, and Ci
101 or from Swin-T (tiny version) to Swin-L (large ver- is the groundtruth class label.
sion). We set the teacher backbone to be the top performer As for the matching cost of masks, we calculate similar-
(Swin-L) and the student uses either ResNet-50 or Swin-T ities of pixels and overlaps between instances,
as backbones as they computationally more efficient. The
teacher backbone was selected to ensure that it can capture \mathcal {L}_{msk} = -\mathbbm {1}_{C_{i}\neq \O } [\mathcal {L}_{dic}(m_{i},\hat {m}_{\sigma (i)})+\mathcal {L}_{xe}(m_{i},\hat {m}_{\sigma (i)})] \label {loss_mask} (3)
deeper relationships within the scene and it consists of 200
queries. As the student backbone is simpler we reduce its where mi and m̂σ(i) are masks from gt and predictions in
number of queries to be 100. one image, and Ldic and Lxe are the Dice loss and Cross
Entropy loss respectively.
3.2. Student-Teacher Query Matching
3.3. Knowledge Distillation Loss function
In order to perform knowledge distillation we need to find
matching queries between the teacher and student mod- As shown in Figure 2, the student uses both the teacher in-
els. As shown in Figure 1, we establish the teacher-student formation and groundtruth labels during distillation. This
matching by exploiting the existing groundtruth labels (gt). leads to our loss function being composed of two parts, the
We do this by first resolving the teacher-gt and student-gt groundtruth loss and the teacher-student loss,
associations. From this, we can then derive the teacher-
student associations, ϕ̂. \mathcal {L}_{all} = \mathcal {L}_{gt} + \alpha \mathcal {L}_{dis} \label {loss_all}, (4)
To find the optimal bipartite matching, teacher-gt and
student-gt we exploit the commonly used Hungarian algo- where α is the balancing weight which varies according to
rithm in previous work [3]. This establishes the query per- the different datasets. Lgt and Ldis are the corresponding
mutation by optimizing the total cost involved by combin- losses from the hard labels and soft labels during distilla-
ing both class probabilities and instance mask predictions. tion.

5435
For the groundtruth loss, we compare the similarity of All of our models are implemented in Detectron2 [41]
the predicted class possibilities and mask predictions with and trained on an NVidia A6000 GPU. We fine-tune all the
the corresponding hard labels as, models with weights pre-trained on the COCO dataset [42]
and do this 3 times to get the mean and variance of the per-
\mathcal {L}_{gt}=\delta _{cls}\mathcal {L}_{xe}(GT_{cls},S_{cls}) + \delta _{msk}\mathcal {L}_{msk}\\ \mathcal {L}_{msk}=\mathcal {L}_{dic}(GT_{msk},S_{msk})+\mathcal {L}_{xe}(GT_{msk},S_{msk}) \label {loss_gt}, formance. The only exception is that on Cityscapes we do
(6) not train multiple teacher models but instead use the pre-
trained model provided by Mask2Former. For BUP20 and
where GTmsk is the groundtruth mask, Smsk is the pre-
SB20, we set AdamW optimizer [43] with a step learning
dicted mask, GTcls is the groundtruth class, and Scls is the
rate of γ = 1e−4 and γ = 1e−5 respectively for the back-
predicted class. The balancing weights are set to δcls = 2
bone ResNet and Swin-Transformer with a batch b = 1;
and δmsk = 5. The class-based loss Lxe is a standard cross-
the small batch size is due to the low number of training
entropy loss. The mask-based loss Lmsk uses a weighted
images. We search for the optimal weight α to combine
combination of a standard cross entropy loss and a Dice loss
groundtruth and teacher labels using ranges 0.2 to 5.0. For
Ldic ; similar to [4] we use equal weights of 5 for each.
Cityscapes, we set AdamW optimizer with a step learning
The distillation loss is defined as follows,
rate of γ = 1e−4 with a batch b = 16. Here we set the
search for α to in the range 0.2 to 2.0.
\mathcal {L}_{dis} = \sum _{i}^{N} ||T_{cls}^{i},S_{cls}^{\hat {\phi }(i)}||_2^2 + \beta \sum _{i}^{N} ||T_{msk}^{i},S_{msk}^{\hat {\phi }(i)}||_2^2 \label {loss_dis} (7)
We report performance primarily using average preci-
sion (AP ). Our teacher network has 200 queries with a
i ϕ̂(i) Swin-Large backbone, referred to as M 2Fsl . The student
where Tcls is the teacher class logits, Scls is the student
i ϕ̂(i) networks have only 100 queries and use either a ResNet-50
class logits, Tmsk is the teacher predicted mask logits, Smsk
(Sr50 ) or Swin-Tiny (Sst ) backbone.
is the student predicted mask logits, and β is the balancing
weight. The i-th teacher-student association is denoted as
ϕ̂(i). To compare the teacher and student logits we use an
4.1. Experiments on Agricultural Data
L2 loss. To address the imbalance between the class and
mask losses we set the balancing weight β = 0.02. 4.1.1 Results on BUP20
4. Experimental Setup and Results
Table 1 outlines the results for BUP20 where there is a clear
We evaluate our instance semantic segmentation systems improvement based on our distillation scheme. Our most
on two challenging agricultural datasets, BUP20 and SB20, efficient network, Sr50 , obtains the greatest performance
and Cityscapes dataset. BUP20 [38] is a sweet pepper boost with an absolute AP improvement of 1.3 points when
dataset (glasshouse) which consists of 280 images which compared to its direct baseline M 2Fr50 (44.8 to 46.1). The
are split into three sets with 124 images to train, 63 im- other distilled network, Sst , also improves results with an
ages to validate and 93 images to evaluate (test). SB20 [39] absolute improvement in AP of 0.8 points.
is a sugar beet dataset (arable farming) which consists of
143 images which are split into three sets with 71 images We attribute the greater increase of the Sr50 model to the
to train, 37 images to validate and 35 images to evaluate overall performance gap between the baseline (M 2Fr50 )
(test). Cityscapes [40] is a traffic scene dataset which con- and the teacher network when compared to that of the
sists of 5000 images, which are split into three sets with M 2Fst to the teacher. The difference between the two
2975 to train, 500 to valid and 1525 to evaluate (test). Sam- baseline approaches (M 2Fr50 and M 2Fst ) is believed to
ples of the three real-world datasets can be seen in Fig- be based on more informative features output by the Swin-T
ure 3. BUP20 and Cityscapes both have high levels of backbone. Interestingly, it can be seen that the main perfor-
occlusion while SB20 has large variation due to different mance gain for the distilled models occurs at higher APs.
growth stages of the plants. It can be seen that there are large performance gains for
Table 1. Instance knowledge distillation results on BUP20 in terms AP75 but much lower gains for AP50. The performance
of AP , AP 50 and AP 75. gain for AP75 is 2.1 and 1.3 points for Sr50 and Sst respec-
tively. By comparison the performance for AP50 is 1.1 and
models AP AP 50 AP 75 0.3 points for Sr50 and Sst respectively. This indicates that
teacher (M 2Fsl ) 51.3±0.1 81.1±0.1 53.3±0.3 the distillation approach is providing models which provide
baseline (M 2Fr50 ) 44.8±0.3 72.3±0.8 46.2±0.6 considerably more accurate semantic segmentation masks
dis (Sr50 ) 46.1±0.8 73.4±0.6 48.3±0.9 which is important for downstream tasks for automating the
baseline (M 2Fst ) 47.7±0.3 76.9±0.2 49.6±0.3 estimation of phenotypic attributes (e.g. size of fruit) as
dis (Sst ) 48.5±0.3 77.2±0.4 50.9±0.7 well as robotic tasks such as harvesting.

5436
Figure 3. Example images from the three datasets with groundtruth instance segmentation masks of BUP20, SB20 , and Cityscape from
left to right.
4.1.2 Results on SB20 Mask2Former [4].
Our results in Table 3 demonstrate our distillation ap-
The results for SB20, see Table 2, demonstrate that our dis-
proach is also effective on this data. The student Sr50 gains
tillation approach consistently improves the performance of
an absolute performance increase in AP of 2.1 points from
our models. For SB20, the student network Sst is able
35.8 to 37.9, and the student Sst gains 1.8 points improve-
to achieve the greatest performance improvement. For the
ment from 37.9 to 39.7. Despite these considerable perfor-
Sr50 network we achieve an absolute improvement in AP
mance gains there is still a large gap in performance be-
of 2.0 points (34.9 to 36.9) compared to 3.0 points for the
tween the students and the teacher. For the Sr50 network
more complex Sst (36.4 to 39.4). The impact of our dis-
the absolute performance in terms of AP is 5.8 points while
tillation scheme is most evident when comparing Sst to the
for Sst it is 4.0 points, when compared to the teacher.
teacher network where we improve our performance by 1.0
point, however, we note that this improvement is still within
the bounds of the variance of the models where the teacher 4.3. Best Student Model and Inference Time
network has an AP of 38.4±0.7 and the distilled model Sst
an AP of 39.4±0.5. In the previous experiments we demonstrate that our distil-
Similar to BUP20 the biggest performance improve- lation approach consistently improves performance. Over-
ments occur for higher APs which can be seen by exam- all, we demonstrate that our distillation scheme achieves
ining the performance gains of AP50 vs AP75. The perfor- considerable improvements with an average absolute per-
mance gain for AP75 is 3.5 and 5.2 points for Sr50 and Sst formance improvement in terms of AP of 1.8 points for
respectively. By comparison the performance for AP50 is Sr50 and 1.9 points for Sst across the three datasets. On
1.1 and 1.5 points for Sr50 and Sst respectively. For arable average, over the three datasets, the two student models are
farming data such as SB20, providing considerably more 2.0 and 2.3 times faster, for Sst and Sr50 respectively. Fur-
accurate segmentation masks is important for phenotyping thermore, the Sst model consistently outperforms Sr50 and
tasks such as estimating leaf-area as well as precise robotic so in most cases it would be the preferred distilled model.
weeding. However, for SB20 the relative performance degradation is
smallest, with Sr50 dropping by 1.5 points for AP which
4.2. Ablation on Cityscapes is a 3.9% relative reduction in performance. Therefore, for
this case it might be considered as preferable if the lower
To further validate the generalizability ability of our distilla- inference time is considered imperative as it is 14 millisec-
tion system, we apply this to Cityscapes. Cityscapes is very onds, or 14%, faster than Sst ; SB20 has a smaller image
different to BUP20 and SB20 as it consists of pedestrians size so that relative speed difference is lower than on other
and vehicles. For our analysis we use a pre-trained teacher high resolution datasets.
model by downloading the online pre-trained weights from

Table 2. Instance knowledge distillation results on SB20 in terms Table 3. Instance knowledge distillation results on Cityscapes in
of AP , AP 50 and AP 75. terms of AP , and AP 50.

models AP AP 50 AP 75 models AP AP 50
teacher (M 2Fsl ) 38.4±0.7 79.2±0.9 31.9±0.7 teacher (M 2Fsl ) 43.7 71.3
baseline (M 2Fr50 ) 34.9±1.3 75.3±2.0 28.1±2.1 baseline (M 2Fr50 ) 35.8±1.0 62.6±1.1
dis (Sr50 ) 36.9±0.5 76.4±0.9 31.6±0.9 dis (Sr50 ) 37.9±0.3 64.2±0.3
baseline (M 2Fst ) 36.4±0.8 78.6±0.6 29.8±1.5 baseline (M 2Fst ) 37.9±0.8 65.1±0.6
dis (Sst ) 39.4±0.5 80.1±0.5 35.0±1.0 dis (Sst ) 39.7±0.7 66.4±1.4

5437
5. Summary main differences,” IEEE Transactions on Intelligent Trans-
portation Systems, vol. 24, no. 4, pp. 4050–4059, 2023. 2
In this paper, we have proposed a bipartite matching ap- [7] I. Balazevic, D. Steiner, N. Parthasarathy, R. Arandjelović,
proach to perform knowledge distillation on output queries and O. Henaff, “Towards in-context scene understanding,”
from Mask2Former, transformer-based network. The bipar- Advances in Neural Information Processing Systems, vol. 36,
tite matching allows us to associate the queries predicted by 2024. 2
the student and the teacher. Using this association we then [8] K. He, G. Gkioxari, P. Dollár, and R. Girshick, “Mask r-
distill the corresponding query-based class probabilities and cnn,” in Proceedings of the IEEE international conference
instance masks from the teacher to the student. We apply on computer vision, 2017, pp. 2961–2969. 2
this to the student models with vastly different backbones [9] Z. Huang, L. Huang, Y. Gong, C. Huang, and X. Wang,
which consist either of a transformer backbone (Swin-Tiny) “Mask scoring r-cnn,” in Proceedings of the IEEE/CVF con-
and even a DCNN backbone (ResNet-50). To be the best ference on computer vision and pattern recognition, 2019,
pp. 6409–6418.
of our knowledge, this is the first that such an approach for
knowledge distillation has been proposed. [10] H. Zhang, Y. Tian, K. Wang, W. Zhang, and F.-Y. Wang,
“Mask ssd: An effective single-stage approach to object in-
We evaluate our knowledge distillation scheme on two stance segmentation,” IEEE Transactions on Image Process-
challenging agricultural datasets as well as Cityscapes ing, vol. 29, pp. 2078–2093, 2019. 2
which consists of pedestrian and vehicle data. In all cases, [11] D. Neven, B. D. Brabandere, M. Proesmans, and L. V. Gool,
applying our approach leads to improved performance for “Instance segmentation by jointly optimizing spatial em-
the distilled models with an average absolute performance beddings and clustering bandwidth,” in Proceedings of the
improvement in terms of AP of 1.8 points for Sr50 and 1.9 IEEE/cvf conference on computer vision and pattern recog-
points for Sst across the three datasets. In particular, our ap- nition, 2019, pp. 8837–8845. 2
proach leads to more precise detections as demonstrated by [12] N. Gao, Y. Shan, Y. Wang, X. Zhao, Y. Yu, M. Yang, and
higher values for AP75 than AP50. Overall, we show that K. Huang, “Ssap: Single-shot instance segmentation with
simple student networks trained with our instance knowl- affinity pyramid,” in Proceedings of the IEEE/CVF interna-
edge distillation scheme can retain a high accuracy with tional conference on computer vision, 2019, pp. 642–651. 2
faster inference than the teacher model. Future work should [13] D. Bolya, C. Zhou, F. Xiao, and Y. J. Lee, “Yolact: Real-time
examine the potential implication of changing not just the instance segmentation,” in Proceedings of the IEEE/CVF in-
ternational conference on computer vision, 2019, pp. 9157–
backbone but also the pixel decoder to further improve the
9166. 2
tradeoff between accuracy and computational efficient.
[14] X. Chen, R. Girshick, K. He, and P. Dollár, “Tensormask: A
foundation for dense object segmentation,” in Proceedings of
References the IEEE/CVF international conference on computer vision,
2019, pp. 2061–2069. 2
[1] M. Halstead, P. Zimmer, and C. McCool, “A cross-
[15] X. Wang, T. Kong, C. Shen, Y. Jiang, and L. Li, “Solo: Seg-
domain challenge with panoptic segmentation in agricul-
menting objects by locations,” in Computer Vision–ECCV
ture,” The International Journal of Robotics Research, p.
2020: 16th European Conference, Glasgow, UK, August 23–
02783649241227448, 2024. 1
28, 2020, Proceedings, Part XVIII 16. Springer, 2020, pp.
[2] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, 649–665. 2
A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all
[16] X. Wang, R. Zhang, T. Kong, L. Li, and C. Shen, “Solov2:
you need,” Advances in neural information processing sys-
Dynamic and fast instance segmentation,” Advances in Neu-
tems, vol. 30, 2017. 1
ral information processing systems, vol. 33, pp. 17 721–
[3] N. Carion, F. Massa, G. Synnaeve, N. Usunier, A. Kir- 17 732, 2020. 2
illov, and S. Zagoruyko, “End-to-end object detection with [17] X. Dong, J. Bao, D. Chen, W. Zhang, N. Yu, L. Yuan,
transformers,” in European conference on computer vision. D. Chen, and B. Guo, “Cswin transformer: A general vi-
Springer, 2020, pp. 213–229. 1, 2, 4 sion transformer backbone with cross-shaped windows,” in
[4] B. Cheng, I. Misra, A. G. Schwing, A. Kirillov, and R. Gird- Proceedings of the IEEE/CVF conference on computer vi-
har, “Masked-attention mask transformer for universal image sion and pattern recognition, 2022, pp. 12 124–12 134. 2
segmentation,” in Proceedings of the IEEE/CVF conference [18] Y. Lee, J. Kim, J. Willette, and S. J. Hwang, “Mpvit: Multi-
on computer vision and pattern recognition, 2022, pp. 1290– path vision transformer for dense prediction,” in Proceedings
1299. 1, 2, 4, 5, 6 of the IEEE/CVF conference on computer vision and pattern
[5] R. Wang, T. Lei, R. Cui, B. Zhang, H. Meng, and A. K. recognition, 2022, pp. 7287–7296. 2
Nandi, “Medical image segmentation using deep learning: [19] Z. Liu, Y. Lin, Y. Cao, H. Hu, Y. Wei, Z. Zhang, S. Lin, and
A survey,” IET Image Processing, vol. 16, no. 5, pp. 1243– B. Guo, “Swin transformer: Hierarchical vision transformer
1267, 2022. 2 using shifted windows,” in Proceedings of the IEEE/CVF
[6] L. Guan and X. Yuan, “Instance segmentation model evalua- international conference on computer vision, 2021, pp.
tion and rapid deployment for autonomous driving using do- 10 012–10 022. 2, 3

5438
[20] P. Sun, R. Zhang, Y. Jiang, T. Kong, C. Xu, W. Zhan, G. Heigold, S. Gelly et al., “An image is worth 16x16 words:
M. Tomizuka, L. Li, Z. Yuan, C. Wang et al., “Sparse r- Transformers for image recognition at scale,” arXiv preprint
cnn: End-to-end object detection with learnable proposals,” arXiv:2010.11929, 2020. 3
in Proceedings of the IEEE/CVF conference on computer vi- [35] Y. Liu, J. Cao, B. Li, W. Hu, J. Ding, and L. Li, “Cross-
sion and pattern recognition, 2021, pp. 14 454–14 463. 2 architecture knowledge distillation,” in Proceedings of the
[21] Y. Fang, S. Yang, X. Wang, Y. Li, C. Fang, Y. Shan, Asian Conference on Computer Vision, 2022, pp. 3396–
B. Feng, and W. Liu, “Instances as queries,” in Proceedings 3411. 3
of the IEEE/CVF international conference on computer vi- [36] K. Wu, J. Zhang, H. Peng, M. Liu, B. Xiao, J. Fu, and
sion, 2021, pp. 6910–6919. 2 L. Yuan, “Tinyvit: Fast pretraining distillation for small vi-
[22] J. Hu, L. Cao, Y. Lu, S. Zhang, Y. Wang, K. Li, F. Huang, sion transformers,” in European Conference on Computer Vi-
L. Shao, and R. Ji, “Istr: End-to-end instance segmentation sion. Springer, 2022, pp. 68–85. 3
with transformers,” Energy, vol. 50, p. 100. 2 [37] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning
[23] B. Dong, F. Zeng, T. Wang, X. Zhang, and Y. Wei, “Solq: for image recognition,” in Proceedings of the IEEE confer-
Segmenting objects by learning queries,” Advances in Neural ence on computer vision and pattern recognition, 2016, pp.
Information Processing Systems, vol. 34, pp. 21 898–21 909, 770–778. 3
2021. 2 [38] C. Smitt, M. Halstead, T. Zaenker, M. Bennewitz, and C. Mc-
[24] G. Hinton, O. Vinyals, and J. Dean, “Distilling the knowl- Cool, “Pathobot: A robot for glasshouse crop phenotyping
edge in a neural network,” arXiv preprint arXiv:1503.02531, and intervention,” in 2021 IEEE International Conference on
2015. 3 Robotics and Automation (ICRA). IEEE, 2021, pp. 2324–
[25] J. Gou, B. Yu, S. J. Maybank, and D. Tao, “Knowledge distil- 2330. 5
lation: A survey,” International Journal of Computer Vision, [39] A. Ahmadi, M. Halstead, and C. McCool, “Bonnbot-i: A
vol. 129, no. 6, pp. 1789–1819, 2021. 3 precise weed management and crop monitoring platform,”
[26] C. Shu, Y. Liu, J. Gao, Z. Yan, and C. Shen, “Channel-wise in 2022 IEEE/RSJ International Conference on Intelligent
knowledge distillation for dense prediction,” in Proceedings Robots and Systems (IROS). IEEE, 2022, pp. 9202–9209.
of the IEEE/CVF International Conference on Computer Vi- 5
sion, 2021, pp. 5311–5320. 3 [40] M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler,
[27] H. Bai, H. Mao, and D. Nair, “Dynamically pruning seg- R. Benenson, U. Franke, S. Roth, and B. Schiele, “The
former for efficient semantic segmentation,” in ICASSP cityscapes dataset for semantic urban scene understanding,”
2022-2022 IEEE International Conference on Acoustics, in Proc. of the IEEE Conference on Computer Vision and
Speech and Signal Processing (ICASSP). IEEE, 2022, pp. Pattern Recognition (CVPR), 2016. 5
3298–3302. 3
[41] Y. Wu, A. Kirillov, F. Massa, W.-Y. Lo, and R. Girshick, “De-
[28] C. Yang, H. Zhou, Z. An, X. Jiang, Y. Xu, and Q. Zhang, tectron2,” https://github.com/facebookresearch/detectron2,
“Cross-image relational knowledge distillation for seman- 2019. 5
tic segmentation,” in Proceedings of the IEEE/CVF Confer-
[42] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ra-
ence on Computer Vision and Pattern Recognition, 2022, pp.
manan, P. Dollár, and C. L. Zitnick, “Microsoft coco: Com-
12 319–12 328. 3
mon objects in context,” in Computer Vision–ECCV 2014:
[29] Z. Yang, Z. Li, M. Shao, D. Shi, Z. Yuan, and C. Yuan,
13th European Conference, Zurich, Switzerland, September
“Masked generative distillation,” in European Conference on
6-12, 2014, Proceedings, Part V 13. Springer, 2014, pp.
Computer Vision. Springer, 2022, pp. 53–69. 3
740–755. 5
[30] S. An, Q. Liao, Z. Lu, and J.-H. Xue, “Efficient semantic
[43] I. Loshchilov and F. Hutter, “Decoupled weight decay regu-
segmentation via self-attention and self-distillation,” IEEE
larization,” in International Conference on Learning Repre-
Transactions on Intelligent Transportation Systems, vol. 23,
sentations, 2018. 5
no. 9, pp. 15 256–15 266, 2022. 3
[31] S. Lin, H. Xie, B. Wang, K. Yu, X. Chang, X. Liang, and
G. Wang, “Knowledge distillation via the target-aware trans-
former,” in Proceedings of the IEEE/CVF Conference on
Computer Vision and Pattern Recognition, 2022, pp. 10 915–
10 924. 3
[32] S. Ren, Z. Gao, T. Hua, Z. Xue, Y. Tian, S. He, and H. Zhao,
“Co-advise: Cross inductive bias distillation,” in Proceed-
ings of the IEEE/CVF Conference on computer vision and
pattern recognition, 2022, pp. 16 773–16 782. 3
[33] H. Touvron, M. Cord, M. Douze, F. Massa, A. Sablayrolles,
and H. Jégou, “Training data-efficient image transformers &
distillation through attention,” in International conference on
machine learning. PMLR, 2021, pp. 10 347–10 357. 3
[34] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn,
X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer,

5439

30+ Custom Instructions To Help You Build Custom GPTs
100% (2)
30+ Custom Instructions To Help You Build Custom GPTs
5 pages
Graph Relation Distillation For Efficient Biomedical Instance Segmentation
No ratings yet
Graph Relation Distillation For Efficient Biomedical Instance Segmentation
12 pages
2301.07499v1
No ratings yet
2301.07499v1
177 pages
Thesis AlexanderJaus BIBTEX
No ratings yet
Thesis AlexanderJaus BIBTEX
9 pages
SDPT Semantic-Aware Dimension-Pooling Transformer For Image Segmentation
No ratings yet
SDPT Semantic-Aware Dimension-Pooling Transformer For Image Segmentation
13 pages
【SegFormer】NeurIPS 2021 Segformer Simple and Efficient Design for Semantic Segmentation With Transformers Paper
No ratings yet
【SegFormer】NeurIPS 2021 Segformer Simple and Efficient Design for Semantic Segmentation With Transformers Paper
14 pages
Deep Learning For Object Detection and Segmentation in Videos Toward An Integration With Domain Knowledge
No ratings yet
Deep Learning For Object Detection and Segmentation in Videos Toward An Integration With Domain Knowledge
15 pages
A Survey On Deep Learning Techniques For Image and Video Semantic Segmentation
No ratings yet
A Survey On Deep Learning Techniques For Image and Video Semantic Segmentation
68 pages
Strudel Transformer Segmentation
No ratings yet
Strudel Transformer Segmentation
17 pages
1
No ratings yet
1
11 pages
Machine Learning - Advanced Concepts
From Everand
Machine Learning - Advanced Concepts
Derrick Mwiti
No ratings yet
Transformer-Based Visual Segmentation: A Survey
No ratings yet
Transformer-Based Visual Segmentation: A Survey
25 pages
Transformer-Based Framework For Accurate Segmentation of High-Resolution Images in Structural Health Monitoring
No ratings yet
Transformer-Based Framework For Accurate Segmentation of High-Resolution Images in Structural Health Monitoring
15 pages
Transformer-Based Visual Segmentation - A Survey
No ratings yet
Transformer-Based Visual Segmentation - A Survey
23 pages
Harley MSC Thesis Menos Especializadpo
No ratings yet
Harley MSC Thesis Menos Especializadpo
71 pages
BSSNet_A_Real-Time_Semantic_Segmentation_Network_for_Road_Scenes_Inspired_From_AutoEncoder
No ratings yet
BSSNet_A_Real-Time_Semantic_Segmentation_Network_for_Road_Scenes_Inspired_From_AutoEncoder
15 pages
DL UNIT 5
No ratings yet
DL UNIT 5
63 pages
Radwan Distilling Part-Whole Hierarchical Knowledge From A Huge Pretrained Class Agnostic ICCVW 2023 Paper
No ratings yet
Radwan Distilling Part-Whole Hierarchical Knowledge From A Huge Pretrained Class Agnostic ICCVW 2023 Paper
9 pages
Tell Me What You See and I Will Show You Where It Is
No ratings yet
Tell Me What You See and I Will Show You Where It Is
8 pages
Fully Convolutional Networks For Semantic Segmentation: Jonathan Long Evan Shelhamer Trevor Darrell UC Berkeley
No ratings yet
Fully Convolutional Networks For Semantic Segmentation: Jonathan Long Evan Shelhamer Trevor Darrell UC Berkeley
10 pages
Image Segmentation Using Deep Learning A Survey
No ratings yet
Image Segmentation Using Deep Learning A Survey
20 pages
Computer Vision
No ratings yet
Computer Vision
2 pages
DATA MINING and MACHINE LEARNING. CLASSIFICATION PREDICTIVE TECHNIQUES: NAIVE BAYES, NEAREST NEIGHBORS and NEURAL NETWORKS: Examples with MATLAB
From Everand
DATA MINING and MACHINE LEARNING. CLASSIFICATION PREDICTIVE TECHNIQUES: NAIVE BAYES, NEAREST NEIGHBORS and NEURAL NETWORKS: Examples with MATLAB
César Pérez López
No ratings yet
Practical MXNet Applications: Definitive Reference for Developers and Engineers
From Everand
Practical MXNet Applications: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Recent Progress in Semantic Image Segmentation: Xiaolong Liu Zhidong Deng Yuhan Yang
No ratings yet
Recent Progress in Semantic Image Segmentation: Xiaolong Liu Zhidong Deng Yuhan Yang
18 pages
2408.14957v1
No ratings yet
2408.14957v1
7 pages
AI for Everyone: An Intermediate Guide to Artificial Intelligence
From Everand
AI for Everyone: An Intermediate Guide to Artificial Intelligence
Nova Clarke
No ratings yet
Panoptic Segmentation
No ratings yet
Panoptic Segmentation
29 pages
9781638280712-summary
No ratings yet
9781638280712-summary
65 pages
2211.14126v2
No ratings yet
2211.14126v2
14 pages
DEEP LEARNING TECHNIQUES: CLUSTER ANALYSIS and PATTERN RECOGNITION with NEURAL NETWORKS. Examples with MATLAB
From Everand
DEEP LEARNING TECHNIQUES: CLUSTER ANALYSIS and PATTERN RECOGNITION with NEURAL NETWORKS. Examples with MATLAB
César Pérez López
No ratings yet
Image Segmentation Using Deep Learning: A Survey
No ratings yet
Image Segmentation Using Deep Learning: A Survey
23 pages
02 Semantic Segmentation 2024
No ratings yet
02 Semantic Segmentation 2024
53 pages
Computer VIsion Applications
No ratings yet
Computer VIsion Applications
30 pages
Han Few-Shot Object Detection With Fully Cross-Transformer CVPR 2022 Paper
No ratings yet
Han Few-Shot Object Detection With Fully Cross-Transformer CVPR 2022 Paper
10 pages
ML Report-Image Segmentation
No ratings yet
ML Report-Image Segmentation
19 pages
Image Segmentation in Deep Learning
No ratings yet
Image Segmentation in Deep Learning
12 pages
Deep Semantic Segmentation New Model of Natural and Medical Images
No ratings yet
Deep Semantic Segmentation New Model of Natural and Medical Images
4 pages
XMNet
No ratings yet
XMNet
10 pages
IJRAR1DUP001
No ratings yet
IJRAR1DUP001
3 pages
Sensors: Depth Estimation and Semantic Segmentation From A Single RGB Image Using A Hybrid Convolutional Neural Network
No ratings yet
Sensors: Depth Estimation and Semantic Segmentation From A Single RGB Image Using A Hybrid Convolutional Neural Network
20 pages
segmentation_by_gan
No ratings yet
segmentation_by_gan
18 pages
A Multi-Path Semantic Segmentation Network Based o
No ratings yet
A Multi-Path Semantic Segmentation Network Based o
17 pages
A Review On Deep Learning Techniques Applied To Semantic Segmentation
No ratings yet
A Review On Deep Learning Techniques Applied To Semantic Segmentation
23 pages
Image Segmentation Using Deep Learning: A Survey
No ratings yet
Image Segmentation Using Deep Learning: A Survey
22 pages
DeepCut Unsupervised Segmentation Using Graph Neural Networks Clustering
No ratings yet
DeepCut Unsupervised Segmentation Using Graph Neural Networks Clustering
14 pages
IT5409 - Ch7 - Part3 - DL For CV-v2 - 4pages
No ratings yet
IT5409 - Ch7 - Part3 - DL For CV-v2 - 4pages
42 pages
A Comparative Study of Real-Time Semantic Segmentation For Autonomous Driving
No ratings yet
A Comparative Study of Real-Time Semantic Segmentation For Autonomous Driving
11 pages
Wang Dual Super-Resolution Learning For Semantic Segmentation CVPR 2020 Paper
No ratings yet
Wang Dual Super-Resolution Learning For Semantic Segmentation CVPR 2020 Paper
10 pages
Keras Deep Learning Essentials: Definitive Reference for Developers and Engineers
From Everand
Keras Deep Learning Essentials: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
5. Object Detection and Segmentation - part 2
No ratings yet
5. Object Detection and Segmentation - part 2
36 pages
Principles of Multiple Spanning Tree Protocol: Definitive Reference for Developers and Engineers
From Everand
Principles of Multiple Spanning Tree Protocol: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Learning Video Object Segmentation From Static Images
No ratings yet
Learning Video Object Segmentation From Static Images
10 pages
Joint Segmentation and Recognition of Categorized Objects From Noisy Web Image Collection
No ratings yet
Joint Segmentation and Recognition of Categorized Objects From Noisy Web Image Collection
17 pages
Attractor Networks: Fundamentals and Applications in Computational Neuroscience
From Everand
Attractor Networks: Fundamentals and Applications in Computational Neuroscience
Fouad Sabry
No ratings yet
Blitznet: A Real-Time Deep Network For Scene Understanding
No ratings yet
Blitznet: A Real-Time Deep Network For Scene Understanding
11 pages
MMDetection Open MMLab Detection Toolbox and Benchmark
No ratings yet
MMDetection Open MMLab Detection Toolbox and Benchmark
13 pages
Zhang_Semantic_Segmentation_by_Early_Region_Proxy_CVPR_2022_paper
No ratings yet
Zhang_Semantic_Segmentation_by_Early_Region_Proxy_CVPR_2022_paper
11 pages
Image Segmentationand Semantic Labelingusing Machine Learning
No ratings yet
Image Segmentationand Semantic Labelingusing Machine Learning
6 pages
Transformer Segmentation
No ratings yet
Transformer Segmentation
35 pages
SegNeXt Rethinking Convolutional Attention Design Segmentation
No ratings yet
SegNeXt Rethinking Convolutional Attention Design Segmentation
15 pages
Jetro Real
No ratings yet
Jetro Real
11 pages
Class 9 Summer Vacation Holiday Homework 2024
No ratings yet
Class 9 Summer Vacation Holiday Homework 2024
26 pages
Tang Thesis 2022
No ratings yet
Tang Thesis 2022
48 pages
Image Processing Technology Based On Machine Learning
No ratings yet
Image Processing Technology Based On Machine Learning
6 pages
Balancing Indian Copyright Law with AI-Generated Content_ The ‘Significant Human Input’ Approach
No ratings yet
Balancing Indian Copyright Law with AI-Generated Content_ The ‘Significant Human Input’ Approach
6 pages
Eti CH-1&3 MCQ
No ratings yet
Eti CH-1&3 MCQ
30 pages
Deep LearningINAF With MATLAB
No ratings yet
Deep LearningINAF With MATLAB
80 pages
Social Bot
No ratings yet
Social Bot
122 pages
Mca project
No ratings yet
Mca project
57 pages
AI and Procurement PDF
No ratings yet
AI and Procurement PDF
6 pages
AI Tools in Teaching and Learning English Academic
No ratings yet
AI Tools in Teaching and Learning English Academic
18 pages
pre-reading
No ratings yet
pre-reading
11 pages
Resume Prasanth PDF
No ratings yet
Resume Prasanth PDF
1 page
Machine Learning
No ratings yet
Machine Learning
8 pages
Udemy_AI 900_Exam
No ratings yet
Udemy_AI 900_Exam
17 pages
Thesis Proposal For Computer Science Philippines
100% (3)
Thesis Proposal For Computer Science Philippines
7 pages
Century Identification and Recognition of Ancient Tamil Character Recognition
No ratings yet
Century Identification and Recognition of Ancient Tamil Character Recognition
4 pages
Improving+Project+Time+and+Cost+Estimation+Accuracy+Using+AI-Based+Predictive+Models
No ratings yet
Improving+Project+Time+and+Cost+Estimation+Accuracy+Using+AI-Based+Predictive+Models
7 pages
NLP Extc Sem8 Final Exam IMPs
No ratings yet
NLP Extc Sem8 Final Exam IMPs
3 pages
需要帮助写作？
100% (2)
需要帮助写作？
5 pages
Module 8 When Humanity and Technology Cross
No ratings yet
Module 8 When Humanity and Technology Cross
9 pages
Data Mining: Practical Machine Learning Tools and Techniques
No ratings yet
Data Mining: Practical Machine Learning Tools and Techniques
11 pages
Algorithm
No ratings yet
Algorithm
3 pages
Sensors 23 03333 v2
No ratings yet
Sensors 23 03333 v2
22 pages
iitj resume
No ratings yet
iitj resume
1 page
AI
100% (2)
AI
234 pages
towards Controllable Speech Synthesis in the Era of Large Language Models A Survey
No ratings yet
towards Controllable Speech Synthesis in the Era of Large Language Models A Survey
23 pages
Natural-Hazards-in-Arunachal-Pradesh-Challenges-and-Opportunities
No ratings yet
Natural-Hazards-in-Arunachal-Pradesh-Challenges-and-Opportunities
6 pages
n8n Automation Starter Guide by Chiranjeev Gaggar
No ratings yet
n8n Automation Starter Guide by Chiranjeev Gaggar
13 pages

Knowledge Distillation For Efficient Instance Semantic Segmentation With Transformers

Uploaded by

Knowledge Distillation For Efficient Instance Semantic Segmentation With Transformers

Uploaded by

This CVPR Workshop paper is the Open Access version, provided by the Computer Vision Foundation.

Except for this watermark, it is identical to the accepted version;

Knowledge Distillation for Efficient Instance Semantic Segmentation with

Maohui Li1 Michael Halstead1

Instance-based semantic segmentation provides detailed

You might also like