Islam_Self-Supervised_Learning_With_Local_Contrastive_Loss_for_Detection_and_Semantic_WACV_2023_paper
Islam_Self-Supervised_Learning_With_Local_Contrastive_Loss_for_Detection_and_Semantic_WACV_2023_paper
Semantic Segmentation
Abstract
5624
tion achieves superior performance. Specifically, our key We adopt the latter strategy.
technical contributions are: (1) a simple framework that Broadly, there are two approaches in the literature to en-
computes local contrastive loss (LC-loss) to make the cor- suring local consistency during self-supervised pretraining:
responding pixels of two augmented versions of the same pixel-based and region-based. In region-based methods,
image similar, that can be added to any self-supervised first region proposals are generated - either during the self-
learning method, such as BYOL; and (2) state-of-the-art supervised training [38, 34, 44], or before training starts
transfer learning results in several dense labeling tasks. [41], and then local consistency is applied between pooled
Using ResNet-50 backbones pretrained on ImageNet, our features of the proposed regions. Our approach is pixel-
BYOL variant achieves 40.6 AP for COCO object detection based; local consistency is applied between the local fea-
(+1.9 vs SSL SOTA1 ), 60.1 AP for VOC object detection tures for corresponding pixels of transformed versions of
(+1.4 vs SOTA), 72.1 mIoU for VOC segmentation (+1 vs the same image [37, 43, 42].
SOTA), and 77.8 mIoU for CityScapes segmentation (+0.6 DenseCL [37] proposed dense contrastive learning for
vs SOTA) for full-network fine-tuning setting. Our perfor- self-supervised visual pretraining. It follows the MoCo [18]
mance improvement is even more significant in the frozen framework to formulate the dense loss. However, DenseCL
backbone setting discussed in Section 4.4. does not use known pixel-correspondences to generate pos-
itive pairs of local feature between two images. Instead,
2. Related Work it extracts the correspondence across views. This creates
a chicken-and-egg problem where DenseCL first requires
Self-supervised learning (SSL). In self-supervisd learn- learning a good feature representation to generate correct
ing, the supervisory signal is automatically generated from correspondences.
a pair of input images and a pretext task. The input pair
Our local contrastive loss is more similar to the PixCon-
is generated by applying two distinct transformations. The
trast loss in the PixPro paper [43]. Given an image I, both
pretext task is comparison between the learned representa-
methods operate on two distinct transforms J1 = T1 (I) and
tions of each pair of input images . Various pretext tasks
J2 = T2 (I) of I to produce two low-resolution, spatial fea-
have been explored, such as, patch position [12], image col-
ture maps. Both methods use a contrastive loss using pixel
orization [45], image inpainting [31], rotation [15], and pre-
correspondences to generate positive and negative samples.
dictive coding [24]. The pretext task that has shown the
The difference comes in how these samples are selected.
most promise is the instance discrimination task, in which
In PixContrast, pixels in the low-resolution feature map
each image is considered as a single class. SimCLR [4], the
are warped back to the original image space using T1 1 and
first to propose this pretext task, adopts contrastive learning
T1 1 . Positive samples are then determined by all pairs of
in which features from augmented versions of the same im-
pixels that are sufficiently close after warping. Our method
age are made closer in the feature space than all the other
generates positive samples using the correspondences in J1
images in a mini-batch. SimCLR requires a large mini-
and J2 derived directly from T1 and T2 . While similar to
batch to make contrastive learning feasible. MoCo [18]
ours, the method proposed for PixContrast does not work.
solves the issue of large batch size using a momentum queue
Instead, an additional pixel-propagation module is intro-
and moving average encoder. Despite impressive results
duced (PixPro) to measure the feature similarity between
in image classification tasks, contrastive learning requires
corresponding pixels. We show that no pixel propagation or
careful handling of negative pairs. Recent approaches like
feature warping is required in our simpler formulation.
BYOL [17], SwaV [1], and DINO [2] do not require any
In summary, our framework does not require: (1) an
negative pairs or a memory bank. They also achieve impres-
encoder-decoder architecture for local correspondence loss
sive performance on the ImageNet1k linear evaluation task
[32]; (2) contrastive learning that needs carefully tuned neg-
and downstream image classification-based transfer learn-
ative pairs [32, 37, 41]; (3) a good local feature extractor to
ing tasks.
find local feature correspondences [37]; and (4) an addi-
SSL for Detection and Segmentation. For dense predic- tional propagation module to measure local contrastive loss
tion tasks, SSL methods use an ImageNet-pretrained back- [43]. A simple local correspondence loss obtained from
bone within a larger architecture designed for a detection or matching pixel pairs achieves state-of-the-art results in de-
segmentation task [29, 33, 40], and fine tune the network on tection and segmentation tasks.
the downstream task dataset. He et al. [19]reported that Im-
ageNet pretrained models might be less helpful if the target 3. Methodology
task is localization sensitive than classification . One poten-
tial solution is to increase the target dataset size [19], or to Figure 2 depicts our LC-loss framework. We use BYOL
impose local consistency during the ImageNet pretraining. for the global self-supervised loss function, and apply a
local contrastive loss on dense feature representations ob-
1 State-Of-The-Art tained from the backbone networks. We adopted BYOL
5625
framework because it achieves higher performance than transformations like image flipping or random crop. We ap-
contrastive learning without using any negative pairs, and it ply normal resize operation to resize the image to H ⇥ W
is more resilient to changes in hyper-parameters like batch shape. Another transformation Tsc is applied to I to ob-
size and image transformations. In the following, we briefly tain Isc , where Tsc contains both spatial and color trans-
describe the BYOL framework, and introduce our approach. formations. We obtain dense feature representations of Isc
Instance Discrimination from Global Features. BYOL from the backbone of the online network, and denote it as
consists of two neural networks: an online network with F✓ 2 Rh⇥w where h = H/p and w = W/p with p as the
parameters ✓ and a target network with parameters ⇠. The stride size for the feature representation. We get a similar
online network has three sub-networks: an encoder, a pro- feature representation F⇠ 2 Rh⇥w by passing Ic through
jector and a predictor. The target network has the same ar- the target network.
chitecture as the online network except the predictor. The Next we select h ⇥ w image points on a 2D uniform grid
online network is updated by gradient descent, and the pa- in Ic . Each point pc of Ic has the local feature representation
rameters of the target network are exponential moving av- F⇠ (pc ) obtained from the target network. For the point pc ,
erages of the parameters of the online network. Given an there is a corresponding point psc in Isc . Note that because
input image I, two transformed views Ic and It of I are of random crop and flipping in Tsc , the feature F✓ (psc ) may
obtained. One view Ic is passed through the online net- not be at an integer pixel coordinate psc . Instead of adopt-
work with parameters ✓ to obtain local features F✓ , average ing an expensive high-dimensional warping of F✓ to ob-
pooled encoder output f✓ , a projection g✓ and prediction q✓ . tain corresponding features, we compute the negative log-
View It is passed through the target network with parame- likelihood from correspondence map and resample the neg-
ters ⇠ to obtain local features F⇠ , average pooled encoder ative log-likelihood using bilinear interpolation to deal 2D
output f⇠ and a projection g⇠ . Check Fig. 2. There is no points with subpixel coordinates, which is described below
predictor in the target network. This asymmetric design is [14].
adopted to prevent collapse during self-supervised training We compute a dense correspondence map C 0 2
[17]. We follow the original BYOL network to set the di- R (h⇥w)⇥(h⇥w)
between Ic and Isc as,
mension of encoder outputs (2048 dim), projections (4096
dim), and predictions (256 dim). The global self-supervised F⇠ (pc )T F✓ (psc )
C 0 (pc , pt ) = (2)
loss function for a single input is defined as: ||F⇠ (pc )||2 ||F✓ (psc )||2
qT✓ g⇠ where pc and pt are image points from Ic and Isc respec-
LG = 2 2 (1) tively. Then, we calculate the negative log-likelihood
||q✓ ||2 ||g⇠ ||2
where q✓ and g⇠ are learned global representations for two
transformed images that are forced to be similar under co- exp(C 0 (pc , psc )/⌧ )
NLL(pc , pt ) = log P (3)
sine similarity. 0
k2⌦sc (exp(C (pc , pk )/⌧ ))
5626
Figure 2: Proposed Framework. We use BYOL as the self-supervised learning framework. BYOL consists of an online network with
parameters ✓ and a target network with parameters ⇠ which are the exponential moving average of ✓. Given an image, we create two
transformed versions. We apply mean squared error loss between the L2-normalized global feature representations from the online and
target networks.. We also calculate a local contrastive loss from the dense feature representations of the image pairs.
4. Experiments GPUs with 256 batch-size on each GPU, hence, the effec-
tive batch-size is 4096. We linearly scale the learning rate
4.1. Implementation Details with the effective batch size. The weight parameter ↵ is set
to 0.1 (Eqn. 5). For the momentum encoder, the momen-
Pretraining Setup. We use ResNet-50 [21] as the backbone
tum value starts from 0.996 and ends at 1. We use 16 bit
network and BYOL [17] as the self-supervision architec-
mixed-precision during pre-training.
ture. We use identical architectures for projection and pre-
diction networks as in BYOL. For extracting local features, 4.2. Results on Object Detection and Instance Seg-
we add a local projection branch on the online network and mentation
another branch of similar architecture on the target network.
As with the global branches, only the online local projection We use Detectron2 framework [39] for evaluation of
branch is updated through optimization while the target lo- downstream object detection and segmentation results on
cal projection branch is the exponential moving average of COCO and PASCAL VOC dataset.
the online one. The local projection branch consists of two COCO Object Detection. For object detection on COCO,
convolution layers. The first convolution layer consists of a we adopt RetinaNet [27] following [38, 34, 42]. We fine-
1⇥1 convolution kernel with input dimension 2048 and tar- tune all layers with Sync BatchNorm for 90k iterations on
get dimension 2048 following by a BatchNorm layer. The COCO train2017 set and evaluate on COCO val2017
second convolution layer contains a 1 ⇥ 1 kernel with out- set. Table 1 shows object detection results on COCO for
put dimension 256. The input to the local projection branch our method and other approaches in the literature with full-
is the local feature representation from the final stage of network finetuning. Note that ReSim, SoCo, and SCRL
ResNet-50 (before the global average pooling layer). For use region proposal networks during pretraining on Ima-
image transformations during pretraining, following BYOL geNet, hence, these approaches are not exactly compara-
[17], we use random resize crop (resize to 224 ⇥ 224), ran- ble to ours. Our model is more similar with methods like
dom horizontal flip, color distortion, blurring, and solariza- DetCo, DenseCL, and PixPro. We achieve 40.6 AP for ob-
tion. We do not apply random crop for the image that is ject detection tasks outperforming the second best method
used to obtain local contrastive loss. PixPro [43] by a significant 1.9%. We also report results
Dataset. We use the ImageNet [11] dataset for pretraining on COCO detection using Mask R-CNN + FPN. We again
the networks. ImageNet contains ⇠1.28M training images, outperform PixPro (our mAP is higher by 1.4), when using
mostly with a single foreground object. Mask R-CNN + FPN for the detector.
Optimization. The default model is trained with 400 COCO Instance Segmentation. We use the Mask-RCNN
epochs if not specified in the results. See Sec. 4.5 for de- framework [20] with ResNet50-FPN backbone. We follow
tails on the effect of pretraining epoch to the transfer per- the 1⇥ schedule. Table 1 depicts that we achieve 38.3% AP
formance. The LARS optimizer is used with a base learn- for COCO instance segmentation, which is comparable with
ing rate of 0.3 for batch-size 256, momentum 0.9, weight- the SoCo [38]. Note that SoCo performs selective search on
decay 1e-6, and with cosine learning rate decay schedule the input image to find object proposals, and uses additional
for with learning rate warm-up for 10 epochs. We use 16 feature pyramid networks during pre-training.
5627
COCO
Object Detection Object Detection Instance Segmentation
Method
Pretrain RetinaNet + FPN Mask-RCNN + FPN Mask-RCNN + FPN
Epochs APb APb50 APb75 APb APb50 APb75 APmk APmk
50 APmk
75
Supervised [21] 90 37.7 57.2 40.4 38.9 59.6 42.7 35.4 56.5 38.1
Moco v2 [6] 200 37.3 56.2 40.4 40.4 60.2 44.2 36.4 57.2 38.9
BYOL [17] 300 35.4 54.7 37.4 40.4 61.6 44.1 37.2 58.8 39.8
DetCo [42] 800 38.4 57.8 41.2 40.1 61.0 43.9 36.4 58.0 38.9
ReSim-FPN [41] 200 38.6 57.6 41.6 39.8 60.2 43.5 36.0 57.1 38.6
SCRL [34] 800 39.0 58.7 41.9 - - - 37.7 59.6 40.7
SoCo [38] 400 38.3 57.2 41.2 43.0 63.3 47.1 38.2 60.2 41.0
DenseCL [37] 200 37.6 56.6 40.2 40.3 59.9 44.3 36.4 57.0 39.2
PixPro [43] 400 38.7 57.5 41.6 41.4 61.6 45.4 - - -
O URS 400 40.6 60.4 43.6 42.5 62.9 46.7 38.3 60.0 41.1
Table 1: Main Results. We use faster-RCNN with RetinaNet for COCO object detection, Mask-RCNN with FPN for COCO instance
segmentation, Faster RCNN with FPN for VOC object detection.
PASCAL VOC Object Detection [13]. We use the Faster- PASCAL VOC Segmentation. We train on VOC
RCNN [33] object detector with ResNet50-FPN back- train-aug2012 set for 20k iterations and evaluate on
bone following [34]. For training, we use images from val2012 set. Table 3 shows that on the VOC2007 test set,
both trainval07+12 sets and we evaluate only on the our method yields 72.1% mIoU outperforming BYOL by a
VOC07 test07 set. We use the pre-trained checkpoints 7.7% and PixPro by 1% mIoU.
released by the authors for the backbone network, and fine Cityscapes Segmentation. CityScapes [10] contains im-
tune the full networks on the VOC dataset. Table 2 shows ages from urban street scenes. Table 3 shows that for fine
that we achieve 60.1 AP for VOC detection. Our method tuning setting our approach yield 77.8% AP which is 6.2%
improves mean AP by a significant 3.2% over baseline mIoU improvement over BYOL and 0.6% improvement
BYOL, and outperforms the current SOTA PixPro by 1.4% over PixPro.
AP. The improvement is even more significant in AP75,
where we outperform BYOL by 3.6% and PixPro by 1.9%. Pretrain VOC CityScapes
Method
Epochs mIoU mIoU
PASCAL VOC
Object Detection Scratch - 40.7 63.5
Method
Pretrain FRCNN + FPN Supervised 90 67.7 74.6
Epochs APb APb50 APb75 Moco v2 200 67.5 74.5
BYOL 300 63.3 71.6
Supervised [21] 90 53.2 81.7 58.2
DenseCL 200 69.4 69.4
BYOL [17] 300 55.0 83.1 61.1
PixPro† 400 71.1 77.2
SCRL [34] 800 57.2 83.8 63.9 O URS 400 72.1 77.8
DenseCL† [37] 200 56.6 81.8 62.9 Table 3: Evaluation on Semantic Segmentation using FCN
PixPro† [43] 400 58.7 82.9 65.9 ResNet-50 network on PASCAL VOC and CityScapes dataset.
O URS 400 60.1 84.2 67.8 († ): We use pretrained checkpoint released by the authors and fine-
tune the full networks on the VOC dataset. All other scores are
Table 2: Main Results. We use Faster RCNN with FPN for VOC
obtained from the respective papers.
object detection. Supervised and BYOL results are from [34]. († ):
We use pre-trained checkpoint released by the authors and fine
tune on the VOC dataset. 4.4. Analysis
Frozen Backbone Analysis. We also report detection
4.3. Results on Semantic Segmentation and segmentation results for frozen backbone following
[16, 23, 41]. Training a linear classifier on a frozen back-
We show semantic segmentation evaluation in Table 3 on bone is a standard approach to evaluate self-supervised rep-
PASCAL VOC and CityScapes [10] datasets for both fine resentation quality for image classification [1, 5, 17, 18].
tuning and frozen backbone settings. We use FCN back- We adopt the standard strategy in ‘frozen backbone’ set-
bone [29] following the settings in mmsegmentation [9]. ting where we freeze the pre-trained ResNet50 backbone
5628
and only fine tune the remaining layers. Frozen backbone models trained with local consistency. We use Mask-RCNN
might be an ideal evaluation strategy because fine tuning the (keypoint version) with ResNet50 FPN network to evaluate
full network evaluates quality of representations along with keypoint estimation. We fine tune on COCO train 2017
initialization and optimization, whereas frozen backbone for 90k iterations. Table 8 shows that our method outper-
evaluates mostly the representation quality of the backbone forms all other approaches in keypoint estimation task.
[16, 41]. Detection on Mini COCO. As the full COCO dataset con-
For frozen backbone (Table 4), we achieve 30.5% AP tains extensive annotated images for supervision, it might
outperforming PixPro by 2.8% AP on COCO object de- not always reveal the generalization ability of the network
tection, 55.1% AP for VOC detection outperforming Pix- [19]. We also report results for object detection on smaller
Pro by 1.6% and BYOL by 2.7%. We achieve 63.4% versions of COCO training set in Table 9. We report results
mIoU on VOC semantic segmentation, which is more than when only 5% and 10% of the images (randomly sampled)
the score achieved by BYOL in finetuning setting (63.3% are used for fine tuning the mask-RCNN with FPN network
mIoU). We also outperform PixPro by a significant 2.9% with 1⇥ schedule. The evaluation is performed on the full
mIoU. On CityScapes semantic segmentation, We achieve val2017 set. For the 5% setting, our method outperforms
60.7% mIoU which improves upon BYOL by 5.1% and Pix- BYOL by 1.4% AP for the ImageNet pretrained models.
Pro by 2.5%. For the 10% setting, our method achieves improvement over
Efficient Pre-training. In Table 5, we report results of BYOL by 1.8% AP.
VOC object detection with FasterRCNN-FPN, COCO ob- Generalization to other SSL methods. In the Appendix,
ject detection from MaskRCNN-FPN, VOC and CityScapes we show results of our approach applied on other SSL ap-
segmentation from FCN for BYOL and O URS pre-trained proaches (e.g., DINO), where we also show consistent im-
with different epochs. Results reveal that our model pre- provement over baseline methods.
trained with 200 epochs and with training image size 160
can achieve better results than BYOL pre-trained with 1000 4.5. Ablation Studies
epochs saving 5.3⇥ computational resource. Even our
Effect of Pretraining Epochs. Figure 3a reports object de-
100-epoch pre-trained model seems to be comparable with
tection performance on PASCAL VOC with faster-RCNN-
1000-epoch pre-trained BYOL model. This validates effi-
FPN and MS-COCO with Mask-RCNN-FPN for different
cacy of our local loss during self-supervised pre-training.
numbers of pre-training epochs. The models are pre-trained
Importance of Local Contrast. In Table 6, we show rel- on the ImageNet training set. Longer training generally
ative performance of our local contrastive loss against non- results in better downstream object detection performance.
contrastive BYOL-type loss. In, ‘BYOL+Local MSE loss’, For example, for the 100 epoch pre-trained model, the AP
we apply the same L2-normalized MSE local loss as the is 39.8%, whereas for the 600 epoch pre-trained model it
global loss in BYOL. Models are trained for 200 epochs on improves to 42.8% for COCO evaluation. Upswing is also
the ImageNet dataset. We report the average AP scores for observed for PASCAL VOC object detection.
VOC detection, COCO detection with Mask-RCNN, and
Ablation on Loss Weight ↵. Figure 3b reports the AP for
CityScapes segmentation, which shows that our approach
object detection on PASCAL VOC for different values of
of calculating local consistency using contrastive loss works
the weight parameter ↵. The models are pre-trained on the
better than non-contrastive BYOL-type local loss.
ImageNet dataset for 200 epochs with training image size of
Few-shot Image Classification. Since global and local 160 for faster training. ↵ balances the weight between the
losses appear to be complementary to each other, we as- global and local loss functions. For ↵ = 0.05, the mean AP
certain if our method hurts the image classification perfor- is 58.4%. We get a slightly better performance with ↵ = 0.1
mance for transfer learning. We use our pre-trained models (58.9%) and ↵ = 0.3 (59.0%). The performance degrades
as fixed feature extractors, and perform 5-way 5-shot few- a little when ↵ is increased to ↵ = 0.7 (57.5%). Results
shot learning on 7 datasets from diverse domains using a reveal the best performance is achieved when we use both
logistic regression classifier. Table 7 reports the 5-shot top- global and local loss functions, and a proper balance be-
1 accuracy for the 7 diverse datasets. Table 7 reveals that tween them ensures better downstream performance.
O URS shows the best performance on average among the
self-supervised models that use local consistency. O URS 4.6. Qualitative Analysis
outperforms PixPro by 2.4% top-1 accuracy on average; the Correspondence Visualization. In Figure 4, we show vi-
minor fluctuation can be attributed to random noise. sual examples of correspondence from our model. For two
Transfer to Other Vision Tasks. Even though we mainly transformed images I1 and I2 , we extract feature represen-
evaluted on detection and segmentation, we also show re- tations F1 and F2 . For each feature in F1 , the corresponding
sults for keypoint estimation a task that might benefit from feature in F2 is calculated based on maximum cosine sim-
5629
PASCAL VOC OD COCO OD VOC SS Cityscapes SS
Method b
AP APb50 APb75 AP b
APb50 APb75 mIoU mIoU
Supervised 50.7 80.4 55.1 30.3 50.0 31.3 56.6 55.7
BYOL 52.4 81.1 57.5 30.2 49.1 31.5 55.7 55.6
DenseCL 50.9 79.9 55.0 25.5 43.6 25.8 63.0 58.5
PixPro 53.5 80.4 59.7 27.7 44.6 29.1 60.3 58.2
O URS 55.1 82.6 61.7 30.5 49.8 31.7 63.4 60.7
Table 4: Frozen backbone evaluation. We freeze the ResNet-50 backbone and finetune the other layers (RPN, FPN, classifier networks,
regression layers, etc.). We use faster-RCNN with FPN for PASCAL VOC object detection (OD), RetinaNet-RCNN for COCO detection
(OD), and FCN network for VOC and CityScapes segmentation (SS). For this experiment, we use publicly available checkpoint for the
backbone networks and evaluate on the downstream tasks.
Pretrain Pretrain Pretrain VOC COCO VOC Cityscapes
Method
Epochs Im-size time APb APb mIoU mIoU
BYOL 300 224 ⇥1.6 56.9 40.4 63.3 71.6
BYOL 1000 224 ⇥5.3 57.0 40.9 69.0 73.4
O URS 100 224 ⇥0.8 58.2 40.9 68.4 76.5
O URS 200 160 ⇥1 59.0 41.6 68.5 77.0
O URS 200 224 ⇥1.6 59.6 42.0 70.9 77.4
Table 5: Efficient SSL training on ImageNet. Performance of object detection and segmentation for BYOL and O URS for different pre-
training epochs and training image size. We achieve better performance than BYOL (1000 epochs pretraining) with our model pre-trained
with 200 epochs and with training image size 160 that is 5.3⇥ faster to pre-train.
5630
Method EuroSAT[22] CropDisease[30] ChestX[36] ISIC[8] Sketch[35] DTD[7] Omniglot[26] Avg
Supervised 85.8 92.5 25.2 43.4 86.3 81.9 93.0 72.6
SoCo 78.3 84.1 25.1 41.2 81.5 73.9 92.2 68.0
DenseCL 77.7 81.0 23.8 36.8 76.5 78.3 77.4 64.5
PixPro 80.5 86.4 26.5 41.2 81.5 73.9 92.2 68.9
O URS 84.5 90.1 25.2 41.9 85.6 80.2 91.5 71.3
Table 7: Few-shot learning results on downstream datasets. The pre-trained models are used as fixed feature extractors We report top-1
accuracy for 5-way 5-shot averaged over 600 episodes. We use the publicly available pre-trained backbone as feature extractor for the
few-shot evaluation.
5631
References [14] Hugo Germain, Vincent Lepetit, and Guillaume Bourmaud.
Visual correspondence hallucination: Towards geometric
[1] Mathilde Caron, Ishan Misra, Julien Mairal, et al. Unsuper- reasoning. arXiv preprint arXiv:2106.09711, 2021.
vised learning of visual features by contrasting cluster as- [15] Spyros Gidaris, Praveer Singh, and Nikos Komodakis. Unsu-
signments. arXiv preprint:2006.09882, 2020. pervised Representation Learning by Predicting Image Rota-
[2] Mathilde Caron, Hugo Touvron, Ishan Misra, Hervé Jégou, tions. In ICLR. arXiv, 3 2018.
et al. Emerging properties in self-supervised vision trans- [16] Priya Goyal, Dhruv Mahajan, Abhinav Gupta, and Ishan
formers. arXiv preprint arXiv:2104.14294, 2021. Misra. Scaling and benchmarking self-supervised visual rep-
[3] Kai Chen, Jiaqi Wang, Jiangmiao Pang, Yuhang Cao, Yu resentation learning. In Proceedings of the ieee/cvf Inter-
Xiong, Xiaoxiao Li, Shuyang Sun, Wansen Feng, Ziwei Liu, national Conference on computer vision, pages 6391–6400,
Jiarui Xu, Zheng Zhang, Dazhi Cheng, Chenchen Zhu, Tian- 2019.
heng Cheng, Qijie Zhao, Buyu Li, Xin Lu, Rui Zhu, Yue Wu, [17] Jean-Bastien Grill, Florian Strub, Florent Altché, Corentin
Jifeng Dai, Jingdong Wang, Jianping Shi, Wanli Ouyang, Tallec, Pierre H Richemond, Elena Buchatskaya, Carl Do-
Chen Change Loy, and Dahua Lin. MMDetection: Open ersch, Bernardo Avila Pires, Zhaohan Daniel Guo, Moham-
mmlab detection toolbox and benchmark. arXiv preprint mad Gheshlaghi Azar, et al. Bootstrap your own latent: A
arXiv:1906.07155, 2019. new approach to self-supervised learning. arXiv preprint
[4] Ting Chen, Simon Kornblith, Mohammad Norouzi, and Ge- arXiv:2006.07733, 2020.
offrey Hinton. A simple framework for contrastive learning [18] Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and Ross
of visual representations. In ICML, 2020. Girshick. Momentum contrast for unsupervised visual rep-
[5] Ting Chen, Simon Kornblith, Kevin Swersky, Mohammad resentation learning. In Proceedings of the IEEE/CVF Con-
Norouzi, and Geoffrey Hinton. Big self-supervised mod- ference on Computer Vision and Pattern Recognition, pages
els are strong semi-supervised learners. arXiv preprint 9729–9738, 2020.
arXiv:2006.10029, 2020. [19] Kaiming He, Ross Girshick, and Piotr Dollár. Rethinking im-
[6] Xinlei Chen, Haoqi Fan, Ross Girshick, and Kaiming He. agenet pre-training. In Proceedings of the IEEE/CVF Inter-
Improved baselines with momentum contrastive learning. national Conference on Computer Vision, pages 4918–4927,
arXiv preprint arXiv:2003.04297, 2020. 2019.
[7] Mircea Cimpoi, Subhransu Maji, Iasonas Kokkinos, Sammy [20] Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross Gir-
Mohamed, and Andrea Vedaldi. Describing textures in the shick. Mask RCNN. In Proceedings of the IEEE inter-
wild. In Proceedings of the IEEE conference on computer national conference on computer vision, pages 2961–2969,
vision and pattern recognition, pages 3606–3613, 2014. 2017.
[8] Noel Codella, Veronica Rotemberg, Philipp Tschandl, [21] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun.
M Emre Celebi, Stephen Dusza, David Gutman, Brian Deep residual learning for image recognition. In Proceed-
Helba, Aadi Kalloo, Konstantinos Liopyris, Michael ings of the IEEE conference on computer vision and pattern
Marchetti, et al. Skin lesion analysis toward melanoma recognition, pages 770–778, 2016.
detection 2018: A challenge hosted by the interna- [22] Patrick Helber, Benjamin Bischke, Andreas Dengel, and
tional skin imaging collaboration (isic). arXiv preprint Damian Borth. Eurosat: A novel dataset and deep learning
arXiv:1902.03368, 2019. benchmark for land use and land cover classification. IEEE
[9] MMSegmentation Contributors. MMSegmentation: Journal of Selected Topics in Applied Earth Observations
Openmmlab semantic segmentation toolbox and bench- and Remote Sensing, 12(7):2217–2226, 2019.
mark. https://github.com/open-mmlab/ [23] Olivier J Henaff, Skanda Koppula, Jean-Baptiste Alayrac,
mmsegmentation, 2020. Aaron van den Oord, Oriol Vinyals, and João Carreira.
[10] Marius Cordts, Mohamed Omran, Sebastian Ramos, Timo Efficient visual pretraining with contrastive detection. In
Rehfeld, Markus Enzweiler, Rodrigo Benenson, Uwe Proceedings of the IEEE/CVF International Conference on
Franke, Stefan Roth, and Bernt Schiele. The cityscapes Computer Vision, pages 10086–10096, 2021.
dataset for semantic urban scene understanding. In Proceed- [24] Olivier J Hénaff, Aravind Srinivas, Jeffrey De Fauw, Ali
ings of the IEEE conference on computer vision and pattern Razavi, Carl Doersch, SM Eslami, and Aaron van den Oord.
recognition, pages 3213–3223, 2016. Data-efficient image recognition with contrastive predictive
[11] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, coding. arXiv preprint arXiv:1905.09272, 2019.
and Li Fei-Fei. Imagenet: A large-scale hierarchical image [25] Ashraful Islam, Chun-Fu Richard Chen, Rameswar Panda,
database. In 2009 IEEE conference on computer vision and Leonid Karlinsky, Richard Radke, and Rogerio Feris. A
pattern recognition, pages 248–255. IEEE, 2009. broad study on the transferability of visual representations
[12] Carl Doersch, Abhinav Gupta, and Alexei A Efros. Unsuper- with contrastive learning. In Proceedings of the IEEE/CVF
vised Visual Representation Learning by Context Prediction. International Conference on Computer Vision, pages 8845–
In ICCV, 2015. 8855, 2021.
[13] Mark Everingham, Luc Van Gool, Christopher KI Williams, [26] Brenden M Lake, Ruslan Salakhutdinov, and Joshua B
John Winn, and Andrew Zisserman. The pascal visual object Tenenbaum. Human-level concept learning through proba-
classes (voc) challenge. International journal of computer bilistic program induction. Science, 350(6266):1332–1338,
vision, 88(2):303–338, 2010. 2015.
5632
[27] Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, and [41] Tete Xiao, Colorado J Reed, Xiaolong Wang, Kurt Keutzer,
Piotr Dollár. Focal loss for dense object detection. In Pro- and Trevor Darrell. Region similarity representation learn-
ceedings of the IEEE international conference on computer ing. In Proceedings of the IEEE/CVF International Confer-
vision, pages 2980–2988, 2017. ence on Computer Vision, pages 10539–10548, 2021.
[28] Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, [42] Enze Xie, Jian Ding, Wenhai Wang, Xiaohang Zhan, Hang
Zheng Zhang, Stephen Lin, and Baining Guo. Swin trans- Xu, Peize Sun, Zhenguo Li, and Ping Luo. Detco: Unsuper-
former: Hierarchical vision transformer using shifted win- vised contrastive learning for object detection. In Proceed-
dows. arXiv preprint arXiv:2103.14030, 2021. ings of the IEEE/CVF International Conference on Com-
[29] Jonathan Long, Evan Shelhamer, and Trevor Darrell. Fully puter Vision, pages 8392–8401, 2021.
convolutional networks for semantic segmentation. In Pro- [43] Zhenda Xie, Yutong Lin, Zheng Zhang, Yue Cao, Stephen
ceedings of the IEEE conference on computer vision and pat- Lin, and Han Hu. Propagate yourself: Exploring pixel-level
tern recognition, pages 3431–3440, 2015. consistency for unsupervised visual representation learning.
[30] Sharada P Mohanty, David P Hughes, and Marcel Salathé. In Proceedings of the IEEE/CVF Conference on Computer
Using deep learning for image-based plant disease detection. Vision and Pattern Recognition, pages 16684–16693, 2021.
Frontiers in plant science, 7:1419, 2016. [44] Ceyuan Yang, Zhirong Wu, Bolei Zhou, and Stephen Lin. In-
[31] Deepak Pathak, Philipp Krahenbuhl, Jeff Donahue, Trevor stance localization for self-supervised detection pretraining.
Darrell, and Alexei A. Efros. Context Encoders: Feature In Proceedings of the IEEE/CVF Conference on Computer
Learning by Inpainting. In CVPR, volume 2016-December, Vision and Pattern Recognition, pages 3987–3996, 2021.
pages 2536–2544. IEEE Computer Society, 4 2016. [45] Richard Zhang, Phillip Isola, and Alexei A. Efros. Colorful
[32] Pedro O Pinheiro, Amjad Almahairi, Ryan Y Benmalek, Image Colorization. In ECCV, volume 9907 LNCS, pages
Florian Golemo, and Aaron Courville. Unsupervised 649–666. Springer Verlag, 3 2016.
learning of dense visual representations. arXiv preprint
arXiv:2011.05499, 2020.
[33] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun.
Faster r-cnn: towards real-time object detection with region
proposal networks. IEEE transactions on pattern analysis
and machine intelligence, 39(6):1137–1149, 2016.
[34] Byungseok Roh, Wuhyun Shin, Ildoo Kim, and Sungwoong
Kim. Spatially consistent representation learning. In Pro-
ceedings of the IEEE/CVF Conference on Computer Vision
and Pattern Recognition, pages 1144–1153, 2021.
[35] Haohan Wang, Songwei Ge, Eric P. Xing, and Zachary C.
Lipton. Learning robust global representations by penaliz-
ing local predictive power. arXiv preprint arXiv:1905.13549,
2019.
[36] Xiaosong Wang, Yifan Peng, Le Lu, Zhiyong Lu, Mo-
hammadhadi Bagheri, and Ronald M Summers. Chestx-
ray8: Hospital-scale chest x-ray database and benchmarks
on weakly-supervised classification and localization of com-
mon thorax diseases. In Proceedings of the IEEE Conference
on Computer Vision and Pattern Recognition, pages 2097–
2106, 2017.
[37] Xinlong Wang, Rufeng Zhang, Chunhua Shen, Tao Kong,
and Lei Li. Dense contrastive learning for self-supervised
visual pre-training. In Proceedings of the IEEE/CVF Con-
ference on Computer Vision and Pattern Recognition, pages
3024–3033, 2021.
[38] Fangyun Wei, Yue Gao, Zhirong Wu, Han Hu, and Stephen
Lin. Aligning pretraining for detection via object-level con-
trastive learning. Advances in Neural Information Process-
ing Systems, 34, 2021.
[39] Yuxin Wu, Alexander Kirillov, Francisco Massa, Wan-Yen
Lo, and Ross Girshick. Detectron2. https://github.
com/facebookresearch/detectron2, 2019.
[40] Tete Xiao, Yingcheng Liu, Bolei Zhou, Yuning Jiang, and
Jian Sun. Unified perceptual parsing for scene understand-
ing. In Proceedings of the European Conference on Com-
puter Vision (ECCV), pages 418–434, 2018.
5633