0% found this document useful (0 votes)
3 views

Islam_Self-Supervised_Learning_With_Local_Contrastive_Loss_for_Detection_and_Semantic_WACV_2023_paper

This document presents a self-supervised learning (SSL) method that incorporates a local contrastive loss (LC-loss) to enhance object detection and semantic segmentation tasks. The method shows significant performance improvements over existing SSL approaches on various datasets, achieving state-of-the-art results in multiple dense prediction tasks. The framework is designed to maintain local consistency between corresponding pixels of transformed images, allowing for effective feature learning without the need for additional complex architectures or pretext tasks.

Uploaded by

efgh07533
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views

Islam_Self-Supervised_Learning_With_Local_Contrastive_Loss_for_Detection_and_Semantic_WACV_2023_paper

This document presents a self-supervised learning (SSL) method that incorporates a local contrastive loss (LC-loss) to enhance object detection and semantic segmentation tasks. The method shows significant performance improvements over existing SSL approaches on various datasets, achieving state-of-the-art results in multiple dense prediction tasks. The framework is designed to maintain local consistency between corresponding pixels of transformed images, allowing for effective feature learning without the need for additional complex architectures or pretext tasks.

Uploaded by

efgh07533
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

Self-supervised Learning with Local Contrastive Loss for Detection and

Semantic Segmentation

Ashraful Islam* Ben Lundell, Harpreet Sawhney, Sudipta N. Sinha


Nvidia Microsoft Mixed Reality
[email protected] {Benjamin.Lundell,Harpreet.Sawhney,Sudipta.Sinha}@microsoft.com

Peter Morales Richard J. Radke


[email protected] Rensselaer Polytechnic Institute
[email protected]

Abstract

We present a self-supervised learning (SSL) method suit-


able for semi-global tasks such as object detection and se-
mantic segmentation. We enforce local consistency between
self-learned features that represent corresponding image lo-
cations of transformed versions of the same image, by mini- Figure 1: Our framework encourages local regions of two trans-
mizing a pixel-level local contrastive (LC) loss during train- formed images to learn similar features. The first image (left)
ing. LC-loss can be added to existing self-supervised learn- uses color augmentation only, and another image (right) employs
ing methods with minimal overhead. We evaluate our SSL both spatial (random resize crop) and color transformations. With
approach on two downstream tasks – object detection and known corresponding pixels a consistency loss enforces maximal
semantic segmentation, using COCO, PASCAL VOC, and similarity between the corresponding learned features.
CityScapes datasets. Our method outperforms the existing
state-of-the-art SSL approaches by 1.9% on COCO object a dense and fine-grained pixel level. The main idea is illus-
detection, 1.4% on PASCAL VOC detection, and 0.6% on trated in Figure 1. Specifically, we encourage correspond-
CityScapes segmentation. ing local pixels in the two transformed images to produce
similar features. The true pixel correspondences are known
since the input image pairs are generated by applying two
1. Introduction distinct transformations to a single image. Note that this ap-
proach can be used along side any conventional global SSL
Self-supervised learning (SSL) approaches learn generic objective with minimal overhead.
feature representations from data in the absence of any ex- We evaluate the impact of LC-loss on several down-
ternal supervision. These approaches often solve an in- stream tasks, namely object detection, and instance and se-
stance discrimination pretext task in which multiple trans- mantic segmentation and report promising improvements
formations of the same image are required to generate sim- over previous spatially-aware SSL methods [37, 38, 41, 43]
ilar learned features. Recent SSL methods have shown re- on Pascal VOC, COCO and Cityscapes datasets.
markable promise in global tasks such as classifying im-
Contributions. Our main contribution is in demonstrating
ages by training simple classifiers on the features learned
that adding a pixel-level contrastive loss to the BYOL [17]
via instance discrimination [1, 2, 4, 17, 18]. However,
training procedure for the instance discrimination pretext
global feature-learning SSL approaches do not explicitly re-
task is sufficient to produce excellent results on many down-
tain spatial information thus rendering them ill-suited for
stream dense prediction tasks. A similar, pixel-level con-
semi-global tasks such as object detection, and instance and
trastive loss formulation was presented in PixPro [43], but
semantic segmentation [37, 43].
a more complicated pixel-to-propagation consistency pre-
This work focuses on extending SSL to incorporate spa-
text task was required to achieve state of the art results in
tial locality by using a local contrastive (LC) loss function at
dense prediction tasks. We show that no additional pre-text
* This work was done while the author was an intern at Microsoft. is necessary, and our simpler local contrastive loss formula-

5624
tion achieves superior performance. Specifically, our key We adopt the latter strategy.
technical contributions are: (1) a simple framework that Broadly, there are two approaches in the literature to en-
computes local contrastive loss (LC-loss) to make the cor- suring local consistency during self-supervised pretraining:
responding pixels of two augmented versions of the same pixel-based and region-based. In region-based methods,
image similar, that can be added to any self-supervised first region proposals are generated - either during the self-
learning method, such as BYOL; and (2) state-of-the-art supervised training [38, 34, 44], or before training starts
transfer learning results in several dense labeling tasks. [41], and then local consistency is applied between pooled
Using ResNet-50 backbones pretrained on ImageNet, our features of the proposed regions. Our approach is pixel-
BYOL variant achieves 40.6 AP for COCO object detection based; local consistency is applied between the local fea-
(+1.9 vs SSL SOTA1 ), 60.1 AP for VOC object detection tures for corresponding pixels of transformed versions of
(+1.4 vs SOTA), 72.1 mIoU for VOC segmentation (+1 vs the same image [37, 43, 42].
SOTA), and 77.8 mIoU for CityScapes segmentation (+0.6 DenseCL [37] proposed dense contrastive learning for
vs SOTA) for full-network fine-tuning setting. Our perfor- self-supervised visual pretraining. It follows the MoCo [18]
mance improvement is even more significant in the frozen framework to formulate the dense loss. However, DenseCL
backbone setting discussed in Section 4.4. does not use known pixel-correspondences to generate pos-
itive pairs of local feature between two images. Instead,
2. Related Work it extracts the correspondence across views. This creates
a chicken-and-egg problem where DenseCL first requires
Self-supervised learning (SSL). In self-supervisd learn- learning a good feature representation to generate correct
ing, the supervisory signal is automatically generated from correspondences.
a pair of input images and a pretext task. The input pair
Our local contrastive loss is more similar to the PixCon-
is generated by applying two distinct transformations. The
trast loss in the PixPro paper [43]. Given an image I, both
pretext task is comparison between the learned representa-
methods operate on two distinct transforms J1 = T1 (I) and
tions of each pair of input images . Various pretext tasks
J2 = T2 (I) of I to produce two low-resolution, spatial fea-
have been explored, such as, patch position [12], image col-
ture maps. Both methods use a contrastive loss using pixel
orization [45], image inpainting [31], rotation [15], and pre-
correspondences to generate positive and negative samples.
dictive coding [24]. The pretext task that has shown the
The difference comes in how these samples are selected.
most promise is the instance discrimination task, in which
In PixContrast, pixels in the low-resolution feature map
each image is considered as a single class. SimCLR [4], the
are warped back to the original image space using T1 1 and
first to propose this pretext task, adopts contrastive learning
T1 1 . Positive samples are then determined by all pairs of
in which features from augmented versions of the same im-
pixels that are sufficiently close after warping. Our method
age are made closer in the feature space than all the other
generates positive samples using the correspondences in J1
images in a mini-batch. SimCLR requires a large mini-
and J2 derived directly from T1 and T2 . While similar to
batch to make contrastive learning feasible. MoCo [18]
ours, the method proposed for PixContrast does not work.
solves the issue of large batch size using a momentum queue
Instead, an additional pixel-propagation module is intro-
and moving average encoder. Despite impressive results
duced (PixPro) to measure the feature similarity between
in image classification tasks, contrastive learning requires
corresponding pixels. We show that no pixel propagation or
careful handling of negative pairs. Recent approaches like
feature warping is required in our simpler formulation.
BYOL [17], SwaV [1], and DINO [2] do not require any
In summary, our framework does not require: (1) an
negative pairs or a memory bank. They also achieve impres-
encoder-decoder architecture for local correspondence loss
sive performance on the ImageNet1k linear evaluation task
[32]; (2) contrastive learning that needs carefully tuned neg-
and downstream image classification-based transfer learn-
ative pairs [32, 37, 41]; (3) a good local feature extractor to
ing tasks.
find local feature correspondences [37]; and (4) an addi-
SSL for Detection and Segmentation. For dense predic- tional propagation module to measure local contrastive loss
tion tasks, SSL methods use an ImageNet-pretrained back- [43]. A simple local correspondence loss obtained from
bone within a larger architecture designed for a detection or matching pixel pairs achieves state-of-the-art results in de-
segmentation task [29, 33, 40], and fine tune the network on tection and segmentation tasks.
the downstream task dataset. He et al. [19]reported that Im-
ageNet pretrained models might be less helpful if the target 3. Methodology
task is localization sensitive than classification . One poten-
tial solution is to increase the target dataset size [19], or to Figure 2 depicts our LC-loss framework. We use BYOL
impose local consistency during the ImageNet pretraining. for the global self-supervised loss function, and apply a
local contrastive loss on dense feature representations ob-
1 State-Of-The-Art tained from the backbone networks. We adopted BYOL

5625
framework because it achieves higher performance than transformations like image flipping or random crop. We ap-
contrastive learning without using any negative pairs, and it ply normal resize operation to resize the image to H ⇥ W
is more resilient to changes in hyper-parameters like batch shape. Another transformation Tsc is applied to I to ob-
size and image transformations. In the following, we briefly tain Isc , where Tsc contains both spatial and color trans-
describe the BYOL framework, and introduce our approach. formations. We obtain dense feature representations of Isc
Instance Discrimination from Global Features. BYOL from the backbone of the online network, and denote it as
consists of two neural networks: an online network with F✓ 2 Rh⇥w where h = H/p and w = W/p with p as the
parameters ✓ and a target network with parameters ⇠. The stride size for the feature representation. We get a similar
online network has three sub-networks: an encoder, a pro- feature representation F⇠ 2 Rh⇥w by passing Ic through
jector and a predictor. The target network has the same ar- the target network.
chitecture as the online network except the predictor. The Next we select h ⇥ w image points on a 2D uniform grid
online network is updated by gradient descent, and the pa- in Ic . Each point pc of Ic has the local feature representation
rameters of the target network are exponential moving av- F⇠ (pc ) obtained from the target network. For the point pc ,
erages of the parameters of the online network. Given an there is a corresponding point psc in Isc . Note that because
input image I, two transformed views Ic and It of I are of random crop and flipping in Tsc , the feature F✓ (psc ) may
obtained. One view Ic is passed through the online net- not be at an integer pixel coordinate psc . Instead of adopt-
work with parameters ✓ to obtain local features F✓ , average ing an expensive high-dimensional warping of F✓ to ob-
pooled encoder output f✓ , a projection g✓ and prediction q✓ . tain corresponding features, we compute the negative log-
View It is passed through the target network with parame- likelihood from correspondence map and resample the neg-
ters ⇠ to obtain local features F⇠ , average pooled encoder ative log-likelihood using bilinear interpolation to deal 2D
output f⇠ and a projection g⇠ . Check Fig. 2. There is no points with subpixel coordinates, which is described below
predictor in the target network. This asymmetric design is [14].
adopted to prevent collapse during self-supervised training We compute a dense correspondence map C 0 2
[17]. We follow the original BYOL network to set the di- R (h⇥w)⇥(h⇥w)
between Ic and Isc as,
mension of encoder outputs (2048 dim), projections (4096
dim), and predictions (256 dim). The global self-supervised F⇠ (pc )T F✓ (psc )
C 0 (pc , pt ) = (2)
loss function for a single input is defined as: ||F⇠ (pc )||2 ||F✓ (psc )||2

qT✓ g⇠ where pc and pt are image points from Ic and Isc respec-
LG = 2 2 (1) tively. Then, we calculate the negative log-likelihood
||q✓ ||2 ||g⇠ ||2
where q✓ and g⇠ are learned global representations for two
transformed images that are forced to be similar under co- exp(C 0 (pc , psc )/⌧ )
NLL(pc , pt ) = log P (3)
sine similarity. 0
k2⌦sc (exp(C (pc , pk )/⌧ ))

Local Contrastive Loss. Given two transformed versions


where ⌦ are set of locations in the Isc . If psc is not
of the same image, namely Ic and It , let an image point pc
an integer location, we obtain the negative log-likelihood
in Ic , correspond to another image point pt in It . We can de-
NLL(pc , psc ) by bilinearly interpolating NLL(pc , :). The
termine pt for every pc given the known image transforma-
contrastive loss, LC-loss, is defined as
tions. The correspondence map Cpc 2 RH⇥W for source
point pc is calculated, where Cpc (pl ) denotes the similarity 1 X
score between pc of Ic and every pixel pl of It . We define LLC = NLL(pc , psc ) (4)
|P|
Cpc (pl ) to be the similarity score that pl is the correspond- (pc ,psc )2P
ing pixel of pc . As we know pt is the actual corresponding
pixel, we want Cpc (pt ) to be maximized. The local loss where P contains all corresponding pairs {(pc , psc )} for Ic
for pc is the negative log likelihood at pt which encourages and Isc such that psc does not fall outside of the boundary of
maximizing the likelihood estimate for the target locations Isc , and |P|  h ⇥ w.
log Cpc (pt ). Our total loss is defined as:
We now describe how this is incorporated in our pipeline.
We employ learned feature-level correspondence as a mea- L = (1 ↵)LG + ↵LLC (5)
sure of pixel-level correspondence. Given an image I, we
apply image transformation Tc to I to obtain Ic . Tc contains where ↵ is the multiplicative factor that balances the two
strong color transformations (for example, Gaussian blur, loss components. See Section 4.5 for a study on the impact
solarization, color distortion), but does not include spatial of ↵.

5626
Figure 2: Proposed Framework. We use BYOL as the self-supervised learning framework. BYOL consists of an online network with
parameters ✓ and a target network with parameters ⇠ which are the exponential moving average of ✓. Given an image, we create two
transformed versions. We apply mean squared error loss between the L2-normalized global feature representations from the online and
target networks.. We also calculate a local contrastive loss from the dense feature representations of the image pairs.

4. Experiments GPUs with 256 batch-size on each GPU, hence, the effec-
tive batch-size is 4096. We linearly scale the learning rate
4.1. Implementation Details with the effective batch size. The weight parameter ↵ is set
to 0.1 (Eqn. 5). For the momentum encoder, the momen-
Pretraining Setup. We use ResNet-50 [21] as the backbone
tum value starts from 0.996 and ends at 1. We use 16 bit
network and BYOL [17] as the self-supervision architec-
mixed-precision during pre-training.
ture. We use identical architectures for projection and pre-
diction networks as in BYOL. For extracting local features, 4.2. Results on Object Detection and Instance Seg-
we add a local projection branch on the online network and mentation
another branch of similar architecture on the target network.
As with the global branches, only the online local projection We use Detectron2 framework [39] for evaluation of
branch is updated through optimization while the target lo- downstream object detection and segmentation results on
cal projection branch is the exponential moving average of COCO and PASCAL VOC dataset.
the online one. The local projection branch consists of two COCO Object Detection. For object detection on COCO,
convolution layers. The first convolution layer consists of a we adopt RetinaNet [27] following [38, 34, 42]. We fine-
1⇥1 convolution kernel with input dimension 2048 and tar- tune all layers with Sync BatchNorm for 90k iterations on
get dimension 2048 following by a BatchNorm layer. The COCO train2017 set and evaluate on COCO val2017
second convolution layer contains a 1 ⇥ 1 kernel with out- set. Table 1 shows object detection results on COCO for
put dimension 256. The input to the local projection branch our method and other approaches in the literature with full-
is the local feature representation from the final stage of network finetuning. Note that ReSim, SoCo, and SCRL
ResNet-50 (before the global average pooling layer). For use region proposal networks during pretraining on Ima-
image transformations during pretraining, following BYOL geNet, hence, these approaches are not exactly compara-
[17], we use random resize crop (resize to 224 ⇥ 224), ran- ble to ours. Our model is more similar with methods like
dom horizontal flip, color distortion, blurring, and solariza- DetCo, DenseCL, and PixPro. We achieve 40.6 AP for ob-
tion. We do not apply random crop for the image that is ject detection tasks outperforming the second best method
used to obtain local contrastive loss. PixPro [43] by a significant 1.9%. We also report results
Dataset. We use the ImageNet [11] dataset for pretraining on COCO detection using Mask R-CNN + FPN. We again
the networks. ImageNet contains ⇠1.28M training images, outperform PixPro (our mAP is higher by 1.4), when using
mostly with a single foreground object. Mask R-CNN + FPN for the detector.
Optimization. The default model is trained with 400 COCO Instance Segmentation. We use the Mask-RCNN
epochs if not specified in the results. See Sec. 4.5 for de- framework [20] with ResNet50-FPN backbone. We follow
tails on the effect of pretraining epoch to the transfer per- the 1⇥ schedule. Table 1 depicts that we achieve 38.3% AP
formance. The LARS optimizer is used with a base learn- for COCO instance segmentation, which is comparable with
ing rate of 0.3 for batch-size 256, momentum 0.9, weight- the SoCo [38]. Note that SoCo performs selective search on
decay 1e-6, and with cosine learning rate decay schedule the input image to find object proposals, and uses additional
for with learning rate warm-up for 10 epochs. We use 16 feature pyramid networks during pre-training.

5627
COCO
Object Detection Object Detection Instance Segmentation
Method
Pretrain RetinaNet + FPN Mask-RCNN + FPN Mask-RCNN + FPN
Epochs APb APb50 APb75 APb APb50 APb75 APmk APmk
50 APmk
75

Supervised [21] 90 37.7 57.2 40.4 38.9 59.6 42.7 35.4 56.5 38.1
Moco v2 [6] 200 37.3 56.2 40.4 40.4 60.2 44.2 36.4 57.2 38.9
BYOL [17] 300 35.4 54.7 37.4 40.4 61.6 44.1 37.2 58.8 39.8
DetCo [42] 800 38.4 57.8 41.2 40.1 61.0 43.9 36.4 58.0 38.9
ReSim-FPN [41] 200 38.6 57.6 41.6 39.8 60.2 43.5 36.0 57.1 38.6
SCRL [34] 800 39.0 58.7 41.9 - - - 37.7 59.6 40.7
SoCo [38] 400 38.3 57.2 41.2 43.0 63.3 47.1 38.2 60.2 41.0
DenseCL [37] 200 37.6 56.6 40.2 40.3 59.9 44.3 36.4 57.0 39.2
PixPro [43] 400 38.7 57.5 41.6 41.4 61.6 45.4 - - -
O URS 400 40.6 60.4 43.6 42.5 62.9 46.7 38.3 60.0 41.1
Table 1: Main Results. We use faster-RCNN with RetinaNet for COCO object detection, Mask-RCNN with FPN for COCO instance
segmentation, Faster RCNN with FPN for VOC object detection.

PASCAL VOC Object Detection [13]. We use the Faster- PASCAL VOC Segmentation. We train on VOC
RCNN [33] object detector with ResNet50-FPN back- train-aug2012 set for 20k iterations and evaluate on
bone following [34]. For training, we use images from val2012 set. Table 3 shows that on the VOC2007 test set,
both trainval07+12 sets and we evaluate only on the our method yields 72.1% mIoU outperforming BYOL by a
VOC07 test07 set. We use the pre-trained checkpoints 7.7% and PixPro by 1% mIoU.
released by the authors for the backbone network, and fine Cityscapes Segmentation. CityScapes [10] contains im-
tune the full networks on the VOC dataset. Table 2 shows ages from urban street scenes. Table 3 shows that for fine
that we achieve 60.1 AP for VOC detection. Our method tuning setting our approach yield 77.8% AP which is 6.2%
improves mean AP by a significant 3.2% over baseline mIoU improvement over BYOL and 0.6% improvement
BYOL, and outperforms the current SOTA PixPro by 1.4% over PixPro.
AP. The improvement is even more significant in AP75,
where we outperform BYOL by 3.6% and PixPro by 1.9%. Pretrain VOC CityScapes
Method
Epochs mIoU mIoU
PASCAL VOC
Object Detection Scratch - 40.7 63.5
Method
Pretrain FRCNN + FPN Supervised 90 67.7 74.6
Epochs APb APb50 APb75 Moco v2 200 67.5 74.5
BYOL 300 63.3 71.6
Supervised [21] 90 53.2 81.7 58.2
DenseCL 200 69.4 69.4
BYOL [17] 300 55.0 83.1 61.1
PixPro† 400 71.1 77.2
SCRL [34] 800 57.2 83.8 63.9 O URS 400 72.1 77.8
DenseCL† [37] 200 56.6 81.8 62.9 Table 3: Evaluation on Semantic Segmentation using FCN
PixPro† [43] 400 58.7 82.9 65.9 ResNet-50 network on PASCAL VOC and CityScapes dataset.
O URS 400 60.1 84.2 67.8 († ): We use pretrained checkpoint released by the authors and fine-
tune the full networks on the VOC dataset. All other scores are
Table 2: Main Results. We use Faster RCNN with FPN for VOC
obtained from the respective papers.
object detection. Supervised and BYOL results are from [34]. († ):
We use pre-trained checkpoint released by the authors and fine
tune on the VOC dataset. 4.4. Analysis
Frozen Backbone Analysis. We also report detection
4.3. Results on Semantic Segmentation and segmentation results for frozen backbone following
[16, 23, 41]. Training a linear classifier on a frozen back-
We show semantic segmentation evaluation in Table 3 on bone is a standard approach to evaluate self-supervised rep-
PASCAL VOC and CityScapes [10] datasets for both fine resentation quality for image classification [1, 5, 17, 18].
tuning and frozen backbone settings. We use FCN back- We adopt the standard strategy in ‘frozen backbone’ set-
bone [29] following the settings in mmsegmentation [9]. ting where we freeze the pre-trained ResNet50 backbone

5628
and only fine tune the remaining layers. Frozen backbone models trained with local consistency. We use Mask-RCNN
might be an ideal evaluation strategy because fine tuning the (keypoint version) with ResNet50 FPN network to evaluate
full network evaluates quality of representations along with keypoint estimation. We fine tune on COCO train 2017
initialization and optimization, whereas frozen backbone for 90k iterations. Table 8 shows that our method outper-
evaluates mostly the representation quality of the backbone forms all other approaches in keypoint estimation task.
[16, 41]. Detection on Mini COCO. As the full COCO dataset con-
For frozen backbone (Table 4), we achieve 30.5% AP tains extensive annotated images for supervision, it might
outperforming PixPro by 2.8% AP on COCO object de- not always reveal the generalization ability of the network
tection, 55.1% AP for VOC detection outperforming Pix- [19]. We also report results for object detection on smaller
Pro by 1.6% and BYOL by 2.7%. We achieve 63.4% versions of COCO training set in Table 9. We report results
mIoU on VOC semantic segmentation, which is more than when only 5% and 10% of the images (randomly sampled)
the score achieved by BYOL in finetuning setting (63.3% are used for fine tuning the mask-RCNN with FPN network
mIoU). We also outperform PixPro by a significant 2.9% with 1⇥ schedule. The evaluation is performed on the full
mIoU. On CityScapes semantic segmentation, We achieve val2017 set. For the 5% setting, our method outperforms
60.7% mIoU which improves upon BYOL by 5.1% and Pix- BYOL by 1.4% AP for the ImageNet pretrained models.
Pro by 2.5%. For the 10% setting, our method achieves improvement over
Efficient Pre-training. In Table 5, we report results of BYOL by 1.8% AP.
VOC object detection with FasterRCNN-FPN, COCO ob- Generalization to other SSL methods. In the Appendix,
ject detection from MaskRCNN-FPN, VOC and CityScapes we show results of our approach applied on other SSL ap-
segmentation from FCN for BYOL and O URS pre-trained proaches (e.g., DINO), where we also show consistent im-
with different epochs. Results reveal that our model pre- provement over baseline methods.
trained with 200 epochs and with training image size 160
can achieve better results than BYOL pre-trained with 1000 4.5. Ablation Studies
epochs saving 5.3⇥ computational resource. Even our
Effect of Pretraining Epochs. Figure 3a reports object de-
100-epoch pre-trained model seems to be comparable with
tection performance on PASCAL VOC with faster-RCNN-
1000-epoch pre-trained BYOL model. This validates effi-
FPN and MS-COCO with Mask-RCNN-FPN for different
cacy of our local loss during self-supervised pre-training.
numbers of pre-training epochs. The models are pre-trained
Importance of Local Contrast. In Table 6, we show rel- on the ImageNet training set. Longer training generally
ative performance of our local contrastive loss against non- results in better downstream object detection performance.
contrastive BYOL-type loss. In, ‘BYOL+Local MSE loss’, For example, for the 100 epoch pre-trained model, the AP
we apply the same L2-normalized MSE local loss as the is 39.8%, whereas for the 600 epoch pre-trained model it
global loss in BYOL. Models are trained for 200 epochs on improves to 42.8% for COCO evaluation. Upswing is also
the ImageNet dataset. We report the average AP scores for observed for PASCAL VOC object detection.
VOC detection, COCO detection with Mask-RCNN, and
Ablation on Loss Weight ↵. Figure 3b reports the AP for
CityScapes segmentation, which shows that our approach
object detection on PASCAL VOC for different values of
of calculating local consistency using contrastive loss works
the weight parameter ↵. The models are pre-trained on the
better than non-contrastive BYOL-type local loss.
ImageNet dataset for 200 epochs with training image size of
Few-shot Image Classification. Since global and local 160 for faster training. ↵ balances the weight between the
losses appear to be complementary to each other, we as- global and local loss functions. For ↵ = 0.05, the mean AP
certain if our method hurts the image classification perfor- is 58.4%. We get a slightly better performance with ↵ = 0.1
mance for transfer learning. We use our pre-trained models (58.9%) and ↵ = 0.3 (59.0%). The performance degrades
as fixed feature extractors, and perform 5-way 5-shot few- a little when ↵ is increased to ↵ = 0.7 (57.5%). Results
shot learning on 7 datasets from diverse domains using a reveal the best performance is achieved when we use both
logistic regression classifier. Table 7 reports the 5-shot top- global and local loss functions, and a proper balance be-
1 accuracy for the 7 diverse datasets. Table 7 reveals that tween them ensures better downstream performance.
O URS shows the best performance on average among the
self-supervised models that use local consistency. O URS 4.6. Qualitative Analysis
outperforms PixPro by 2.4% top-1 accuracy on average; the Correspondence Visualization. In Figure 4, we show vi-
minor fluctuation can be attributed to random noise. sual examples of correspondence from our model. For two
Transfer to Other Vision Tasks. Even though we mainly transformed images I1 and I2 , we extract feature represen-
evaluted on detection and segmentation, we also show re- tations F1 and F2 . For each feature in F1 , the corresponding
sults for keypoint estimation a task that might benefit from feature in F2 is calculated based on maximum cosine sim-

5629
PASCAL VOC OD COCO OD VOC SS Cityscapes SS
Method b
AP APb50 APb75 AP b
APb50 APb75 mIoU mIoU
Supervised 50.7 80.4 55.1 30.3 50.0 31.3 56.6 55.7
BYOL 52.4 81.1 57.5 30.2 49.1 31.5 55.7 55.6
DenseCL 50.9 79.9 55.0 25.5 43.6 25.8 63.0 58.5
PixPro 53.5 80.4 59.7 27.7 44.6 29.1 60.3 58.2
O URS 55.1 82.6 61.7 30.5 49.8 31.7 63.4 60.7
Table 4: Frozen backbone evaluation. We freeze the ResNet-50 backbone and finetune the other layers (RPN, FPN, classifier networks,
regression layers, etc.). We use faster-RCNN with FPN for PASCAL VOC object detection (OD), RetinaNet-RCNN for COCO detection
(OD), and FCN network for VOC and CityScapes segmentation (SS). For this experiment, we use publicly available checkpoint for the
backbone networks and evaluate on the downstream tasks.
Pretrain Pretrain Pretrain VOC COCO VOC Cityscapes
Method
Epochs Im-size time APb APb mIoU mIoU
BYOL 300 224 ⇥1.6 56.9 40.4 63.3 71.6
BYOL 1000 224 ⇥5.3 57.0 40.9 69.0 73.4
O URS 100 224 ⇥0.8 58.2 40.9 68.4 76.5
O URS 200 160 ⇥1 59.0 41.6 68.5 77.0
O URS 200 224 ⇥1.6 59.6 42.0 70.9 77.4
Table 5: Efficient SSL training on ImageNet. Performance of object detection and segmentation for BYOL and O URS for different pre-
training epochs and training image size. We achieve better performance than BYOL (1000 epochs pretraining) with our model pre-trained
with 200 epochs and with training image size 160 that is 5.3⇥ faster to pre-train.

VOC COCO VOC Cityscapes


Method
APb mAP mIoU mIoU
BYOL 57.0 40.9 69.0 73.4
BYOL +Local MSE loss 58.7 42.0 70.5 76.7
O URS 59.6 42.5 72.1 77.8
Table 6: Effectiveness of our LC-loss over MSE-loss.

Figure 4: Correspondence Visualization. (Top) Ground-truth


correspondence between two transformed versions of the same im-
age. (Bottom) Correspondence prediction from O URS.

that our method predicts accurate matches most of the time


(considering the resolution error due to the 32 ⇥ 32 grid size
(a) Effect of longer pre-training. The plot shows average in the pixel space for each feature point).
AP for object detection on COCO validation sets (left) and
PASCAL VOC (right). More Analysis on Correspondence. To show that our
method is learning better correspondence across datasets,
we perform a simple experiment. Given an image, we flip
the image along the horizontal direction, and apply color
transformations (Gaussian blur, color jitter, and random
grayscale operations). Since the only spatial transformation
is horizontal flipping, the corresponding pixels are simply
at the mirror locations of the original pixels, i.e., for a pixel
(b) Effect of relative weights of LC-loss. The plot location (x,y), the correct correspondence location in the
shows the average AP for object detection on VOC.
transformed image is (w-x, y), where w is the width. Note
Figure 3: Ablation Studies on pretraining epochs and relative that the correspondence is measured in the feature loca-
weights of local contrastive loss. tions, not in actual pixel locations. We can also measure the
correspondence accuracy based on whether the matching is
ilarity between the feature representations. We show the correct or not. We use the ImageNet pre-trained backbone
matching at the original image resolution. Figure 4 shows from BYOL and O URS, and evaluate the correspondence

5630
Method EuroSAT[22] CropDisease[30] ChestX[36] ISIC[8] Sketch[35] DTD[7] Omniglot[26] Avg
Supervised 85.8 92.5 25.2 43.4 86.3 81.9 93.0 72.6
SoCo 78.3 84.1 25.1 41.2 81.5 73.9 92.2 68.0
DenseCL 77.7 81.0 23.8 36.8 76.5 78.3 77.4 64.5
PixPro 80.5 86.4 26.5 41.2 81.5 73.9 92.2 68.9
O URS 84.5 90.1 25.2 41.9 85.6 80.2 91.5 71.3

Table 7: Few-shot learning results on downstream datasets. The pre-trained models are used as fixed feature extractors We report top-1
accuracy for 5-way 5-shot averaged over 600 episodes. We use the publicly available pre-trained backbone as feature extractor for the
few-shot evaluation.

Method Pretrain Epoch AP AP50 AP75


Supervised 90 65.7 87.2 71.5
BYOL 300 66.3 87.4 72.4
VADeR [32] 200 66.1 87.3 72.1
SCRL [34] 1000 66.5 87.8 72.3
DenseCL† 200 66.2 87.3 71.9
PixPro † 400 66.6 87.8 72.8
O URS 400 67.2 87.4 73.7

Table 8: COCO keypoint estimation. Supervised and BYOL


results are from [34]. († ) denotes We use the publicly available
ImageNet-pretrained checkpoints released by the authors and fine- (a) BYOL (b) PixPro (c) O URS
tune on the COCO dataset. Figure 5: Correspondence map for an image and its trans-
formed versions with only flipping and color transformation.
5% 10% (a) Results from BYOL without local loss. There are many erro-
Method AP AP50 AP75 AP AP50 AP75 neous corresponding pixels denoting that global loss alone does
not learn good features for local correspondence. (b) and (c): Re-
Supervised 19.2 31.0 20.5 25.0 39.9 26.6
BYOL 21.9 36.2 23.2 27.1 43.4 29.3 sults from PixPro and O URS, both of which perfectly detects most
PixPro 20.3 31.4 22.1 25.4 39.5 27.4 of the correspondences.
O URS 23.3 37.4 25.0 27.9 44.0 30.1
resolutions may be beneficial but is much more compu-
Table 9: Object detection on mini-COCO with 1⇥ schedule.
All scores are obtained from finetuning the publicly available pre-
tationally expensive. Thus, the trade-off between perfor-
trained backbone on the downstream dataset. mance and accuracy needs to be further studied. Second,
the local correspondences are not sampled at good feature
on the COCO dataset. Figure 5 shows some visual exam- points (e.g., corners); rather they are sampled on a uniform
ples of the correspondence map on images from the COCO 2D grid. Thus, LC-loss might be too strict when dealing
dataset. We also measure the accuracy of correct correspon- with large, texture-less image regions. Our loss also does
dences on the COCO val dataset. For the BYOL pre-trained not account for presence of self-similar image regions and
model, the accuracy is only ⇠33%, PixPro achieves ⇠99% the effect of not modeling them needs to be evaluated.
accuracy, and O URS achieves ⇠96% accuracy. The results To summarise, we propose a simple framework for self-
suggest that our approach is more robust against color trans- supervised learning that leverages known pixel correspon-
formations. We infer that PixPro achieves better accuracy as dences between different transformations of an image. We
PixPro is trained only with local consistency loss, whereas showed that the model pre-trained with our approach pro-
we use both global and local correspondence during pre- vides better representations for detection and segmenta-
training. The models have not been trained on COCO. tion. Imposing our loss enables a single network to retain
Hence, the results also show that the correspondence maps both spatial and global information, both of which we have
generalize to other datasets. shown are necessary to obtain good features. Our training
does not require any external supervision, since all the lo-
5. Discussion and Conclusion cal and global constraints are generated from the input im-
Even though our model consistently improves perfor- age itself. We showed that our method outperforms exist-
mance on detection and segmentation, it has some limita- ing self-supervised methods that impose local consistency
tions. First, we calculate local correspondence loss at low without requiring complex architectural components such
spatial resolution (32⇥ down-sampled from the original im- as encoder-decoder layers, propagation modules, and re-
age resolution for ResNet50). Computing LC-loss at higher gions proposal networks.

5631
References [14] Hugo Germain, Vincent Lepetit, and Guillaume Bourmaud.
Visual correspondence hallucination: Towards geometric
[1] Mathilde Caron, Ishan Misra, Julien Mairal, et al. Unsuper- reasoning. arXiv preprint arXiv:2106.09711, 2021.
vised learning of visual features by contrasting cluster as- [15] Spyros Gidaris, Praveer Singh, and Nikos Komodakis. Unsu-
signments. arXiv preprint:2006.09882, 2020. pervised Representation Learning by Predicting Image Rota-
[2] Mathilde Caron, Hugo Touvron, Ishan Misra, Hervé Jégou, tions. In ICLR. arXiv, 3 2018.
et al. Emerging properties in self-supervised vision trans- [16] Priya Goyal, Dhruv Mahajan, Abhinav Gupta, and Ishan
formers. arXiv preprint arXiv:2104.14294, 2021. Misra. Scaling and benchmarking self-supervised visual rep-
[3] Kai Chen, Jiaqi Wang, Jiangmiao Pang, Yuhang Cao, Yu resentation learning. In Proceedings of the ieee/cvf Inter-
Xiong, Xiaoxiao Li, Shuyang Sun, Wansen Feng, Ziwei Liu, national Conference on computer vision, pages 6391–6400,
Jiarui Xu, Zheng Zhang, Dazhi Cheng, Chenchen Zhu, Tian- 2019.
heng Cheng, Qijie Zhao, Buyu Li, Xin Lu, Rui Zhu, Yue Wu, [17] Jean-Bastien Grill, Florian Strub, Florent Altché, Corentin
Jifeng Dai, Jingdong Wang, Jianping Shi, Wanli Ouyang, Tallec, Pierre H Richemond, Elena Buchatskaya, Carl Do-
Chen Change Loy, and Dahua Lin. MMDetection: Open ersch, Bernardo Avila Pires, Zhaohan Daniel Guo, Moham-
mmlab detection toolbox and benchmark. arXiv preprint mad Gheshlaghi Azar, et al. Bootstrap your own latent: A
arXiv:1906.07155, 2019. new approach to self-supervised learning. arXiv preprint
[4] Ting Chen, Simon Kornblith, Mohammad Norouzi, and Ge- arXiv:2006.07733, 2020.
offrey Hinton. A simple framework for contrastive learning [18] Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and Ross
of visual representations. In ICML, 2020. Girshick. Momentum contrast for unsupervised visual rep-
[5] Ting Chen, Simon Kornblith, Kevin Swersky, Mohammad resentation learning. In Proceedings of the IEEE/CVF Con-
Norouzi, and Geoffrey Hinton. Big self-supervised mod- ference on Computer Vision and Pattern Recognition, pages
els are strong semi-supervised learners. arXiv preprint 9729–9738, 2020.
arXiv:2006.10029, 2020. [19] Kaiming He, Ross Girshick, and Piotr Dollár. Rethinking im-
[6] Xinlei Chen, Haoqi Fan, Ross Girshick, and Kaiming He. agenet pre-training. In Proceedings of the IEEE/CVF Inter-
Improved baselines with momentum contrastive learning. national Conference on Computer Vision, pages 4918–4927,
arXiv preprint arXiv:2003.04297, 2020. 2019.
[7] Mircea Cimpoi, Subhransu Maji, Iasonas Kokkinos, Sammy [20] Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross Gir-
Mohamed, and Andrea Vedaldi. Describing textures in the shick. Mask RCNN. In Proceedings of the IEEE inter-
wild. In Proceedings of the IEEE conference on computer national conference on computer vision, pages 2961–2969,
vision and pattern recognition, pages 3606–3613, 2014. 2017.
[8] Noel Codella, Veronica Rotemberg, Philipp Tschandl, [21] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun.
M Emre Celebi, Stephen Dusza, David Gutman, Brian Deep residual learning for image recognition. In Proceed-
Helba, Aadi Kalloo, Konstantinos Liopyris, Michael ings of the IEEE conference on computer vision and pattern
Marchetti, et al. Skin lesion analysis toward melanoma recognition, pages 770–778, 2016.
detection 2018: A challenge hosted by the interna- [22] Patrick Helber, Benjamin Bischke, Andreas Dengel, and
tional skin imaging collaboration (isic). arXiv preprint Damian Borth. Eurosat: A novel dataset and deep learning
arXiv:1902.03368, 2019. benchmark for land use and land cover classification. IEEE
[9] MMSegmentation Contributors. MMSegmentation: Journal of Selected Topics in Applied Earth Observations
Openmmlab semantic segmentation toolbox and bench- and Remote Sensing, 12(7):2217–2226, 2019.
mark. https://github.com/open-mmlab/ [23] Olivier J Henaff, Skanda Koppula, Jean-Baptiste Alayrac,
mmsegmentation, 2020. Aaron van den Oord, Oriol Vinyals, and João Carreira.
[10] Marius Cordts, Mohamed Omran, Sebastian Ramos, Timo Efficient visual pretraining with contrastive detection. In
Rehfeld, Markus Enzweiler, Rodrigo Benenson, Uwe Proceedings of the IEEE/CVF International Conference on
Franke, Stefan Roth, and Bernt Schiele. The cityscapes Computer Vision, pages 10086–10096, 2021.
dataset for semantic urban scene understanding. In Proceed- [24] Olivier J Hénaff, Aravind Srinivas, Jeffrey De Fauw, Ali
ings of the IEEE conference on computer vision and pattern Razavi, Carl Doersch, SM Eslami, and Aaron van den Oord.
recognition, pages 3213–3223, 2016. Data-efficient image recognition with contrastive predictive
[11] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, coding. arXiv preprint arXiv:1905.09272, 2019.
and Li Fei-Fei. Imagenet: A large-scale hierarchical image [25] Ashraful Islam, Chun-Fu Richard Chen, Rameswar Panda,
database. In 2009 IEEE conference on computer vision and Leonid Karlinsky, Richard Radke, and Rogerio Feris. A
pattern recognition, pages 248–255. IEEE, 2009. broad study on the transferability of visual representations
[12] Carl Doersch, Abhinav Gupta, and Alexei A Efros. Unsuper- with contrastive learning. In Proceedings of the IEEE/CVF
vised Visual Representation Learning by Context Prediction. International Conference on Computer Vision, pages 8845–
In ICCV, 2015. 8855, 2021.
[13] Mark Everingham, Luc Van Gool, Christopher KI Williams, [26] Brenden M Lake, Ruslan Salakhutdinov, and Joshua B
John Winn, and Andrew Zisserman. The pascal visual object Tenenbaum. Human-level concept learning through proba-
classes (voc) challenge. International journal of computer bilistic program induction. Science, 350(6266):1332–1338,
vision, 88(2):303–338, 2010. 2015.

5632
[27] Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, and [41] Tete Xiao, Colorado J Reed, Xiaolong Wang, Kurt Keutzer,
Piotr Dollár. Focal loss for dense object detection. In Pro- and Trevor Darrell. Region similarity representation learn-
ceedings of the IEEE international conference on computer ing. In Proceedings of the IEEE/CVF International Confer-
vision, pages 2980–2988, 2017. ence on Computer Vision, pages 10539–10548, 2021.
[28] Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, [42] Enze Xie, Jian Ding, Wenhai Wang, Xiaohang Zhan, Hang
Zheng Zhang, Stephen Lin, and Baining Guo. Swin trans- Xu, Peize Sun, Zhenguo Li, and Ping Luo. Detco: Unsuper-
former: Hierarchical vision transformer using shifted win- vised contrastive learning for object detection. In Proceed-
dows. arXiv preprint arXiv:2103.14030, 2021. ings of the IEEE/CVF International Conference on Com-
[29] Jonathan Long, Evan Shelhamer, and Trevor Darrell. Fully puter Vision, pages 8392–8401, 2021.
convolutional networks for semantic segmentation. In Pro- [43] Zhenda Xie, Yutong Lin, Zheng Zhang, Yue Cao, Stephen
ceedings of the IEEE conference on computer vision and pat- Lin, and Han Hu. Propagate yourself: Exploring pixel-level
tern recognition, pages 3431–3440, 2015. consistency for unsupervised visual representation learning.
[30] Sharada P Mohanty, David P Hughes, and Marcel Salathé. In Proceedings of the IEEE/CVF Conference on Computer
Using deep learning for image-based plant disease detection. Vision and Pattern Recognition, pages 16684–16693, 2021.
Frontiers in plant science, 7:1419, 2016. [44] Ceyuan Yang, Zhirong Wu, Bolei Zhou, and Stephen Lin. In-
[31] Deepak Pathak, Philipp Krahenbuhl, Jeff Donahue, Trevor stance localization for self-supervised detection pretraining.
Darrell, and Alexei A. Efros. Context Encoders: Feature In Proceedings of the IEEE/CVF Conference on Computer
Learning by Inpainting. In CVPR, volume 2016-December, Vision and Pattern Recognition, pages 3987–3996, 2021.
pages 2536–2544. IEEE Computer Society, 4 2016. [45] Richard Zhang, Phillip Isola, and Alexei A. Efros. Colorful
[32] Pedro O Pinheiro, Amjad Almahairi, Ryan Y Benmalek, Image Colorization. In ECCV, volume 9907 LNCS, pages
Florian Golemo, and Aaron Courville. Unsupervised 649–666. Springer Verlag, 3 2016.
learning of dense visual representations. arXiv preprint
arXiv:2011.05499, 2020.
[33] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun.
Faster r-cnn: towards real-time object detection with region
proposal networks. IEEE transactions on pattern analysis
and machine intelligence, 39(6):1137–1149, 2016.
[34] Byungseok Roh, Wuhyun Shin, Ildoo Kim, and Sungwoong
Kim. Spatially consistent representation learning. In Pro-
ceedings of the IEEE/CVF Conference on Computer Vision
and Pattern Recognition, pages 1144–1153, 2021.
[35] Haohan Wang, Songwei Ge, Eric P. Xing, and Zachary C.
Lipton. Learning robust global representations by penaliz-
ing local predictive power. arXiv preprint arXiv:1905.13549,
2019.
[36] Xiaosong Wang, Yifan Peng, Le Lu, Zhiyong Lu, Mo-
hammadhadi Bagheri, and Ronald M Summers. Chestx-
ray8: Hospital-scale chest x-ray database and benchmarks
on weakly-supervised classification and localization of com-
mon thorax diseases. In Proceedings of the IEEE Conference
on Computer Vision and Pattern Recognition, pages 2097–
2106, 2017.
[37] Xinlong Wang, Rufeng Zhang, Chunhua Shen, Tao Kong,
and Lei Li. Dense contrastive learning for self-supervised
visual pre-training. In Proceedings of the IEEE/CVF Con-
ference on Computer Vision and Pattern Recognition, pages
3024–3033, 2021.
[38] Fangyun Wei, Yue Gao, Zhirong Wu, Han Hu, and Stephen
Lin. Aligning pretraining for detection via object-level con-
trastive learning. Advances in Neural Information Process-
ing Systems, 34, 2021.
[39] Yuxin Wu, Alexander Kirillov, Francisco Massa, Wan-Yen
Lo, and Ross Girshick. Detectron2. https://github.
com/facebookresearch/detectron2, 2019.
[40] Tete Xiao, Yingcheng Liu, Bolei Zhou, Yuning Jiang, and
Jian Sun. Unified perceptual parsing for scene understand-
ing. In Proceedings of the European Conference on Com-
puter Vision (ECCV), pages 418–434, 2018.

5633

You might also like