Islam_Self-Supervised_Learning_With_Local_Contrastive_Loss_for_Detection_and_Semantic_WACV_2023_paper

This document presents a self-supervised learning (SSL) method that incorporates a local contrastive loss (LC-loss) to enhance object detection and semantic segmentation tasks. The method shows significant performance improvements over existing SSL approaches on various datasets, achieving state-of-the-art results in multiple dense prediction tasks. The framework is designed to maintain local consistency between corresponding pixels of transformed images, allowing for effective feature learning without the need for additional complex architectures or pretext tasks.

Uploaded by

efgh07533

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

3 views

Islam_Self-Supervised_Learning_With_Local_Contrastive_Loss_for_Detection_and_Semantic_WACV_2023_paper

Uploaded by

efgh07533

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 10

Self-supervised Learning with Local Contrastive Loss for Detection and

Semantic Segmentation

Ashraful Islam* Ben Lundell, Harpreet Sawhney, Sudipta N. Sinha

Nvidia Microsoft Mixed Reality
[email protected] {Benjamin.Lundell,Harpreet.Sawhney,Sudipta.Sinha}@microsoft.com

Peter Morales Richard J. Radke

[email protected] Rensselaer Polytechnic Institute
[email protected]

Abstract

We present a self-supervised learning (SSL) method suit-

able for semi-global tasks such as object detection and se-
mantic segmentation. We enforce local consistency between
self-learned features that represent corresponding image lo-
cations of transformed versions of the same image, by mini- Figure 1: Our framework encourages local regions of two trans-
mizing a pixel-level local contrastive (LC) loss during train- formed images to learn similar features. The first image (left)
ing. LC-loss can be added to existing self-supervised learn- uses color augmentation only, and another image (right) employs
ing methods with minimal overhead. We evaluate our SSL both spatial (random resize crop) and color transformations. With
approach on two downstream tasks – object detection and known corresponding pixels a consistency loss enforces maximal
semantic segmentation, using COCO, PASCAL VOC, and similarity between the corresponding learned features.
CityScapes datasets. Our method outperforms the existing
state-of-the-art SSL approaches by 1.9% on COCO object a dense and fine-grained pixel level. The main idea is illus-
detection, 1.4% on PASCAL VOC detection, and 0.6% on trated in Figure 1. Specifically, we encourage correspond-
CityScapes segmentation. ing local pixels in the two transformed images to produce
similar features. The true pixel correspondences are known
since the input image pairs are generated by applying two
1. Introduction distinct transformations to a single image. Note that this ap-
proach can be used along side any conventional global SSL
Self-supervised learning (SSL) approaches learn generic objective with minimal overhead.
feature representations from data in the absence of any ex- We evaluate the impact of LC-loss on several down-
ternal supervision. These approaches often solve an in- stream tasks, namely object detection, and instance and se-
stance discrimination pretext task in which multiple trans- mantic segmentation and report promising improvements
formations of the same image are required to generate sim- over previous spatially-aware SSL methods [37, 38, 41, 43]
ilar learned features. Recent SSL methods have shown re- on Pascal VOC, COCO and Cityscapes datasets.
markable promise in global tasks such as classifying im-
Contributions. Our main contribution is in demonstrating
ages by training simple classifiers on the features learned
that adding a pixel-level contrastive loss to the BYOL [17]
via instance discrimination [1, 2, 4, 17, 18]. However,
training procedure for the instance discrimination pretext
global feature-learning SSL approaches do not explicitly re-
task is sufficient to produce excellent results on many down-
tain spatial information thus rendering them ill-suited for
stream dense prediction tasks. A similar, pixel-level con-
semi-global tasks such as object detection, and instance and
trastive loss formulation was presented in PixPro [43], but
semantic segmentation [37, 43].
a more complicated pixel-to-propagation consistency pre-
This work focuses on extending SSL to incorporate spa-
text task was required to achieve state of the art results in
tial locality by using a local contrastive (LC) loss function at
dense prediction tasks. We show that no additional pre-text
* This work was done while the author was an intern at Microsoft. is necessary, and our simpler local contrastive loss formula-

5624
tion achieves superior performance. Specifically, our key We adopt the latter strategy.
technical contributions are: (1) a simple framework that Broadly, there are two approaches in the literature to en-
computes local contrastive loss (LC-loss) to make the cor- suring local consistency during self-supervised pretraining:
responding pixels of two augmented versions of the same pixel-based and region-based. In region-based methods,
image similar, that can be added to any self-supervised first region proposals are generated - either during the self-
learning method, such as BYOL; and (2) state-of-the-art supervised training [38, 34, 44], or before training starts
transfer learning results in several dense labeling tasks. [41], and then local consistency is applied between pooled
Using ResNet-50 backbones pretrained on ImageNet, our features of the proposed regions. Our approach is pixel-
BYOL variant achieves 40.6 AP for COCO object detection based; local consistency is applied between the local fea-
(+1.9 vs SSL SOTA1 ), 60.1 AP for VOC object detection tures for corresponding pixels of transformed versions of
(+1.4 vs SOTA), 72.1 mIoU for VOC segmentation (+1 vs the same image [37, 43, 42].
SOTA), and 77.8 mIoU for CityScapes segmentation (+0.6 DenseCL [37] proposed dense contrastive learning for
vs SOTA) for full-network fine-tuning setting. Our perfor- self-supervised visual pretraining. It follows the MoCo [18]
mance improvement is even more significant in the frozen framework to formulate the dense loss. However, DenseCL
backbone setting discussed in Section 4.4. does not use known pixel-correspondences to generate pos-
itive pairs of local feature between two images. Instead,
2. Related Work it extracts the correspondence across views. This creates
a chicken-and-egg problem where DenseCL first requires
Self-supervised learning (SSL). In self-supervisd learn- learning a good feature representation to generate correct
ing, the supervisory signal is automatically generated from correspondences.
a pair of input images and a pretext task. The input pair
Our local contrastive loss is more similar to the PixCon-
is generated by applying two distinct transformations. The
trast loss in the PixPro paper [43]. Given an image I, both
pretext task is comparison between the learned representa-
methods operate on two distinct transforms J1 = T1 (I) and
tions of each pair of input images . Various pretext tasks
J2 = T2 (I) of I to produce two low-resolution, spatial fea-
have been explored, such as, patch position [12], image col-
ture maps. Both methods use a contrastive loss using pixel
orization [45], image inpainting [31], rotation [15], and pre-
correspondences to generate positive and negative samples.
dictive coding [24]. The pretext task that has shown the
The difference comes in how these samples are selected.
most promise is the instance discrimination task, in which
In PixContrast, pixels in the low-resolution feature map
each image is considered as a single class. SimCLR [4], the
are warped back to the original image space using T1 1 and
first to propose this pretext task, adopts contrastive learning
T1 1 . Positive samples are then determined by all pairs of
in which features from augmented versions of the same im-
pixels that are sufficiently close after warping. Our method
age are made closer in the feature space than all the other
generates positive samples using the correspondences in J1
images in a mini-batch. SimCLR requires a large mini-
and J2 derived directly from T1 and T2 . While similar to
batch to make contrastive learning feasible. MoCo [18]
ours, the method proposed for PixContrast does not work.
solves the issue of large batch size using a momentum queue
Instead, an additional pixel-propagation module is intro-
and moving average encoder. Despite impressive results
duced (PixPro) to measure the feature similarity between
in image classification tasks, contrastive learning requires
corresponding pixels. We show that no pixel propagation or
careful handling of negative pairs. Recent approaches like
feature warping is required in our simpler formulation.
BYOL [17], SwaV [1], and DINO [2] do not require any
In summary, our framework does not require: (1) an
negative pairs or a memory bank. They also achieve impres-
encoder-decoder architecture for local correspondence loss
sive performance on the ImageNet1k linear evaluation task
[32]; (2) contrastive learning that needs carefully tuned neg-
and downstream image classification-based transfer learn-
ative pairs [32, 37, 41]; (3) a good local feature extractor to
ing tasks.
find local feature correspondences [37]; and (4) an addi-
SSL for Detection and Segmentation. For dense predic- tional propagation module to measure local contrastive loss
tion tasks, SSL methods use an ImageNet-pretrained back- [43]. A simple local correspondence loss obtained from
bone within a larger architecture designed for a detection or matching pixel pairs achieves state-of-the-art results in de-
segmentation task [29, 33, 40], and fine tune the network on tection and segmentation tasks.
the downstream task dataset. He et al. [19]reported that Im-
ageNet pretrained models might be less helpful if the target 3. Methodology
task is localization sensitive than classification . One poten-
tial solution is to increase the target dataset size [19], or to Figure 2 depicts our LC-loss framework. We use BYOL
impose local consistency during the ImageNet pretraining. for the global self-supervised loss function, and apply a
local contrastive loss on dense feature representations ob-
1 State-Of-The-Art tained from the backbone networks. We adopted BYOL

5625
framework because it achieves higher performance than transformations like image flipping or random crop. We ap-
contrastive learning without using any negative pairs, and it ply normal resize operation to resize the image to H ⇥ W
is more resilient to changes in hyper-parameters like batch shape. Another transformation Tsc is applied to I to ob-
size and image transformations. In the following, we briefly tain Isc , where Tsc contains both spatial and color trans-
describe the BYOL framework, and introduce our approach. formations. We obtain dense feature representations of Isc
Instance Discrimination from Global Features. BYOL from the backbone of the online network, and denote it as
consists of two neural networks: an online network with F✓ 2 Rh⇥w where h = H/p and w = W/p with p as the
parameters ✓ and a target network with parameters ⇠. The stride size for the feature representation. We get a similar
online network has three sub-networks: an encoder, a pro- feature representation F⇠ 2 Rh⇥w by passing Ic through
jector and a predictor. The target network has the same ar- the target network.
chitecture as the online network except the predictor. The Next we select h ⇥ w image points on a 2D uniform grid
online network is updated by gradient descent, and the pa- in Ic . Each point pc of Ic has the local feature representation
rameters of the target network are exponential moving av- F⇠ (pc ) obtained from the target network. For the point pc ,
erages of the parameters of the online network. Given an there is a corresponding point psc in Isc . Note that because
input image I, two transformed views Ic and It of I are of random crop and flipping in Tsc , the feature F✓ (psc ) may
obtained. One view Ic is passed through the online net- not be at an integer pixel coordinate psc . Instead of adopt-
work with parameters ✓ to obtain local features F✓ , average ing an expensive high-dimensional warping of F✓ to ob-
pooled encoder output f✓ , a projection g✓ and prediction q✓ . tain corresponding features, we compute the negative log-
View It is passed through the target network with parame- likelihood from correspondence map and resample the neg-
ters ⇠ to obtain local features F⇠ , average pooled encoder ative log-likelihood using bilinear interpolation to deal 2D
output f⇠ and a projection g⇠ . Check Fig. 2. There is no points with subpixel coordinates, which is described below
predictor in the target network. This asymmetric design is [14].
adopted to prevent collapse during self-supervised training We compute a dense correspondence map C 0 2
[17]. We follow the original BYOL network to set the di- R (h⇥w)⇥(h⇥w)
between Ic and Isc as,
mension of encoder outputs (2048 dim), projections (4096
dim), and predictions (256 dim). The global self-supervised F⇠ (pc )T F✓ (psc )
C 0 (pc , pt ) = (2)
loss function for a single input is defined as: ||F⇠ (pc )||2 ||F✓ (psc )||2

qT✓ g⇠ where pc and pt are image points from Ic and Isc respec-
LG = 2 2 (1) tively. Then, we calculate the negative log-likelihood
||q✓ ||2 ||g⇠ ||2
where q✓ and g⇠ are learned global representations for two
transformed images that are forced to be similar under co- exp(C 0 (pc , psc )/⌧ )
NLL(pc , pt ) = log P (3)
sine similarity. 0
k2⌦sc (exp(C (pc , pk )/⌧ ))

Local Contrastive Loss. Given two transformed versions

where ⌦ are set of locations in the Isc . If psc is not
of the same image, namely Ic and It , let an image point pc
an integer location, we obtain the negative log-likelihood
in Ic , correspond to another image point pt in It . We can de-
NLL(pc , psc ) by bilinearly interpolating NLL(pc , :). The
termine pt for every pc given the known image transforma-
contrastive loss, LC-loss, is defined as
tions. The correspondence map Cpc 2 RH⇥W for source
point pc is calculated, where Cpc (pl ) denotes the similarity 1 X
score between pc of Ic and every pixel pl of It . We define LLC = NLL(pc , psc ) (4)
|P|
Cpc (pl ) to be the similarity score that pl is the correspond- (pc ,psc )2P
ing pixel of pc . As we know pt is the actual corresponding
pixel, we want Cpc (pt ) to be maximized. The local loss where P contains all corresponding pairs {(pc , psc )} for Ic
for pc is the negative log likelihood at pt which encourages and Isc such that psc does not fall outside of the boundary of
maximizing the likelihood estimate for the target locations Isc , and |P|  h ⇥ w.
log Cpc (pt ). Our total loss is defined as:
We now describe how this is incorporated in our pipeline.
We employ learned feature-level correspondence as a mea- L = (1 ↵)LG + ↵LLC (5)
sure of pixel-level correspondence. Given an image I, we
apply image transformation Tc to I to obtain Ic . Tc contains where ↵ is the multiplicative factor that balances the two
strong color transformations (for example, Gaussian blur, loss components. See Section 4.5 for a study on the impact
solarization, color distortion), but does not include spatial of ↵.

5626
Figure 2: Proposed Framework. We use BYOL as the self-supervised learning framework. BYOL consists of an online network with
parameters ✓ and a target network with parameters ⇠ which are the exponential moving average of ✓. Given an image, we create two
transformed versions. We apply mean squared error loss between the L2-normalized global feature representations from the online and
target networks.. We also calculate a local contrastive loss from the dense feature representations of the image pairs.

4. Experiments GPUs with 256 batch-size on each GPU, hence, the effec-
tive batch-size is 4096. We linearly scale the learning rate
4.1. Implementation Details with the effective batch size. The weight parameter ↵ is set
to 0.1 (Eqn. 5). For the momentum encoder, the momen-
Pretraining Setup. We use ResNet-50 [21] as the backbone
tum value starts from 0.996 and ends at 1. We use 16 bit
network and BYOL [17] as the self-supervision architec-
mixed-precision during pre-training.
ture. We use identical architectures for projection and pre-
diction networks as in BYOL. For extracting local features, 4.2. Results on Object Detection and Instance Seg-
we add a local projection branch on the online network and mentation
another branch of similar architecture on the target network.
As with the global branches, only the online local projection We use Detectron2 framework [39] for evaluation of
branch is updated through optimization while the target lo- downstream object detection and segmentation results on
cal projection branch is the exponential moving average of COCO and PASCAL VOC dataset.
the online one. The local projection branch consists of two COCO Object Detection. For object detection on COCO,
convolution layers. The first convolution layer consists of a we adopt RetinaNet [27] following [38, 34, 42]. We fine-
1⇥1 convolution kernel with input dimension 2048 and tar- tune all layers with Sync BatchNorm for 90k iterations on
get dimension 2048 following by a BatchNorm layer. The COCO train2017 set and evaluate on COCO val2017
second convolution layer contains a 1 ⇥ 1 kernel with out- set. Table 1 shows object detection results on COCO for
put dimension 256. The input to the local projection branch our method and other approaches in the literature with full-
is the local feature representation from the final stage of network finetuning. Note that ReSim, SoCo, and SCRL
ResNet-50 (before the global average pooling layer). For use region proposal networks during pretraining on Ima-
image transformations during pretraining, following BYOL geNet, hence, these approaches are not exactly compara-
[17], we use random resize crop (resize to 224 ⇥ 224), ran- ble to ours. Our model is more similar with methods like
dom horizontal flip, color distortion, blurring, and solariza- DetCo, DenseCL, and PixPro. We achieve 40.6 AP for ob-
tion. We do not apply random crop for the image that is ject detection tasks outperforming the second best method
used to obtain local contrastive loss. PixPro [43] by a significant 1.9%. We also report results
Dataset. We use the ImageNet [11] dataset for pretraining on COCO detection using Mask R-CNN + FPN. We again
the networks. ImageNet contains ⇠1.28M training images, outperform PixPro (our mAP is higher by 1.4), when using
mostly with a single foreground object. Mask R-CNN + FPN for the detector.
Optimization. The default model is trained with 400 COCO Instance Segmentation. We use the Mask-RCNN
epochs if not specified in the results. See Sec. 4.5 for de- framework [20] with ResNet50-FPN backbone. We follow
tails on the effect of pretraining epoch to the transfer per- the 1⇥ schedule. Table 1 depicts that we achieve 38.3% AP
formance. The LARS optimizer is used with a base learn- for COCO instance segmentation, which is comparable with
ing rate of 0.3 for batch-size 256, momentum 0.9, weight- the SoCo [38]. Note that SoCo performs selective search on
decay 1e-6, and with cosine learning rate decay schedule the input image to find object proposals, and uses additional
for with learning rate warm-up for 10 epochs. We use 16 feature pyramid networks during pre-training.

5627
COCO
Object Detection Object Detection Instance Segmentation
Method
Pretrain RetinaNet + FPN Mask-RCNN + FPN Mask-RCNN + FPN
Epochs APb APb50 APb75 APb APb50 APb75 APmk APmk
50 APmk
75

Supervised [21] 90 37.7 57.2 40.4 38.9 59.6 42.7 35.4 56.5 38.1
Moco v2 [6] 200 37.3 56.2 40.4 40.4 60.2 44.2 36.4 57.2 38.9
BYOL [17] 300 35.4 54.7 37.4 40.4 61.6 44.1 37.2 58.8 39.8
DetCo [42] 800 38.4 57.8 41.2 40.1 61.0 43.9 36.4 58.0 38.9
ReSim-FPN [41] 200 38.6 57.6 41.6 39.8 60.2 43.5 36.0 57.1 38.6
SCRL [34] 800 39.0 58.7 41.9 - - - 37.7 59.6 40.7
SoCo [38] 400 38.3 57.2 41.2 43.0 63.3 47.1 38.2 60.2 41.0
DenseCL [37] 200 37.6 56.6 40.2 40.3 59.9 44.3 36.4 57.0 39.2
PixPro [43] 400 38.7 57.5 41.6 41.4 61.6 45.4 - - -
O URS 400 40.6 60.4 43.6 42.5 62.9 46.7 38.3 60.0 41.1
Table 1: Main Results. We use faster-RCNN with RetinaNet for COCO object detection, Mask-RCNN with FPN for COCO instance
segmentation, Faster RCNN with FPN for VOC object detection.

PASCAL VOC Object Detection [13]. We use the Faster- PASCAL VOC Segmentation. We train on VOC
RCNN [33] object detector with ResNet50-FPN back- train-aug2012 set for 20k iterations and evaluate on
bone following [34]. For training, we use images from val2012 set. Table 3 shows that on the VOC2007 test set,
both trainval07+12 sets and we evaluate only on the our method yields 72.1% mIoU outperforming BYOL by a
VOC07 test07 set. We use the pre-trained checkpoints 7.7% and PixPro by 1% mIoU.
released by the authors for the backbone network, and fine Cityscapes Segmentation. CityScapes [10] contains im-
tune the full networks on the VOC dataset. Table 2 shows ages from urban street scenes. Table 3 shows that for fine
that we achieve 60.1 AP for VOC detection. Our method tuning setting our approach yield 77.8% AP which is 6.2%
improves mean AP by a significant 3.2% over baseline mIoU improvement over BYOL and 0.6% improvement
BYOL, and outperforms the current SOTA PixPro by 1.4% over PixPro.
AP. The improvement is even more significant in AP75,
where we outperform BYOL by 3.6% and PixPro by 1.9%. Pretrain VOC CityScapes
Method
Epochs mIoU mIoU
PASCAL VOC
Object Detection Scratch - 40.7 63.5
Method
Pretrain FRCNN + FPN Supervised 90 67.7 74.6
Epochs APb APb50 APb75 Moco v2 200 67.5 74.5
BYOL 300 63.3 71.6
Supervised [21] 90 53.2 81.7 58.2
DenseCL 200 69.4 69.4
BYOL [17] 300 55.0 83.1 61.1
PixPro† 400 71.1 77.2
SCRL [34] 800 57.2 83.8 63.9 O URS 400 72.1 77.8
DenseCL† [37] 200 56.6 81.8 62.9 Table 3: Evaluation on Semantic Segmentation using FCN
PixPro† [43] 400 58.7 82.9 65.9 ResNet-50 network on PASCAL VOC and CityScapes dataset.
O URS 400 60.1 84.2 67.8 († ): We use pretrained checkpoint released by the authors and fine-
tune the full networks on the VOC dataset. All other scores are
Table 2: Main Results. We use Faster RCNN with FPN for VOC
obtained from the respective papers.
object detection. Supervised and BYOL results are from [34]. († ):
We use pre-trained checkpoint released by the authors and fine
tune on the VOC dataset. 4.4. Analysis
Frozen Backbone Analysis. We also report detection
4.3. Results on Semantic Segmentation and segmentation results for frozen backbone following
[16, 23, 41]. Training a linear classifier on a frozen back-
We show semantic segmentation evaluation in Table 3 on bone is a standard approach to evaluate self-supervised rep-
PASCAL VOC and CityScapes [10] datasets for both fine resentation quality for image classification [1, 5, 17, 18].
tuning and frozen backbone settings. We use FCN back- We adopt the standard strategy in ‘frozen backbone’ set-
bone [29] following the settings in mmsegmentation [9]. ting where we freeze the pre-trained ResNet50 backbone

5628
and only fine tune the remaining layers. Frozen backbone models trained with local consistency. We use Mask-RCNN
might be an ideal evaluation strategy because fine tuning the (keypoint version) with ResNet50 FPN network to evaluate
full network evaluates quality of representations along with keypoint estimation. We fine tune on COCO train 2017
initialization and optimization, whereas frozen backbone for 90k iterations. Table 8 shows that our method outper-
evaluates mostly the representation quality of the backbone forms all other approaches in keypoint estimation task.
[16, 41]. Detection on Mini COCO. As the full COCO dataset con-
For frozen backbone (Table 4), we achieve 30.5% AP tains extensive annotated images for supervision, it might
outperforming PixPro by 2.8% AP on COCO object de- not always reveal the generalization ability of the network
tection, 55.1% AP for VOC detection outperforming Pix- [19]. We also report results for object detection on smaller
Pro by 1.6% and BYOL by 2.7%. We achieve 63.4% versions of COCO training set in Table 9. We report results
mIoU on VOC semantic segmentation, which is more than when only 5% and 10% of the images (randomly sampled)
the score achieved by BYOL in finetuning setting (63.3% are used for fine tuning the mask-RCNN with FPN network
mIoU). We also outperform PixPro by a significant 2.9% with 1⇥ schedule. The evaluation is performed on the full
mIoU. On CityScapes semantic segmentation, We achieve val2017 set. For the 5% setting, our method outperforms
60.7% mIoU which improves upon BYOL by 5.1% and Pix- BYOL by 1.4% AP for the ImageNet pretrained models.
Pro by 2.5%. For the 10% setting, our method achieves improvement over
Efficient Pre-training. In Table 5, we report results of BYOL by 1.8% AP.
VOC object detection with FasterRCNN-FPN, COCO ob- Generalization to other SSL methods. In the Appendix,
ject detection from MaskRCNN-FPN, VOC and CityScapes we show results of our approach applied on other SSL ap-
segmentation from FCN for BYOL and O URS pre-trained proaches (e.g., DINO), where we also show consistent im-
with different epochs. Results reveal that our model pre- provement over baseline methods.
trained with 200 epochs and with training image size 160
can achieve better results than BYOL pre-trained with 1000 4.5. Ablation Studies
epochs saving 5.3⇥ computational resource. Even our
Effect of Pretraining Epochs. Figure 3a reports object de-
100-epoch pre-trained model seems to be comparable with
tection performance on PASCAL VOC with faster-RCNN-
1000-epoch pre-trained BYOL model. This validates effi-
FPN and MS-COCO with Mask-RCNN-FPN for different
cacy of our local loss during self-supervised pre-training.
numbers of pre-training epochs. The models are pre-trained
Importance of Local Contrast. In Table 6, we show rel- on the ImageNet training set. Longer training generally
ative performance of our local contrastive loss against non- results in better downstream object detection performance.
contrastive BYOL-type loss. In, ‘BYOL+Local MSE loss’, For example, for the 100 epoch pre-trained model, the AP
we apply the same L2-normalized MSE local loss as the is 39.8%, whereas for the 600 epoch pre-trained model it
global loss in BYOL. Models are trained for 200 epochs on improves to 42.8% for COCO evaluation. Upswing is also
the ImageNet dataset. We report the average AP scores for observed for PASCAL VOC object detection.
VOC detection, COCO detection with Mask-RCNN, and
Ablation on Loss Weight ↵. Figure 3b reports the AP for
CityScapes segmentation, which shows that our approach
object detection on PASCAL VOC for different values of
of calculating local consistency using contrastive loss works
the weight parameter ↵. The models are pre-trained on the
better than non-contrastive BYOL-type local loss.
ImageNet dataset for 200 epochs with training image size of
Few-shot Image Classification. Since global and local 160 for faster training. ↵ balances the weight between the
losses appear to be complementary to each other, we as- global and local loss functions. For ↵ = 0.05, the mean AP
certain if our method hurts the image classification perfor- is 58.4%. We get a slightly better performance with ↵ = 0.1
mance for transfer learning. We use our pre-trained models (58.9%) and ↵ = 0.3 (59.0%). The performance degrades
as fixed feature extractors, and perform 5-way 5-shot few- a little when ↵ is increased to ↵ = 0.7 (57.5%). Results
shot learning on 7 datasets from diverse domains using a reveal the best performance is achieved when we use both
logistic regression classifier. Table 7 reports the 5-shot top- global and local loss functions, and a proper balance be-
1 accuracy for the 7 diverse datasets. Table 7 reveals that tween them ensures better downstream performance.
O URS shows the best performance on average among the
self-supervised models that use local consistency. O URS 4.6. Qualitative Analysis
outperforms PixPro by 2.4% top-1 accuracy on average; the Correspondence Visualization. In Figure 4, we show vi-
minor fluctuation can be attributed to random noise. sual examples of correspondence from our model. For two
Transfer to Other Vision Tasks. Even though we mainly transformed images I1 and I2 , we extract feature represen-
evaluted on detection and segmentation, we also show re- tations F1 and F2 . For each feature in F1 , the corresponding
sults for keypoint estimation a task that might benefit from feature in F2 is calculated based on maximum cosine sim-

5629
PASCAL VOC OD COCO OD VOC SS Cityscapes SS
Method b
AP APb50 APb75 AP b
APb50 APb75 mIoU mIoU
Supervised 50.7 80.4 55.1 30.3 50.0 31.3 56.6 55.7
BYOL 52.4 81.1 57.5 30.2 49.1 31.5 55.7 55.6
DenseCL 50.9 79.9 55.0 25.5 43.6 25.8 63.0 58.5
PixPro 53.5 80.4 59.7 27.7 44.6 29.1 60.3 58.2
O URS 55.1 82.6 61.7 30.5 49.8 31.7 63.4 60.7
Table 4: Frozen backbone evaluation. We freeze the ResNet-50 backbone and finetune the other layers (RPN, FPN, classifier networks,
regression layers, etc.). We use faster-RCNN with FPN for PASCAL VOC object detection (OD), RetinaNet-RCNN for COCO detection
(OD), and FCN network for VOC and CityScapes segmentation (SS). For this experiment, we use publicly available checkpoint for the
backbone networks and evaluate on the downstream tasks.
Pretrain Pretrain Pretrain VOC COCO VOC Cityscapes
Method
Epochs Im-size time APb APb mIoU mIoU
BYOL 300 224 ⇥1.6 56.9 40.4 63.3 71.6
BYOL 1000 224 ⇥5.3 57.0 40.9 69.0 73.4
O URS 100 224 ⇥0.8 58.2 40.9 68.4 76.5
O URS 200 160 ⇥1 59.0 41.6 68.5 77.0
O URS 200 224 ⇥1.6 59.6 42.0 70.9 77.4
Table 5: Efficient SSL training on ImageNet. Performance of object detection and segmentation for BYOL and O URS for different pre-
training epochs and training image size. We achieve better performance than BYOL (1000 epochs pretraining) with our model pre-trained
with 200 epochs and with training image size 160 that is 5.3⇥ faster to pre-train.

VOC COCO VOC Cityscapes

Method
APb mAP mIoU mIoU
BYOL 57.0 40.9 69.0 73.4
BYOL +Local MSE loss 58.7 42.0 70.5 76.7
O URS 59.6 42.5 72.1 77.8
Table 6: Effectiveness of our LC-loss over MSE-loss.

Figure 4: Correspondence Visualization. (Top) Ground-truth

correspondence between two transformed versions of the same im-
age. (Bottom) Correspondence prediction from O URS.

that our method predicts accurate matches most of the time

(considering the resolution error due to the 32 ⇥ 32 grid size
(a) Effect of longer pre-training. The plot shows average in the pixel space for each feature point).
AP for object detection on COCO validation sets (left) and
PASCAL VOC (right). More Analysis on Correspondence. To show that our
method is learning better correspondence across datasets,
we perform a simple experiment. Given an image, we flip
the image along the horizontal direction, and apply color
transformations (Gaussian blur, color jitter, and random
grayscale operations). Since the only spatial transformation
is horizontal flipping, the corresponding pixels are simply
at the mirror locations of the original pixels, i.e., for a pixel
(b) Effect of relative weights of LC-loss. The plot location (x,y), the correct correspondence location in the
shows the average AP for object detection on VOC.
transformed image is (w-x, y), where w is the width. Note
Figure 3: Ablation Studies on pretraining epochs and relative that the correspondence is measured in the feature loca-
weights of local contrastive loss. tions, not in actual pixel locations. We can also measure the
correspondence accuracy based on whether the matching is
ilarity between the feature representations. We show the correct or not. We use the ImageNet pre-trained backbone
matching at the original image resolution. Figure 4 shows from BYOL and O URS, and evaluate the correspondence

5630
Method EuroSAT[22] CropDisease[30] ChestX[36] ISIC[8] Sketch[35] DTD[7] Omniglot[26] Avg
Supervised 85.8 92.5 25.2 43.4 86.3 81.9 93.0 72.6
SoCo 78.3 84.1 25.1 41.2 81.5 73.9 92.2 68.0
DenseCL 77.7 81.0 23.8 36.8 76.5 78.3 77.4 64.5
PixPro 80.5 86.4 26.5 41.2 81.5 73.9 92.2 68.9
O URS 84.5 90.1 25.2 41.9 85.6 80.2 91.5 71.3

Table 7: Few-shot learning results on downstream datasets. The pre-trained models are used as fixed feature extractors We report top-1
accuracy for 5-way 5-shot averaged over 600 episodes. We use the publicly available pre-trained backbone as feature extractor for the
few-shot evaluation.

Method Pretrain Epoch AP AP50 AP75

Supervised 90 65.7 87.2 71.5
BYOL 300 66.3 87.4 72.4
VADeR [32] 200 66.1 87.3 72.1
SCRL [34] 1000 66.5 87.8 72.3
DenseCL† 200 66.2 87.3 71.9
PixPro † 400 66.6 87.8 72.8
O URS 400 67.2 87.4 73.7

Table 8: COCO keypoint estimation. Supervised and BYOL

results are from [34]. († ) denotes We use the publicly available
ImageNet-pretrained checkpoints released by the authors and fine- (a) BYOL (b) PixPro (c) O URS
tune on the COCO dataset. Figure 5: Correspondence map for an image and its trans-
formed versions with only flipping and color transformation.
5% 10% (a) Results from BYOL without local loss. There are many erro-
Method AP AP50 AP75 AP AP50 AP75 neous corresponding pixels denoting that global loss alone does
not learn good features for local correspondence. (b) and (c): Re-
Supervised 19.2 31.0 20.5 25.0 39.9 26.6
BYOL 21.9 36.2 23.2 27.1 43.4 29.3 sults from PixPro and O URS, both of which perfectly detects most
PixPro 20.3 31.4 22.1 25.4 39.5 27.4 of the correspondences.
O URS 23.3 37.4 25.0 27.9 44.0 30.1
resolutions may be beneficial but is much more compu-
Table 9: Object detection on mini-COCO with 1⇥ schedule.
All scores are obtained from finetuning the publicly available pre-
tationally expensive. Thus, the trade-off between perfor-
trained backbone on the downstream dataset. mance and accuracy needs to be further studied. Second,
the local correspondences are not sampled at good feature
on the COCO dataset. Figure 5 shows some visual exam- points (e.g., corners); rather they are sampled on a uniform
ples of the correspondence map on images from the COCO 2D grid. Thus, LC-loss might be too strict when dealing
dataset. We also measure the accuracy of correct correspon- with large, texture-less image regions. Our loss also does
dences on the COCO val dataset. For the BYOL pre-trained not account for presence of self-similar image regions and
model, the accuracy is only ⇠33%, PixPro achieves ⇠99% the effect of not modeling them needs to be evaluated.
accuracy, and O URS achieves ⇠96% accuracy. The results To summarise, we propose a simple framework for self-
suggest that our approach is more robust against color trans- supervised learning that leverages known pixel correspon-
formations. We infer that PixPro achieves better accuracy as dences between different transformations of an image. We
PixPro is trained only with local consistency loss, whereas showed that the model pre-trained with our approach pro-
we use both global and local correspondence during pre- vides better representations for detection and segmenta-
training. The models have not been trained on COCO. tion. Imposing our loss enables a single network to retain
Hence, the results also show that the correspondence maps both spatial and global information, both of which we have
generalize to other datasets. shown are necessary to obtain good features. Our training
does not require any external supervision, since all the lo-
5. Discussion and Conclusion cal and global constraints are generated from the input im-
Even though our model consistently improves perfor- age itself. We showed that our method outperforms exist-
mance on detection and segmentation, it has some limita- ing self-supervised methods that impose local consistency
tions. First, we calculate local correspondence loss at low without requiring complex architectural components such
spatial resolution (32⇥ down-sampled from the original im- as encoder-decoder layers, propagation modules, and re-
age resolution for ResNet50). Computing LC-loss at higher gions proposal networks.

5631
References [14] Hugo Germain, Vincent Lepetit, and Guillaume Bourmaud.
Visual correspondence hallucination: Towards geometric
[1] Mathilde Caron, Ishan Misra, Julien Mairal, et al. Unsuper- reasoning. arXiv preprint arXiv:2106.09711, 2021.
vised learning of visual features by contrasting cluster as- [15] Spyros Gidaris, Praveer Singh, and Nikos Komodakis. Unsu-
signments. arXiv preprint:2006.09882, 2020. pervised Representation Learning by Predicting Image Rota-
[2] Mathilde Caron, Hugo Touvron, Ishan Misra, Hervé Jégou, tions. In ICLR. arXiv, 3 2018.
et al. Emerging properties in self-supervised vision trans- [16] Priya Goyal, Dhruv Mahajan, Abhinav Gupta, and Ishan
formers. arXiv preprint arXiv:2104.14294, 2021. Misra. Scaling and benchmarking self-supervised visual rep-
[3] Kai Chen, Jiaqi Wang, Jiangmiao Pang, Yuhang Cao, Yu resentation learning. In Proceedings of the ieee/cvf Inter-
Xiong, Xiaoxiao Li, Shuyang Sun, Wansen Feng, Ziwei Liu, national Conference on computer vision, pages 6391–6400,
Jiarui Xu, Zheng Zhang, Dazhi Cheng, Chenchen Zhu, Tian- 2019.
heng Cheng, Qijie Zhao, Buyu Li, Xin Lu, Rui Zhu, Yue Wu, [17] Jean-Bastien Grill, Florian Strub, Florent Altché, Corentin
Jifeng Dai, Jingdong Wang, Jianping Shi, Wanli Ouyang, Tallec, Pierre H Richemond, Elena Buchatskaya, Carl Do-
Chen Change Loy, and Dahua Lin. MMDetection: Open ersch, Bernardo Avila Pires, Zhaohan Daniel Guo, Moham-
mmlab detection toolbox and benchmark. arXiv preprint mad Gheshlaghi Azar, et al. Bootstrap your own latent: A
arXiv:1906.07155, 2019. new approach to self-supervised learning. arXiv preprint
[4] Ting Chen, Simon Kornblith, Mohammad Norouzi, and Ge- arXiv:2006.07733, 2020.
offrey Hinton. A simple framework for contrastive learning [18] Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and Ross
of visual representations. In ICML, 2020. Girshick. Momentum contrast for unsupervised visual rep-
[5] Ting Chen, Simon Kornblith, Kevin Swersky, Mohammad resentation learning. In Proceedings of the IEEE/CVF Con-
Norouzi, and Geoffrey Hinton. Big self-supervised mod- ference on Computer Vision and Pattern Recognition, pages
els are strong semi-supervised learners. arXiv preprint 9729–9738, 2020.
arXiv:2006.10029, 2020. [19] Kaiming He, Ross Girshick, and Piotr Dollár. Rethinking im-
[6] Xinlei Chen, Haoqi Fan, Ross Girshick, and Kaiming He. agenet pre-training. In Proceedings of the IEEE/CVF Inter-
Improved baselines with momentum contrastive learning. national Conference on Computer Vision, pages 4918–4927,
arXiv preprint arXiv:2003.04297, 2020. 2019.
[7] Mircea Cimpoi, Subhransu Maji, Iasonas Kokkinos, Sammy [20] Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross Gir-
Mohamed, and Andrea Vedaldi. Describing textures in the shick. Mask RCNN. In Proceedings of the IEEE inter-
wild. In Proceedings of the IEEE conference on computer national conference on computer vision, pages 2961–2969,
vision and pattern recognition, pages 3606–3613, 2014. 2017.
[8] Noel Codella, Veronica Rotemberg, Philipp Tschandl, [21] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun.
M Emre Celebi, Stephen Dusza, David Gutman, Brian Deep residual learning for image recognition. In Proceed-
Helba, Aadi Kalloo, Konstantinos Liopyris, Michael ings of the IEEE conference on computer vision and pattern
Marchetti, et al. Skin lesion analysis toward melanoma recognition, pages 770–778, 2016.
detection 2018: A challenge hosted by the interna- [22] Patrick Helber, Benjamin Bischke, Andreas Dengel, and
tional skin imaging collaboration (isic). arXiv preprint Damian Borth. Eurosat: A novel dataset and deep learning
arXiv:1902.03368, 2019. benchmark for land use and land cover classification. IEEE
[9] MMSegmentation Contributors. MMSegmentation: Journal of Selected Topics in Applied Earth Observations
Openmmlab semantic segmentation toolbox and bench- and Remote Sensing, 12(7):2217–2226, 2019.
mark. https://github.com/open-mmlab/ [23] Olivier J Henaff, Skanda Koppula, Jean-Baptiste Alayrac,
mmsegmentation, 2020. Aaron van den Oord, Oriol Vinyals, and João Carreira.
[10] Marius Cordts, Mohamed Omran, Sebastian Ramos, Timo Efficient visual pretraining with contrastive detection. In
Rehfeld, Markus Enzweiler, Rodrigo Benenson, Uwe Proceedings of the IEEE/CVF International Conference on
Franke, Stefan Roth, and Bernt Schiele. The cityscapes Computer Vision, pages 10086–10096, 2021.
dataset for semantic urban scene understanding. In Proceed- [24] Olivier J Hénaff, Aravind Srinivas, Jeffrey De Fauw, Ali
ings of the IEEE conference on computer vision and pattern Razavi, Carl Doersch, SM Eslami, and Aaron van den Oord.
recognition, pages 3213–3223, 2016. Data-efficient image recognition with contrastive predictive
[11] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, coding. arXiv preprint arXiv:1905.09272, 2019.
and Li Fei-Fei. Imagenet: A large-scale hierarchical image [25] Ashraful Islam, Chun-Fu Richard Chen, Rameswar Panda,
database. In 2009 IEEE conference on computer vision and Leonid Karlinsky, Richard Radke, and Rogerio Feris. A
pattern recognition, pages 248–255. IEEE, 2009. broad study on the transferability of visual representations
[12] Carl Doersch, Abhinav Gupta, and Alexei A Efros. Unsuper- with contrastive learning. In Proceedings of the IEEE/CVF
vised Visual Representation Learning by Context Prediction. International Conference on Computer Vision, pages 8845–
In ICCV, 2015. 8855, 2021.
[13] Mark Everingham, Luc Van Gool, Christopher KI Williams, [26] Brenden M Lake, Ruslan Salakhutdinov, and Joshua B
John Winn, and Andrew Zisserman. The pascal visual object Tenenbaum. Human-level concept learning through proba-
classes (voc) challenge. International journal of computer bilistic program induction. Science, 350(6266):1332–1338,
vision, 88(2):303–338, 2010. 2015.

5632
[27] Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, and [41] Tete Xiao, Colorado J Reed, Xiaolong Wang, Kurt Keutzer,
Piotr Dollár. Focal loss for dense object detection. In Pro- and Trevor Darrell. Region similarity representation learn-
ceedings of the IEEE international conference on computer ing. In Proceedings of the IEEE/CVF International Confer-
vision, pages 2980–2988, 2017. ence on Computer Vision, pages 10539–10548, 2021.
[28] Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, [42] Enze Xie, Jian Ding, Wenhai Wang, Xiaohang Zhan, Hang
Zheng Zhang, Stephen Lin, and Baining Guo. Swin trans- Xu, Peize Sun, Zhenguo Li, and Ping Luo. Detco: Unsuper-
former: Hierarchical vision transformer using shifted win- vised contrastive learning for object detection. In Proceed-
dows. arXiv preprint arXiv:2103.14030, 2021. ings of the IEEE/CVF International Conference on Com-
[29] Jonathan Long, Evan Shelhamer, and Trevor Darrell. Fully puter Vision, pages 8392–8401, 2021.
convolutional networks for semantic segmentation. In Pro- [43] Zhenda Xie, Yutong Lin, Zheng Zhang, Yue Cao, Stephen
ceedings of the IEEE conference on computer vision and pat- Lin, and Han Hu. Propagate yourself: Exploring pixel-level
tern recognition, pages 3431–3440, 2015. consistency for unsupervised visual representation learning.
[30] Sharada P Mohanty, David P Hughes, and Marcel Salathé. In Proceedings of the IEEE/CVF Conference on Computer
Using deep learning for image-based plant disease detection. Vision and Pattern Recognition, pages 16684–16693, 2021.
Frontiers in plant science, 7:1419, 2016. [44] Ceyuan Yang, Zhirong Wu, Bolei Zhou, and Stephen Lin. In-
[31] Deepak Pathak, Philipp Krahenbuhl, Jeff Donahue, Trevor stance localization for self-supervised detection pretraining.
Darrell, and Alexei A. Efros. Context Encoders: Feature In Proceedings of the IEEE/CVF Conference on Computer
Learning by Inpainting. In CVPR, volume 2016-December, Vision and Pattern Recognition, pages 3987–3996, 2021.
pages 2536–2544. IEEE Computer Society, 4 2016. [45] Richard Zhang, Phillip Isola, and Alexei A. Efros. Colorful
[32] Pedro O Pinheiro, Amjad Almahairi, Ryan Y Benmalek, Image Colorization. In ECCV, volume 9907 LNCS, pages
Florian Golemo, and Aaron Courville. Unsupervised 649–666. Springer Verlag, 3 2016.
learning of dense visual representations. arXiv preprint
arXiv:2011.05499, 2020.
[33] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun.
Faster r-cnn: towards real-time object detection with region
proposal networks. IEEE transactions on pattern analysis
and machine intelligence, 39(6):1137–1149, 2016.
[34] Byungseok Roh, Wuhyun Shin, Ildoo Kim, and Sungwoong
Kim. Spatially consistent representation learning. In Pro-
ceedings of the IEEE/CVF Conference on Computer Vision
and Pattern Recognition, pages 1144–1153, 2021.
[35] Haohan Wang, Songwei Ge, Eric P. Xing, and Zachary C.
Lipton. Learning robust global representations by penaliz-
ing local predictive power. arXiv preprint arXiv:1905.13549,
2019.
[36] Xiaosong Wang, Yifan Peng, Le Lu, Zhiyong Lu, Mo-
hammadhadi Bagheri, and Ronald M Summers. Chestx-
ray8: Hospital-scale chest x-ray database and benchmarks
on weakly-supervised classification and localization of com-
mon thorax diseases. In Proceedings of the IEEE Conference
on Computer Vision and Pattern Recognition, pages 2097–
2106, 2017.
[37] Xinlong Wang, Rufeng Zhang, Chunhua Shen, Tao Kong,
and Lei Li. Dense contrastive learning for self-supervised
visual pre-training. In Proceedings of the IEEE/CVF Con-
ference on Computer Vision and Pattern Recognition, pages
3024–3033, 2021.
[38] Fangyun Wei, Yue Gao, Zhirong Wu, Han Hu, and Stephen
Lin. Aligning pretraining for detection via object-level con-
trastive learning. Advances in Neural Information Process-
ing Systems, 34, 2021.
[39] Yuxin Wu, Alexander Kirillov, Francisco Massa, Wan-Yen
Lo, and Ross Girshick. Detectron2. https://github.
com/facebookresearch/detectron2, 2019.
[40] Tete Xiao, Yingcheng Liu, Bolei Zhou, Yuning Jiang, and
Jian Sun. Unified perceptual parsing for scene understand-
ing. In Proceedings of the European Conference on Com-
puter Vision (ECCV), pages 418–434, 2018.

5633

m102 Service Manual
67% (9)
m102 Service Manual
304 pages
Labour Productivity Rates
88% (8)
Labour Productivity Rates
17 pages
Art 230 Module 1-3
100% (1)
Art 230 Module 1-3
14 pages
Measurement of Horizontal Distances
71% (7)
Measurement of Horizontal Distances
37 pages
Ts Manual
100% (1)
Ts Manual
22 pages
STANDARD IEC 60840-2004 Interpretación
No ratings yet
STANDARD IEC 60840-2004 Interpretación
153 pages
DenseConstrastiveLearningForSelfSupervisedVisualPreTraining
No ratings yet
DenseConstrastiveLearningForSelfSupervisedVisualPreTraining
11 pages
Machine Learning - Advanced Concepts
From Everand
Machine Learning - Advanced Concepts
Derrick Mwiti
No ratings yet
Scheibenreif Self-Supervised Vision Transformers For Land-Cover Segmentation and Classification CVPRW 2022 Paper
No ratings yet
Scheibenreif Self-Supervised Vision Transformers For Land-Cover Segmentation and Classification CVPRW 2022 Paper
10 pages
entropy-24-00551-v2
No ratings yet
entropy-24-00551-v2
22 pages
(2021 - CVPR) Spatially Consistent Representation Learning
No ratings yet
(2021 - CVPR) Spatially Consistent Representation Learning
14 pages
Supervised Contrastive Learning
No ratings yet
Supervised Contrastive Learning
23 pages
Understanding Self-Supervised Features For Learning Unsupervised Instance Segmentation
No ratings yet
Understanding Self-Supervised Features For Learning Unsupervised Instance Segmentation
12 pages
2020 - Supervised Contrastive Learning - Khosla Et Al - Curran Associates, Inc.
No ratings yet
2020 - Supervised Contrastive Learning - Khosla Et Al - Curran Associates, Inc.
13 pages
Zhao Contrastive Learning for Label Efficient Semantic Segmentation ICCV 2021 Paper
No ratings yet
Zhao Contrastive Learning for Label Efficient Semantic Segmentation ICCV 2021 Paper
11 pages
Loss Functions For Semantic Segmentation
No ratings yet
Loss Functions For Semantic Segmentation
6 pages
SLIP: Self-Supervision Meets Language-Image Pre-Training
No ratings yet
SLIP: Self-Supervision Meets Language-Image Pre-Training
13 pages
Self_Supervised_Learning
No ratings yet
Self_Supervised_Learning
5 pages
Weakly Supervised Contrastive Learning
No ratings yet
Weakly Supervised Contrastive Learning
10 pages
Xy If
No ratings yet
Xy If
15 pages
Unsupervised Learning of Visual Features by Contrasting Cluster Assignments
No ratings yet
Unsupervised Learning of Visual Features by Contrasting Cluster Assignments
13 pages
Unsupervised Learning of Visual Features by Contrasting Cluster Assignments
No ratings yet
Unsupervised Learning of Visual Features by Contrasting Cluster Assignments
23 pages
Towards Democratizing Joint-Embedding Self-Supervised Learning
No ratings yet
Towards Democratizing Joint-Embedding Self-Supervised Learning
11 pages
Pyramid Image Processing: Exploring the Depths of Visual Analysis
From Everand
Pyramid Image Processing: Exploring the Depths of Visual Analysis
Fouad Sabry
No ratings yet
EGU24-4772-print
No ratings yet
EGU24-4772-print
2 pages
Unsupervised Learning of Visual Features by Contrasting Cluster Assignments
No ratings yet
Unsupervised Learning of Visual Features by Contrasting Cluster Assignments
21 pages
Scarf
No ratings yet
Scarf
24 pages
2408.14957v1
No ratings yet
2408.14957v1
7 pages
Beyond A Gaussian Denoiser: Residual Learning of Deep CNN For Image Denoising
No ratings yet
Beyond A Gaussian Denoiser: Residual Learning of Deep CNN For Image Denoising
13 pages
A_Practical_Contrastive_Learning_Framework_for_Single_Image_Super_Resolution
No ratings yet
A_Practical_Contrastive_Learning_Framework_for_Single_Image_Super_Resolution
12 pages
4.1 - Unsupervised Visual Representation Learning by Context Prediction
No ratings yet
4.1 - Unsupervised Visual Representation Learning by Context Prediction
10 pages
22A Heterogenous Group Cnn for Image SR
No ratings yet
22A Heterogenous Group Cnn for Image SR
13 pages
Zhang Deep Unfolding Network For Image Super-Resolution CVPR 2020 Paper
No ratings yet
Zhang Deep Unfolding Network For Image Super-Resolution CVPR 2020 Paper
10 pages
A Simple and Generalist Approach for Panoptic Segmentation
No ratings yet
A Simple and Generalist Approach for Panoptic Segmentation
17 pages
OD Trans Christopher-Lang2022 Q2
No ratings yet
OD Trans Christopher-Lang2022 Q2
15 pages
Wang Hunting Sparsity Density-Guided Contrastive Learning For Semi-Supervised Semantic Segmentation CVPR 2023 Paper
No ratings yet
Wang Hunting Sparsity Density-Guided Contrastive Learning For Semi-Supervised Semantic Segmentation CVPR 2023 Paper
10 pages
Detecting Twenty-Thousand Classes Using Image-Level Supervision
No ratings yet
Detecting Twenty-Thousand Classes Using Image-Level Supervision
27 pages
Dense Clip
No ratings yet
Dense Clip
11 pages
Paper Majorproject1
No ratings yet
Paper Majorproject1
6 pages
Technologies 09 00002 v2
No ratings yet
Technologies 09 00002 v2
22 pages
Paper1 With LLM
No ratings yet
Paper1 With LLM
6 pages
Basak Pseudo-Label Guided Contrastive Learning For Semi-Supervised Medical Image Segmentation CVPR 2023 Paper
No ratings yet
Basak Pseudo-Label Guided Contrastive Learning For Semi-Supervised Medical Image Segmentation CVPR 2023 Paper
12 pages
Kim 等 - 2023 - Region-Aware Pretraining for Open-Vocabulary Objec
No ratings yet
Kim 等 - 2023 - Region-Aware Pretraining for Open-Vocabulary Objec
11 pages
InverseForm A Loss Function For Structured Boundar
No ratings yet
InverseForm A Loss Function For Structured Boundar
11 pages
Context Encoders Feature Learning by Inpainting
No ratings yet
Context Encoders Feature Learning by Inpainting
9 pages
Pathak Context Encoders Feature CVPR 2016 Paper
No ratings yet
Pathak Context Encoders Feature CVPR 2016 Paper
9 pages
Focal Loss For Dense Object Detection
No ratings yet
Focal Loss For Dense Object Detection
10 pages
Generative Pretraining From Pixels
No ratings yet
Generative Pretraining From Pixels
13 pages
Image Segmentation: Unlocking Insights through Pixel Precision
From Everand
Image Segmentation: Unlocking Insights through Pixel Precision
Fouad Sabry
No ratings yet
2307.09220v2
No ratings yet
2307.09220v2
27 pages
2024 AAAI Decoupled Contrastive Learning for Long-Tailed Recognition
No ratings yet
2024 AAAI Decoupled Contrastive Learning for Long-Tailed Recognition
8 pages
On The Duality Between Contrastive and Noncontrastive Self-Supervised Learning
No ratings yet
On The Duality Between Contrastive and Noncontrastive Self-Supervised Learning
28 pages
1-s2.0-S1361841524000112-am
No ratings yet
1-s2.0-S1361841524000112-am
22 pages
Beyond A Gaussian Denoiser: Residual Learning of Deep CNN For Image Denoising
No ratings yet
Beyond A Gaussian Denoiser: Residual Learning of Deep CNN For Image Denoising
14 pages
Deep Residual Learning For Image Recognition
No ratings yet
Deep Residual Learning For Image Recognition
2 pages
Location-Aware Self-Supervised Transformers For Semantic Segmentation
No ratings yet
Location-Aware Self-Supervised Transformers For Semantic Segmentation
16 pages
CVPR_2023_CiaoSR Continuous Implicit Attention-in-Attention Network
No ratings yet
CVPR_2023_CiaoSR Continuous Implicit Attention-in-Attention Network
12 pages
Deep Networks For Saliency Detection Via Local Estimation and Global Search
No ratings yet
Deep Networks For Saliency Detection Via Local Estimation and Global Search
10 pages
27966-Article Text-32020-1-2-20240324
No ratings yet
27966-Article Text-32020-1-2-20240324
9 pages
4-【BYOL】NeurIPS - 2020 - Bootstrap your own latent-a new approach to self-supervised learning
No ratings yet
4-【BYOL】NeurIPS - 2020 - Bootstrap your own latent-a new approach to self-supervised learning
35 pages
DN CNN
No ratings yet
DN CNN
14 pages
DSSD: Deconvolutional Single Shot Detector
No ratings yet
DSSD: Deconvolutional Single Shot Detector
11 pages
Online Network: G-CNN Encoder
No ratings yet
Online Network: G-CNN Encoder
1 page
Jeon Mining Better Samples For Contrastive Learning of Temporal Correspondence CVPR 2021 Paper
No ratings yet
Jeon Mining Better Samples For Contrastive Learning of Temporal Correspondence CVPR 2021 Paper
11 pages
Loss Odyssey in Medical Image Segmentation
No ratings yet
Loss Odyssey in Medical Image Segmentation
13 pages
Real Image Denoising With Feature Attention
No ratings yet
Real Image Denoising With Feature Attention
10 pages
Self Supervised Segmentation
No ratings yet
Self Supervised Segmentation
2 pages
Computer Architecture eBook
No ratings yet
Computer Architecture eBook
443 pages
Hardware addresses and connector details
No ratings yet
Hardware addresses and connector details
3 pages
8051_Instruction_Set
No ratings yet
8051_Instruction_Set
4 pages
COMMUNICATION DOCC - Odt
No ratings yet
COMMUNICATION DOCC - Odt
21 pages
Berger: Product Specifications
No ratings yet
Berger: Product Specifications
2 pages
AS Bicutan Welcome Home - Compressed
No ratings yet
AS Bicutan Welcome Home - Compressed
35 pages
Data Optimization Activities 2011
No ratings yet
Data Optimization Activities 2011
6 pages
Putting Leadership Back Into Strategy
100% (2)
Putting Leadership Back Into Strategy
28 pages
SC13003-C9001-U10-0002-Commissioning Plan Procedure For MBOP - 24-08-2015 PDF
No ratings yet
SC13003-C9001-U10-0002-Commissioning Plan Procedure For MBOP - 24-08-2015 PDF
70 pages
UV Sensing Scanner Manual
No ratings yet
UV Sensing Scanner Manual
24 pages
Pump Clinic 7: Installation and Piping
100% (1)
Pump Clinic 7: Installation and Piping
11 pages
Thermodynamic Steam Trap: Description
No ratings yet
Thermodynamic Steam Trap: Description
2 pages
Daily 01 00
No ratings yet
Daily 01 00
5 pages
Pen-Testing-Whitepaper - eLearnSecurity PDF
No ratings yet
Pen-Testing-Whitepaper - eLearnSecurity PDF
9 pages
Installation of Expansion Joints - AIRBUS - AMM - A319 - A320 - A321 - CSN - ZH-CN - 20230201
No ratings yet
Installation of Expansion Joints - AIRBUS - AMM - A319 - A320 - A321 - CSN - ZH-CN - 20230201
6 pages
11.16 Pneumatic/Vacuum (ATA 36) : Basic Maintenance Training Manual Module 11 Aircraft Structures and Systems
0% (1)
11.16 Pneumatic/Vacuum (ATA 36) : Basic Maintenance Training Manual Module 11 Aircraft Structures and Systems
14 pages
Brake
100% (1)
Brake
12 pages
TNPSC - Printing and GK Syllabus
No ratings yet
TNPSC - Printing and GK Syllabus
3 pages
Canh Bao Core CS Ericsson Tuan 47 (28!11!2014)
No ratings yet
Canh Bao Core CS Ericsson Tuan 47 (28!11!2014)
10 pages
Daleel Registration For New Admissions (2024-25) - Parents Circular No. 145 1
No ratings yet
Daleel Registration For New Admissions (2024-25) - Parents Circular No. 145 1
10 pages
Applications of 3D Printing in Manufacturing and Life
No ratings yet
Applications of 3D Printing in Manufacturing and Life
52 pages
MCQ in Age, Work, Mixture, Digit, Motion Problems Part 3 - ECE Board Exam
No ratings yet
MCQ in Age, Work, Mixture, Digit, Motion Problems Part 3 - ECE Board Exam
21 pages
System Software Notes
100% (1)
System Software Notes
97 pages
Unplugged Power Cord Alarm
No ratings yet
Unplugged Power Cord Alarm
1 page
LTE PP Slides
67% (3)
LTE PP Slides
13 pages
2025_ai_trends_finance
No ratings yet
2025_ai_trends_finance
8 pages
6 05 2019
No ratings yet
6 05 2019
19 pages