0% found this document useful (0 votes)

44 views

4.1 - Unsupervised Visual Representation Learning by Context Prediction

This paper explores using spatial context as a self-supervised signal to train a visual representation without image labels. The method extracts random patch pairs from unlabeled images and trains a neural network to predict the relative position of one patch to the other. The paper argues this task requires recognizing objects and parts to do well. Evaluation shows the representation captures visual similarity across images and allows unsupervised discovery of objects from unlabeled data, providing a boost to supervised object detection performance.

Uploaded by

farzad imanpour

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

44 views

4.1 - Unsupervised Visual Representation Learning by Context Prediction

Uploaded by

farzad imanpour

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 10

Unsupervised Visual Representation Learning by Context Prediction

Carl Doersch1,2 Abhinav Gupta1 Alexei A. Efros2

1 2
School of Computer Science Dept. of Electrical Engineering and Computer Science
Carnegie Mellon University University of California, Berkeley

Abstract Example:
arXiv:1505.05192v3 [cs.CV] 16 Jan 2016

This work explores the use of spatial context as a source

of free and plentiful supervisory signal for training a rich
visual representation. Given only a large, unlabeled image
collection, we extract random pairs of patches from each
image and train a convolutional neural net to predict the po-
sition of the second patch relative to the first. We argue that
Question 1: Question 2:
doing well on this task requires the model to learn to recog-
nize objects and their parts. We demonstrate that the fea-
ture representation learned using this within-image context
indeed captures visual similarity across images. For exam-
_? _?
ple, this representation allows us to perform unsupervised
Figure 1. Our task for learning patch representations involves ran-
visual discovery of objects like cats, people, and even birds
domly sampling a patch (blue) and then one of eight possible
from the Pascal VOC 2011 detection dataset. Furthermore, neighbors (red). Can you guess the spatial configuration for the
we show that the learned ConvNet can be used in the R- two pairs of patches? Note that the task is much easier once you
CNN framework [21] and provides a significant boost over have recognized the object!
a randomly-initialized ConvNet, resulting in state-of-the- Answer key: Q1: Bottom right Q2: Top center
art performance among algorithms which use only Pascal-
provided training set annotations. in the context (i.e., a few words before and/or after) given
the vector. This converts an apparently unsupervised prob-
1. Introduction lem (finding a good similarity metric between words) into
Recently, new computer vision methods have leveraged a “self-supervised” one: learning a function from a given
large datasets of millions of labeled examples to learn rich, word to the words surrounding it. Here the context predic-
high-performance visual representations [32]. Yet efforts tion task is just a “pretext” to force the model to learn a
to scale these methods to truly Internet-scale datasets (i.e. good word embedding, which, in turn, has been shown to
hundreds of billions of images) are hampered by the sheer be useful in a number of real tasks, such as semantic word
expense of the human annotation required. A natural way similarity [40].
to address this difficulty would be to employ unsupervised Our paper aims to provide a similar “self-supervised”
learning, which aims to use data without any annotation. formulation for image data: a supervised task involving pre-
Unfortunately, despite several decades of sustained effort, dicting the context for a patch. Our task is illustrated in Fig-
unsupervised methods have not yet been shown to extract ures 1 and 2. We sample random pairs of patches in one of
useful information from large collections of full-sized, real eight spatial configurations, and present each pair to a ma-
images. After all, without labels, it is not even clear what chine learner, providing no information about the patches’
should be represented. How can one write an objective original position within the image. The algorithm must then
function to encourage a representation to capture, for ex- guess the position of one patch relative to the other. Our
ample, objects, if none of the objects are labeled? underlying hypothesis is that doing well on this task re-
Interestingly, in the text domain, context has proven to quires understanding scenes and objects, i.e. a good visual
be a powerful source of automatic supervisory signal for representation for this task will need to extract objects and
learning representations [3, 41, 9, 40]. Given a large text their parts in order to reason about their relative spatial lo-
corpus, the idea is to train a model that maps each word cation. “Objects,” after all, consist of multiple parts that
to a feature vector, such that it is easy to predict the words can be detected independently of one another, and which

1
occur in a specific spatial configuration (if there is no spe-
cific configuration of the parts, then it is “stuff” [1]). We
present a ConvNet-based approach to learn a visual repre- 1 2 3
sentation from this task. We demonstrate that the resulting
visual representation is good for both object detection, pro-
viding a significant boost on PASCAL VOC 2007 compared
to learning from scratch, as well as for unsupervised object 4 5
discovery / visual data mining. This means, surprisingly,
that our representation generalizes across images, despite
being trained using an objective function that operates on a
single image at a time. That is, instance-level supervision
6 7 8
appears to improve performance on category-level tasks.

2. Related Work
One way to think of a good image representation is as X=( , ); Y = 3
the latent variables of an appropriate generative model. An
ideal generative model of natural images would both gener- Figure 2. The algorithm receives two patches in one of these eight
ate images according to their natural distribution, and be possible spatial arrangements, without any context, and must then
classify which configuration was sampled.
concise in the sense that it would seek common causes
for different images and share information between them. model (e.g. a deep network) to predict, from a single word,
However, inferring the latent structure given an image is in- the n preceding and n succeeding words. In principle, sim-
tractable for even relatively simple models. To deal with ilar reasoning could be applied in the image domain, a kind
these computational issues, a number of works, such as of visual “fill in the blank” task, but, again, one runs into the
the wake-sleep algorithm [25], contrastive divergence [24], problem of determining whether the predictions themselves
deep Boltzmann machines [48], and variational Bayesian are correct [12], unless one cares about predicting only very
methods [30, 46] use sampling to perform approximate in- low-level features [14, 33, 53]. To address this, [39] predicts
ference. Generative models have shown promising per- the appearance of an image region by consensus voting of
formance on smaller datasets such as handwritten dig- the transitive nearest neighbors of its surrounding regions.
its [25, 24, 48, 30, 46], but none have proven effective for Our previous work [12] explicitly formulates a statistical
high-resolution natural images. test to determine whether the data is better explained by a
Unsupervised representation learning can also be formu- prediction or by a low-level null hypothesis model.
lated as learning an embedding (i.e. a feature vector for The key problem that these approaches must address is
each image) where images that are semantically similar are that predicting pixels is much harder than predicting words,
close, while semantically different ones are far apart. One due to the huge variety of pixels that can arise from the same
way to build such a representation is to create a supervised semantic object. In the text domain, one interesting idea is
“pretext” task such that an embedding which solves the task to switch from a pure prediction task to a discrimination
will also be useful for other real-world tasks. For exam- task [41, 9]. In this case, the pretext task is to discriminate
ple, denoising autoencoders [56, 4] use reconstruction from true snippets of text from the same snippets where a word
noisy data as a pretext task: the algorithm must connect has been replaced at random. A direct extension of this to
images to other images with similar objects to tell the dif- 2D might be to discriminate between real images vs. im-
ference between noise and signal. Sparse autoencoders also ages where one patch has been replaced by a random patch
use reconstruction as a pretext task, along with a sparsity from elsewhere in the dataset. However, such a task would
penalty [42], and such autoencoders may be stacked to form be trivial, since discriminating low-level color statistics and
a deep representation [35, 34]. (however, only [34] was suc- lighting would be enough. To make the task harder and
cessfully applied to full-sized images, requiring a million more high-level, in this paper, we instead classify between
CPU hours to discover just three objects). We believe that multiple possible configurations of patches sampled from
current reconstruction-based algorithms struggle with low- the same image, which means they will share lighting and
level phenomena, like stochastic textures, making it hard to color statistics, as shown on Figure 2.
even measure whether a model is generating well. Another line of work in unsupervised learning from im-
Another pretext task is “context prediction.” A strong ages aims to discover object categories using hand-crafted
tradition for this kind of task already exists in the text do- features and various forms of clustering (e.g. [51, 47]
main, where “skip-gram” [40] models have been shown to learned a generative model over bags of visual words). Such
generate useful word representations. The idea is to train a representations lose shape information, and will readily dis-

2
cover clusters of, say, foliage. A few subsequent works have
attempted to use representations more closely tied to shape fc9 (8)
fc8 (4096)
[36, 43], but relied on contour extraction, which is difficult
in complex images. Many other approaches [22, 29, 16] fc7 (4096)
focus on defining similarity metrics which can be used in
fc6 (4096) fc6 (4096)
more standard clustering algorithms; [45], for instance, pool5 (3x3,256,2) pool5 (3x3,256,2)
re-casts the problem as frequent itemset mining. Geom- conv5 (3x3,256,1) conv5 (3x3,256,1)
etry may also be used to for verifying links between im- conv4 (3x3,384,1) conv4 (3x3,384,1)
conv3 (3x3,384,1) conv3 (3x3,384,1)
ages [44, 6, 23], although this can fail for deformable ob-
LRN2 LRN2
jects. pool2 (3x3,384,2) pool2 (3x3,384,2)
Video can provide another cue for representation learn- conv2 (5x5,384,2) conv2 (5x5,384,2)
ing. For most scenes, the identity of objects remains un- LRN1 LRN1
pool1 (3x3,96,2) pool1 (3x3,96,2)
changed even as appearance changes with time. This kind conv1 (11x11,96,4) conv1 (11x11,96,4)
of temporal coherence has a long history in visual learning
literature [18, 59], and contemporaneous work shows strong Patch 1 Patch 2
improvements on modern detection datasets [57].
Figure 3. Our architecture for pair classification. Dotted lines in-
Finally, our work is related to a line of research on dis- dicate shared weights. ‘conv’ stands for a convolution layer, ‘fc’
criminative patch mining [13, 50, 28, 37, 52, 11], which has stands for a fully-connected one, ‘pool’ is a max-pooling layer, and
emphasized weak supervision as a means of object discov- ‘LRN’ is a local response normalization layer. Numbers in paren-
ery. Like the current work, they emphasize the utility of theses are kernel size, number of outputs, and stride (fc layers have
learning representations of patches (i.e. object parts) before only a number of outputs). The LRN parameters follow [32]. All
learning full objects and scenes, and argue that scene-level conv and fc layers are followed by ReLU nonlinearities, except fc9
labels can serve as a pretext task. For example, [13] trains which feeds into a softmax classifier.
detectors to be sensitive to different geographic locales, but semantic reasoning for each patch separately. When design-
the actual goal is to discover specific elements of architec- ing the network, we followed AlexNet where possible.
tural style. To obtain training examples given an image, we sample
the first patch uniformly, without any reference to image
3. Learning Visual Context Prediction content. Given the position of the first patch, we sample the
We aim to learn an image representation for our pre- second patch randomly from the eight possible neighboring
text task, i.e., predicting the relative position of patches locations as in Figure 2.
within an image. We employ Convolutional Neural Net-
works (ConvNets), which are well known to learn complex 3.1. Avoiding “trivial” solutions
image representations with minimal human feature design. When designing a pretext task, care must be taken to en-
Building a ConvNet that can predict a relative offset for a sure that the task forces the network to extract the desired
pair of patches is, in principle, straightforward: the network information (high-level semantics, in our case), without tak-
must feed the two input patches through several convolu- ing “trivial” shortcuts. In our case, low-level cues like
tion layers, and produce an output that assigns a probability boundary patterns or textures continuing between patches
to each of the eight spatial configurations (Figure 2) that could potentially serve as such a shortcut. Hence, for the
might have been sampled (i.e. a softmax output). Note, relative prediction task, it was important to include a gap
however, that we ultimately wish to learn a feature embed- between patches (in our case, approximately half the patch
ding for individual patches, such that patches which are vi- width). Even with the gap, it is possible that long lines span-
sually similar (across different images) would be close in ning neighboring patches could could give away the correct
the embedding space. answer. Therefore, we also randomly jitter each patch loca-
To achieve this, we use a late-fusion architecture shown tion by up to 7 pixels (see Figure 2).
in Figure 3: a pair of AlexNet-style architectures [32] that However, even these precautions are not enough: we
process each patch separately, until a depth analogous to were surprised to find that, for some images, another triv-
fc6 in AlexNet, after which point the representations are ial solution exists. We traced the problem to an unexpected
fused. For the layers that process only one of the patches, culprit: chromatic aberration. Chromatic aberration arises
weights are tied between both sides of the network, such from differences in the way the lens focuses light at differ-
that the same fc6-level embedding function is computed for ent wavelengths. In some cameras, one color channel (com-
both patches. Because there is limited capacity for joint monly green) is shrunk toward the image center relative to
reasoning—i.e., only two layers receive input from both the others [5, p. 76]. A ConvNet, it turns out, can learn to lo-
patches—we expect the network to perform the bulk of the calize a patch relative to the lens itself (see Section 4.2) sim-

3
Input Random Initialization ImageNet AlexNet Ours

Figure 4. Examples of patch clusters obtained by nearest neighbors. The query patch is shown on the far left. Matches are for three different
features: fc6 features from a random initialization of our architecture, AlexNet fc7 after training on labeled ImageNet, and the fc6 features
learned from our method. Queries were chosen from 1000 randomly-sampled patches. The top group is examples where our algorithm
performs well; for the middle AlexNet outperforms our approach; and for the bottom all three features work well.
ply by detecting the separation between green and magenta we show both results.
(red + blue). Once the network learns the absolute location Implementation Details: We use Caffe [27], and train on
on the lens, solving the relative location task becomes triv- the ImageNet [10] 2012 training set ( 1.3M images), using
ial. To deal with this problem, we experimented with two only the images and discarding the labels. First, we resize
types of pre-processing. One is to shift green and magenta each image to between 150K and 450K total pixels, preserv-
toward gray (‘projection’). Specifically, let a = [−1, 2, −1] ing the aspect-ratio. From these images, we sample patches
(the ’green-magenta color axis’ in RGB space). We then at resolution 96-by-96. For computational efficiency, we
define B = I − aT a/(aaT ), which is a matrix that sub- only sample the patches from a grid like pattern, such that
tracts the projection of a color onto the green-magenta color each sampled patch can participate in as many as 8 separate
axis. We multiply every pixel value by B. An alternative ap- pairings. We allow a gap of 48 pixels between the sampled
proach is to randomly drop 2 of the 3 color channels from patches in the grid, but also jitter the location of each patch
each patch (‘color dropping’), replacing the dropped colors in the grid by −7 to 7 pixels in each direction. We prepro-
with Gaussian noise (standard deviation ∼ 1/100 the stan- cess patches by (1) mean subtraction (2) projecting or drop-
dard deviation of the remaining channel). For qualitative ping colors (see above), and (3) randomly downsampling
results, we show the ‘color-dropping’ approach, but found some patches to as little as 100 total pixels, and then upsam-
both performed similarly; for the object detection results, pling it, to build robustness to pixelation. When applying

4
Image layout
Initial layout, with sampled patches in red is discarded We can recover image layout automatically Cannot recover layout with color removed
Figure 5. We trained a network to predict the absolute (x, y) coordinates of randomly sampled patches. Far left: input image. Center left:
extracted patches. Center right: the location the trained network predicts for each patch shown on the left. Far right: the same result after
our color projection scheme. Note that the far right patches are shown after color projection; the operation’s effect is almost unnoticeable.
simple SGD to train the network, we found that the network fc8 (21) Figure 6. Our architecture for Pascal
predictions would degenerate to a uniform prediction over fc7 (4096) VOC detection. Layers from conv1
the 8 categories, with all activations for fc6 and fc7 col- pool6 (3x3,1024,2) through pool5 are copied from our
lapsing to 0. This meant that the optimization became per- conv6b (1x1,1024,1) patch-based network (Figure 3). The
conv6 (3x3,4096,1) new ’conv6’ layer is created by con-
manently stuck in a saddle point where it ignored the input
pool5 verting the fc6 layer into a convolu-
from the lower layers (which helped minimize the variance

…
tion layer. Kernel sizes, output units,
of the final output), and therefore that the net could not tune and stride are given in parentheses, as
the lower-level features and escape the saddle point. Hence, Image (227x227) in Figure 3.
our final implementation employs batch normalization [26],
without the scale and shift (γ and β), which forces the net- tialization). As shown in Figure 4, the matches returned by
work activations to vary across examples. We also find that our feature often capture the semantic information that we
high momentum values (e.g. .999) accelerated learning. For are after, matching AlexNet in terms of semantic content (in
experiments, we use a ConvNet trained on a K40 GPU for some cases, e.g. the car wheel, our matches capture pose
approximately four weeks. better). Interestingly, in a few cases, random (untrained)
ConvNet also does reasonably well.
4. Experiments
We first demonstrate the network has learned to associate
4.2. Aside: Learnability of Chromatic Aberration
semantically similar patches, using simple nearest-neighbor We noticed in early nearest-neighbor experiments that
matching. We then apply the trained network in two do- some patches retrieved match patches from the same ab-
mains. First, we use the model as “pre-training” for a stan- solute location in the image, regardless of content, be-
dard vision task with only limited training data: specifically, cause those patches displayed similar aberration. To further
we use the VOC 2007 object detection. Second, we evalu- demonstrate this phenomenon, we trained a network to pre-
ate visual data mining, where the goal is to start with an dict the absolute (x, y) coordinates of patches sampled from
unlabeled image collection and discover object classes. Fi- ImageNet. While the overall accuracy of this regressor is
nally, we analyze the performance on the layout prediction not very high, it does surprisingly well for some images:
“pretext task” to see how much is left to learn from this su- for the top 10% of images, the average (root-mean-square)
pervisory signal. error is .255, while chance performance (always predict-
ing the image center) yields a RMSE of .371. Figure 5
4.1. Nearest Neighbors
shows one such result. Applying the proposed “projection”
Recall our intuition that training should assign similar scheme increases the error on the top 10% of images to .321.
representations to semantically similar patches. In this sec-
tion, our goal is to understand which patches our network 4.3. Object Detection
considers similar. We begin by sampling random 96x96 Previous work on the Pascal VOC challenge [15] has
patches, which we represent using fc6 features (i.e. we re- shown that pre-training on ImageNet (i.e., training a Con-
move fc7 and higher shown in Figure 3, and use only one vNet to solve the ImageNet challenge) and then “fine-
of the two stacks). We find nearest neighbors using normal- tuning” the network (i.e. re-training the ImageNet model
ized correlation of these features. Results for some patches for PASCAL data) provides a substantial boost over training
(selected out of 1000 random queries) are shown in Fig- on the Pascal training set alone [21, 2]. However, as far as
ure 4. For comparison, we repeated the experiment using we are aware, no works have shown that unsupervised pre-
fc7 features from AlexNet trained on ImageNet (obtained training on images can provide such a performance boost,
by upsampling the patches), and using fc6 features from our no matter how much data is used.
architecture but without any training (random weights ini- Since we are already using a ConvNet, we adopt the cur-

5
VOC-2007 Test aero bike bird boat bottle bus car cat chair cow table dog horse mbike person plant sheep sofa train tv mAP
DPM-v5[17] 33.2 60.3 10.2 16.1 27.3 54.3 58.2 23.0 20.0 24.1 26.7 12.7 58.1 48.2 43.2 12.0 21.1 36.1 46.0 43.5 33.7
[8] w/o context 52.6 52.6 19.2 25.4 18.7 47.3 56.9 42.1 16.6 41.4 41.9 27.7 47.9 51.5 29.9 20.0 41.1 36.4 48.6 53.2 38.5
Regionlets[58] 54.2 52.0 20.3 24.0 20.1 55.5 68.7 42.6 19.2 44.2 49.1 26.6 57.0 54.5 43.4 16.4 36.6 37.7 59.4 52.3 41.7
Scratch-R-CNN[2] 49.9 60.6 24.7 23.7 20.3 52.5 64.8 32.9 20.4 43.5 34.2 29.9 49.0 60.4 47.5 28.0 42.3 28.6 51.2 50.0 40.7
Scratch-Ours 52.6 60.5 23.8 24.3 18.1 50.6 65.9 29.2 19.5 43.5 35.2 27.6 46.5 59.4 46.5 25.6 42.4 23.5 50.0 50.6 39.8
Ours-projection 58.4 62.8 33.5 27.7 24.4 58.5 68.5 41.2 26.3 49.5 42.6 37.3 55.7 62.5 49.4 29.0 47.5 28.4 54.7 56.8 45.7
Ours-color-dropping 60.5 66.5 29.6 28.5 26.3 56.1 70.4 44.8 24.6 45.5 45.4 35.1 52.2 60.2 50.0 28.1 46.7 42.6 54.8 58.6 46.3
Ours-Yahoo100m 56.2 63.9 29.8 27.8 23.9 57.4 69.8 35.6 23.7 47.4 43.0 29.5 52.9 62.0 48.7 28.4 45.1 33.6 49.0 55.5 44.2
ImageNet-R-CNN[21] 64.2 69.7 50 41.9 32.0 62.6 71.0 60.7 32.7 58.5 46.5 56.1 60.6 66.8 54.2 31.5 52.8 48.9 57.9 64.7 54.2
K-means-rescale [31] 55.7 60.9 27.9 30.9 12.0 59.1 63.7 47.0 21.4 45.2 55.8 40.3 67.5 61.2 48.3 21.9 32.8 46.9 61.6 51.7 45.6
Ours-rescale [31] 61.9 63.3 35.8 32.6 17.2 68.0 67.9 54.8 29.6 52.4 62.9 51.3 67.1 64.3 50.5 24.4 43.7 54.9 67.1 52.7 51.1
ImageNet-rescale [31] 64.0 69.6 53.2 44.4 24.9 65.7 69.6 69.2 28.9 63.6 62.8 63.9 73.3 64.6 55.8 25.7 50.5 55.4 69.3 56.4 56.5
VGG-K-means-rescale 56.1 58.6 23.3 25.7 12.8 57.8 61.2 45.2 21.4 47.1 39.5 35.6 60.1 61.4 44.9 17.3 37.7 33.2 57.9 51.2 42.4
VGG-Ours-rescale 71.1 72.4 54.1 48.2 29.9 75.2 78.0 71.9 38.3 60.5 62.3 68.1 74.3 74.2 64.8 32.6 56.5 66.4 74.0 60.3 61.7
VGG-ImageNet-rescale 76.6 79.6 68.5 57.4 40.8 79.9 78.4 85.4 41.7 77.0 69.3 80.1 78.6 74.6 70.1 37.5 66.0 67.5 77.4 64.9 68.6
Table 1. Mean Average Precision on VOC-2007.

rent state-of-the-art R-CNN pipeline [21]. R-CNN works ically. The performance after fine-tuning is slightly worse
on object proposals that have been resized to 227x227. Our than Imagenet, but there is still a considerable boost over
algorithm, however, is aimed at 96x96 patches. We find that the from-scratch model.
downsampling the proposals to 96x96 loses too much detail. In the above fine-tuning experiments, we removed the
Instead, we adopt the architecture shown in Figure 6. As batch normalization layers by estimating the mean and vari-
above, we use only one stack from Figure 3. Second, we re- ance of the conv- and fc- layers, and then rescaling the
size the convolution layers to operate on inputs of 227x227. weights and biases such that the outputs of the conv and fc
This results in a pool5 that is 7x7 spatially, so we must con- layers have mean 0 and variance 1 for each channel. Recent
vert the previous fc6 layer into a convolution layer (which work [31], however, has shown empirically that the scal-
we call conv6) following [38]. Note our conv6 layer has ing of the weights prior to finetuning can have a strong im-
4096 channels, where each unit connects to a 3x3 region pact on test-time performance, and argues that our previous
of pool5. A conv layer with 4096 channels would be quite method of removing batch normalization leads too poorly
expensive to connect directly to a 4096-dimensional fully- scaled weights. They propose a simple way to rescale the
connected layer. Hence, we add another layer after conv6 network’s weights without changing the function that the
(called conv6b), using a 1x1 kernel, which reduces the di- network computes, such that the network behaves better
mensionality to 1024 channels (and adds a nonlinearity). during finetuning. Results using this technique are shown
Finally, we feed the outputs through a pooling layer to a in Table 1. Their approach gives a boost to all methods, but
fully connected layer (fc7) which in turn connects to a fi- gives less of a boost to the already-well-scaled ImageNet-
nal fc8 layer which feeds into the softmax. We fine-tune category model. Note that for this comparison, we used
this network according to the procedure described in [21] fast-rcnn [20] to save compute time, and we discarded all
(conv6b, fc7, and fc8 start with random weights), and use pre-trained fc-layers from our model, re-initializing them
fc7 as the final representation. We do not use bounding- with the K-means procedure of [31] (which was used to ini-
box regression, and take the appropriate results from [21] tialize all layers in the “K-means-rescale” row). Hence, the
and [2]. structure of the network during fine-tuning and testing was
Table 1 shows our results. Our architecture trained from the same for all models.
scratch (random initialization) performs slightly worse than Considering that we have essentially infinite data to train
AlexNet trained from scratch. However, our pre-training our model, we might expect that our algorithm should also
makes up for this, boosting the from-scratch number by provide a large boost to higher-capacity models such as
6% MAP, and outperforms an AlexNet-style model trained VGG [49]. To test this, we trained a model following the 16-
from scratch on Pascal by over 5%. This puts us about 8% layer structure of [49] for the convolutional layers on each
behind the performance of R-CNN pre-trained with Ima- side of the network (the final fc6-fc9 layers were the same
geNet labels [21]. This is the best result we are aware of as in Figure 3). We again fine-tuned the representation on
on VOC 2007 without using labels outside the dataset. We Pascal VOC using fast-rcnn, by transferring only the conv
ran additional baselines initialized with batch normaliza- layers, again following Krähenbühl et al. [31] to re-scale
tion, but found they performed worse than the ones shown. the transferred weights and initialize the rest. As a base-
To understand the effect of various dataset biases [55], line, we performed a similar experiment with the ImageNet-
we also performed a preliminary experiment pre-training pretrained 16-layer model of [49] (though we kept pre-
on a randomly-selected 2M subset of the Yahoo/Flickr 100- trained fc layers rather than re-initializing them), and also
million Dataset [54], which was collected entirely automat- by initializing the entire network with K-means [31]. Train-

6
Lower Better Higher Better texture has no global layout.
Mean Median 11.25◦ 22.5◦ 30◦ To implement this, we first sample a constellation of
Scratch 38.6 26.5 33.1 46.8 52.5 four adjacent patches from an image (we use four to reduce
Unsup. Tracking [57] 34.2 21.9 35.7 50.6 57.0 the likelihood of a matching spatial arrangement happen-
Ours 33.2 21.3 36.0 51.2 57.8 ing by chance). We find the top 100 images which have
ImageNet Labels 33.3 20.8 36.7 51.7 58.1 the strongest matches for all four patches, ignoring spatial
layout. We then use a type of geometric verification [7]
Table 2. Accuracy on NYUv2.
to filter away the images where the four matches are not
geometrically consistent. Because our features are more
ing time was considerably longer—about 8 weeks on a Titan
semantically-tuned, we can use a much weaker type of ge-
X GPU—but the the network outperformed the AlexNet-
ometric verification than [7]. Finally, we rank the different
style model by a considerable margin. Note the model ini-
constellations by counting the number of times the top 100
tialized with K-means performed roughly on par with the
matches geometrically verify.
analogous AlexNet model, suggesting that most of the boost
came from the unsupervised pre-training. Implementation Details: To compute whether a set of four
matched patches geometrically verifies, we first compute
4.4. Geometry Estimation the best-fitting square S to the patch centers (via least-
The results of Section 4.3 suggest that our representa- squares), while constraining that side of S be between 2/3
tion is sensitive to objects, even though it was not originally and 4/3 of the average side of the patches. We then compute
trained to find them. This raises the question: Does our the squared error of the patch centers relative to S (normal-
representation extract information that is useful for other, ized by dividing the sum-of-squared-errors by the square of
non-object-based tasks? To find out, we fine-tuned our net- the side of S). The patch is geometrically verified if this
work to perform the surface normal estimation on NYUv2 normalized squared error is less than 1. When sampling
proposed in Fouhey et al. [19], following the finetuning pro- patches do not use any of the data augmentation preprocess-
cedure of Wang et al. [57] (hence, we compare directly to ing steps (e.g. downsampling). We use the color-dropping
the unsupervised pretraining results reported there). We version of our network.
used the color-dropping network, restructuring the fully- We applied the described mining algorithm to Pascal
connected layers as in Section 4.3. Surprisingly, our re- VOC 2011, with no pre-filtering of images and no addi-
sults are almost equivalent to those obtained using a fully- tional labels. We show some of the resulting patch clusters
labeled ImageNet model. One possible explanation for this in Figure 7. The results are visually comparable to our pre-
is that the ImageNet categorization task does relatively little vious work [12], although we discover a few objects that
to encourage a network to pay attention to geometry, since were not found in [12], such as monitors, birds, torsos, and
the geometry is largely irrelevant once an object is identi- plates of food. The discovery of birds and torsos—which
fied. Further evidence of this can be seen in seventh row of are notoriously deformable—provides further evidence for
Figure 4: the nearest neighbors for ImageNet AlexNet are the invariances our algorithm has learned. We believe we
all car wheels, but they are not aligned well with the query have covered all objects discovered in [12], with the ex-
patch. ception of (1) trusses and (2) railroad tracks without trains
4.5. Visual Data Mining (though we do discover them with trains). For some objects
Visual data mining [44, 13, 50, 45], or unsupervised ob- like dogs, we discover more variety and rank the best ones
ject discovery [51, 47, 22], aims to use a large image col- higher. Furthermore, many of the clusters shown in [12] de-
lection to discover image fragments which happen to depict pict gratings (14 out of the top 100), whereas none of ours
the same semantic objects. Applications include dataset vi- do (though two of our top hundred depict diffuse gradients).
sualization, content-based retrieval, and tasks that require As in [12], we often re-discover the same object multiple
relating visual data to other unstructured information (e.g. times with different viewpoints, which accounts for most of
GPS coordinates [13]). For automatic data mining, our the gaps between ranks in Figure 7. The main disadvan-
approach from section 4.1 is inadequate: although object tages of our algorithm relative to [12] are 1) some loss of
patches match to similar objects, textures match just as purity, and 2) that we cannot currently determine an object
readily to similar textures. Suppose, however, that we sam- mask automatically (although one could imagine dynami-
pled two non-overlapping patches from the same object. cally adding more sub-patches to each proposed object).
Not only would the nearest neighbor lists for both patches To ensure that our algorithm has not simply learned an
share many images, but within those images, the nearest object-centric representation due to the various biases [55]
neighbors would be in roughly the same spatial configura- in ImageNet, we also applied our algorithm to 15,000 Street
tion. For texture regions, on the other hand, the spatial con- View images from Paris (following [13]). The results in
figurations of the neighbors would be random, because the Figure 8 show that our representation captures scene lay-

7
1 88

4 121

7 131

12 142

25 179

29 187

30 229 229

35 232 232

46 240

70 256

71 351

73 464
Figure 7. Object clusters discovered by our algorithm. The number beside each cluster indicates its ranking, determined by the fraction of
the top matches that geometrically verified. For all clusters, we show the raw top 7 matches that verified geometrically. The full ranking is
available on our project webpage.
out and architectural elements. For this experiment, to rank ters compared to [12]—which is not very surprising consid-
clusters, we use the de-duplication procedure originally pro- ering that our validation procedure is considerably simpler.
posed in [13]. Implementation Details: We initialize 16,384 clusters by
4.5.1 Quantitative Results sampling patches, mining nearest neighbors, and geomet-
ric verification ranking as described above. The resulting
As part of the qualitative evaluation, we applied our algo- clusters are highly redundant. The cluster selection proce-
rithm to the subset of Pascal VOC 2007 selected in [50]: dure of [12] relies on a likelihood ratio score that is cali-
specifically, those containing at least one instance of bus, brated across clusters, which is not available to us. To se-
dining table, motorbike, horse, sofa, or train, and evaluate lect clusters, we first select the top 10 geometrically-verified
via a purity coverage curve following [12]. We select 1000 neighbors for each cluster. Then we iteratively select the
sets of 10 images each for evaluation. The evaluation then highest-ranked cluster that contributes at least one image to
sorts the sets by purity: the fraction of images in the clus- our coverage score. When we run out of images that aren’t
ter containing the same category. We generate the curve by included in the coverage score, we choose clusters to cover
walking down the ranking. For each point on the curve, we each image at least twice, and then three times, and so on.
plot average purity of all sets up to a given point in the rank-
ing against coverage: the fraction of images in the dataset 4.6. Accuracy on the Relative Prediction Task Task
that are contained in at least one of the sets up to that point.
As shown in Figure 9, we have gained substantially in terms Can we improve the representation by further training
of coverage, suggesting increased invariance for our learned on our relative prediction pretext task? To find out, we
feature. However, we have also lost some highly-pure clus- briefly analyze classification performance on pretext task

8
age, the task is almost impossible. Might the task be easiest
1: for image regions corresponding to objects? To test this
hypothesis, we repeated our experiment using only patches
4: sampled from within Pascal object ground-truth bounding
boxes. We select only those boxes that are at least 240 pix-
els on each side, and which are not labeled as truncated,
5: occluded, or difficult. Surprisingly, this gave essentially the
same accuracy of 39.2%, and a similar experiment only on
cars yielded 45.6% accuracy. So, while our algorithm is
6: sensitive to objects, it is almost as sensitive to the layout of
the rest of the image.
Acknowledgements We thank Xiaolong Wang and Pulkit Agrawal for
13: help with baselines, Berkeley and CMU vision group members for many
fruitful discussions, and Jitendra Malik for putting gelato on the line. This
work was partially supported by Google Graduate Fellowship to CD, ONR
18: MURI N000141010934, Intel research grant, an NVidia hardware grant,
and an Amazon Web Services grant.

42: References
[1] E. H. Adelson. On seeing stuff: the perception of materials by hu-
53: mans and machines. In Photonics West 2001-Electronic Imaging,
2001. 2
Figure 8. Clusters discovered and automatically ranked via our al- [2] P. Agrawal, R. Girshick, and J. Malik. Analyzing the performance of
gorithm (§ 4.5) from the Paris Street View dataset. multilayer neural networks for object recognition. In ECCV. 2014.
5, 6

1.0
Purity-Coverage for Proposed Objects [3] R. K. Ando and T. Zhang. A framework for learning predictive struc-
tures from multiple tasks and unlabeled data. JMLR, 2005. 1
0.9 [4] Y. Bengio, E. Thibodeau-Laufer, G. Alain, and J. Yosinski. Deep
0.8 generative stochastic networks trainable by backprop. ICML, 2014.
2
0.7
[5] D. Brewster and A. D. Bache. Treatise on optics. Blanchard and Lea,
0.6 1854. 3
Purity

0.5 [6] O. Chum, M. Perdoch, and J. Matas. Geometric min-hashing: Find-

ing a (thick) needle in a haystack. In CVPR, 2009. 3
0.4
Visual Words .63 (.37) [7] O. Chum, J. Philbin, J. Sivic, M. Isard, and A. Zisserman. Total
0.3 Russel et al. .66 (.38) recall: Automatic query expansion with a generative feature model
HOG Kmeans .70 (.40)
0.2 Singh et al. .83 (.47)
for object retrieval. In ICCV, 2007. 7
0.1 Doersch et al. .83 (.48) [8] R. G. Cinbis, J. Verbeek, and C. Schmid. Segmentation driven object
Our Approach .87 (.48) detection with Fisher vectors. In ICCV, 2013. 6
0.0
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 [9] R. Collobert and J. Weston. A unified architecture for natural lan-
Coverage guage processing: Deep neural networks with multitask learning. In
Figure 9. Purity vs coverage for objects discovered on a subset of ICML, 2008. 1, 2
Pascal VOC 2007. The numbers in the legend indicate area under [10] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. Im-
the curve (AUC). In parentheses is the AUC up to a coverage of .5. agenet: A large-scale hierarchical image database. In CVPR, 2009.
4
itself. We sampled 500 random images from Pascal VOC [11] C. Doersch, A. Gupta, and A. A. Efros. Mid-level visual element
2007, sampled 256 pairs of patches from each, and clas- discovery as discriminative mode seeking. In NIPS, 2013. 3
[12] C. Doersch, A. Gupta, and A. A. Efros. Context as supervisory sig-
sified them into the eight relative-position categories from
nal: Discovering objects with predictable context. In ECCV. 2014.
Figure 2. This gave an accuracy of 38.4%, where chance 2, 7, 8
performance is 12.5%, suggesting that the pretext task is [13] C. Doersch, S. Singh, A. Gupta, J. Sivic, and A. A. Efros. What
quite hard (indeed, human performance on the task is simi- makes Paris look like Paris? SIGGRAPH, 2012. 3, 7
lar). To measure possible overfitting, we also ran the same [14] J. Domke, A. Karapurkar, and Y. Aloimonos. Who killed the directed
model? In CVPR, 2008. 2
experiment on ImageNet, which is the dataset we used for [15] M. Everingham, L. Van Gool, C. K. Williams, J. Winn, and A. Zisser-
training. The network was 39.5% accurate on the training man. The pascal visual object classes (voc) challenge. IJCV, 2010.
set, and 40.3% accurate on the validation set (which the net- 5
work never saw during training), suggesting that little over- [16] A. Faktor and M. Irani. clustering by composition–unsupervised dis-
covery of image categories. In ECCV. 2012. 3
fitting has occurred.
[17] P. Felzenszwalb, R. Girshick, D. McAllester, and D. Ramanan. Ob-
One possible reason why the pretext task is so difficult ject detection with discriminatively trained part-based models. PAMI,
is because, for a large fraction of patches within each im- 2010. 6

9
[18] P. Földiák. Learning invariance from transformation sequences. Neu- [46] D. J. Rezende, S. Mohamed, and D. Wierstra. Stochastic backpropa-
ral Computation, 1991. 3 gation and approximate inference in deep generative models. ICML,
[19] D. F. Fouhey, A. Gupta, and M. Hebert. Data-driven 3D primitives 2014. 2
for single image understanding. In ICCV, 2013. 7 [47] B. C. Russell, W. T. Freeman, A. A. Efros, J. Sivic, and A. Zisserman.
[20] R. Girshick. Fast r-cnn. In ICCV, 2015. 6 Using multiple segmentations to discover objects and their extent in
[21] R. Girshick, J. Donahue, T. Darrell, and J. Malik. Rich feature hier- image collections. In CVPR, 2006. 2, 7
archies for accurate object detection and semantic segmentation. In [48] R. Salakhutdinov and G. E. Hinton. Deep boltzmann machines. In
CVPR, 2014. 1, 5, 6 ICAIS, 2009. 2
[22] K. Grauman and T. Darrell. Unsupervised learning of categories from [49] K. Simonyan and A. Zisserman. Very deep convolutional networks
sets of partially matching image features. In CVPR, 2006. 3, 7 for large-scale image recognition. CoRR, 2014. 6
[23] K. Heath, N. Gelfand, M. Ovsjanikov, M. Aanjaneya, and L. J. [50] S. Singh, A. Gupta, and A. A. Efros. Unsupervised discovery of
Guibas. Image webs: Computing and exploiting connectivity in im- mid-level discriminative patches. In ECCV, 2012. 3, 7, 8
age collections. In CVPR, 2010. 3 [51] J. Sivic, B. C. Russell, A. A. Efros, A. Zisserman, and W. T. Freeman.
[24] G. Hinton, S. Osindero, and Y.-W. Teh. A fast learning algorithm for Discovering objects and their location in images. In ICCV, 2005. 2,
deep belief nets. Neural computation, 2006. 2 7
[25] G. E. Hinton, P. Dayan, B. J. Frey, and R. M. Neal. The “wake- [52] J. Sun and J. Ponce. Learning discriminative part detectors for image
sleep” algorithm for unsupervised neural networks. Proceedings. classification and cosegmentation. In ICCV, 2013. 3
IEEE, 1995. 2 [53] L. Theis and M. Bethge. Generative image modeling using spatial
[26] S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep lstms. In NIPS, 2015. 2
network training by reducing internal covariate shift. arXiv preprint [54] B. Thomee, D. A. Shamma, G. Friedland, B. Elizalde, K. Ni,
arXiv:1502.03167, 2015. 5 D. Poland, D. Borth, and L.-J. Li. The new data and new challenges
[27] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, in multimedia research. arXiv preprint arXiv:1503.01817, 2015. 6
S. Guadarrama, and T. Darrell. Caffe: Convolutional architecture for [55] A. Torralba and A. A. Efros. Unbiased look at dataset bias. In CVPR,
fast feature embedding. In ACM-MM, 2014. 4 2011. 6, 7
[28] M. Juneja, A. Vedaldi, C. V. Jawahar, and A. Zisserman. Blocks that [56] P. Vincent, H. Larochelle, Y. Bengio, and P.-A. Manzagol. Extract-
shout: Distinctive parts for scene classification. In CVPR, 2013. 3 ing and composing robust features with denoising autoencoders. In
[29] G. Kim, C. Faloutsos, and M. Hebert. Unsupervised modeling of ICML, 2008. 2
object categories using link analysis techniques. In CVPR, 2008. 3 [57] X. Wang and A. Gupta. Unsupervised learning of visual representa-
[30] D. P. Kingma and M. Welling. Auto-encoding variational bayes. tions using videos. In ICCV, 2015. 3, 7
2014. 2 [58] X. Wang, M. Yang, S. Zhu, and Y. Lin. Regionlets for generic object
[31] P. Krähenbühl, C. Doersch, J. Donahue, and T. Darrell. Data- detection. In ICCV, 2013. 6
dependent initializations of convolutional neural networks. arXiv [59] L. Wiskott and T. J. Sejnowski. Slow feature analysis:unsupervised
preprint arXiv:1511.06856, 2015. 6 learning of invariances. Neural Computation, 2002. 3
[32] A. Krizhevsky, I. Sutskever, and G. Hinton. Imagenet classification
with deep convolutional neural networks. In NIPS, 2012. 1, 3
[33] H. Larochelle and I. Murray. The neural autoregressive distribution
estimator. In AISTATS, 2011. 2
[34] Q. V. Le. Building high-level features using large scale unsupervised
learning. In ICASSP, 2013. 2
[35] H. Lee, A. Battle, R. Raina, and A. Y. Ng. Efficient sparse coding
algorithms. In NIPS, 2006. 2
[36] Y. J. Lee and K. Grauman. Foreground focus: Unsupervised learning
from partially matching images. IJCV, 2009. 3
[37] Q. Li, J. Wu, and Z. Tu. Harvesting mid-level visual concepts from
large-scale internet images. In CVPR, 2013. 3
[38] J. Long, E. Shelhamer, and T. Darrell. Fully convolutional networks
for semantic segmentation. arXiv preprint arXiv:1411.4038, 2014. 6
[39] T. Malisiewicz and A. Efros. Beyond categories: The visual memex
model for reasoning about object relationships. In NIPS, 2009. 2
[40] T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean. Dis-
tributed representations of words and phrases and their composition-
ality. In NIPS, 2013. 1, 2
[41] D. Okanohara and J. Tsujii. A discriminative language model with
pseudo-negative samples. In ACL, 2007. 1, 2
[42] B. A. Olshausen and D. J. Field. Emergence of simple-cell receptive
field properties by learning a sparse code for natural images. Nature,
1996. 2
[43] N. Payet and S. Todorovic. From a set of shapes to object discovery.
In ECCV. 2010. 3
[44] T. Quack, B. Leibe, and L. Van Gool. World-scale mining of objects
and events from community photo collections. In CIVR, 2008. 3, 7
[45] K. Rematas, B. Fernando, F. Dellaert, and T. Tuytelaars. Dataset
fingerprints: Exploring image collections through data mining. In
CVPR, 2015. 3, 7

LCS 323 - Final Assessment - (3812303)
No ratings yet
LCS 323 - Final Assessment - (3812303)
13 pages
Father-Child Play: A Systematic Review of Its Frequency, T Characteristics and Potential Impact On Children's Development
No ratings yet
Father-Child Play: A Systematic Review of Its Frequency, T Characteristics and Potential Impact On Children's Development
17 pages
Post Appendectomy
No ratings yet
Post Appendectomy
3 pages
Context Encoders Feature Learning by Inpainting
No ratings yet
Context Encoders Feature Learning by Inpainting
9 pages
Pathak Context Encoders Feature CVPR 2016 Paper
No ratings yet
Pathak Context Encoders Feature CVPR 2016 Paper
9 pages
Context Encoders: Feature Learning by Inpainting
No ratings yet
Context Encoders: Feature Learning by Inpainting
12 pages
Self_Supervised_Learning
No ratings yet
Self_Supervised_Learning
5 pages
S - S L C - C G A N: EMI Upervised Earning With Ontext Onditional Enerative Dversarial Etworks
No ratings yet
S - S L C - C G A N: EMI Upervised Earning With Ontext Onditional Enerative Dversarial Etworks
10 pages
Weakly Supervised Contrastive Learning
No ratings yet
Weakly Supervised Contrastive Learning
10 pages
Cross Training
No ratings yet
Cross Training
11 pages
Dinov 2
No ratings yet
Dinov 2
31 pages
L U V U C - P N 3DM: Earning From Nlabelled Ideos Sing ON Trastive Redictive Eural Apping
No ratings yet
L U V U C - P N 3DM: Earning From Nlabelled Ideos Sing ON Trastive Redictive Eural Apping
19 pages
Paper SimAN
No ratings yet
Paper SimAN
10 pages
Self-Supervised Learning: Pretext Tasks
No ratings yet
Self-Supervised Learning: Pretext Tasks
3 pages
Unsupervised Embedding Learning Via Invariant and Spreading Instance Feature
No ratings yet
Unsupervised Embedding Learning Via Invariant and Spreading Instance Feature
10 pages
Learning Transferable Visual Models From Natural Language Supervision
No ratings yet
Learning Transferable Visual Models From Natural Language Supervision
48 pages
SCAN: Learning To Classify Images Without Labels: 1 Introduction and Prior Work
No ratings yet
SCAN: Learning To Classify Images Without Labels: 1 Introduction and Prior Work
26 pages
Revisiting Self-Supervised Visual Representation Learning PDF
No ratings yet
Revisiting Self-Supervised Visual Representation Learning PDF
10 pages
Learning Transferable Visual Models From Natural Language Supervision
No ratings yet
Learning Transferable Visual Models From Natural Language Supervision
47 pages
Telling Stories For Common Sense Zero-Shot Action Recognition
No ratings yet
Telling Stories For Common Sense Zero-Shot Action Recognition
16 pages
2206.10207v3
No ratings yet
2206.10207v3
13 pages
SSL 18 Mar 23 PDF
No ratings yet
SSL 18 Mar 23 PDF
50 pages
From Text to Mask Localizing Entities Using the
No ratings yet
From Text to Mask Localizing Entities Using the
43 pages
Unsupervised Learning of Visual Features by Contrasting Cluster Assignments
No ratings yet
Unsupervised Learning of Visual Features by Contrasting Cluster Assignments
23 pages
Revisting
No ratings yet
Revisting
13 pages
Unsupervised Embedding Learning Via Invariant and Spreading Instance Feature
No ratings yet
Unsupervised Embedding Learning Via Invariant and Spreading Instance Feature
11 pages
Carlucci Domain Generalization by Solving Jigsaw Puzzles CVPR 2019 Paper
No ratings yet
Carlucci Domain Generalization by Solving Jigsaw Puzzles CVPR 2019 Paper
10 pages
paper4
No ratings yet
paper4
12 pages
Recent Advances in Deep Learning For Object Detection
No ratings yet
Recent Advances in Deep Learning For Object Detection
26 pages
Unsupervised Learning of Visual Features by Contrasting Cluster Assignments
No ratings yet
Unsupervised Learning of Visual Features by Contrasting Cluster Assignments
13 pages
An Image Is Worth More Than 16x16 Patches
No ratings yet
An Image Is Worth More Than 16x16 Patches
23 pages
A Survey On Segment Anything Model (Sam)
No ratings yet
A Survey On Segment Anything Model (Sam)
20 pages
Generative Pretraining From Pixels
No ratings yet
Generative Pretraining From Pixels
13 pages
2019 Researchstatement
No ratings yet
2019 Researchstatement
6 pages
Jia Et Al. - 2021 - Scaling Up Visual and Vision-Language Representati
No ratings yet
Jia Et Al. - 2021 - Scaling Up Visual and Vision-Language Representati
11 pages
Self-Supervised Representation Learning From Temporal Ordering of Automated Driving Sequences
No ratings yet
Self-Supervised Representation Learning From Temporal Ordering of Automated Driving Sequences
13 pages
N-19377
No ratings yet
N-19377
8 pages
U R L D C G A N: Nsupervised Epresentation Earning With EEP Onvolutional Enerative Dversarial Etworks
No ratings yet
U R L D C G A N: Nsupervised Epresentation Earning With EEP Onvolutional Enerative Dversarial Etworks
15 pages
Learning To Compare Image Patches Via Convolutional Neural Networks
No ratings yet
Learning To Compare Image Patches Via Convolutional Neural Networks
9 pages
DL Tutorial NIPS2015 PDF
No ratings yet
DL Tutorial NIPS2015 PDF
133 pages
DW & Caption Generator - Paper 1
No ratings yet
DW & Caption Generator - Paper 1
6 pages
Technologies 09 00002 v2
No ratings yet
Technologies 09 00002 v2
22 pages
A Guide To Image Captioning. How Deep Learning Helps in Captioning
No ratings yet
A Guide To Image Captioning. How Deep Learning Helps in Captioning
17 pages
Image Captioning
No ratings yet
Image Captioning
17 pages
Research Paper 5
No ratings yet
Research Paper 5
10 pages
Generative Pretraining From Pixels V2
No ratings yet
Generative Pretraining From Pixels V2
12 pages
Islam 等 - 2020 - How Much Position Information Do Convolutional Neu
No ratings yet
Islam 等 - 2020 - How Much Position Information Do Convolutional Neu
11 pages
Elative Representations Enable Zero Shot Latent Space Communication
No ratings yet
Elative Representations Enable Zero Shot Latent Space Communication
20 pages
Self-Supervised_Contrastive_Representation_Learning_for_Semi-Supervised_Time-Series_Classification
No ratings yet
Self-Supervised_Contrastive_Representation_Learning_for_Semi-Supervised_Time-Series_Classification
15 pages
Yerxa et al - Efficient Coding of Natural Images using Maximum Manifold Capacity Representations
No ratings yet
Yerxa et al - Efficient Coding of Natural Images using Maximum Manifold Capacity Representations
26 pages
Xian Latent Embeddings For CVPR 2016 Paper
No ratings yet
Xian Latent Embeddings For CVPR 2016 Paper
9 pages
Tian Learning Vision From Models Rivals Learning Vision From Data CVPR 2024 Paper
No ratings yet
Tian Learning Vision From Models Rivals Learning Vision From Data CVPR 2024 Paper
12 pages
Beery_Synthetic_Examples_Improve_Generalization_for_Rare_Classes_WACV_2020_paper
No ratings yet
Beery_Synthetic_Examples_Improve_Generalization_for_Rare_Classes_WACV_2020_paper
11 pages
TCS Ocr
No ratings yet
TCS Ocr
39 pages
bag of words
No ratings yet
bag of words
11 pages
OD Trans Christopher-Lang2022 Q2
No ratings yet
OD Trans Christopher-Lang2022 Q2
15 pages
Jia Bin Huang Research Statement
No ratings yet
Jia Bin Huang Research Statement
6 pages
Unsupervised Learning of Visual Features by Contrasting Cluster Assignments
No ratings yet
Unsupervised Learning of Visual Features by Contrasting Cluster Assignments
21 pages
Deep Learning Important Studies
No ratings yet
Deep Learning Important Studies
6 pages
2406.15955v3
No ratings yet
2406.15955v3
37 pages
IT5409 - Ch7 - Part2 - Object Recognition - v2 - 4pages
No ratings yet
IT5409 - Ch7 - Part2 - Object Recognition - v2 - 4pages
38 pages
Perceptrons: Fundamentals and Applications for The Neural Building Block
From Everand
Perceptrons: Fundamentals and Applications for The Neural Building Block
Fouad Sabry
No ratings yet
Machine Learning - Advanced Concepts
From Everand
Machine Learning - Advanced Concepts
Derrick Mwiti
No ratings yet
Bodollo Math 3rd Quarter DLP Co 1
No ratings yet
Bodollo Math 3rd Quarter DLP Co 1
8 pages
VQAR 1 and 2
No ratings yet
VQAR 1 and 2
3 pages
Jennings Unit Plan
No ratings yet
Jennings Unit Plan
9 pages
ETHICS Module 3
No ratings yet
ETHICS Module 3
23 pages
Novelty: Ibnu Siswanto, M.PD.,PH.D
No ratings yet
Novelty: Ibnu Siswanto, M.PD.,PH.D
15 pages
As-Is Scenario Mapactivity 3
No ratings yet
As-Is Scenario Mapactivity 3
11 pages
Appendix I: Questionnaire: TOPIC: High Job Performance (To Be Filled in by The Employees of Bank A) Dear Sir/ Madam
No ratings yet
Appendix I: Questionnaire: TOPIC: High Job Performance (To Be Filled in by The Employees of Bank A) Dear Sir/ Madam
3 pages
Cross Cultural Communication
No ratings yet
Cross Cultural Communication
3 pages
Sohcahtoa Lesson Plan
100% (1)
Sohcahtoa Lesson Plan
3 pages
ҚМЖ Excel 7 term 4
No ratings yet
ҚМЖ Excel 7 term 4
15 pages
DLL - All Subjects 2 - Q4 - W6 - D1
No ratings yet
DLL - All Subjects 2 - Q4 - W6 - D1
8 pages
Identifying and Evaluating Claims
No ratings yet
Identifying and Evaluating Claims
19 pages
School Administrator Views
No ratings yet
School Administrator Views
4 pages
Content Based ML Repo
No ratings yet
Content Based ML Repo
36 pages
Strategic Change - 2021 - French - Personal finance apps and low‐income households
No ratings yet
Strategic Change - 2021 - French - Personal finance apps and low‐income households
9 pages
List of Departmental Files/Documents
No ratings yet
List of Departmental Files/Documents
2 pages
Conflict Notes
100% (1)
Conflict Notes
6 pages
Feb 4-8 DLL TLE 101
100% (3)
Feb 4-8 DLL TLE 101
2 pages
An Exploration of Chinese English Language Learners' Foreign Language Anxiety
No ratings yet
An Exploration of Chinese English Language Learners' Foreign Language Anxiety
9 pages
The+Impact+of+Artificial+Intelligence+on+Educational+Leadership+Challenges+and+Opportunities+in+School+Management
No ratings yet
The+Impact+of+Artificial+Intelligence+on+Educational+Leadership+Challenges+and+Opportunities+in+School+Management
12 pages
MMD 30105 Design Exercise Brief
No ratings yet
MMD 30105 Design Exercise Brief
4 pages
Reflection - Punctuation For Blog
No ratings yet
Reflection - Punctuation For Blog
3 pages
Greetings and Introductions
No ratings yet
Greetings and Introductions
52 pages
1.0 Research Methods For Architecture Module 1
No ratings yet
1.0 Research Methods For Architecture Module 1
5 pages
Handout - Challenges For Novice Leaders Facing Todays Issues in Administration - Xyna Jobyleen Nuyles
No ratings yet
Handout - Challenges For Novice Leaders Facing Todays Issues in Administration - Xyna Jobyleen Nuyles
9 pages
DLL, Music 8, Week 8, Quarter 1
No ratings yet
DLL, Music 8, Week 8, Quarter 1
4 pages
Group Therapy
100% (3)
Group Therapy
71 pages

4.1 - Unsupervised Visual Representation Learning by Context Prediction

Uploaded by

4.1 - Unsupervised Visual Representation Learning by Context Prediction

Uploaded by

Unsupervised Visual Representation Learning by Context Prediction

Carl Doersch1,2 Abhinav Gupta1 Alexei A. Efros2

This work explores the use of spatial context as a source

0.5 [6] O. Chum, M. Perdoch, and J. Matas. Geometric min-hashing: Find-

You might also like