4.1 - Unsupervised Visual Representation Learning by Context Prediction
4.1 - Unsupervised Visual Representation Learning by Context Prediction
Abstract Example:
arXiv:1505.05192v3 [cs.CV] 16 Jan 2016
1
occur in a specific spatial configuration (if there is no spe-
cific configuration of the parts, then it is “stuff” [1]). We
present a ConvNet-based approach to learn a visual repre- 1 2 3
sentation from this task. We demonstrate that the resulting
visual representation is good for both object detection, pro-
viding a significant boost on PASCAL VOC 2007 compared
to learning from scratch, as well as for unsupervised object 4 5
discovery / visual data mining. This means, surprisingly,
that our representation generalizes across images, despite
being trained using an objective function that operates on a
single image at a time. That is, instance-level supervision
6 7 8
appears to improve performance on category-level tasks.
2. Related Work
One way to think of a good image representation is as X=( , ); Y = 3
the latent variables of an appropriate generative model. An
ideal generative model of natural images would both gener- Figure 2. The algorithm receives two patches in one of these eight
ate images according to their natural distribution, and be possible spatial arrangements, without any context, and must then
classify which configuration was sampled.
concise in the sense that it would seek common causes
for different images and share information between them. model (e.g. a deep network) to predict, from a single word,
However, inferring the latent structure given an image is in- the n preceding and n succeeding words. In principle, sim-
tractable for even relatively simple models. To deal with ilar reasoning could be applied in the image domain, a kind
these computational issues, a number of works, such as of visual “fill in the blank” task, but, again, one runs into the
the wake-sleep algorithm [25], contrastive divergence [24], problem of determining whether the predictions themselves
deep Boltzmann machines [48], and variational Bayesian are correct [12], unless one cares about predicting only very
methods [30, 46] use sampling to perform approximate in- low-level features [14, 33, 53]. To address this, [39] predicts
ference. Generative models have shown promising per- the appearance of an image region by consensus voting of
formance on smaller datasets such as handwritten dig- the transitive nearest neighbors of its surrounding regions.
its [25, 24, 48, 30, 46], but none have proven effective for Our previous work [12] explicitly formulates a statistical
high-resolution natural images. test to determine whether the data is better explained by a
Unsupervised representation learning can also be formu- prediction or by a low-level null hypothesis model.
lated as learning an embedding (i.e. a feature vector for The key problem that these approaches must address is
each image) where images that are semantically similar are that predicting pixels is much harder than predicting words,
close, while semantically different ones are far apart. One due to the huge variety of pixels that can arise from the same
way to build such a representation is to create a supervised semantic object. In the text domain, one interesting idea is
“pretext” task such that an embedding which solves the task to switch from a pure prediction task to a discrimination
will also be useful for other real-world tasks. For exam- task [41, 9]. In this case, the pretext task is to discriminate
ple, denoising autoencoders [56, 4] use reconstruction from true snippets of text from the same snippets where a word
noisy data as a pretext task: the algorithm must connect has been replaced at random. A direct extension of this to
images to other images with similar objects to tell the dif- 2D might be to discriminate between real images vs. im-
ference between noise and signal. Sparse autoencoders also ages where one patch has been replaced by a random patch
use reconstruction as a pretext task, along with a sparsity from elsewhere in the dataset. However, such a task would
penalty [42], and such autoencoders may be stacked to form be trivial, since discriminating low-level color statistics and
a deep representation [35, 34]. (however, only [34] was suc- lighting would be enough. To make the task harder and
cessfully applied to full-sized images, requiring a million more high-level, in this paper, we instead classify between
CPU hours to discover just three objects). We believe that multiple possible configurations of patches sampled from
current reconstruction-based algorithms struggle with low- the same image, which means they will share lighting and
level phenomena, like stochastic textures, making it hard to color statistics, as shown on Figure 2.
even measure whether a model is generating well. Another line of work in unsupervised learning from im-
Another pretext task is “context prediction.” A strong ages aims to discover object categories using hand-crafted
tradition for this kind of task already exists in the text do- features and various forms of clustering (e.g. [51, 47]
main, where “skip-gram” [40] models have been shown to learned a generative model over bags of visual words). Such
generate useful word representations. The idea is to train a representations lose shape information, and will readily dis-
2
cover clusters of, say, foliage. A few subsequent works have
attempted to use representations more closely tied to shape fc9 (8)
fc8 (4096)
[36, 43], but relied on contour extraction, which is difficult
in complex images. Many other approaches [22, 29, 16] fc7 (4096)
focus on defining similarity metrics which can be used in
fc6 (4096) fc6 (4096)
more standard clustering algorithms; [45], for instance, pool5 (3x3,256,2) pool5 (3x3,256,2)
re-casts the problem as frequent itemset mining. Geom- conv5 (3x3,256,1) conv5 (3x3,256,1)
etry may also be used to for verifying links between im- conv4 (3x3,384,1) conv4 (3x3,384,1)
conv3 (3x3,384,1) conv3 (3x3,384,1)
ages [44, 6, 23], although this can fail for deformable ob-
LRN2 LRN2
jects. pool2 (3x3,384,2) pool2 (3x3,384,2)
Video can provide another cue for representation learn- conv2 (5x5,384,2) conv2 (5x5,384,2)
ing. For most scenes, the identity of objects remains un- LRN1 LRN1
pool1 (3x3,96,2) pool1 (3x3,96,2)
changed even as appearance changes with time. This kind conv1 (11x11,96,4) conv1 (11x11,96,4)
of temporal coherence has a long history in visual learning
literature [18, 59], and contemporaneous work shows strong Patch 1 Patch 2
improvements on modern detection datasets [57].
Figure 3. Our architecture for pair classification. Dotted lines in-
Finally, our work is related to a line of research on dis- dicate shared weights. ‘conv’ stands for a convolution layer, ‘fc’
criminative patch mining [13, 50, 28, 37, 52, 11], which has stands for a fully-connected one, ‘pool’ is a max-pooling layer, and
emphasized weak supervision as a means of object discov- ‘LRN’ is a local response normalization layer. Numbers in paren-
ery. Like the current work, they emphasize the utility of theses are kernel size, number of outputs, and stride (fc layers have
learning representations of patches (i.e. object parts) before only a number of outputs). The LRN parameters follow [32]. All
learning full objects and scenes, and argue that scene-level conv and fc layers are followed by ReLU nonlinearities, except fc9
labels can serve as a pretext task. For example, [13] trains which feeds into a softmax classifier.
detectors to be sensitive to different geographic locales, but semantic reasoning for each patch separately. When design-
the actual goal is to discover specific elements of architec- ing the network, we followed AlexNet where possible.
tural style. To obtain training examples given an image, we sample
the first patch uniformly, without any reference to image
3. Learning Visual Context Prediction content. Given the position of the first patch, we sample the
We aim to learn an image representation for our pre- second patch randomly from the eight possible neighboring
text task, i.e., predicting the relative position of patches locations as in Figure 2.
within an image. We employ Convolutional Neural Net-
works (ConvNets), which are well known to learn complex 3.1. Avoiding “trivial” solutions
image representations with minimal human feature design. When designing a pretext task, care must be taken to en-
Building a ConvNet that can predict a relative offset for a sure that the task forces the network to extract the desired
pair of patches is, in principle, straightforward: the network information (high-level semantics, in our case), without tak-
must feed the two input patches through several convolu- ing “trivial” shortcuts. In our case, low-level cues like
tion layers, and produce an output that assigns a probability boundary patterns or textures continuing between patches
to each of the eight spatial configurations (Figure 2) that could potentially serve as such a shortcut. Hence, for the
might have been sampled (i.e. a softmax output). Note, relative prediction task, it was important to include a gap
however, that we ultimately wish to learn a feature embed- between patches (in our case, approximately half the patch
ding for individual patches, such that patches which are vi- width). Even with the gap, it is possible that long lines span-
sually similar (across different images) would be close in ning neighboring patches could could give away the correct
the embedding space. answer. Therefore, we also randomly jitter each patch loca-
To achieve this, we use a late-fusion architecture shown tion by up to 7 pixels (see Figure 2).
in Figure 3: a pair of AlexNet-style architectures [32] that However, even these precautions are not enough: we
process each patch separately, until a depth analogous to were surprised to find that, for some images, another triv-
fc6 in AlexNet, after which point the representations are ial solution exists. We traced the problem to an unexpected
fused. For the layers that process only one of the patches, culprit: chromatic aberration. Chromatic aberration arises
weights are tied between both sides of the network, such from differences in the way the lens focuses light at differ-
that the same fc6-level embedding function is computed for ent wavelengths. In some cameras, one color channel (com-
both patches. Because there is limited capacity for joint monly green) is shrunk toward the image center relative to
reasoning—i.e., only two layers receive input from both the others [5, p. 76]. A ConvNet, it turns out, can learn to lo-
patches—we expect the network to perform the bulk of the calize a patch relative to the lens itself (see Section 4.2) sim-
3
Input Random Initialization ImageNet AlexNet Ours
Figure 4. Examples of patch clusters obtained by nearest neighbors. The query patch is shown on the far left. Matches are for three different
features: fc6 features from a random initialization of our architecture, AlexNet fc7 after training on labeled ImageNet, and the fc6 features
learned from our method. Queries were chosen from 1000 randomly-sampled patches. The top group is examples where our algorithm
performs well; for the middle AlexNet outperforms our approach; and for the bottom all three features work well.
ply by detecting the separation between green and magenta we show both results.
(red + blue). Once the network learns the absolute location Implementation Details: We use Caffe [27], and train on
on the lens, solving the relative location task becomes triv- the ImageNet [10] 2012 training set ( 1.3M images), using
ial. To deal with this problem, we experimented with two only the images and discarding the labels. First, we resize
types of pre-processing. One is to shift green and magenta each image to between 150K and 450K total pixels, preserv-
toward gray (‘projection’). Specifically, let a = [−1, 2, −1] ing the aspect-ratio. From these images, we sample patches
(the ’green-magenta color axis’ in RGB space). We then at resolution 96-by-96. For computational efficiency, we
define B = I − aT a/(aaT ), which is a matrix that sub- only sample the patches from a grid like pattern, such that
tracts the projection of a color onto the green-magenta color each sampled patch can participate in as many as 8 separate
axis. We multiply every pixel value by B. An alternative ap- pairings. We allow a gap of 48 pixels between the sampled
proach is to randomly drop 2 of the 3 color channels from patches in the grid, but also jitter the location of each patch
each patch (‘color dropping’), replacing the dropped colors in the grid by −7 to 7 pixels in each direction. We prepro-
with Gaussian noise (standard deviation ∼ 1/100 the stan- cess patches by (1) mean subtraction (2) projecting or drop-
dard deviation of the remaining channel). For qualitative ping colors (see above), and (3) randomly downsampling
results, we show the ‘color-dropping’ approach, but found some patches to as little as 100 total pixels, and then upsam-
both performed similarly; for the object detection results, pling it, to build robustness to pixelation. When applying
4
Image layout
Initial layout, with sampled patches in red is discarded We can recover image layout automatically Cannot recover layout with color removed
Figure 5. We trained a network to predict the absolute (x, y) coordinates of randomly sampled patches. Far left: input image. Center left:
extracted patches. Center right: the location the trained network predicts for each patch shown on the left. Far right: the same result after
our color projection scheme. Note that the far right patches are shown after color projection; the operation’s effect is almost unnoticeable.
simple SGD to train the network, we found that the network fc8 (21) Figure 6. Our architecture for Pascal
predictions would degenerate to a uniform prediction over fc7 (4096) VOC detection. Layers from conv1
the 8 categories, with all activations for fc6 and fc7 col- pool6 (3x3,1024,2) through pool5 are copied from our
lapsing to 0. This meant that the optimization became per- conv6b (1x1,1024,1) patch-based network (Figure 3). The
conv6 (3x3,4096,1) new ’conv6’ layer is created by con-
manently stuck in a saddle point where it ignored the input
pool5 verting the fc6 layer into a convolu-
from the lower layers (which helped minimize the variance
…
tion layer. Kernel sizes, output units,
of the final output), and therefore that the net could not tune and stride are given in parentheses, as
the lower-level features and escape the saddle point. Hence, Image (227x227) in Figure 3.
our final implementation employs batch normalization [26],
without the scale and shift (γ and β), which forces the net- tialization). As shown in Figure 4, the matches returned by
work activations to vary across examples. We also find that our feature often capture the semantic information that we
high momentum values (e.g. .999) accelerated learning. For are after, matching AlexNet in terms of semantic content (in
experiments, we use a ConvNet trained on a K40 GPU for some cases, e.g. the car wheel, our matches capture pose
approximately four weeks. better). Interestingly, in a few cases, random (untrained)
ConvNet also does reasonably well.
4. Experiments
We first demonstrate the network has learned to associate
4.2. Aside: Learnability of Chromatic Aberration
semantically similar patches, using simple nearest-neighbor We noticed in early nearest-neighbor experiments that
matching. We then apply the trained network in two do- some patches retrieved match patches from the same ab-
mains. First, we use the model as “pre-training” for a stan- solute location in the image, regardless of content, be-
dard vision task with only limited training data: specifically, cause those patches displayed similar aberration. To further
we use the VOC 2007 object detection. Second, we evalu- demonstrate this phenomenon, we trained a network to pre-
ate visual data mining, where the goal is to start with an dict the absolute (x, y) coordinates of patches sampled from
unlabeled image collection and discover object classes. Fi- ImageNet. While the overall accuracy of this regressor is
nally, we analyze the performance on the layout prediction not very high, it does surprisingly well for some images:
“pretext task” to see how much is left to learn from this su- for the top 10% of images, the average (root-mean-square)
pervisory signal. error is .255, while chance performance (always predict-
ing the image center) yields a RMSE of .371. Figure 5
4.1. Nearest Neighbors
shows one such result. Applying the proposed “projection”
Recall our intuition that training should assign similar scheme increases the error on the top 10% of images to .321.
representations to semantically similar patches. In this sec-
tion, our goal is to understand which patches our network 4.3. Object Detection
considers similar. We begin by sampling random 96x96 Previous work on the Pascal VOC challenge [15] has
patches, which we represent using fc6 features (i.e. we re- shown that pre-training on ImageNet (i.e., training a Con-
move fc7 and higher shown in Figure 3, and use only one vNet to solve the ImageNet challenge) and then “fine-
of the two stacks). We find nearest neighbors using normal- tuning” the network (i.e. re-training the ImageNet model
ized correlation of these features. Results for some patches for PASCAL data) provides a substantial boost over training
(selected out of 1000 random queries) are shown in Fig- on the Pascal training set alone [21, 2]. However, as far as
ure 4. For comparison, we repeated the experiment using we are aware, no works have shown that unsupervised pre-
fc7 features from AlexNet trained on ImageNet (obtained training on images can provide such a performance boost,
by upsampling the patches), and using fc6 features from our no matter how much data is used.
architecture but without any training (random weights ini- Since we are already using a ConvNet, we adopt the cur-
5
VOC-2007 Test aero bike bird boat bottle bus car cat chair cow table dog horse mbike person plant sheep sofa train tv mAP
DPM-v5[17] 33.2 60.3 10.2 16.1 27.3 54.3 58.2 23.0 20.0 24.1 26.7 12.7 58.1 48.2 43.2 12.0 21.1 36.1 46.0 43.5 33.7
[8] w/o context 52.6 52.6 19.2 25.4 18.7 47.3 56.9 42.1 16.6 41.4 41.9 27.7 47.9 51.5 29.9 20.0 41.1 36.4 48.6 53.2 38.5
Regionlets[58] 54.2 52.0 20.3 24.0 20.1 55.5 68.7 42.6 19.2 44.2 49.1 26.6 57.0 54.5 43.4 16.4 36.6 37.7 59.4 52.3 41.7
Scratch-R-CNN[2] 49.9 60.6 24.7 23.7 20.3 52.5 64.8 32.9 20.4 43.5 34.2 29.9 49.0 60.4 47.5 28.0 42.3 28.6 51.2 50.0 40.7
Scratch-Ours 52.6 60.5 23.8 24.3 18.1 50.6 65.9 29.2 19.5 43.5 35.2 27.6 46.5 59.4 46.5 25.6 42.4 23.5 50.0 50.6 39.8
Ours-projection 58.4 62.8 33.5 27.7 24.4 58.5 68.5 41.2 26.3 49.5 42.6 37.3 55.7 62.5 49.4 29.0 47.5 28.4 54.7 56.8 45.7
Ours-color-dropping 60.5 66.5 29.6 28.5 26.3 56.1 70.4 44.8 24.6 45.5 45.4 35.1 52.2 60.2 50.0 28.1 46.7 42.6 54.8 58.6 46.3
Ours-Yahoo100m 56.2 63.9 29.8 27.8 23.9 57.4 69.8 35.6 23.7 47.4 43.0 29.5 52.9 62.0 48.7 28.4 45.1 33.6 49.0 55.5 44.2
ImageNet-R-CNN[21] 64.2 69.7 50 41.9 32.0 62.6 71.0 60.7 32.7 58.5 46.5 56.1 60.6 66.8 54.2 31.5 52.8 48.9 57.9 64.7 54.2
K-means-rescale [31] 55.7 60.9 27.9 30.9 12.0 59.1 63.7 47.0 21.4 45.2 55.8 40.3 67.5 61.2 48.3 21.9 32.8 46.9 61.6 51.7 45.6
Ours-rescale [31] 61.9 63.3 35.8 32.6 17.2 68.0 67.9 54.8 29.6 52.4 62.9 51.3 67.1 64.3 50.5 24.4 43.7 54.9 67.1 52.7 51.1
ImageNet-rescale [31] 64.0 69.6 53.2 44.4 24.9 65.7 69.6 69.2 28.9 63.6 62.8 63.9 73.3 64.6 55.8 25.7 50.5 55.4 69.3 56.4 56.5
VGG-K-means-rescale 56.1 58.6 23.3 25.7 12.8 57.8 61.2 45.2 21.4 47.1 39.5 35.6 60.1 61.4 44.9 17.3 37.7 33.2 57.9 51.2 42.4
VGG-Ours-rescale 71.1 72.4 54.1 48.2 29.9 75.2 78.0 71.9 38.3 60.5 62.3 68.1 74.3 74.2 64.8 32.6 56.5 66.4 74.0 60.3 61.7
VGG-ImageNet-rescale 76.6 79.6 68.5 57.4 40.8 79.9 78.4 85.4 41.7 77.0 69.3 80.1 78.6 74.6 70.1 37.5 66.0 67.5 77.4 64.9 68.6
Table 1. Mean Average Precision on VOC-2007.
rent state-of-the-art R-CNN pipeline [21]. R-CNN works ically. The performance after fine-tuning is slightly worse
on object proposals that have been resized to 227x227. Our than Imagenet, but there is still a considerable boost over
algorithm, however, is aimed at 96x96 patches. We find that the from-scratch model.
downsampling the proposals to 96x96 loses too much detail. In the above fine-tuning experiments, we removed the
Instead, we adopt the architecture shown in Figure 6. As batch normalization layers by estimating the mean and vari-
above, we use only one stack from Figure 3. Second, we re- ance of the conv- and fc- layers, and then rescaling the
size the convolution layers to operate on inputs of 227x227. weights and biases such that the outputs of the conv and fc
This results in a pool5 that is 7x7 spatially, so we must con- layers have mean 0 and variance 1 for each channel. Recent
vert the previous fc6 layer into a convolution layer (which work [31], however, has shown empirically that the scal-
we call conv6) following [38]. Note our conv6 layer has ing of the weights prior to finetuning can have a strong im-
4096 channels, where each unit connects to a 3x3 region pact on test-time performance, and argues that our previous
of pool5. A conv layer with 4096 channels would be quite method of removing batch normalization leads too poorly
expensive to connect directly to a 4096-dimensional fully- scaled weights. They propose a simple way to rescale the
connected layer. Hence, we add another layer after conv6 network’s weights without changing the function that the
(called conv6b), using a 1x1 kernel, which reduces the di- network computes, such that the network behaves better
mensionality to 1024 channels (and adds a nonlinearity). during finetuning. Results using this technique are shown
Finally, we feed the outputs through a pooling layer to a in Table 1. Their approach gives a boost to all methods, but
fully connected layer (fc7) which in turn connects to a fi- gives less of a boost to the already-well-scaled ImageNet-
nal fc8 layer which feeds into the softmax. We fine-tune category model. Note that for this comparison, we used
this network according to the procedure described in [21] fast-rcnn [20] to save compute time, and we discarded all
(conv6b, fc7, and fc8 start with random weights), and use pre-trained fc-layers from our model, re-initializing them
fc7 as the final representation. We do not use bounding- with the K-means procedure of [31] (which was used to ini-
box regression, and take the appropriate results from [21] tialize all layers in the “K-means-rescale” row). Hence, the
and [2]. structure of the network during fine-tuning and testing was
Table 1 shows our results. Our architecture trained from the same for all models.
scratch (random initialization) performs slightly worse than Considering that we have essentially infinite data to train
AlexNet trained from scratch. However, our pre-training our model, we might expect that our algorithm should also
makes up for this, boosting the from-scratch number by provide a large boost to higher-capacity models such as
6% MAP, and outperforms an AlexNet-style model trained VGG [49]. To test this, we trained a model following the 16-
from scratch on Pascal by over 5%. This puts us about 8% layer structure of [49] for the convolutional layers on each
behind the performance of R-CNN pre-trained with Ima- side of the network (the final fc6-fc9 layers were the same
geNet labels [21]. This is the best result we are aware of as in Figure 3). We again fine-tuned the representation on
on VOC 2007 without using labels outside the dataset. We Pascal VOC using fast-rcnn, by transferring only the conv
ran additional baselines initialized with batch normaliza- layers, again following Krähenbühl et al. [31] to re-scale
tion, but found they performed worse than the ones shown. the transferred weights and initialize the rest. As a base-
To understand the effect of various dataset biases [55], line, we performed a similar experiment with the ImageNet-
we also performed a preliminary experiment pre-training pretrained 16-layer model of [49] (though we kept pre-
on a randomly-selected 2M subset of the Yahoo/Flickr 100- trained fc layers rather than re-initializing them), and also
million Dataset [54], which was collected entirely automat- by initializing the entire network with K-means [31]. Train-
6
Lower Better Higher Better texture has no global layout.
Mean Median 11.25◦ 22.5◦ 30◦ To implement this, we first sample a constellation of
Scratch 38.6 26.5 33.1 46.8 52.5 four adjacent patches from an image (we use four to reduce
Unsup. Tracking [57] 34.2 21.9 35.7 50.6 57.0 the likelihood of a matching spatial arrangement happen-
Ours 33.2 21.3 36.0 51.2 57.8 ing by chance). We find the top 100 images which have
ImageNet Labels 33.3 20.8 36.7 51.7 58.1 the strongest matches for all four patches, ignoring spatial
layout. We then use a type of geometric verification [7]
Table 2. Accuracy on NYUv2.
to filter away the images where the four matches are not
geometrically consistent. Because our features are more
ing time was considerably longer—about 8 weeks on a Titan
semantically-tuned, we can use a much weaker type of ge-
X GPU—but the the network outperformed the AlexNet-
ometric verification than [7]. Finally, we rank the different
style model by a considerable margin. Note the model ini-
constellations by counting the number of times the top 100
tialized with K-means performed roughly on par with the
matches geometrically verify.
analogous AlexNet model, suggesting that most of the boost
came from the unsupervised pre-training. Implementation Details: To compute whether a set of four
matched patches geometrically verifies, we first compute
4.4. Geometry Estimation the best-fitting square S to the patch centers (via least-
The results of Section 4.3 suggest that our representa- squares), while constraining that side of S be between 2/3
tion is sensitive to objects, even though it was not originally and 4/3 of the average side of the patches. We then compute
trained to find them. This raises the question: Does our the squared error of the patch centers relative to S (normal-
representation extract information that is useful for other, ized by dividing the sum-of-squared-errors by the square of
non-object-based tasks? To find out, we fine-tuned our net- the side of S). The patch is geometrically verified if this
work to perform the surface normal estimation on NYUv2 normalized squared error is less than 1. When sampling
proposed in Fouhey et al. [19], following the finetuning pro- patches do not use any of the data augmentation preprocess-
cedure of Wang et al. [57] (hence, we compare directly to ing steps (e.g. downsampling). We use the color-dropping
the unsupervised pretraining results reported there). We version of our network.
used the color-dropping network, restructuring the fully- We applied the described mining algorithm to Pascal
connected layers as in Section 4.3. Surprisingly, our re- VOC 2011, with no pre-filtering of images and no addi-
sults are almost equivalent to those obtained using a fully- tional labels. We show some of the resulting patch clusters
labeled ImageNet model. One possible explanation for this in Figure 7. The results are visually comparable to our pre-
is that the ImageNet categorization task does relatively little vious work [12], although we discover a few objects that
to encourage a network to pay attention to geometry, since were not found in [12], such as monitors, birds, torsos, and
the geometry is largely irrelevant once an object is identi- plates of food. The discovery of birds and torsos—which
fied. Further evidence of this can be seen in seventh row of are notoriously deformable—provides further evidence for
Figure 4: the nearest neighbors for ImageNet AlexNet are the invariances our algorithm has learned. We believe we
all car wheels, but they are not aligned well with the query have covered all objects discovered in [12], with the ex-
patch. ception of (1) trusses and (2) railroad tracks without trains
4.5. Visual Data Mining (though we do discover them with trains). For some objects
Visual data mining [44, 13, 50, 45], or unsupervised ob- like dogs, we discover more variety and rank the best ones
ject discovery [51, 47, 22], aims to use a large image col- higher. Furthermore, many of the clusters shown in [12] de-
lection to discover image fragments which happen to depict pict gratings (14 out of the top 100), whereas none of ours
the same semantic objects. Applications include dataset vi- do (though two of our top hundred depict diffuse gradients).
sualization, content-based retrieval, and tasks that require As in [12], we often re-discover the same object multiple
relating visual data to other unstructured information (e.g. times with different viewpoints, which accounts for most of
GPS coordinates [13]). For automatic data mining, our the gaps between ranks in Figure 7. The main disadvan-
approach from section 4.1 is inadequate: although object tages of our algorithm relative to [12] are 1) some loss of
patches match to similar objects, textures match just as purity, and 2) that we cannot currently determine an object
readily to similar textures. Suppose, however, that we sam- mask automatically (although one could imagine dynami-
pled two non-overlapping patches from the same object. cally adding more sub-patches to each proposed object).
Not only would the nearest neighbor lists for both patches To ensure that our algorithm has not simply learned an
share many images, but within those images, the nearest object-centric representation due to the various biases [55]
neighbors would be in roughly the same spatial configura- in ImageNet, we also applied our algorithm to 15,000 Street
tion. For texture regions, on the other hand, the spatial con- View images from Paris (following [13]). The results in
figurations of the neighbors would be random, because the Figure 8 show that our representation captures scene lay-
7
1 88
4 121
7 131
12 142
25 179
29 187
30 229 229
35 232 232
46 240
70 256
71 351
73 464
Figure 7. Object clusters discovered by our algorithm. The number beside each cluster indicates its ranking, determined by the fraction of
the top matches that geometrically verified. For all clusters, we show the raw top 7 matches that verified geometrically. The full ranking is
available on our project webpage.
out and architectural elements. For this experiment, to rank ters compared to [12]—which is not very surprising consid-
clusters, we use the de-duplication procedure originally pro- ering that our validation procedure is considerably simpler.
posed in [13]. Implementation Details: We initialize 16,384 clusters by
4.5.1 Quantitative Results sampling patches, mining nearest neighbors, and geomet-
ric verification ranking as described above. The resulting
As part of the qualitative evaluation, we applied our algo- clusters are highly redundant. The cluster selection proce-
rithm to the subset of Pascal VOC 2007 selected in [50]: dure of [12] relies on a likelihood ratio score that is cali-
specifically, those containing at least one instance of bus, brated across clusters, which is not available to us. To se-
dining table, motorbike, horse, sofa, or train, and evaluate lect clusters, we first select the top 10 geometrically-verified
via a purity coverage curve following [12]. We select 1000 neighbors for each cluster. Then we iteratively select the
sets of 10 images each for evaluation. The evaluation then highest-ranked cluster that contributes at least one image to
sorts the sets by purity: the fraction of images in the clus- our coverage score. When we run out of images that aren’t
ter containing the same category. We generate the curve by included in the coverage score, we choose clusters to cover
walking down the ranking. For each point on the curve, we each image at least twice, and then three times, and so on.
plot average purity of all sets up to a given point in the rank-
ing against coverage: the fraction of images in the dataset 4.6. Accuracy on the Relative Prediction Task Task
that are contained in at least one of the sets up to that point.
As shown in Figure 9, we have gained substantially in terms Can we improve the representation by further training
of coverage, suggesting increased invariance for our learned on our relative prediction pretext task? To find out, we
feature. However, we have also lost some highly-pure clus- briefly analyze classification performance on pretext task
8
age, the task is almost impossible. Might the task be easiest
1: for image regions corresponding to objects? To test this
hypothesis, we repeated our experiment using only patches
4: sampled from within Pascal object ground-truth bounding
boxes. We select only those boxes that are at least 240 pix-
els on each side, and which are not labeled as truncated,
5: occluded, or difficult. Surprisingly, this gave essentially the
same accuracy of 39.2%, and a similar experiment only on
cars yielded 45.6% accuracy. So, while our algorithm is
6: sensitive to objects, it is almost as sensitive to the layout of
the rest of the image.
Acknowledgements We thank Xiaolong Wang and Pulkit Agrawal for
13: help with baselines, Berkeley and CMU vision group members for many
fruitful discussions, and Jitendra Malik for putting gelato on the line. This
work was partially supported by Google Graduate Fellowship to CD, ONR
18: MURI N000141010934, Intel research grant, an NVidia hardware grant,
and an Amazon Web Services grant.
42: References
[1] E. H. Adelson. On seeing stuff: the perception of materials by hu-
53: mans and machines. In Photonics West 2001-Electronic Imaging,
2001. 2
Figure 8. Clusters discovered and automatically ranked via our al- [2] P. Agrawal, R. Girshick, and J. Malik. Analyzing the performance of
gorithm (§ 4.5) from the Paris Street View dataset. multilayer neural networks for object recognition. In ECCV. 2014.
5, 6
1.0
Purity-Coverage for Proposed Objects [3] R. K. Ando and T. Zhang. A framework for learning predictive struc-
tures from multiple tasks and unlabeled data. JMLR, 2005. 1
0.9 [4] Y. Bengio, E. Thibodeau-Laufer, G. Alain, and J. Yosinski. Deep
0.8 generative stochastic networks trainable by backprop. ICML, 2014.
2
0.7
[5] D. Brewster and A. D. Bache. Treatise on optics. Blanchard and Lea,
0.6 1854. 3
Purity
9
[18] P. Földiák. Learning invariance from transformation sequences. Neu- [46] D. J. Rezende, S. Mohamed, and D. Wierstra. Stochastic backpropa-
ral Computation, 1991. 3 gation and approximate inference in deep generative models. ICML,
[19] D. F. Fouhey, A. Gupta, and M. Hebert. Data-driven 3D primitives 2014. 2
for single image understanding. In ICCV, 2013. 7 [47] B. C. Russell, W. T. Freeman, A. A. Efros, J. Sivic, and A. Zisserman.
[20] R. Girshick. Fast r-cnn. In ICCV, 2015. 6 Using multiple segmentations to discover objects and their extent in
[21] R. Girshick, J. Donahue, T. Darrell, and J. Malik. Rich feature hier- image collections. In CVPR, 2006. 2, 7
archies for accurate object detection and semantic segmentation. In [48] R. Salakhutdinov and G. E. Hinton. Deep boltzmann machines. In
CVPR, 2014. 1, 5, 6 ICAIS, 2009. 2
[22] K. Grauman and T. Darrell. Unsupervised learning of categories from [49] K. Simonyan and A. Zisserman. Very deep convolutional networks
sets of partially matching image features. In CVPR, 2006. 3, 7 for large-scale image recognition. CoRR, 2014. 6
[23] K. Heath, N. Gelfand, M. Ovsjanikov, M. Aanjaneya, and L. J. [50] S. Singh, A. Gupta, and A. A. Efros. Unsupervised discovery of
Guibas. Image webs: Computing and exploiting connectivity in im- mid-level discriminative patches. In ECCV, 2012. 3, 7, 8
age collections. In CVPR, 2010. 3 [51] J. Sivic, B. C. Russell, A. A. Efros, A. Zisserman, and W. T. Freeman.
[24] G. Hinton, S. Osindero, and Y.-W. Teh. A fast learning algorithm for Discovering objects and their location in images. In ICCV, 2005. 2,
deep belief nets. Neural computation, 2006. 2 7
[25] G. E. Hinton, P. Dayan, B. J. Frey, and R. M. Neal. The “wake- [52] J. Sun and J. Ponce. Learning discriminative part detectors for image
sleep” algorithm for unsupervised neural networks. Proceedings. classification and cosegmentation. In ICCV, 2013. 3
IEEE, 1995. 2 [53] L. Theis and M. Bethge. Generative image modeling using spatial
[26] S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep lstms. In NIPS, 2015. 2
network training by reducing internal covariate shift. arXiv preprint [54] B. Thomee, D. A. Shamma, G. Friedland, B. Elizalde, K. Ni,
arXiv:1502.03167, 2015. 5 D. Poland, D. Borth, and L.-J. Li. The new data and new challenges
[27] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, in multimedia research. arXiv preprint arXiv:1503.01817, 2015. 6
S. Guadarrama, and T. Darrell. Caffe: Convolutional architecture for [55] A. Torralba and A. A. Efros. Unbiased look at dataset bias. In CVPR,
fast feature embedding. In ACM-MM, 2014. 4 2011. 6, 7
[28] M. Juneja, A. Vedaldi, C. V. Jawahar, and A. Zisserman. Blocks that [56] P. Vincent, H. Larochelle, Y. Bengio, and P.-A. Manzagol. Extract-
shout: Distinctive parts for scene classification. In CVPR, 2013. 3 ing and composing robust features with denoising autoencoders. In
[29] G. Kim, C. Faloutsos, and M. Hebert. Unsupervised modeling of ICML, 2008. 2
object categories using link analysis techniques. In CVPR, 2008. 3 [57] X. Wang and A. Gupta. Unsupervised learning of visual representa-
[30] D. P. Kingma and M. Welling. Auto-encoding variational bayes. tions using videos. In ICCV, 2015. 3, 7
2014. 2 [58] X. Wang, M. Yang, S. Zhu, and Y. Lin. Regionlets for generic object
[31] P. Krähenbühl, C. Doersch, J. Donahue, and T. Darrell. Data- detection. In ICCV, 2013. 6
dependent initializations of convolutional neural networks. arXiv [59] L. Wiskott and T. J. Sejnowski. Slow feature analysis:unsupervised
preprint arXiv:1511.06856, 2015. 6 learning of invariances. Neural Computation, 2002. 3
[32] A. Krizhevsky, I. Sutskever, and G. Hinton. Imagenet classification
with deep convolutional neural networks. In NIPS, 2012. 1, 3
[33] H. Larochelle and I. Murray. The neural autoregressive distribution
estimator. In AISTATS, 2011. 2
[34] Q. V. Le. Building high-level features using large scale unsupervised
learning. In ICASSP, 2013. 2
[35] H. Lee, A. Battle, R. Raina, and A. Y. Ng. Efficient sparse coding
algorithms. In NIPS, 2006. 2
[36] Y. J. Lee and K. Grauman. Foreground focus: Unsupervised learning
from partially matching images. IJCV, 2009. 3
[37] Q. Li, J. Wu, and Z. Tu. Harvesting mid-level visual concepts from
large-scale internet images. In CVPR, 2013. 3
[38] J. Long, E. Shelhamer, and T. Darrell. Fully convolutional networks
for semantic segmentation. arXiv preprint arXiv:1411.4038, 2014. 6
[39] T. Malisiewicz and A. Efros. Beyond categories: The visual memex
model for reasoning about object relationships. In NIPS, 2009. 2
[40] T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean. Dis-
tributed representations of words and phrases and their composition-
ality. In NIPS, 2013. 1, 2
[41] D. Okanohara and J. Tsujii. A discriminative language model with
pseudo-negative samples. In ACL, 2007. 1, 2
[42] B. A. Olshausen and D. J. Field. Emergence of simple-cell receptive
field properties by learning a sparse code for natural images. Nature,
1996. 2
[43] N. Payet and S. Todorovic. From a set of shapes to object discovery.
In ECCV. 2010. 3
[44] T. Quack, B. Leibe, and L. Van Gool. World-scale mining of objects
and events from community photo collections. In CIVR, 2008. 3, 7
[45] K. Rematas, B. Fernando, F. Dellaert, and T. Tuytelaars. Dataset
fingerprints: Exploring image collections through data mining. In
CVPR, 2015. 3, 7
10