bag of words
bag of words
Spyros Gidaris1 , Andrei Bursuc1 , Gilles Puy1 , Nikos Komodakis2 , Matthieu Cord1,3 , Patrick Pérez1
1 2 3
valeo.ai University of Crete Sorbonne Université
6830
trained teacher network. More importantly, it assumes that 15, 27, 44, 47, 48, 68, 70, 75, 79]. Many works rely on pre-
this teacher remains static throughout training. However, text reconstruction tasks [1, 29, 43, 52, 54, 67, 76, 77, 79],
due to the fact that during the training process the quality of where the reconstruction target is defined at image pixel level.
the student’s representations will surpass the teacher’s ones, This is in stark contrast with our method, which uses a recon-
a static teacher is prone to offer a suboptimal supervisory struction task defined over high-level visual concepts (i.e.,
signal to the student and to lead to an inefficient usage of the visual words) that are learnt with a teacher-student scheme
computational budget for training. in a fully online manner.
In this paper, we propose a BoW-based self-supervised Instance discrimination and contrastive objectives. Re-
approach that overcomes the aforementioned limitations. To cently, unsupervised methods based on contrastive learning
that end, our main technical contributions are three-fold: objectives [8, 9, 22, 23, 33, 35, 38, 46, 49, 62, 69, 71] have
1. We design a novel fully online teacher-student learning shown great results. Among them, contrastive-based in-
scheme for BoW-based self-supervised training (Fig. 1). stance discrimination training [9, 18, 33, 40, 71] is the most
This is achieved by online training of the teacher and prominent example. In this case, a convnet is trained to learn
the student networks. image representations that are invariant to several perturba-
tions and at the same time discriminative among different im-
2. We significantly revisit key elements of the BoW-
ages (instances). Our method also learns intra-image invari-
guided reconstruction task. This includes the proposal
ant representations, since the convnet must predict the same
of a dynamic BoW prediction module used for recon-
BoW target (computed from the original image) regardless of
structing the BoW representation of the student image.
the applied perturbation. More than that however, our work
This module is carefully combined with adequate on-
also places emphasis on learning context-aware representa-
line updates of the visual-words vocabulary used for
tions, which, we believe, is another important characteristic
the BoW targets.
that effective representations should have. In that respect, it
3. We enforce the learning of powerful contextual rea- is closer to contrastive-based approaches that target to learn
soning skills in our image representations. Revisiting context-aware representations by predicting (in a contrastive
data augmentation with aggressive cropping and spatial way) the state of missing image patches [35, 49, 65].
image perturbations, and exploiting multi-scale BoW Teacher-student approaches. This paradigm has a long
reconstruction targets, we equip our student network research history [24, 58] and it is frequently used for distill-
with a powerful feature representation. ing a single large network or an ensemble, the teacher, into
Overall, the proposed method leads to a simpler and much a smaller network, the student [5, 36, 41, 51]. This setting
more efficient training methodology for the BoW-guided has been revisited in the context of semi-supervised learning
reconstruction task that manages to learn significantly better where the teacher is no longer fixed but evolves during train-
image representations and therefore achieves or even sur- ing [42, 61, 66]. In self-supervised learning, BowNet [25]
passes the state of the art in several unsupervised learning trains a student to match the BoW representations produced
benchmarks. We call our method OBoW after the online by a self-supervised pre-trained teacher. MoCo [33] relies
BoW generation mechanism that it uses. on a slow-moving momentum-updated teacher to generate
up-to-date representations to fill a memory bank of negative
2. Related Work images. BYOL [31], which is a method concurrent to our
work, also uses a momentum-updated teacher and trains the
Bags of visual words. Bag-of-visual-words representations student to predict features generated by the teacher. However,
are powerful image models able to encode image statistics BYOL, similar to contrastive-based instance discrimination
from hundreds of local features. Thanks to that, they were methods, uses low-dimensional global image embeddings as
used extensively in the past [13, 39, 53, 59, 64] and con- targets (produced from the final convnet output) and primar-
tinue to be a key ingredient of several recent deep learning ily focuses on making them intra-image invariant. On the
approaches [2, 25, 28, 37]. Among them, BoWNet [25] contrary, our training targets are produced by converting the
was the first work to use BoWs as reconstruction targets for intermediate teacher feature maps to high-dimensional BoW
unsupervised representation learning. Inspired by it, we pro- vectors that capture multiple local visual concepts, thus con-
pose a novel BoW-based self-supervised method with a more stituting a richer target representation. Moreover, they are
simple and effective training methodology that includes gen- built over an online-updated vocabulary from randomly sam-
erating BoW targets in a fully online manner and further pled local features, expressing the current image as statistics
enforcing the learning of context-aware representations. over this vocabulary (see § 3.1). Therefore, our BoW tar-
Self-supervised learning. A prominent paradigm for un- gets expose fewer learning “shortcuts” (a critical aspect in
supervised representation learning is to train a convnet on an self-supervised learning [3, 15, 48, 70]), thus preventing to
artificially designed annotation-free pretext task, e.g., [3, 10, a larger extent teacher-student collapse and overfitting.
6831
Figure 1: Unsupervised learning with Bag-of-Words guidance. Two encoders T and S learn at different tempos by interacting and
learning from each other. An image x is passed through the encoder T and its output feature maps Tℓ (x) are embedded into a BoW
representation yT (x) over a vocabulary V of features from T. The vocabulary V is updated at each step. The encoder S aims to reconstruct
yT (x) from data-augmented instances x̃. A dynamic BoW-prediction head learns to leverage the continuously updated vocabulary V to
compute the BoW representation from the features S(x̃). T follows slowly the learning trajectory of S via momentum updates.
Relation to SwAV [8]. OBoW presents some similarity is the assignment value of the code q(x)[u] for the k th
(e.g., using online vocabularies) with SwAV [8]. However, word. Finally, ỹT (x) is converted into a probability dis-
the prediction tasks fundamentally differ: OBoW exploits tribution over the visual words by L1 -normalization, i.e.,
a BoW prediction task while SwAV uses an image-cluster yT (x)[k] = P ỹ′Tỹ(x)[k] ′ .
T (x)[k ]
k
prediction task [4, 6, 7]. BoW targets are much richer repre- To learn image representations, the student gets as input a
sentations than image-cluster assignments: a BoW encodes perturbed version of the image x, denoted as x̃, and is trained
all the local-feature statistics of an image whereas an image- to reconstruct the BoW representation yT (x), produced by
cluster assignment encodes only one global image feature. the teacher, of the original unperturbed image x. To that
end, it first extracts a global vector representation S(x̃) ∈ Rc
3. Our approach (with c channels) from the entire image x̃ and then applies a
Here we explain our proposed approach for learning im- linear-plus-softmax layer to S(x̃), as follows:
age representations by reconstructing bags of visual words. exp(wk⊤ S(x̃))
We start with an overview of our method. yS (x̃)[k] = P ⊤
, (1)
k′ exp(wk′ S(x̃))
Overview. The bag-of-words reconstruction task involves where W = [w1 , · · · , wK ] are the c-dimensional weight
a student convnet S(·) that learns image representations, and vectors (one per word) of the linear layer. The K-
a teacher convnet T(·) that generates BoW targets used for dimensional vector yS (x̃) is the predicted softmax proba-
training the student network. The student S(·) is parameter- bility of the target yT (x). Hence, the training loss that is
ized by θS and the teacher T(·) by θT . minimized for a single image x is the cross-entropy loss
To generate a BoW representation yT (x) out of an im-
K
age x, the teacher first extracts the feature map Tℓ (x) ∈ X
Rcℓ ×hℓ ×wℓ , of spatial size hℓ × wℓ with cℓ channels, from CE yS (x̃), yT (x) = − yT (x)[k] log yS (x̃)[k] (2)
k=1
its ℓth layer (in our experiments ℓ is either the last L or
penultimate L − 1 convolutional layer of T(·)). It quan- between the softmax distribution yS (x̃) predicted by the stu-
tizes the cℓ -dimensional feature vectors Tℓ (x)[u] at each dent from the perturbed image x̃, and the BoW distribution
location u ∈ {1, · · · , hℓ × wℓ } of the feature map over yT (x) of the unperturbed image x given by the teacher.
a vocabulary V = [v1 , . . . , vK ] of K visual words of di- Our technical contributions. In the following, we explain
mension cℓ . This quantization process produces for each (i) in § 3.1, how to construct a fully online training methodol-
location u a K-dimensional code vector q(x)[u] that en- ogy for the teacher, the student and the visual-words vocabu-
codes the assignment of T(x)[u] to its closest (in terms lary, (ii) in § 3.2, how to implement a dynamic approach for
of squared Euclidean distance) visual word(s). Then, the the BoW prediction that can adapt to continuously-changing
teacher reduces the quantized feature maps q(x) to a K- vocabularies of visual words, and finally (iii) in § 3.3, how
dimensional BoW, ỹT (x), by channel-wise max-pooling, i.e., to significantly enhance the learning of contextual reasoning
ỹT (x)[k] = maxu q(x)[u][k] (alternatively, the reduction skills by utilizing multi-scale BoW reconstruction targets
can be performed with average pooling), where q(x)[u][k] and by revisiting the image augmentation schemes.
6832
3.1. Fully online BoW-based learning
To make the BoW targets encode more high-level features,
BoWNet pre-trains the teacher convnet T(·) with another
unsupervised method, such as RotNet [27], and computes
the vocabulary V for quantizing the teacher feature maps off-
line by applying k-means on a set of teacher feature maps
extracted from training images. After the end of the stu-
dent training, during which the teacher’s parameters remain
frozen, the student becomes the new teacher T(·) ← S(·), a
new vocabulary V is learned off-line from the new teacher, Figure 2: Vocabulary queue from randomly sampled local fea-
tures. For each input image x to T, “local” features are pooled
and a new student is trained, starting a new training cycle.
from Tℓ (x) by averaging over 3 × 3 sliding windows. One of the
In this case however, (a) the final success depends on the resulting vectors is selected randomly and added as visual word to
quality of the first pre-trained teacher, (b) the teacher and the the vocabulary queue, replacing the oldest word in the vocabulary.
BoW reconstruction targets yT (x) remain frozen for long
periods of time, which, as already explained, results in a
different strategies: (a) detection of rarely used visual words
suboptimal training signal, and (c) multiple training cycles
over several mini-batches and replacement of these words
are required, making the overall training time consuming.
with a randomly sampled feature vector from the current
To address these important limitations, in this work we
mini-batch; (b) enforcing uniform assignments to each clus-
propose a fully online training methodology that allows the
ter thanks to the Sinkhorn optimization as in, e.g., [4, 8]. For
teacher to be continuously updated as the student training
more details see § D in Supplementary Material.
progresses, with no need for off-line k-means stages. This
A queue-based vocabulary. In this case, the vocabulary
requires an online updating scheme for the teacher as well
V of visual words is a K-sized queue of random features.
as for the vocabulary of visual words used for generating the
At each step, after computing the assignment codes over the
BoW targets, both of which are detailed below.
current vocabulary V, we update V by selecting one feature
Updating the teacher network. Inspired by MoCo [33], vector per image from the current mini-batch, inserting it to
the parameters θT of the teacher convnet are an exponential the queue, and removing the oldest item in the queue if its
moving average of the student parameters. Specifically, at size exceeds K. Hence, the visual words in V are always
each training iteration the parameters θT are updated as feature vectors from past mini-batches. We explore three dif-
ferent ways to select these local feature vectors: (a) uniform
θT ← α · θT + (1 − α) · θS , (3)
random sampling of one feature vector in Tℓ (x); (b) global
where α ∈ [0, 1] is a momentum coefficient. Note that, as average pooling of Tℓ (x) (average feature vector of each
a consequence, the teacher has to share exactly the same image); (c) an intermediate approach between (a) and (b)
architecture as the student. With a proper tuning of α, e.g., which consists of a local average pooling with a 3 × 3 kernel
α = 0.99, this update rule allows slow and continuous up- (stride 1, padding 0) of the feature map Tℓ (x) followed by
dates of the teacher, avoiding rapid changes of its parameters, a uniform random sampling of one of the resulting feature
such as with α = 0, which would make the training unstable. vectors (Fig. 2). Our intuition for option (c) is that, assuming
As in MoCo, for its batch-norm units, the teacher maintains that the local features in a 3 × 3 neighborhood belong to one
different batch-norm statistics from the student. common visual concept, then local averaging selects a more
representative visual-word feature from this neighborhood
Updating the visual-words vocabulary. Since the teacher
than simply sampling at random one local feature (option
is continuously updated, off-line learning of V is not a viable
(a)). Likewise, the global averaging option (b) produces a
option. Instead, we explore two solutions for computing V,
representative feature from an entire image, which however,
online k-means and a queue-based vocabulary.
might result in overly coarse visual word features.
Online k-means. One possible choice for updating the
The advantage of the queue-based solution over online
vocabulary is to apply online k-means clustering after each
k-means is that it is simpler to implement and it does not re-
training step. Specifically, as proposed in VQ-VAE [50, 56],
quire any extra mechanism for avoiding unbalanced clusters,
we use exponential moving average for vocabulary updates.
since at each step the queue is updated with new randomly
A critical issue that arises in this case is that, as training
sampled features. Indeed, in our experiments, the queue-
progresses, the features distribution changes over time. The
based vocabulary with option (c) provided the best results.
visual words computed by online k-means do not adapt to
this distribution shift leading to extremely unbalanced clus- Generating BoW targets with soft-assignment
ter assignments and even to assignments that collapse to a codes. For generating the BoW targets, we use
single cluster. In order to counter this effect, we investigate soft-assignments instead of the hard-assignments used
6833
where κ is a fixed coefficient that equally scales the
magnitudes of all the predicted weights G(V ) =
[G(v1 ), . . . , G(vK )], which by design are L2 -normalized.
We implement G(·) with a 2-layer perceptron whose input
and output vectors are L2 -normalized (see Fig. 3). Its hidden
layer has size 2 × c.
We highlight that dynamic weight-generation modules
are extensively used in the context of few-shot learning for
producing classification weight vectors of novel classes us-
ing as input a limited set of training examples [26, 30, 55].
Figure 3: Dynamic BoW-prediction head. G(·) learns to quickly The advantages of using G(·) instead of fixed weights, which
adapt to the visual words in the continuously refreshed vocabulary
BoWNet uses, are the equivariance to permutations of the vi-
V . The outputs G(V ) are in fact weights that are used for mapping
the features S(x̃) to the corresponding BoW vector yS (x̃).
sual words, the increased stability to the frequent and abrupt
updates of the visual words, a number of parameters |θG |
in BoWNet. This is preferable from an optimization independent from the number of visual words K, hence re-
perspective due to the fact that the vocabulary of visual quiring fewer parameters than a fixed-weights linear layer
words is continuously evolving. We thus compute the for large vocabularies.
assignment codes q(x)[u] as 3.3. Representation learning based on enhanced
contextual reasoning
exp(− 1δ ℓ
kT (x)[u] − vk k22 )
q(x)[u][k] = P 1 ℓ 2
. (4)
k′ exp(− δ kT (x)[u] − vk k2 )
′ Data augmentation. The key factor for the success of
many recent self-supervised representation learning meth-
The parameter δ is a temperature value that controls the ods [8, 9, 11, 31, 63] is to leverage several image augmenta-
softness of the assignment. We use δ = δbase · µ̄MSD , where tion/perturbation techniques, such as Gaussian blur [9], color
δbase > 0 and µ̄MSD is the exponential moving average jittering and random cropping techniques, as cutmix [73]
(with momentum 0.99) of the mean squared distance of the that substitutes one random-size patch of an image with that
feature vectors in Tℓ (x) from their closest visual words. of another. In our method, we want to fully exploit the pos-
The reason for using an adaptive temperature instead of a sibility of building strong image representations by hiding
constant one is due to the change of magnitude of the feature local information. As the teacher is randomly initialised, it
activations during training, which induces a change of scale is important to hide large regions of the original image from
of the distances between the feature vectors and the words. the student so as to prevent the student from relying only on
low-level image statistics for reconstructing the distributions
3.2. Dynamic bag-of-visual-word prediction
yT (x) over the teacher visual words, which capture low-
To learn effective image representations, the student must level visual cues at the beginning of the training. Therefore,
predict the BoW distribution over V of an image using as we carefully design our image perturbations scheme to make
input a perturbed version of that same image. However, in sure that the student has access to only a very small portion
our method the vocabulary is constantly updated and the of the original image. Specifically, similar to [8, 46], we
visual words are changing or being replaced from one step to extract from a training image multiple crops with two dif-
the next. Therefore, predicting the BoW distribution over a ferent mechanisms: one that outputs 160 × 160-sized crops
continuously updating vocabulary V with a fixed linear layer that cover less than 60% of the original image and one with
would make training unstable, if not impossible. To address 96 × 96-sized crops that cover less than 14% of the original
this issue we propose to use a dynamic BoW-prediction head image (see Fig. 4). Given those image crops, the student
that can adapt to the evolving nature of the vocabulary. To must reconstruct the full bags of visual words from each of
that end, instead of using fixed weights as in (1), we employ a them independently. Therefore, our cropping strategy defini-
generation network G(·) that takes as input the current vocab- tively forces the student network to understand and learn
ulary of visual words V = [v1 , . . . , vK ] and produces pre- spatial dependency between visual parts.
diction weights for them as G(V ) = [G(v1 ), . . . , G(vK )], Multi-scale BoW reconstruction targets. We also con-
where G(·) : Rcℓ → Rc is parameterized by θG and G(vk ) sider reconstructing BoW from multiple network layers that
represents the prediction weight vector for the k th visual correspond to different scale levels. In particular, we ex-
word. Therefore, Equation 1 becomes periment with using both the last scale level L (i.e., layer
conv5 in ResNet) and the penultimate scale level L − 1
exp(κ · G(vk )⊤ S(x̃)) (i.e., layer conv4 in ResNet). The reasoning behind this
yS (x̃)[k] = P , (5)
′ ⊤
k′ exp(κ · G(vk ) S(x̃)) is that the features of level L−1 still encode semantically
6834
Few-shot
Updating method 1-shot 5-shot Linear
Online k-means
(a) replacing rare clusters 40.98 60.35 44.45
(b) Sinkhorn-based balancing 37.20 55.22 39.74
Figure 4: Reconstructing BoWs from small parts of the orig- Queue-based vocabulary
inal image. Given a training image (left), we extract two types (a) local features 40.29 60.81 44.39
of image crops. The first type (middle) is obtained by randomly (b) globally-averaged features 41.57 62.54 45.79
sampling an image region whose area covers no more than 60% of (c) locally-averaged features 42.11 62.44 45.86
the entire image, resizing it to 160 × 160 and then giving it as input
to the student as part of the reconstruction task. The second type Queue-based vocabulary – multi-scale BoW
(right) is obtained by randomly selecting an area that covers be- (b) globally-averaged features 41.29 63.09 49.40
tween 60% to 100% of the entire image, resizing it to a 256 × 256 (c) locally-averaged features 44.18 64.89 50.89
image, dividing it into 3 × 3 overlapping patches of size 96 × 96,
and randomly choosing 5 out of these 9 patches (indicated with Table 1: Comparison of online vocabulary-update approaches.
red rectangles) that are given as 5 separate inputs to the student. The results in the first two sections are with the vanilla version of
The student must then reconstruct the original BoW target indepen- our method and with the full version in the third section.
dently for each patch. The blue rectangle on the left image indicates
the central 224 × 224 crop from which the teacher produces the
BoW target. Note that, except from horizontal flipping, no other feature maps. The momentum coefficient α for the teacher
perturbation is applied on the teacher’s inputs. updates is initialized at 0.99 and is annealed to 1.0 during
training with a cosine schedule. The hyper-parameters κ
important concepts but have a smaller receptive field than and δbase are set to 5 and 1/10 respectively for the results in
those in the last level. As a result, the visual words of level § 4.1, to 8 and 1/15 respectively for the results in § 4.3. For
L−1 that belong to image regions hidden to the student are more implementation details, see § C in Supplementary.
less likely to be influenced by pixels of image regions the
student is given as input. Therefore, by using BoW from this 4.1. Analysis
extra feature level, the student is further enforced to learn Here we perform a detailed analysis of our method. Due
contextual reasoning skills (and in fact, at a level with higher to the computationally intensive nature of pre-training on
spatial details due to the higher resolution of level L−1), thus ImageNet, we use a smaller but still representative version
learning richer and more powerful representations. When created by keeping only 20% of its images and we implement
using BoW extracted from two layers, our method includes a our model with the light-weight ResNet18 architecture. For
separate vocabulary for each layer, denoted by VL and VL−1 training we use SGD for 80 epochs with cosine learning rate
for layers L and L−1 respectively, and two different weight initialized at 0.05, batch size 128 and weight decay 5e−4.
generators, denoted by GL (·) and GL−1 (·) for layers L and We evaluate models trained with two versions of our method,
L−1, respectively. Regardless of what layer the BoW target the vanilla version that uses single-scale BoWs and from
comes from, the student uses a single global image repre- each training image extracts one 160 × 160-sized crop (with
sentation S(x̃), typically coming from the global average which it trains the student), and the full version that uses
pooling layer after the last convolutional layer (i.e., layer multi-scale BoWs and extracts from each training image two
pool5 in ResNet), to perform the reconstruction task. 160 × 160-sized crops plus five 96 × 96-sized patches.
We show empirically in Section 4.1 that the contextual
Evaluation protocols. After pre-training, we freeze the
reasoning skills implicitly developed via using the above two
learned representations and use two evaluation protocols. (1)
schemes are decisive to learn effective image representations
The first one consists in training 1000-way linear classifiers
with the BoW-reconstruction task.
for the ImageNet classification task. (2) For the second pro-
tocol, our goal is to analyze the ability of the representations
4. Experiments and results to learn with few training examples. To that end, we use 300
We evaluate our method (OBoW) on the ImageNet [57], ImageNet classes and run with them multiple (200) episodes
Places205 [78] and VOC07 [21] classification datasets as of 50-way classification tasks with 1 or 5 training examples
well as on the V0C07+12 detection dataset. per class and a Prototypical-Networks [60] classifier.
Implementation details. For our models, the vocabulary 4.2. Results
size is set to K = 8192 words and, as in BoWNet, when
computing the BoW targets we ignore the visual words that Online vocabulary updates. In Tab. 1, we compare the ap-
correspond to the feature vectors on the edge of the teacher proaches for online vocabulary updates described § 3.1. The
6835
Few-shot Image crops Multi-scale Linear
α lr 1-shot 1-shot Linear 2
1 × 224 31.39
0.99 → 1 0.05 42.11 62.44 45.86 1 × 2242 + cutmix 39.46
1 × 1602 45.86
0.999 0.05 40.87 61.41 45.76
2 × 1602 47.64
0.99 0.05 41.19 61.65 46.25
5 × 962 44.24
0.9 0.05 40.79 60.92 44.89
2 × 1602 + 5 × 962 49.64
0.5 0.05 12.70 23.20 15.41
0.0 0.05 13.19 24.85 17.47 2 × 1602 X 49.00
2 × 1602 + 5 × 962 X 50.89
0.5 0.03 39.52 60.18 43.82
0.0 0.01 33.80 55.02 39.90
Table 4: Evaluation of image crop augmentations and of multi-
scale BoWs. See text.
Table 2: Influence of the momentum coefficient α used for the
teacher updates. For these results, we used the vanilla version. In
the “0.99 → 1” row, α is initialized to 0.99 and annealed to 1.0 Few-shot
with cosine schedule. The other entries use constant α values. Method EP n=1 n=5 Linear
BoWNet 200 33.80 55.02 41.30
Few-shot BoWNet (1602 crops) 200 29.26 49.68 43.59
Soft Dyn 1-shot 5-shot Linear
OBoW (vanilla) 80 42.11 62.44 45.86
X X 42.11 62.44 45.86 OBoW (full) 80 44.18 64.89 50.89
X 38.61 59.98 44.64
X 2.00 2.00 0.10 Table 5: Comparison with BoW-like methods. “EP”: total num-
ber of epochs used for pre-training. Note that the BoWNet method
Table 3: Ablation of dynamic BoW prediction and soft- consists of 40 epochs for teacher pre-training with the RotNet
quantization. For these results, we used the vanilla version of method followed by two BoWNet training rounds of 80 epochs.
our method. “Soft”: soft assignment instead of hard assignment.
“Dyn”: dynamic weight generation instead of fixed weights.
Dynamic BoW prediction and soft quantization. In Ta-
ble 3, we study the impact of the dynamic BoW prediction
queue-based solutions achieve in general better results than and of using soft assignment for the codes instead of hard
online k-means. Among the queue-based options, random assignment. We see that (1), as expected, the network is un-
sampling of locally averaged features, opt. (c), provides the able to learn useful features without the proposed dynamic
best results. Its advantage over option (b) with global aver- BoW prediction, i.e., when using fixed weights; (2) soft
aging is more evident with multi-scale BoWs where an extra assignment indeed provides a performance boost.
feature level with a higher resolution and more localized fea- Enforcing context-aware representations. In Table 4 we
tures is used, in which case global averaging produces visual study different types of image crops for the BoW reconstruc-
words that are too coarse. In all remaining experiments, we tion tasks, as well as the impact of multi-scale BoW targets.
use a queue-based vocabulary with option (c). We observe that: (1) as we discussed in § 3.3, smaller crops
Momentum for teacher updates. In Table 2, we study that hide significant portions of the original image are better
the sensitivity of our method w.r.t. the momentum α for the suited for our reconstruction task thus leading to dramatic
teacher updates (Equation 3). We notice a strong drop in increase in performance (compare entries 1 × 2242 with
performance when decreasing α from 0.9 to 0.5 (a rapidly- the 1 × 1602 and 5 × 962 entries). (2) Randomly sampling
changing teacher), and to 0 (the teacher and student have two 160 × 160-sized crops (entries 2 × 1602 ) and using
identical parameters), while keeping the initial learning rate 96 × 96-sized patches leads to another significant increase
fixed (lr = 0.05). However, we noticed that this was not in performance. (3) Finally, employing multi-scale BoWs
due to any cluster/mode collapse issue. The issue is that improves the performance even further.
the teacher signal is more noisy at low α because of the BoW-like comparison. In Table 5, we compare our
rapid change of its parameters. This prevents the student to method with the reference BoW-like method BoWNet. For
converge when keeping the learning rate as high as 0.05. We a fair comparison, we implemented BoWNet both with its
notice in Table 2 that a reduction of the learning rate to adapt proposed augmentations, i.e., using one 224 × 224-sized
to the reduction of α reduces the performance gap. This crop with cutmix (“BoWNet” row), and with the image aug-
indicates that our method is not as sensitive to the choice of mentation we propose in the vanilla version of our method,
the momentum as MoCo and BYOL were shown to be. i.e., one 160 × 160-sized crop (“BoWNet (1602 crops)”).
6836
Linear Classification VOC Detection Semi-supervised learning
Method Epochs Batch ImageNet Places205 VOC07 AP50 AP75 APall 1% Labels 10% Labels
Supervised 100 256 76.5 53.2 87.5 81.3 58.8 53.5 48.4 80.4
BoWNet [25] 325 256 62.1 51.1 79.3 81.3 61.1 55.8 - -
PCL [45] 200 256 67.6 50.3 85.4 - - - 75.3 85.6
MoCo v2 [33] 200 256 67.5 - - 82.4 63.6 57.0 - -
SimCLR [9] 200 4096 66.8 - - - - - - -
SwAV [8] 200 256 72.7 56.2† 87.2† 81.8† 60.0† 54.4† 76.7† 88.7†
BYOL [31] 300 4096 72.5 - - - - - - -
OBoW (Ours) 200 256 73.8 56.8 89.3 82.9 64.8 57.9 82.9 90.7
PIRL [46] 800 1024 63.6 49.8 81.1 80.7 59.7 54.0 57.2 83.8
MoCo v2 [33] 800 256 71.1 52.9 87.1 82.5 64.0 57.4 - -
SimCLR [9] 1000 4096 69.3 53.3 86.4 - - - 75.5 87.8
BYOL [31] 1000 4096 74.3 - - - - - 78.4 89.0
SwAV [8] 800 4096 75.3 56.5 88.9 82.6 62.7 56.1 78.5 89.9
Table 6: Evaluation of ImageNet pre-trained ResNet50 models. The “Epochs” and “Batch” columns provide the number of pre-training
epochs and the batch size of each model respectively. The first section includes models pre-trained with a similar number of epochs as
our model (second section). We boldfaced the best results among all sections as well as of only the top two. For the linear classification
tasks, we provide the top-1 accuracy. For object detection, we fine-tuned Faster R-CNN (R50-C4) on VOC trainval07+12 and report
detection AP scores by testing on test07. For semi-supervised learning, we fine-tune the pre-trained models on 1% and 10% of ImageNet
and report the top-5 accuracy. Note that, in this case the “Supervised” entry results come from [74] and are obtained by supervised training
using only 1% or 10% of the labelled data. All the classification results are computed with single-crop testing. † : results computed by us.
We see that our method, even in its vanilla version, achieves MoCo v2 and SimCLR, and even improves over the recently
significantly better results, while using at least two times proposed BYOL and SwAV methods when considering a
fewer training epochs, which validates the efficiency of our similar amount of pre-training epochs. Moreover, in VOC07
proposed fully-online training methodology. classification and Places205 classification, it achieves a new
state of the art despite using significantly fewer pre-training
4.3. Self-supervised training on ImageNet epochs than related methods. On the semi-supervised Ima-
geNet ResNet50 setting, it significantly surpasses the state of
Here we evaluate our method by pre-training with it
the art for 1% labels, and is also better for 10% labels using
convnet-based representations on the full ImageNet dataset.
again much fewer epochs. On VOC detection, it outperforms
We implement the full solution of our method (as described
previous state-of-the-art methods while demonstrating strong
in § 4.1) using the ResNet50 (v1) [34] architecture. We eval-
performance improvements over supervised pre-training.
uate the learned representations on ImageNet, Places205,
and VOC07 classification tasks as well as on VOC07+12
detection task and provide results in Table 6. On the Ima-
5. Conclusion
geNet classification we evaluate on two settings: (1) training
linear classifiers with 100% of the data, and (2) fine-tuning In this work, we introduce OBoW, a novel unsupervised
the model using 1% or 10% of the data, which is referred to teacher-student scheme that learns convnet-based representa-
as semi-supervised learning. tions with a BoW-guided reconstruction task. By employing
Results. Pre-training on the full ImageNet and then trans- an efficient fully-online training strategy and promoting the
ferring to downstream tasks is the most popular benchmark learning of context-aware representations, it delivers strong
for unsupervised representations and thus many methods results that surpass prior state-of-the-art approaches on most
have configurations specifically tuned on it. In our case, evaluation protocols. For instance, when evaluating the de-
due to the computational intensive nature of pre-training on rived unsupervised representations on the Places205 clas-
ImageNet, no full tuning of OBoW took place. Nevertheless, sification, Pascal classification or Pascal object detection
it achieves very strong empirical results across the board. tasks, OBoW attains a new state of the art, surpassing prior
Its classification performance on ImageNet is 73.8%, which methods while demonstrating significant improvements over
is substantially better than instance discrimination methods supervised representations.
6837
References [20] Vincent Dumoulin, Ishmael Belghazi, Ben Poole, Olivier Mas-
tropietro, Alex Lamb, Martin Arjovsky, and Aaron Courville.
[1] Jean-Baptiste Alayrac, Joao Carreira, and Andrew Zisserman. Adversarially learned inference. In ICLR, 2017. 1
The visual centrifuge: Model-free layered video representa-
[21] Mark Everingham, Luc Van Gool, Chris Williams, John Winn,
tions. In CVPR, 2019. 2
and Andrew Zisserman. The Pascal visual object classes
[2] Relja Arandjelovic, Petr Gronat, Akihiko Torii, Tomas Pajdla, (VOC) challenge. IJCV, 88(2), 2010. 6
and Josef Sivic. NetVLAD: CNN architecture for weakly
[22] William Falcon and Kyunghyun Cho. A framework for
supervised place recognition. In CVPR, 2016. 2
contrastive self-supervised learning and designing a new ap-
[3] Relja Arandjelovic and Andrew Zisserman. Look, listen and
proach. arXiv, 2020. 2
learn. In ICCV, 2017. 2
[23] Jonathan Frankle, David J Schwab, and Ari Morcos. Are all
[4] Yuki Markus Asano, Christian Rupprecht, and Andrea
negatives created equal in contrastive instance discrimination?
Vedaldi. Self-labelling via simultaneous clustering and repre-
arXiv, 2020. 2
sentation learning. In ICLR, 2020. 1, 3, 4
[24] Elizabeth Gardner and Bernard Derrida. Three unfinished
[5] Cristian Bucilǎ, Rich Caruana, and Alexandru Niculescu-
works on the optimal storage capacity of networks. Journal
Mizil. Model compression. In KDD, 2006. 2
of Physics A: Mathematical and General, 22(12), 1989. 2
[6] Mathilde Caron, Piotr Bojanowski, Armand Joulin, and
[25] Spyros Gidaris, Andrei Bursuc, Nikos Komodakis, Patrick
Matthijs Douze. Deep clustering for unsupervised learning of
Pérez, and Matthieu Cord. Learning representations by pre-
visual features. In ECCV, 2018. 1, 3
dicting bags of visual words. In CVPR, 2020. 1, 2, 8
[7] Mathilde Caron, Piotr Bojanowski, Julien Mairal, and Ar-
[26] Spyros Gidaris and Nikos Komodakis. Dynamic few-shot
mand Joulin. Unsupervised pre-training of image features on
visual learning without forgetting. In CVPR, 2018. 5
non-curated data. In ICCV, 2019. 1, 3
[8] Mathilde Caron, Ishan Misra, Julien Mairal, Priya Goyal, [27] Spyros Gidaris, Praveer Singh, and Nikos Komodakis. Unsu-
Piotr Bojanowski, and Armand Joulin. Unsupervised learn- pervised representation learning by predicting image rotations.
ing of visual features by contrasting cluster assignments. In In ICLR, 2018. 1, 2, 4
NeurIPS, 2020. 2, 3, 4, 5, 8 [28] Rohit Girdhar, Deva Ramanan, Abhinav Gupta, Josef Sivic,
[9] Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geof- and Bryan Russell. ActionVLAD: Learning spatio-temporal
frey Hinton. A simple framework for contrastive learning of aggregation for action classification. In CVPR, 2017. 2
visual representations. In ICML, 2020. 1, 2, 5, 8 [29] Clément Godard, Oisin Mac Aodha, and Gabriel J Brostow.
[10] Ting Chen, Xiaohua Zhai, Marvin Ritter, Mario Lucic, and Unsupervised monocular depth estimation with left-right con-
Neil Houlsby. Self-supervised GANs via auxiliary rotation sistency. In CVPR, 2017. 2
loss. In CVPR, 2019. 2 [30] Faustino Gomez and Jürgen Schmidhuber. Evolving modular
[11] Xinlei Chen, Haoqi Fan, Ross Girshick, and Kaiming He. Im- fast-weight networks for control. In ICANN, 2005. 5
proved baselines with momentum contrastive learning. arXiv, [31] Jean-Bastien Grill, Florian Strub, Florent Altché, Corentin
2020. 5 Tallec, Pierre H Richemond, Elena Buchatskaya, Carl Do-
[12] Sumit Chopra, Raia Hadsell, and Yann LeCun. Learning a ersch, Bernardo Avila Pires, Zhaohan Daniel Guo, Moham-
similarity metric discriminatively, with application to face mad Gheshlaghi Azar, Bilal Piot, Koray Kavukcuoglu, Rémi
verification. In CVPR, 2005. 1 Munos, and Michal Valko. Bootstrap your own latent: A new
[13] Gabriella Csurka, Christopher Dance, Lixin Fan, Jutta approach to self-supervised learning. In NeurIPS, 2020. 2, 5,
Willamowski, and Cédric Bray. Visual categorization with 8
bags of keypoints. In ECCVW, 2004. 2 [32] Raia Hadsell, Sumit Chopra, and Yann LeCun. Dimension-
[14] Arnaud Dapogny, Matthieu Cord, and Patrick Pérez. The ality reduction by learning an invariant mapping. In CVPR,
missing data encoder: Cross-channel image completion with 2006. 1
hide-and-seek adversarial network. In AAAI, 2020. 1 [33] Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and Ross
[15] Carl Doersch, Abhinav Gupta, and Alexei Efros. Unsuper- Girshick. Momentum contrast for unsupervised visual repre-
vised visual representation learning by context prediction. In sentation learning. In CVPR, 2020. 1, 2, 4, 8
ICCV, 2015. 1, 2 [34] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun.
[16] Jeff Donahue, Philipp Krähenbühl, and Trevor Darrell. Ad- Deep residual learning for image recognition. In CVPR, 2016.
versarial feature learning. In ICLR, 2017. 1 8
[17] Jeff Donahue and Karen Simonyan. Large scale adversarial [35] Olivier Henaff. Data-efficient image recognition with con-
representation learning. In NeurIPS, 2019. 1 trastive predictive coding. In ICML, 2020. 2
[18] Alexey Dosovitskiy, Philipp Fischer, Jost Tobias Springen- [36] Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the
berg, Martin Riedmiller, and Thomas Brox. Discriminative knowledge in a neural network. In NIPSW, 2014. 2
unsupervised feature learning with exemplar convolutional [37] Himalaya Jain, Spyros Gidaris, Nikos Komodakis, Patrick
neural networks. IEEE Trans. PAMI, 38(9), 2015. 2 Pérez, and Matthieu Cord. QuEST: Quantized embedding
[19] Alexey Dosovitskiy, Jost Tobias Springenberg, Martin Ried- space for transferring knowledge. In ECCV, 2020. 2
miller, and Thomas Brox. Discriminative unsupervised fea- [38] Ashish Jaiswal, Ashwin Ramesh Babu, Mohammad Zaki
ture learning with convolutional neural networks. In NeurIPS, Zadeh, Debapriya Banerjee, and Fillia Makedon. A survey
2014. 1 on contrastive self-supervised learning. arXiv, 2020. 2
6838
[39] Hervé Jégou, Matthijs Douze, Cordelia Schmid, and Patrick [59] Josef Sivic and Andrew Zisserman. Video google: Efficient
Pérez. Aggregating local descriptors into a compact image visual search of videos. In Toward category-level object
representation. In CVPR, 2010. 2 recognition. Springer, 2006. 2
[40] Yannis Kalantidis, Mert Bulent Sariyildiz, Noe Pion, Philippe [60] Jake Snell, Kevin Swersky, and Richard Zemel. Prototypical
Weinzaepfel, and Diane Larlus. Hard negative mixing for networks for few-shot learning. In NeurIPS, 2017. 6
contrastive learning. In NeurIPS, 2020. 2 [61] Antti Tarvainen and Harri Valpola. Mean teachers are better
[41] Anoop Korattikara Balan, Vivek Rathod, Kevin P Murphy, role models: Weight-averaged consistency targets improve
and Max Welling. Bayesian dark knowledge. In NeurIPS, semi-supervised deep learning results. In NeurIPS, 2017. 2
2015. 2 [62] Yonglong Tian, Dilip Krishnan, and Phillip Isola. Contrastive
[42] Samuli Laine and Timo Aila. Temporal ensembling for semi- multiview coding. In ECCV, 2020. 2
supervised learning. In ICLR, 2017. 2
[63] Yonglong Tian, Chen Sun, Ben Poole, Dilip Krishnan,
[43] Gustav Larsson, Michael Maire, and Gregory Shakhnarovich.
Cordelia Schmid, and Phillip Isola. What makes for good
Learning representations for automatic colorization. In ECCV,
views for contrastive learning. In NeurIPS, 2020. 5
2016. 1, 2
[64] Giorgos Tolias, Yannis Avrithis, and Hervé Jégou. To aggre-
[44] Hsin-Ying Lee, Jia-Bin Huang, Maneesh Singh, and Ming-
gate or not to aggregate: Selective match kernels for image
Hsuan Yang. Unsupervised representation learning by sorting
search. In ICCV, 2013. 2
sequences. In ICCV, 2017. 2
[45] Junnan Li, Pan Zhou, Caiming Xiong, Richard Socher, and [65] Trieu H Trinh, Minh-Thang Luong, and Quoc V Le. Selfie:
Steven Hoi. Prototypical contrastive learning of unsupervised Self-supervised pretraining for image embedding. arXiv,
representations. In ICLR, 2021. 8 2019. 2
[46] Ishan Misra and Laurens van der Maaten. Self-supervised [66] Vikas Verma, Alex Lamb, Juho Kannala, Yoshua Bengio,
learning of pretext-invariant representations. In CVPR, 2020. and David Lopez-Paz. Interpolation consistency training for
2, 5, 8 semi-supervised learning. In IJCAI, 2019. 2
[47] Ishan Misra, Lawrence Zitnick, and Martial Hebert. Shuf- [67] Pascal Vincent, Hugo Larochelle, Yoshua Bengio, and Pierre-
fle and learn: unsupervised learning using temporal order Antoine Manzagol. Extracting and composing robust features
verification. In ECCV, 2016. 2 with denoising autoencoders. In ICML, 2008. 2
[48] Mehdi Noroozi and Paolo Favaro. Unsupervised learning of [68] Carl Vondrick, Abhinav Shrivastava, Alireza Fathi, Sergio
visual representations by solving jigsaw puzzles. In ECCV, Guadarrama, and Kevin Murphy. Tracking emerges by col-
2016. 1, 2 orizing videos. In ECCV, 2018. 2
[49] Aaron van den Oord, Yazhe Li, and Oriol Vinyals. Repre- [69] Tongzhou Wang and Phillip Isola. Understanding contrastive
sentation learning with contrastive predictive coding. arXiv, representation learning through alignment and uniformity on
2018. 2 the hypersphere. In ICML, 2020. 2
[50] Aaron van den Oord, Oriol Vinyals, and Koray Kavukcuoglu. [70] Donglai Wei, Joseph J Lim, Andrew Zisserman, and
Neural discrete representation learning. In NeurIPS, 2017. 4 William T Freeman. Learning and using the arrow of time. In
[51] George Papamakarios. Distilling model knowledge. arXiv, CVPR, 2018. 2
2015. 2 [71] Zhirong Wu, Yuanjun Xiong, Stella Yu, and Dahua Lin. Un-
[52] Deepak Pathak, Philipp Krahenbuhl, Jeff Donahue, Trevor supervised feature learning via non-parametric instance-level
Darrell, and Alexei Efros. Context encoders: Feature learning discrimination. In CVPR, 2018. 1, 2
by inpainting. In CVPR, 2016. 1, 2
[72] Jun Yang, Yu-Gang Jiang, Alexander G Hauptmann, and
[53] Florent Perronnin and Christopher Dance. Fisher kernels on
Chong-Wah Ngo. Evaluating bag-of-visual-words representa-
visual vocabularies for image categorization. In CVPR, 2007.
tions in scene classification. In MIR, 2007. 1
2
[73] Sangdoo Yun, Dongyoon Han, Seong Joon Oh, Sanghyuk
[54] Sudeep Pillai, Rareş Ambruş, and Adrien Gaidon. Superdepth:
Chun, Junsuk Choe, and Youngjoon Yoo. CutMix: Regu-
Self-supervised, super-resolved monocular depth estimation.
larization strategy to train strong classifiers with localizable
In ICRA, 2019. 2
features. In ICCV, 2019. 5
[55] Siyuan Qiao, Chenxi Liu, Wei Shen, and Alan L Yuille. Few-
shot image recognition by predicting parameters from activa- [74] Xiaohua Zhai, Avital Oliver, Alexander Kolesnikov, and Lu-
tions. In CVPR, 2018. 5 cas Beyer. S4 L: Self-supervised semi-supervised learning. In
[56] Ali Razavi, Aaron van den Oord, and Oriol Vinyals. Generat- ICCV, 2019. 8
ing diverse high-fidelity images with VQ-VAE-2. In NeurIPS, [75] Liheng Zhang, Guo-Jun Qi, Liqiang Wang, and Jiebo Luo.
2019. 4 AET vs. AED: Unsupervised representation learning by auto-
[57] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, San- encoding transformations rather than data. In CVPR, 2019.
jeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, 2
Aditya Khosla, Michael Bernstein, et al. Imagenet large scale [76] Richard Zhang, Phillip Isola, and Alexei Efros. Colorful
visual recognition challenge. IJCV, 115(3), 2015. 6 image colorization. In ECCV, 2016. 1, 2
[58] David Saad and Sara A Solla. Dynamics of on-line gradient [77] Richard Zhang, Phillip Isola, and Alexei Efros. Split-brain
descent learning for multilayer neural networks. In NeurIPS, autoencoders: Unsupervised learning by cross-channel pre-
1996. 2 diction. In CVPR, 2017. 1, 2
6839
[78] Bolei Zhou, Agata Lapedriza, Jianxiong Xiao, Antonio Tor- [79] Tinghui Zhou, Matthew Brown, Noah Snavely, and David G
ralba, and Aude Oliva. Learning deep features for scene Lowe. Unsupervised learning of depth and ego-motion from
recognition using places database. In NeurIPS, 2014. 6 video. In CVPR, 2017. 2
6840