Attribute Discovery Via Predictable Discriminative Binary Codes
Attribute Discovery Via Predictable Discriminative Binary Codes
Ali Farhadi
David Forsyth
c{1:C}
m,nc
d(B
m
, B
n
) +
s{1:k}
w
s
2
(1)
+
1
i{1:N}
s{1:k}
s
i
2
2
{1:C}
pc
{1:C}
c
=c
,qc
d(B
p
, B
q
)
s.t. l
s
i
(w
s
x
i
) 1
s
i
i {1 : N}, s {1 : k}
b
s
i
= (1 + sign(w
s
x
i
))/2 i {1 : N}, s {1 : k}
s
i
> 0 i {1 : N}, s {1 : k}
where d can be any distance in the hamming space, B
i
= [b
1
i
, b
2
i
, ..., b
k
i
], w
s
is the
weight vector corresponding to the s
th
split,
s
i
is the slack variable corresponding to
the s
th
split and i
th
example, C is the total number of categories, k is the number of
splits, N is the total number of examples in the train set, l
s
i
is the training label for the
i
th
example to train the s
th
split, and b
s
i
indicates the prediction results of i
th
example
using the split s.
6 Rastegari, Farhadi, Forsyth
Algorithm 1 Optimization
Input: X = [x
1
, ...x
N
] (x
i
is low-level feature vector for i
th
image).
Output: B (B
i
= [b
1
i
, b
2
i
, ..., b
k
i
] is binary code for i
th
image ).
1: Initialize B by: B Projection of X on rst k components of PCA(X)
2: Binarize B: B (1 + sign(B))/2
3: repeat
4: Optimizing for B in min
B
1
2
c{1:C}
m,nc
d(B
m
, B
n
)
2
2
{1:C}pc
{1:C}c
=c
,qc
d(B
p
, B
q
) (see supplementary materi-
als for details )
5: l
s
i
(2b
s
i
1) i i {1 : N}, s s {1 : k}
6: Train k linear-SVMs to update w
s
s, s {1 : k} using L as training labels (l
s
i
: label
for i
th
image when training s
th
split)
7: b
s
i
(1 + sign(w
s
T
x
i
))/2 i, s i {1 : N}, s {1 : k}
8: until Convergence on optimization 1
This is an extremely hard optimization problem, but we may not need to nd the
global minimum to obtain good binary codes. Good local minima are capable of
producing promising discriminative binary codes. To go down the objective function in
the optimization 1 we use an iterative block coordinate descent method. In algorithm 1
we described our optimization steb-by-step.
We initialize by choosing B to form orthogonal codes that come from projections
along PCA directions. In our experiments we nd that this initialization yields promis-
ing results. The supplementary material describes the (minor) effects of other choices
of initialization. We then initialize w
s
to predict these codes. Notice that the ws are
independent given a xed B, so we can use an SVM.
We now proceed by iterating three steps in sequence. First, we update B for xed
w
s
i
,
s
i
; this proposes an improvement in the codes that should achieve improved sep-
aration. This is an iterative procedure that is started at the current value of B. We use
stochastic gradient descent (step 4) with an important optimization. Since B is binary,
if b
s
i
is 0 then the sum of differences is the number of 1s and vise versa. We can pre-
compute number of 0s and 1s for each s
th
element of B. This way, we decrease the
complexity of computing sum of differences from O(N
2
K) to O(NK). Second, we
update L using B and then (Fixing L, B) we update w
s
i
,
s
i
by training SVMs using L
as training labels. This produces a set of SVMs to predict these improved codes. Each
bit of B represents a labeling of instances that we want an SVM to reproduce. We can
therefore compute optimal w
s
i
and
s
i
with an SVM code. Third, we update the current
value of B to reect the codes that these SVMs actually predict; this biases the update
of B in the direction of codes that can be predicted. While this optimization procedure
doesnt guarantee descent in each iteration, we have found that we get descent in prac-
tice (Figure 3). This is most likely because the steps balance the goodness of the code
(updated in the rst step), with our ability to predict it (second, third steps). In our ex-
periments, we did not tune the parameters;
1
, are set to 1 and
2
is set to normalize
for the size of categories. Figure 3 shows the behavior of the objective function and all
the terms in equation 1 after each iteration.
Attribute Discovery via Predictable Discriminative Binary Codes 7
Fig. 2. We compare our binary codes(DBC) with Locality Sensitive Hashing(LSH), Spectral
Hashing(SpH), and supervised version of Iterative Quantization(ITQ) under several different set-
tings: changing the length of binary codes (32,64,128,256), classiers (linear SVM or KNN),
original features (Classeme, ColorSift) and also with L1 selection of category specic bits (DBC-
L1). Our codes (DBC) consistently outperforms state of the art methods like SpH and ITQ by
large margins. The test set contains 25 examples per category. Due to space limitations only very
few of experimental settings can be showed in the paper. Please see supplementary material for
all plots.
Once converged, optimization 1 provides us the weight vectors w
s
for split classi-
ers that tend to produce binary codes with built-in margins. We use w
s
to project the
data to the space of the binary codes.
Using Codes: There are several ways to use the resulting binary codes. We evaluate
our codes in a) using them as hash codes and performing KNN on the codes (called
KNN in our experiments), b) using them as features and learning SVM classiers for
each category (called SVM), c) using the codes as features while accepting that these
features might be redundant and using L1 regularized models to pick category specic
codes (e.i. for each category we learn a L1-regularized SVM and pick the bits cor-
respond to larger absolut weight value of the L1-SVM) and then learn normal SVM
classifers using related bits.
4 Experimental Evaluations
Tasks: The main tasks of our experiments are in classication and category retrieval.
We compare our method in several different settings with the state of the art bit-code-
based methods. We also compare our method with state of the art classication tech-
niques. Our bit learning algorithm results in interesting observations about the data like
attribute discovery. Also, we qualitatively evaluate our method in retrieval and attribute
discovery. Our method is also applicable to novel category recognition.
Datasets: We test our method on Caltech256 [27], and ImageNet [28] (ILSVRC2010).
Both of these dataset have large number of categories (256 and 1000) with huge intra-
class variations. Category retrieval on Caltech256 is a challenging task because the
number of categories is much higher than typical experiments and also the intra-class
variations are much higher than typical datasets like MNIST. There are around 30000
images belonging to 256 categories [27]. On average, there are about 120 images per
category.
8 Rastegari, Farhadi, Forsyth
Fig. 3. Our optimization procedure nds descent directions in our challenging objective function.
This gure shows that all terms in the objective function actually improves after each iteration.
Fig. 4. Our method outperforms state of the art binary code methods (LSH, SpH, and supervised
TTQ) on ImageNet(1000 categories, 20 per catgeory for training). Left plot compares precesion
@25 vs. the code length. The test set contains 150 images per category. The right plot shows
precision-recall curve for the same dataset using 512 dimensional codes. Our codes consistently
outperforms all other methods.
Features: For experiments on Caltech256 we use two different sets of features that
have been shown to produce state-of-the-art results on Caltech256: Classeme and Col-
orSift. We use Classeme features [20] because they have shown to outperform other
features [20]. The Classeme features are of 2659 dimensionality. We also use Color-
Sift features as they show promising performances on classication tasks [29]. We use
ColorSift bag-of-words features by building a 1000-word dictionary using ColorSift
features provided by [30]. To make these features more discriminative we use homoge-
neous kernel map [31] on top of SIFT-BoW. The homogeneous kernels have shown to
produce best results in many classication tasks [31]. Both of these features are among
the most discriminative features. For ImageNet experiments we also use Classeme fea-
tures.
Controls: To evaluate our method we perform series of extensive evaluations and
comparisons. For our method, we change the following settings: the length of binary
codes k (32, 64, 128, 256, 512), the number of training examples per category (5 0r 50),
the original features (Classemes or ColorSift), the classier (LSVM [32] or KNN), and
the use of L1 selection of category specic bit strings. To compare with methods in the
literature we compare our results with Locality Sensitive Hashing(LSH) as a standard
baseline, with the supervised version of Iterative Quantization (ITQ) [16] as the best
supervised method and Spectral Hashing (SpH)[11] as the state-of-the-art unsupervised
Attribute Discovery via Predictable Discriminative Binary Codes 9
Fig. 5. Our method produces state of the art results on Caltech256. A linear SVM with only 128-
bit code is as accurate as multiple kernel learning method of LPBeta (marked with a big star)
that uses 13000 dimensional features. As we increase the size of the code we outperform the
LPBeta method signicantly. This gure compares our category specic codes (DBC-L1) , our
codes without L1 selection (DBC), ITQ and SPH on precesion at 25 versus the number of training
examples per category on caltech 256. One interesting observation is that the ITQ method does
a great job in following the original features (Classeme) with 512 codes. This however hurts the
performance as 128 and 256 dimensional codes outperforms the original features. This conrms
our intuition that following the patterns in the original feature space does not necessarily result in
good performance numbers.
Fig. 6. This gure qualitatively compares the quality of retrieved images by our method com-
paring to that of ITQ and SpH. Each row corresponds to the top ve images returned by three
different methods: ours, ITQ and spectral hashing. This retrieval is done by rst projecting the
query image to the space of binary codes and then running KNN in that space. Notice how, even
with relatively short codes(32 bits), our method recovers relevant objects. This menas that the
discriminative training of the code has forced our code learning to focus on distinctive shared
properties of categories. Our method consistently becomre more accurate as we increase the code
size.
method in producing binary codes. Our experimental evaluations demonstrate that our
method consistently outperforms state-of-the-art methods under all the combinations of
above settings. To evaluate our method on a large scale dataset we test it on ImageNet.
We used 1000 category from ILSVRC2010 (ImageNet Challenge). For each category
we randomly chose 20 examples for training and 150 examples for testing. Our results
show that our codes also outperform state-of-the-art binary code results on this dataset.
Measurements: In case of SVM, we use the top k images to compute precision and
recall values. Varying k = [1 : 5 : 100] traces out the precision-recall curves. In case
of using KNN, for each number of nearest neighbors we can compute a precision and
recall. Varying this number makes a precision recall curve.
Results: There are four main categories of results. First, we compare our method
with the state of the art bit-code methods on Caltech256 and ImageNet. We also show
interesting qualitative results. Second, we compare our results with the state of the art
10 Rastegari, Farhadi, Forsyth
Query
32
64
128
256
Our Methods (DBC) ITQ SpH
Fig. 7. This gure qualitatively compares the quality of retrieved images by our method com-
paring to that of ITQ and SpH. Each row corresponds to the top ve images returned by three
different methods: ours, ITQ and spectral hashing. This retrieval is done by rst projecting the
query image to the space of binary codes and then running KNN in that space. Notice how, even
with relatively short codes(32 bits), our method recovers round objects. Our method is consistent
in terms of returned images as we increase the code size. With 256 dimensional code our method
returns 5 correct images.
method on Caltech256. Third, we compare our method on novel category recognition
with the state of the art method of [33]. Fourth, we show qualitative results that reveal
interesting properties of our method. We show promising attribute discovery results and
also projections of the resulting bit code space.
Comparisons to the state of the art bit-code methods: Figure 2 compares our
method (DBC, DBC-L1) with LSH,SpH, and supervised version of ITQ by varying the
number of binary codes. We perform extensive evaluations on all combinations of dif-
ferent settings. Space does not allow showing all comparisons in all settings, please see
supplementary material for all comparisons. The settings that we show here are: (from
left to right on Figure 2) using KNN on 512-dimensional bit coses when 50 training ex-
amples per category are observed during training using Classeme features, using SVM
on 128-dimensional bit codes when 5 training examples per category is observed during
training using Classeme features, and using SVM on 256-dimensional bit codes when
5 training examples per category is observed during training using ColotSift features.
In all possible settings, including these three, our method outperforms state of the art
bit code methods. We also show that DBC-L1 performs better than DBC in all settings.
The gap between the DBC and DBC-L1 increase as the number of bits decreases. The
huge gap in the lower number of bits is due to the fact that in DBC-L1 we chose the
bits to be specic at each category. In all of the experiments we use the same random
selection of train and test set.
Our experiments showthat as we increase the neighborhood size in KNNour method
can still nd the right categories (see supplementary materials). This implies that our
hash cells remain pure as we increase the size of the neighborhood. This conrms that
the optimization 1 managed to produce codes with enough margins. It is also worth not-
ing that with such small training set per category linear SVM achieves excellent results
using our codes.
In gure 5 we compare all the methods in terms of the precision at top-25 ranked
images with different code length. We also compared our method with product quanti-
Attribute Discovery via Predictable Discriminative Binary Codes 11
Fig. 8. Discovering attributes: Each bit corresponds to a hyperplane that group the data according
to unknown notions of similarity. It is interesting to show what our bits have discovered. On two
sides of the black bar we show 8 most condent images for 5 different hyperplanes/bits (Each
row). Note that one can easily provide names for these attributes. For example, the bottom row
corresponds to all round objects versus objects with straight vertical lines. The top row has silver,
metalic and boxy objects on one side and natural images on the other side, the second row has
water animals versus objects with checkerboard patterns. Discovered attributes are in the form
of contrast: both sides have its own meaning. These attributes are compact representations of
standard attributes that only explain one property. For more examples of discovered attributes
please see supplementary material.
zation [34] for 5tr/cl and follow the same experimental setup. Product quantization got
the precision of 0.04, 0.05, 0.064, 0.08, 0.09 for 32, 64, 128, 256, 512 bits repectively.
Our method outperforms all the methods in all different code lengths. The Left plot in
Figure 4 shows this comparison on ImageNet. For all other comparisons on Caltech256
please see the supplementary material. In these experiments, the test set contains 25
images per category . Figure 6 and 7 qualitatively compare our discriminative binary
codes with ITQ and SpH in an image retrieval task. We show the top ve retrieved im-
ages for the query image. It is interesting to see that even with 32 dimensional code our
method is capable of extracting relevant properties. Our method is consistent in terms
of returned images as one increases the code size.
Comparison to the state of the art models on Caltech256: Figure 5 compares
our results with state-of-the art methods on Caltech256. We use the same features as
the state-of-the-art method of LPBeta (The big star in the gure). With only 128 bits
we can achieve the same results as the state of the art method of LPBeta that uses
13000 dimensional features. By increasing the number of bits our codes outperform the
multiple kernel learning method of LPBeta. This shows that DBC can be signicantly
more discriminative than state-of-the-art features. We also compare with the classeme
features. In this Figure we perform the same test with other binary code method. ITQ
is doing a great job in getting close to the original features of Classeme by using 512
binary codes. However, it gets worse comparing to using 128 or 256 codes. This is
mainly due to the fact that ITQ minimizes the quantization error of binarization and
this does not necessarily result in better discrimination. Our method consistently gets
better with more and more bits.
12 Rastegari, Farhadi, Forsyth
Novel Category Recognition: So far, we have shown that our codes are discrimina-
tive for categories they have been trained on. Similar to cross category generalization of
attributes, we also evaluate our method on categories that have not been observed dur-
ing training. For that, we learn the binary codes on 1000 categories of ImageNet with
20 examples per category and test our codes on Caltech256. We make sure that none of
the 1000 categories intersect with 256 categories on Caltech. We adopt an experimental
setting from [33] for which training data is available. Figure 9 shows that our method
outperforms PiCodes [33], the state of the art novel category recognition method. We
used the same low-level features as in [33].
Attribute Discovery: Binary codes can be though of as attributes. Our algorithm
discovers attributes that can be named without much difculty. Figure 8 shows some
of the attributes discovered by our method. Each row shows 8 most condent examples
for both sides of a hyperplane that corresponds to a bit in our code. Our learning proce-
dure can discover attributes like is round, is boxy, is natural, has checkerboard pattern,
and etc. More discovered attributes can be nd in supplementary material. Our model
learns strong contrasts that are discriminative. As a result of this each side of the dis-
covered attributes has its own meaning. The discovered attributes are compact versions
of standard attributes. Standard attributes describe only one property. But our discov-
ered attributes are in the form of contrasts. For example, the rst row contrasts boxy
and silver objects against natural objects. If the bit that corresponds to the rst row is 1
this means that the attribute boxy is 1 and if the bit is zero this means that the attribute
natural is 1. It is also very interesting to look at the space of binary codes. To do that,
we project our binary codes into a 2-dimensional space using multidimensional scal-
ing. Figure 10 shows an interesting balance between discrimination and classication.
In the projected space round things like wheels and coins are close together despite be-
longing to different categories. At the same time, round things are far away from horse
and camel examples. Examples of the head of horse and camels are closer together than
those to side views of horses and camels. Category memberships are suitable proxies
for visual similarity but should not be enforced as hard constraints. Our model manages
to balance between discrimination in terms of basic level categories and learnability of
the codes from visual data.
5 Discussions
In this paper we demonstrate that by balancing discrimination and learnability of the
codes one can achieve small binary codes that outperform state of the art results. We do
this by letting each image has its own code while jointly optimizing for discrimination
and learnability of the codes. Our experimental evaluations show that when there are
strong visual evidence against categorical membership constraint, accounting for vio-
lations actually improves the discrimination. The codes learned in this way can reveal
interesting properties of the data. The space of projected bits reveals interesting group-
ings of objects. Our method is also capable of producing meaningful codes for retrieval
and can discover attributes. Different applications may need different trade offs between
discrimination and learnability of the codes. What remains is how to learn to balance
them according to query tasks. Our software and supplementary material are publicly
Attribute Discovery via Predictable Discriminative Binary Codes 13
Fig. 9. Our codes can be used across the trainining categories (novel categories): we use 1000
categories of ImageNet to train our codes and use the codes to recognize objects in Caltech
256. The 1000 categories from ImageNet do not intersect with those of Caltech256. Our method
outperforms state of the art methods in novel categories.
Fig. 10. Projection of the space of binary codes: We use multidimensional scaling and project
our 64 dimensional codes into a two dimensional space. It is interesting to show that our method
clearly balances between discrimination and learnability of the codes: round objects like wheel
and coins appear close by while horses and camels are faraway. The head of the horse and the
head of camels are close to each other and far way from side views of them. Supplementary
material includes more examples of these projections.
available at http://vision.ri.cmu.edu/projects/dbc/dbc.html.
This work has been supported by the ONR-MURI grant N000141010934.
References
1. Lyman, P., Varian, H.R., Charles, P., Good, N., Jordan, L.L., Pal, J.: How much information?
(2003)
2. A. Gionis, P.I., Motwani, R.: Similarity search in high dimensions via hashing. VLDB
(1999)
3. Shakhnarovich, G., Viola, P.A., Darrell, T.: Fast pose estimation with parameter-sensitive
hashing. In: ICCV. (2003)
4. Kulis, B., Grauman, K.: Kernelized locality-sensitive hashing for scalable image search. In:
ICCV. (2009)
5. Kulis, B., Darrell, T.: Learning to hash with binary reconstructive embeddings. (2009)
6. Rastegari, M., Fang, C., Torresani, L.: Scalable object-class retrieval with approximate and
top-k ranking. In: ICCV. (2011)
14 Rastegari, Farhadi, Forsyth
7. Salakhutdinov, R., Hinton, G.: Semantic hashing. Int. J. Approx. Reasoning (2009)
8. Torralba, A., Fergus, R., , Weiss, Y.: Small codes and large image databases for recognition.
In: CVPR. (2008)
9. Salakhutdinov, R., Hinton, G.: Learning a nonlinear embedding by preserving class neigh-
bourhood structure. In: AISTATS. (2007)
10. Norouzi, M., Fleet, D.: Minimal loss hashing for compact binary codes. In: ICML. (2011)
11. Weiss, Y., Torralba, A.B., Fergus, R.: Spectral hashing. In: NIPS. (2008)
12. Raginsky, M., Lazebnik, S.: Locality sensitive binary codes from shift-invariant kernels. In:
NIPS. (2009)
13. Rahimi, A., Recht, B.: Random features for large-scale kernel machines. In: NIPS. (2007)
14. sung Lin, R., Ross, D.A., Yagnik, J.: Spec hashing: Similarity preserving algorithm for
entropy-based coding. In: CVPR. (2010)
15. J egou, H., Douze, M., Schmid, C., P erez, P.: Aggregating local descriptors into a compact
image representation. In: CVPR. (2010)
16. Gong, Y., Lazebnik, S.: Iterative quantization: A procrustean approach to learning binary
codes. In: CVPR. (2011)
17. Wang, J., Kumar, S., Chang, S.F.: Semi-Supervised Hashing for Scalable Image Retrieval.
In: CVPR. (2010)
18. Farhadi, A., Forsyth, D.: Transfer learning in sign language. In: CVPR. (2007)
19. Farhadi, A., Tabrizi, M.K.: Learning to recognize activities from the wrong view point. In:
ECCV. (2008)
20. Torresani, L., Szummer, M., Fitzgibbon, A.: Efcient object category recognition using
classemes. In: ECCV. (2010)
21. Farhadi, A., Endres, I., Hoiem, D., Forsyth, D.: Describing objects by their attributes. In:
CVPR. (2009)
22. Lampert, C.H., Nickisch, H., Harmeling, S.: Learning to detect unseen object classes by
betweenclass attribute transfer. In: CVPR. (2009)
23. Parikh, D., Grauman, K.: Interactively building a discriminative vocabulary of nameable
attributes. In: CVPR. (2011) 16811688
24. Duan, K., Parikh, D., Crandall, D., Grauman, K.: Discovering localized attributes for ne-
grained recognition. In: Proc. IEEE Conf. on Computer Vision and Pattern Recognition.
(2012)
25. Shrivastava, A., Singh, S., Gupta, A.: Constrained semi-supervised learning using attributes
and comparative attributes. In: ECCV. (2012)
26. Wang, J., Kumar, S., Chang, S.F.: Sequential projection learning for hashing with compact
codes. In: ICML. (2010)
27. Grifn, G., Holub, A., Perona, P.: The Caltech-256. Technical report (2007)
28. Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: ImageNet: A Large-Scale Hier-
archical Image Database. In: CVPR. (2009)
29. Burghouts, G.J., Geusebroek, J.M.: Performance evaluation of local colour invariants. CVIU
(2009)
30. Gehler, P.V., Nowozin, S.: On feature combination for multiclass object classication. In:
ICCV. (2009)
31. Vedaldi, A., Zisserman, A.: Efcient additive kernels via explicit feature maps. In: CVPR.
(2010)
32. Fan, R.E., Chang, K.W., Hsieh, C.J., Wang, X.R., Lin, C.J.: LIBLINEAR: A library for large
linear classication. JMLR (2008)
33. Bergamo, A., Torresani, L.: Picodes: Learning a compact code fornovel-category recogni-
tion. In: NIPS. (2011)
34. J egou, H., Douze, M., Schmid, C.: Product quantization for nearest neighbor search. IEEE
Trans. Pattern Anal. Mach. Intell. 33 (2011)