Semantics-Preserving Bag-of-Words Models and Applications
Semantics-Preserving Bag-of-Words Models and Applications
Research Collection School Of Computing and School of Computing and Information Systems
Information Systems
7-2010
Steven C. H. HOI
Singapore Management University, [email protected]
Nenghai YU
University of Science and Technology of China
Citation
WU, Lei; HOI, Steven C. H.; and YU, Nenghai. Semantics-Preserving Bag-of-Words Models and
Applications. (2010). IEEE Transactions on Image Processing. 19, (7), 1908-1920.
Available at: https://ink.library.smu.edu.sg/sis_research/2309
This Journal Article is brought to you for free and open access by the School of Computing and Information
Systems at Institutional Knowledge at Singapore Management University. It has been accepted for inclusion in
Research Collection School Of Computing and Information Systems by an authorized administrator of Institutional
Knowledge at Singapore Management University. For more information, please email [email protected].
IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 1, NO. 1, 2010 1
Abstract—The Bag-of-Words (BoW) model is a promising data points in vector space. As a result, image annotation
image representation technique for image categorization and is formulated as a supervised classification problem where
annotation tasks. One critical limitation of existing BoW models data are given in some vector space [5]. Such an approach
is that much semantic information is lost during the codebook
generation process, an important step of BoW. This is because enjoys merits of efficient computation and compact storage,
the codebook generated by BoW is often obtained via building but often works effectively only for annotating scene images
the codebook simply by clustering visual features in Euclidian or single-object images. They usually performed poorly on
space. However, visual features related to the same semantics generic images that contain multiple objects.
may not distribute in clusters in the Euclidian space, which is Later, besides extensive studies on global features, more
primarily due to the semantic gap between low-level features and
high-level semantics. In this paper, we propose a novel scheme to promising studies have been focused on regional features.
learn optimized BoW models, which aims to map semantically One typical approach is to partition an image into multiple
related features to the same visual words. In particular, we regions/blobs based on image segmentation and clustering
consider the distance between semantically identical features as techniques. As a result, image annotation is turned into a
a measurement of the semantic gap, and attempt to learn an machine translation task of classifying regions/blobs into key-
optimized codebook by minimizing this gap, aiming to achieve
the minimal loss of the semantics. We refer to such kind of words [8]. Along this direction, a variety of statistical learning
novel codebook as Semantics-Preserving Codebook (SPC) and the techniques, such as relevance models [20], [22], have been
corresponding model as the Semantics-Preserving Bag-of-Words applied to model the relationships of words and regions/blobs.
(SPBoW) model. Extensive experiments on image annotation and The performance of these approaches is often sensitive to the
object detection tasks with public testbeds from MIT’s Labelme quality of image segmentation, which is still an open research
and PASCAL VOC challenge databases show that the proposed
SPC learning scheme is effective for optimizing the codebook challenge in image processing.
generation process, and the SPBoW model is able to greatly Recently, thanks to the advances of powerful local feature
enhance the performance of the existing BoW model. descriptors, such as SIFT [27], researchers in computer vision
Index Terms—bag-of-words models, object representation, se- have attempted to resolve object recognition/image annotation
mantic gap, distance metric learning, image annotation problems by a new approach, known as the “Bag-of-Words”
(BoW) model, which was derived from natural language
processing. Specifically, given an image, BoW first employs
I. I NTRODUCTION
some interest point detector, e.g. the DoG (Difference of
With the advance of cameras and Web 2.0 technology, there Gaussians) detector, to detect salient patches/regions in the
has been a proliferation of digital photos on the Web. Massive image. Further, certain feature descriptor, e.g. SIFT, is applied
photos unlabeled or with few tags have posed a great challenge to represent the local patches/regions as numerical feature
for image retrieval tasks. Automatic image annotation is one vectors. The last step of BoW is to generate a codebook by
promising solution to address this challenge. Generally, auto- converting the patches to “codewords”, e.g. applying k-means
matic image annotation is the process of employing computer algorithm to cluster all the feature vectors into k clusters,
programs to automatically assign an unlabeled image a set of and then defining codewords based on the centers of the k
keywords or tags, each of which represents certain semantic resulting clusters. By mapping each visual feature in the image
object/concept. By automatic image annotation, an image to the codewords, the image is represented by the histogram of
retrieval problem is turned into a text retrieval task, which the codewords. Based on the BoW representation, some well-
can be effectively resolved by taking advantages of mature known topic models, e.g. probabilistic latent semantic analysis
text indexing and retrieval techniques. (pLSA) [17], can be applied to analyze the topics of the images
In the past decade, numerous studies have been focused on [6]. While sacrificing spatial information, BoW has generally
automatic image annotation [5], [8], [11], [20], [22]. Some shown promising performance for object categorization [34]
earlier studies often extract global visual features, such as and image annotation tasks [11].
color and texture, from whole images to represent them as However, BoW still has several important drawbacks. Other
than the ignorance of spatial information that has been widely
Mr. Lei Wu is from MOE-MS Key Lab of MCC, Dept of EEIS, University
of Science and Technology of China, Hefei China, 230026. This work discussed in many recent papers [4], [26], [37], [24], [42],
was performed when Mr. Lei Wu was a research assistant at Nanyang another critical disadvantage is that semantics of objects is
Technological University. considerably lost during the processes of sub-region detec-
Corresponding Author: Dr. Steven C.H. Hoi, School of Computer En-
gineering, Nanyang Technological University, Singapore 639798, E-mail: tion and visual word generation. Firstly, the detection and
[email protected] segmentation of sub-regions damage the semantic integration.
IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 1, NO. 1, 2010 2
Several methods have been proposed to locate the sub-regions learning. Below we briefly review the related work of both
in an image, e.g. regular grid [42], interest point detector [36], categories respectively.
[27], random sampling [28], sliding windows [23], other
segmentation methods [32], [16] etc. However, due to the
A. Image Annotation and Object Recognition
lack of human knowledge, these methods cannot locate the
semantically intact regions very accurately, which partially In literature, numerous studies have been devoted to im-
causes the semantic gap problem. Secondly, it is problematic age annotation and object recognition. They can be roughly
for generating the visual words using k-means clustering in grouped into three major categories. The first category is
Euclidian space, which implicitly assumes that SIFT features based on global features [13]. As a result, regular supervised
of similar semantics are distributed in the same cluster in classification techniques, such as SVM, can be applied to solve
Euclidian space. This however does not always hold, especially the categorization and annotation tasks.
for high dimensional SIFT features. Unlike the completely The second category is to extract regional features such
unsupervised clustering by k-means in visual word generation, that an image can be represented by a set of visual re-
we believe that a semi-supervised clustering approach with the gions/blobs [2], [8], [22]. The image annotation task is thus
aid of side information could lead to more effective codebook converted to a problem of learning keywords/tags from visual
for object representation. regions/blobs. For instance, Barnard et al. [2] treated image
To this end, this paper proposes a novel Semantics- annotation as a machine translation problem. Jeon et al. [8]
Preserving Bag-of-Words (SPBoW) model, which considers proposed the cross-media relevance models (CMRM) model,
the distance between the semantically identical features as which combines both surrounding texts and image contents for
a measurement of the semantic gap, and tries to learn a annotation. Jin et al. [22] studied coherent language models
codebook by minimizing this semantic gap. We formulate that takes into account the word-to-word correlation.
the codebook generation task as a distance metric learning The last category is focused on applying bag-of-features
problem, which can be formalized as semi-definite program- or bag-of-words representations for image annotation/object
ming (SDP). We then propose an efficient eigen projection recognition [6], [21], [35]. Csurka et al. [6] proposed a bag-
algorithm to solve the optimization problem efficiently. With of-keypoints approach similar to BoW in text categorization
the integrated knowledge and side information, the semantic for visual object categorization. Jiang et al. [21] studied some
gap can be minimized and the codebook is able to consistently practical techniques to improve the performance of bag-of-
represent the semantics of the objects. To the best of our features for object recognition and retrieval. Recently, Wu et
knowledge, this is the first distance metric learning approach al. [42] proposed a language modeling approach to address
to overcome the limitation of semantics lost in BoW models. one limitation of the BoW models, i.e., the loss of spatial
As a summary, the main contributions of this paper include: information. These methods generate the codebook by clus-
(1) we are the first to propose a measurement of the semantic tering visual features in the original feature space. Due to the
gap; (2) we propose to bridge the semantic gap via distance semantic gap, each visual word may contain multiple semantic
metric learning method; (4) we propose and implement an meanings and the same semantic meaning may be represented
efficient algorithm to solve the codebook learning task; (4) we by multiple visual words. In these models, each visual word
suggest a novel object based codebook scheme; (5) we propose actually does not have correspondence to a precise semantic
a measurement for visual complexity; (6) the proposed method meaning.
can automatically decide the size of the codebook for each Besides, there are also some emerging paradigms for image
category; (7) we evaluate and compare a number of different annotation, such as search-based annotation [38] that explores
methods for the codebook generation process in building WWW images in helping the annotation tasks, and the ALIPR
various bag-of-word models towards object annotation tasks. paradigm [25], which used advanced statistical learning tech-
The rest of the paper is organized as follows. Section II niques to provide fully automatic and real-time annotation for
reviews related work. Section III presents the framework of the digital pictures. These techniques are not highly relevant to
SPBoW model. Section III-B gives the details of the object our focus, and are thus out of the discussions in this paper.
representations for this novel model. Section IV elaborates
on the codebook learning task and formulates the task as
B. Distance Metric Learning
an optimization problem. Section V applies the proposed
semantics-preserving codebook (SPC) technique on object From a machine learning point of view, our work is related
annotation tasks. Section VI compares the SPBoW model to supervised distance metric learning (DML). Specifically,
and the metric learning algorithm with several state-of-the-art consider a set of n data examples X = {xi ∈ Rd }ni=1 in d-
methods for object annotation experiments on MIT’s Labelme dimensional vector space, the objective of DML is to find an
testbed [30] and object categorization on PASCAL VOC optimal Mahalanobis metric M from training data with side
challenge testbed [9]. Section VII concludes the paper. information that can be either class labels or general pairwise
constraints [43].
In literature, DML has been actively studied recently. Exist-
II. R ELATED W ORK
ing DML studies can be roughly grouped into two major cate-
Our work is related to several research topics, includ- gories. One category is to learn metrics with class labels, such
ing image annotation/object recognition, and distance metric as Neighbourhood Components Analysis (NCA) [14], which
IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 1, NO. 1, 2010 3
we adopt some images from MIT’s Labelme testbed [30] as each image are separated from the background by a bounding
training data, in which objects are well segmented and labeled box. Thus, if two features are in the same bounding box or
by users. By the proposed SPBoW framework, we first apply in the bounding boxes with the same label are treated of the
SIFT to extract features from each image. The SIFT features same semantic meanings.
that are located at the regions of the same semantics (label)
in all the images are collected to represent the semantics. In
order to preserve the semantics in the BoW model, all the
collected features related to the same semantics are clustered
into one or several discriminative visual words for representing
the object based on an optimized distance metric that aims to
minimize the overall semantic loss. The visual words used for
representing an object may describe different semantic parts
or different views of the object. Finally, we note that the set
of visual words used for one object is often different from the
set used for another. This is very different from the regular
BoW model where all objects share the same set of visual
words. Next we present a novel learning technique that aims
to find an optimal distance metric to overcome the limitation
of semantic loss during the codebook generation process.
semantically identical features closer and thus more likely to Algorithm 1 The Semantics-Preserving Metric Learning
be assigned to the same visual word. The second term of the (SPML) algorithm
objective function is the regularization term, which prevents INPUT:
N ×d
• SIFT feature matrix: X ∈ R
the overfitting by minimizing the complexity of the model. th
• pairwise constraint (xi1 , xi2 , zi1 , zi2 , yi ), where xi1 is the i1
The second equality constraint is introduced to prevent the SIFT feature, zi1 indicate whether the location of feature xi1
trivial solution by shrinking metric A into a zero matrix, is on the semantic object, and constraints yi = {+1, 0, −1}
and λ is a constant parameter. By solving the optimization represents feature xi1 and xi2 are on the same semantic part of
problem, we can obtain the optimized distance metric A the object, not known, or on different semantic parts.
• regularization parameter λ
and the threshold variable b that could be used to determine
• learning rate parameter γ
whether two features are similar or dissimilar. In general, the PROCEDURE:
above optimization problem belongs to a general semi-definite 1: initialize metric and threshold: A = I, b = b0
programming (SDP), which is often difficult to solve with 2: set iteration step t = 1;
global optima for large applications. 3: repeat
4: (1) update the learning rate:
γ = γ/t, t = t + 1
B. Optimization 5: (2) update the subset of training instances:
In this section, we present a stochastic gradient search St+ = {(xi1 , xi2 , yi )|(1 + yi )∥xi1 − xi2 ∥2A > 1}
St− = {(x∪ i1 , xi2 , yi )|(1 − yi )∥xi1 − xi2 ∥A < 1}
2
algorithm by combining with an active constraint selection St = St +
St −
scheme to efficiently solve the above optimization problem. 6: (3) compute the gradients w.r.t. A
⊤ ⊤
To simplify the formulation, we denote the feature matrix ∇A L ← Z1 Z2 (λA + DX Y DX ),
as X ∈ RNtr ×d where Ntr is the number of SIFT features DX = X1 − X2 ,
in the training set, and d is the feature dimension. We 7: (4) compute the gradients w.r.t. b
∇b L ← tr(Z1 Z2 Y )
also represent all the feature pairs (xi1 ,xi2 ) in the training 8: (5) update metric and threshold:
data by two feature matrices X1 = [x11 , x21 , · · · , xn1 ]⊤ At+1 ← At − γt ∇A L, bt+1 ← bt − γt ∇b L
and X2 = [x12 , x22 , · · · , xn2 ]⊤ , and similarly their con- 9: (6) project A ∑back to the PSD cone:
straints by three matrices Z1 = diag(z11 , z21 , · · · , zn1 ), At+1 = ∑di=1 λi ϕi ϕ⊤ i
Z2 = diag(z12 , z22 , · · · , zn2 ) and Y = diag[y1 , · · · , yn ]. The At+1 ← i max(0, λi )ϕi ϕ⊤ i
proposed iterative optimization scheme is described in the 10: (7) normalize At+1 to satisfy ∥At+1 ∥ = √1λ :
√
following steps. At+1 ← ∥A 1/ λ
t+1 ∥
At+1
First of all, we actively choose a subset of informative side 11: until convergence
information from the training data as the training instances. OUTPUT:
• feature metric A, threshold variable b
In particular, the training instances must satisfy either one of
the two criterions: (1) the features are of the same semantics
but with large distance in the current metric space; or (2) the
features are of different semantics but with small distance in adopt the following theorem proposed in [15], which provides
the current metric space. a bound for a general sub-gradient method. The detailed proofs
Based on the selected training dataset St in the t-th iteration, and explanations can be found in [31].
we then apply the gradient descent technique to search for the Theorem 1: Let L1 , · · · , LT be a sequence of λ-strongly
optimal metric A and threshold b. convex functions w.r.t the objective function 12 tr(·), where
Finally, to enforce the valid metric constraint, we project L
∏t = L(A, St ). Let A be a closed convex set and define
′
the current solution of metric A back to a positive semidef- A (A) = arg min ′
A ∈A ∥A − A ∥ . Let A1 , · · · , AT +1 be a
inite (PSD) cone by an eigen decomposition approach. The sequence
∏ of vectors such that A 1 ∈ A and for t ≥ 1,At+1 =
details of the proposed Semantics-Preserving Metric Learning A (A t − γ
∇
t t ) ,where ∇ t is a subgradient of Lt at At . Assume
(SPML) algorithm are described in Algorithm 1, in which that for all t, ∥∇t ∥ ≤ G. Then, for all u ∈ A we have
γ is a learning rate variable that is determined empirically.
1∑ 1∑
T T
G2 (1 + ln(T ))
DX = X1 − X2 is the difference between the two feature Lt (At ) ≤ Lt (u) + (6)
matrices X1 and X2 . Empirically, this iterative algorithm T t=1 T t=1 2λT
converges quickly with no more than 5 iterations.
By applying Theorem 1, we can prove the following corollary.
C. Convergence Analysis Corollary 2: Let L1 , · · · , LT be a sequence of λ-strongly
We now analyze the convergence of the algorithm. Let us convex
∏ functions. Let A be a closed convex set and define
denote the objective function in the t-th iteration as follows: A (A) = arg minA′ ∈A ∥A − A′ ∥. Let A1 , · · · , AT +1 be
∑ λ
a sequence∏ of matrics such that A1 ∈ A and for t ≥ 1,
L(At , St ) = yi (∥xi1 − xi2 ∥ − bt ) + tr(At A⊤ At+1 = A (At − γt ∇t ), where ∇t is a subgradient of Lt at
t )
2 At . Then, the bound for the proposed objective function is
(xi1 ,xi2 ,yi )∈St
(5) √ ∑
1∑ 1∑
T T
To prove the convergence of the algorithm, we first calculate ∗ ( λ + i ξi )2 (1 + ln(T ))
Lt (At ) ≤ Lt (A ) +
the bound of the objective function after T iterations. Here we T t=1 T t=1 2λT
IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 1, NO. 1, 2010 6
where A∗ is the optimal solution. particular, consider each object category as a bag of features,
Using Corollary 2 and the ∑Tconvexity property ∑
of the objec- each feature in the object category has a probability of being
T
tive function L, i.e., T1 t−1 Lt (At ) ≤ L( T1 t=1 At ) we generated from the bag. Such a probability can be estimated by
can further show the following corollary of the optimization either the distance to the mean of all features or the frequency
bound. of the features. For example, let us denote by Ci an object
Corollary 3: Let L(At , St ) = Lt (At ), and L(At ) = category and xj some feature, we can estimate the generative
ESt L(At , St ). Assume
∑ the conditions stated
√ in ∑ Corollary 2 and probability p(xj |Ci ) as follows:
denote by Ā = T1 Tt=1 At , and G = ( λ + i ξi ), then we ∥xj −x̂∥2
1
exp− 2σ2
A
have the following result: p(xj |Ci ) = √ (7)
2
2πσ
G (1 + ln(T )) ∑
L(Ā) ≤ L(A∗ ) + where x̂ = n1C
λT i
xj ∈Ci xj , and nCi is the total number of
maxi LCi , cij denotes the center of the j-th cluster and rij
denotes the range radius of the cluster, which is defined as the
largest distance from the features to the cluster center.
To reduce noisy clusters, we further sort the K clusters
according to their sizes Sij that are calculated below:
∑ {
1, a ≤ b;
Sij = δ(∥x−cij ∥A , rij ), where δ(a, b) =
0, otherwise.
x
Similar to existing BoW models, the proposed SPBoW can where fI (k) is the frequency of visual word wk appearing in
also be easily adopted for existing classification methods, the test image I, and prior p(Ci ) can be calculated based on the
including both generative and discriminative models. Fig. 3 normalized frequencies of the object category that appears in
illustrates the idea of applying SPBoW for annotating a the training data. Finally, we rank the object categories by their
novel image in an object annotation task. Below we discuss likelihood p(Ci |I), Ci = 1 · · · , K, and top N (N = 1, · · · , 10)
two representative methods for applying SPBoW in image ranked categories are used to annotate the image.
annotation and object categorization applications.
B. Discriminative Models
A. Generative Model The learned SPC can also be used in a discriminative
Based on the SPBoW technique, we now discuss how to learning setting. To illustrate this property, we apply the code-
apply the resulting semantics-preserving codebook for building book to train SVM models for classifying the visual objects.
generative models in an object annotation task. Assume that Similarly, we are given a set of training images (or image
IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 1, NO. 1, 2010 8
(a) Average Precision (a) AP@N of SPBoW method under different constraints
(b) Average Recall (b) AR@N of SPBoW method under different constraints
Fig. 4. Performance comparison of different approaches for image annotation Fig. 5. Evaluation of constraint sizes on the image annotation performance
Categories BOW AP06-Lee QMUL-LSPCH XRCE RCA ITML LMNN NCA SPBoW
bicycle 56.91 79.10 94.80 94.30 93.45 96.98 94.12 95.34 99.89
bus 56.61 63.70 98.10 97.80 97.57 98.17 97.79 95.98 97.15
car 60.31 83.30 97.50 96.70 94.42 93.17 93.13 93.13 94.54
cat 61.08 73.30 93.70 93.30 92.19 94.15 92.97 93.32 93.33
cow 68.53 75.60 93.80 94.00 93.91 92.18 92.77 92.75 94.18
dog 73.22 64.40 87.60 86.60 87.77 92.11 90.06 89.97 94.42
horse 28.83 60.70 92.60 92.50 93.22 95.58 96.18 93.85 95.18
motorbike 36.01 67.20 96.90 95.70 92.19 94.37 94.75 94.19 96.97
person 60.78 55.00 85.50 86.30 92.18 93.33 94.18 91.31 92.68
sheep 60.74 79.20 95.60 95.10 97.19 97.15 92.39 95.67 97.44
Average 56.30 70.15 93.61 93.23 93.41 94.72 93.83 93.55 95.58
TABLE I
AUC VOC2006
RESULTS ON THE DATASET.
of the data are used to generate the codebook and 1 fold is suffer from the semantic loss in the codebook generation pro-
used for object annotation, we thus focus on measuring the cess, our new technique overcomes this drawback by learning
computational time on codebook generation by the methods. an effective distance metric that aims to bridge the semantic
We omit the results of the annotation time cost since they are gap between low-level features and high-level semantics. We
almost similar for all the compared methods. propose a novel measurement of semantic gap and then try
to minimize the gap via distance metric learning. In addition
Method BoW RCA ITML LMNN NCA SPBoW to the new efficient algorithm for solving the challenging
Time Cost (s) 121 3 96 1759 457 8 distance metric learning task, we also propose the object
based codebook generation scheme, which not only improves
TABLE II
T IME EVALUATION OF CODEBOOK GENERATION BY DIFFERENT METHODS . the efficacy, but also significantly reduces the computational
cost. Extensive experiments have been done on both image
annotation and object categorization applications, in which
Table II shows average computational time for generating
encouraging results show that the new SPBoW technique is
the codebook. It consists of time costs of both metric learning
effective and promising for object representation in a large
and the k-means clustering. 50,000 random features are used
range of multimedia applications. Future work will study more
to generate 2,500 visual words by k-means algorithm. There
advanced approaches of improving the estimation of distribu-
are two kinds of codebook generation schemes: the global
tion for the measurement of visual complexity, and investigate
codebook and the object codebook. The global codebook
other distance metric learning techniques for improving the
scheme uses k-means to cluster the 50,000 features into 2,500
performance.
clusters. The object codebook scheme generates clusters within
each category and then combines them to a codebook of size ACKNOWLEDGEMENTS
2,500. So for each category, we only need to generate around
2, 500/|C| clusters from around 50, 000/|C| features, where The work was supported by Singapore NRF Interactive Dig-
|C| = 495 is the size of the categories. The global codebook ital Media R&D Program, under research grant NRF2008IDM-
scheme requires to compute the distances 2, 500 × 50, 000 ∼ IDM004-006, MOE tier-1 Research Grant (RG67/07), and
O(108 ) times per iteration, but the proposed object codebook the National High Technology Research and Development
scheme only needs 2, 500/|C| × 50, 000/|C| × |C| ∼ O(105 ) Program of China(863)(No. 2008AA01Z117).
times per iteration. BoW adopts the global codebook, while R EFERENCES
the other methods employ the object codebook.
[1] A. Bar-Hillel, T. Hertz, N. Shental, and D. Weinshall. Learning distance
From the results, we found that BoW takes even more time functions using equivalence relations. In In Proceedings of the Twentieth
than some of the SPBoW models due to the limitation of the International Conference on Machine Learning, pages 11–18, 2003.
global codebook, even it does not have any cost for metric [2] K. Barnard, P. Duygulu, D. Forsyth, N. D. Freitas, D. M. Blei, J. K,
T. Hofmann, T. Poggio, and J. Shawe-taylor. Matching words and
learning. This again shows that the object codebook is not pictures. Journal of Machine Learning Research, 3:1107–1135, 2003.
only more effective, but also more efficient than the regular [3] D. M. Blei, A. Y. Ng, M. I. Jordan, and J. Lafferty. Latent dirichlet
BoW method. Finally, by comparing the time cost between allocation. JMLR, 3:993–1022, 2003.
[4] L. Cao and L. Fei-Fei. Spatially coherent latent topic model for
different DML techniques, we can see that our algorithm is concurrent segmentation and classification of objects and scenes. In
comparable to the simple RCA method, and is significantly IEEE International Conference on Computer Vision, pages 1–8, 2007.
more efficient than the other state-of-the-art metric learning [5] G. Carneiro and N. Vasconcelos. Formulating semantic image annotation
as a supervised learning problem. In IEEE CVPR, pages 163–168, 2005.
techniques that are usually computationally intensive. [6] C. Dance, J. Willamowski, L. Fan, C. Bray, and G. Csurka. Visual
categorization with bags of keypoints. In ECCV International Workshop
on Statistical Learning in Computer Vision, 2004.
VII. C ONCLUSION [7] J. V. Davis, B. Kulis, P. Jain, S. Sra, and I. S. Dhillon. Information-
This paper proposed a novel framework of Semantics- theoretic metric learning. In ICML’07, pages 209–216, 2007.
[8] P. Duygulu, K. Barnard, J. de Freitas, and D. A. Forsyth. Object
Preserving Bag-of-Words (SPBoW) for object representation. recognition as machine translation: Learning a lexicon for a fixed image
Unlike conventional Bag-of-Words (BoW) models that usually vocabulary. In ECCV, pages 97–112, 2002.
IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 1, NO. 1, 2010 12
[9] M. Everingham, C. W. A Zisserman, and L. V. Gool. The 2006 pascal [35] P. Tirilly, V. Claveau, and P. Gros. Language modeling for bag-of-visual
visual object classes challenge. In Workshop in ECCV’06, 2006. words image categorization. In Proc. ACM Int. Conf. on Content-based
[10] M. Everingham, A. Zisserman, C. K. I. Williams, and L. Van Gool. The image and video retrieval, pages 249–258, Niagara Falls, Canada, 2008.
PASCAL Visual Object Classes Challenge 2006 (VOC2006) Results. [36] E. Tola, V. Lepetit, and P. Fua. A fast local descriptor for dense
http://www.pascal-network.org/challenges/VOC/voc2006/results.pdf. matching. In IEEE Conference on Computer Vision and Pattern
[11] J. Fan, Y. Gao, and H. Luo. Multi-level annotation of natural scenes Recognition (CVPR2008), pages 1–8, 2008.
using dominant image components and semantic concepts. In ACM [37] J. Z. Wang, J. Li, and G. Wiederhold. SIMPLIcity: semantics-sensitive
Multimedia, pages 540–547, 2004. integrated matching for picture libraries. IEEE Transactions on Pattern
[12] A. Globerson and S. Roweis. Metric learning by collapsing classes. In Analysis and Machine Intelligence, 23:947–963, 2001.
NIPS’05, 2005. [38] X.-J. Wang, L. Zhang, F. Jing, and W.-Y. Ma. Annosearch: Image auto-
[13] K.-S. Goh, B. Li, and E. Chang. Using one-class and two-class svms annotation by search. In CVPR’06, pages 1483–1490, 2006.
for multiclass image annotation. IEEE Trans. on Knowl. and Data Eng., [39] K. Weinberger, J. Blitzer, and L. Saul. Distance metric learning for large
17(10):1333–1346, 2005. margin nearest neighbor classification. Advances in Neural Information
[14] J. Goldberger, S. Roweis, G. Hinton, and R. Salakhutdinov. Neighbor- Processing Systems, 18:1473–1480, 2006.
hood component analysis. In NIPS, 2004. [40] L. Wu, S. C. H. Hoi, R. Jin, J. Zhu, and N. Yu. Distance metric learning
from uncertain side information with application to automated photo
[15] E. Hazan, A. Agarwal, and S. Kale. Logarithmic regret algorithms for
tagging. In Proceedings of the seventeen ACM international conference
online convex optimization. Mach. Learn., 69(2-3):169–192, 2007.
on Multimedia (MM’09), pages 135–144, Beijing, China, 2009.
[16] V. Hedau, H. Arora, and N. Ahuja. Matching images under unstable [41] L. Wu, Y. Hu, M. Li, N. Yu, and X.-S. Hua. Scale-invariant visual
segmentations. In IEEE CVPR, pages 1–8, 2008. language modeling for object categorization. Multimedia, IEEE Trans-
[17] T. Hofmann. Probabilistic latent semantic indexing. In SIGIR’99, pages actions on, 11(2):286–294, Feb. 2009.
50–57, Berkeley, CA, 1999. [42] L. Wu, M. Li, Z. Li, W.-Y. Ma, and N. Yu. Visual language modeling for
[18] S. C. H. Hoi, W. Liu, and S.-F. Chang. Semi-supervised distance image classification. In Proc. Int. workshop on multimedia information
metric learning for collaborative image retrieval. In Proceedings of IEEE retrieval (MIR’07), pages 115–124, Augsburg, Bavaria, Germany, 2007.
Conference on Computer Vision and Pattern Recognition (CVPR2008), [43] E. P. Xing, A. Y. Ng, M. I. Jordan, and S. Russell. Distance metric learn-
June 2008. ing with application to clustering with side-information. In NIPS2002,
[19] S. C. H. Hoi, W. Liu, M. R. Lyu, and W.-Y. Ma. Learning distance 2002.
metrics with contextual constraints for image retrieval. In Proc. [44] L. Yang, R. Jin, R. Sukthankar, and Y. Liu. An efficient algorithm for
CVPR2006, New York, US, June 17–22 2006. local distance metric learning. In AAAI, 2006.
[20] J. Jeon, V. Lavrenko, and R. Manmatha. Automatic image annotation
and retrieval using cross-media relevance models. In Proc. 26th ACM Lei Wu received the Bachelor degree in Special
SIGIR Conference, pages 119–126, 2003. Class for Gifted Young (SCGY) in 2005 from Uni-
[21] Y.-G. Jiang, C.-W. Ngo, and J. Yang. Towards optimal bag-of-features versity of Science and Technology of China (USTC),
for object categorization and semantic video retrieval. In Proc. 6th ACM from which he is now pursuing his Ph.D degree
Int. Conf. on Image and video retrieval, pages 494–501, Amsterdam, The in Electronic Engineering and Information Science.
Netherlands, 2007. From 2006 to 2008, he has been a research intern
[22] R. Jin, J. Y. Chai, and L. Si. Effective automatic image annotation via at Microsoft Research Asia working on image an-
a coherent language model and active learning. In Proc. 12th ACM notation and tagging. From 2008 to 2009, he was
International Conference on Multimedia, pages 892–899, New York, visiting Nanyang Technological University working
NY, USA, 2004. on distance metric learning and multimedia retrieval.
[23] C. H. Lampert, M. B. Blaschko, and T. Hofmann. Beyond sliding His research interests include machine learning, mul-
windows: object localization by efficient subwindow search. In CVPR, timedia retrieval, and computer vision. He received Microsoft Fellowship in
2008. 2007.
[24] S. Lazebnik, C. Schmid, and J. Ponce. Beyond bags of features: Spatial
pyramid matching for recognizing natural scene categories. volume 2, Steven C.H. Hoi is currently an Assistant Professor
pages 2169–2178, 2006. in the School of Computer Engineering of Nanyang
[25] J. Li and J. Z. Wang. Real-time computerized annotation of pictures. Technological University, Singapore. He received his
In Proceedings of the 14th annual ACM international conference on Bachelor degree in Computer Science from Tsinghua
Multimedia, pages 911–920, Santa Barbara, CA, USA, 2006. University, Beijing, P.R. China, and his Master and
[26] J. Li, W. Wu, T. Wang, and Y. Zhang. One step beyond histograms: Ph.D degrees in Computer Science and Engineering
Image representation using markov stationary features. In IEEE CVPR from Chinese University of Hong Kong. His research
Conference, pages 1–8, 2008. interests include statistical machine learning, mul-
[27] D. G. Lowe. Distinctive image features from scale-invariant keypoints. timedia information retrieval, Web search and data
IJCV, 60:91–110, 2004. mining. He is a member of IEEE and ACM.
[28] R. Maree, P. Geurts, J. Piater, and L. Wehenkel. Random subwindows for
robust image classification. In Proc. IEEE Conf. on Computer Vision and
Nenghai Yu is currently a Professor in the De-
Pattern Recognition (CVPR’05), pages 34–40, Washington, DC, USA,
partment of Electronic Engineering and Information
2005.
Science of University of Science and Technology
[29] F. Perronnin, C. Dance, G. Csurka, and M. Bressan. Adapted vocab- of China (USTC). He is the Executive Director
ularies for generic visual categorization. In In ECCV, pages 464–475, of MOE-Microsoft Key Laboratory of Multimedia
2006. Computing and Communication, and the Director of
[30] B. C. Russell, A. Torralba, K. P. Murphy, and W. T. Freeman. Labelme: Information Processing Center at USTC. He grad-
A database and web-based tool for image annotation. Int. J. Comput. uated from Tsinghua University,Beijing, China, and
Vision, 77(1-3):157–173, 2008. obtained his M.Sc. Degree in Electronic Engineering
[31] S. Shalev-Shwartz and Y. Singer. Logarithmic regret algorithms for in 1992, and then he joined in USTC and worked
strongly convex repeated games (technical report). The Hebrew Univer- there until now. He received his Ph.D. Degree in
sity., 2007. Information and Communications Engineering from USTC, Hefei, China,
[32] J. Shotton, M. Johnson, and R. Cipolla. Semantic texton forests for in 2004. His research interests are in the field of multimedia information
image categorization and segmentation. In Computer Vision and Pattern retrieval, digital media analysis and representation, media authentication,
Recognition, 2008. CVPR 2008. IEEE Conference on, pages 1–8, 2008. video surveillance and communications etc. He has been responsible for many
[33] L. Si, R. Jin, S. C. H. Hoi, and M. R. Lyu. Collaborative image national research projects. Based on his contribution, Professor Yu and his
retrieval via regularized metric learning. ACM Multimedia Systems research group won the Excellent Person Award and Excellent Collectivity
Journal (MMSJ), 12(1):34–44, 2006. Award simultaneously from the National Hi-tech Development Project of
[34] J. Sivic, B. C. Russell, A. A. Efros, A. Zisserman, and W. T. Freeman. China in 2004. He has contributed more than 80 papers to journals and
Discovering object categories in image collections. In Proceedings of international conferences.
the IEEE International Conference on Computer Vision (ICCV), 2005.