0% found this document useful (0 votes)
3 views

Semantics-Preserving Bag-of-Words Models and Applications

The document discusses a novel Semantics-Preserving Bag-of-Words (SPBoW) model aimed at improving image annotation and categorization by addressing the semantic gap in traditional Bag-of-Words models. It proposes a method for optimizing codebook generation through distance metric learning, allowing semantically related features to be clustered more effectively. The paper presents extensive experiments demonstrating the enhanced performance of the SPBoW model compared to existing methods in object detection tasks.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views

Semantics-Preserving Bag-of-Words Models and Applications

The document discusses a novel Semantics-Preserving Bag-of-Words (SPBoW) model aimed at improving image annotation and categorization by addressing the semantic gap in traditional Bag-of-Words models. It proposes a method for optimizing codebook generation through distance metric learning, allowing semantically related features to be clustered more effectively. The paper presents extensive experiments demonstrating the enhanced performance of the SPBoW model compared to existing methods in object detection tasks.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

Singapore Management University

Institutional Knowledge at Singapore Management University

Research Collection School Of Computing and School of Computing and Information Systems
Information Systems

7-2010

Semantics-Preserving Bag-of-Words Models and Applications


Lei WU
University of Science and Technology of China

Steven C. H. HOI
Singapore Management University, [email protected]

Nenghai YU
University of Science and Technology of China

Follow this and additional works at: https://ink.library.smu.edu.sg/sis_research

Part of the Databases and Information Systems Commons

Citation
WU, Lei; HOI, Steven C. H.; and YU, Nenghai. Semantics-Preserving Bag-of-Words Models and
Applications. (2010). IEEE Transactions on Image Processing. 19, (7), 1908-1920.
Available at: https://ink.library.smu.edu.sg/sis_research/2309

This Journal Article is brought to you for free and open access by the School of Computing and Information
Systems at Institutional Knowledge at Singapore Management University. It has been accepted for inclusion in
Research Collection School Of Computing and Information Systems by an authorized administrator of Institutional
Knowledge at Singapore Management University. For more information, please email [email protected].
IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 1, NO. 1, 2010 1

Semantics-Preserving Bag-of-Words Models and


Applications
Lei Wu, Steven C.H. Hoi, and Nenghai Yu

Abstract—The Bag-of-Words (BoW) model is a promising data points in vector space. As a result, image annotation
image representation technique for image categorization and is formulated as a supervised classification problem where
annotation tasks. One critical limitation of existing BoW models data are given in some vector space [5]. Such an approach
is that much semantic information is lost during the codebook
generation process, an important step of BoW. This is because enjoys merits of efficient computation and compact storage,
the codebook generated by BoW is often obtained via building but often works effectively only for annotating scene images
the codebook simply by clustering visual features in Euclidian or single-object images. They usually performed poorly on
space. However, visual features related to the same semantics generic images that contain multiple objects.
may not distribute in clusters in the Euclidian space, which is Later, besides extensive studies on global features, more
primarily due to the semantic gap between low-level features and
high-level semantics. In this paper, we propose a novel scheme to promising studies have been focused on regional features.
learn optimized BoW models, which aims to map semantically One typical approach is to partition an image into multiple
related features to the same visual words. In particular, we regions/blobs based on image segmentation and clustering
consider the distance between semantically identical features as techniques. As a result, image annotation is turned into a
a measurement of the semantic gap, and attempt to learn an machine translation task of classifying regions/blobs into key-
optimized codebook by minimizing this gap, aiming to achieve
the minimal loss of the semantics. We refer to such kind of words [8]. Along this direction, a variety of statistical learning
novel codebook as Semantics-Preserving Codebook (SPC) and the techniques, such as relevance models [20], [22], have been
corresponding model as the Semantics-Preserving Bag-of-Words applied to model the relationships of words and regions/blobs.
(SPBoW) model. Extensive experiments on image annotation and The performance of these approaches is often sensitive to the
object detection tasks with public testbeds from MIT’s Labelme quality of image segmentation, which is still an open research
and PASCAL VOC challenge databases show that the proposed
SPC learning scheme is effective for optimizing the codebook challenge in image processing.
generation process, and the SPBoW model is able to greatly Recently, thanks to the advances of powerful local feature
enhance the performance of the existing BoW model. descriptors, such as SIFT [27], researchers in computer vision
Index Terms—bag-of-words models, object representation, se- have attempted to resolve object recognition/image annotation
mantic gap, distance metric learning, image annotation problems by a new approach, known as the “Bag-of-Words”
(BoW) model, which was derived from natural language
processing. Specifically, given an image, BoW first employs
I. I NTRODUCTION
some interest point detector, e.g. the DoG (Difference of
With the advance of cameras and Web 2.0 technology, there Gaussians) detector, to detect salient patches/regions in the
has been a proliferation of digital photos on the Web. Massive image. Further, certain feature descriptor, e.g. SIFT, is applied
photos unlabeled or with few tags have posed a great challenge to represent the local patches/regions as numerical feature
for image retrieval tasks. Automatic image annotation is one vectors. The last step of BoW is to generate a codebook by
promising solution to address this challenge. Generally, auto- converting the patches to “codewords”, e.g. applying k-means
matic image annotation is the process of employing computer algorithm to cluster all the feature vectors into k clusters,
programs to automatically assign an unlabeled image a set of and then defining codewords based on the centers of the k
keywords or tags, each of which represents certain semantic resulting clusters. By mapping each visual feature in the image
object/concept. By automatic image annotation, an image to the codewords, the image is represented by the histogram of
retrieval problem is turned into a text retrieval task, which the codewords. Based on the BoW representation, some well-
can be effectively resolved by taking advantages of mature known topic models, e.g. probabilistic latent semantic analysis
text indexing and retrieval techniques. (pLSA) [17], can be applied to analyze the topics of the images
In the past decade, numerous studies have been focused on [6]. While sacrificing spatial information, BoW has generally
automatic image annotation [5], [8], [11], [20], [22]. Some shown promising performance for object categorization [34]
earlier studies often extract global visual features, such as and image annotation tasks [11].
color and texture, from whole images to represent them as However, BoW still has several important drawbacks. Other
than the ignorance of spatial information that has been widely
Mr. Lei Wu is from MOE-MS Key Lab of MCC, Dept of EEIS, University
of Science and Technology of China, Hefei China, 230026. This work discussed in many recent papers [4], [26], [37], [24], [42],
was performed when Mr. Lei Wu was a research assistant at Nanyang another critical disadvantage is that semantics of objects is
Technological University. considerably lost during the processes of sub-region detec-
Corresponding Author: Dr. Steven C.H. Hoi, School of Computer En-
gineering, Nanyang Technological University, Singapore 639798, E-mail: tion and visual word generation. Firstly, the detection and
[email protected] segmentation of sub-regions damage the semantic integration.
IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 1, NO. 1, 2010 2

Several methods have been proposed to locate the sub-regions learning. Below we briefly review the related work of both
in an image, e.g. regular grid [42], interest point detector [36], categories respectively.
[27], random sampling [28], sliding windows [23], other
segmentation methods [32], [16] etc. However, due to the
A. Image Annotation and Object Recognition
lack of human knowledge, these methods cannot locate the
semantically intact regions very accurately, which partially In literature, numerous studies have been devoted to im-
causes the semantic gap problem. Secondly, it is problematic age annotation and object recognition. They can be roughly
for generating the visual words using k-means clustering in grouped into three major categories. The first category is
Euclidian space, which implicitly assumes that SIFT features based on global features [13]. As a result, regular supervised
of similar semantics are distributed in the same cluster in classification techniques, such as SVM, can be applied to solve
Euclidian space. This however does not always hold, especially the categorization and annotation tasks.
for high dimensional SIFT features. Unlike the completely The second category is to extract regional features such
unsupervised clustering by k-means in visual word generation, that an image can be represented by a set of visual re-
we believe that a semi-supervised clustering approach with the gions/blobs [2], [8], [22]. The image annotation task is thus
aid of side information could lead to more effective codebook converted to a problem of learning keywords/tags from visual
for object representation. regions/blobs. For instance, Barnard et al. [2] treated image
To this end, this paper proposes a novel Semantics- annotation as a machine translation problem. Jeon et al. [8]
Preserving Bag-of-Words (SPBoW) model, which considers proposed the cross-media relevance models (CMRM) model,
the distance between the semantically identical features as which combines both surrounding texts and image contents for
a measurement of the semantic gap, and tries to learn a annotation. Jin et al. [22] studied coherent language models
codebook by minimizing this semantic gap. We formulate that takes into account the word-to-word correlation.
the codebook generation task as a distance metric learning The last category is focused on applying bag-of-features
problem, which can be formalized as semi-definite program- or bag-of-words representations for image annotation/object
ming (SDP). We then propose an efficient eigen projection recognition [6], [21], [35]. Csurka et al. [6] proposed a bag-
algorithm to solve the optimization problem efficiently. With of-keypoints approach similar to BoW in text categorization
the integrated knowledge and side information, the semantic for visual object categorization. Jiang et al. [21] studied some
gap can be minimized and the codebook is able to consistently practical techniques to improve the performance of bag-of-
represent the semantics of the objects. To the best of our features for object recognition and retrieval. Recently, Wu et
knowledge, this is the first distance metric learning approach al. [42] proposed a language modeling approach to address
to overcome the limitation of semantics lost in BoW models. one limitation of the BoW models, i.e., the loss of spatial
As a summary, the main contributions of this paper include: information. These methods generate the codebook by clus-
(1) we are the first to propose a measurement of the semantic tering visual features in the original feature space. Due to the
gap; (2) we propose to bridge the semantic gap via distance semantic gap, each visual word may contain multiple semantic
metric learning method; (4) we propose and implement an meanings and the same semantic meaning may be represented
efficient algorithm to solve the codebook learning task; (4) we by multiple visual words. In these models, each visual word
suggest a novel object based codebook scheme; (5) we propose actually does not have correspondence to a precise semantic
a measurement for visual complexity; (6) the proposed method meaning.
can automatically decide the size of the codebook for each Besides, there are also some emerging paradigms for image
category; (7) we evaluate and compare a number of different annotation, such as search-based annotation [38] that explores
methods for the codebook generation process in building WWW images in helping the annotation tasks, and the ALIPR
various bag-of-word models towards object annotation tasks. paradigm [25], which used advanced statistical learning tech-
The rest of the paper is organized as follows. Section II niques to provide fully automatic and real-time annotation for
reviews related work. Section III presents the framework of the digital pictures. These techniques are not highly relevant to
SPBoW model. Section III-B gives the details of the object our focus, and are thus out of the discussions in this paper.
representations for this novel model. Section IV elaborates
on the codebook learning task and formulates the task as
B. Distance Metric Learning
an optimization problem. Section V applies the proposed
semantics-preserving codebook (SPC) technique on object From a machine learning point of view, our work is related
annotation tasks. Section VI compares the SPBoW model to supervised distance metric learning (DML). Specifically,
and the metric learning algorithm with several state-of-the-art consider a set of n data examples X = {xi ∈ Rd }ni=1 in d-
methods for object annotation experiments on MIT’s Labelme dimensional vector space, the objective of DML is to find an
testbed [30] and object categorization on PASCAL VOC optimal Mahalanobis metric M from training data with side
challenge testbed [9]. Section VII concludes the paper. information that can be either class labels or general pairwise
constraints [43].
In literature, DML has been actively studied recently. Exist-
II. R ELATED W ORK
ing DML studies can be roughly grouped into two major cate-
Our work is related to several research topics, includ- gories. One category is to learn metrics with class labels, such
ing image annotation/object recognition, and distance metric as Neighbourhood Components Analysis (NCA) [14], which
IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 1, NO. 1, 2010 3

are often studied for classification [12], [39], [44]. Neighbor-


hood Component Analysis (NCA) [14] learns a distance metric
by extending the nearest neighbor classifier. The maximum-
margin nearest neighbor (LMNN) classifier [39] extends NCA
through a maximum margin framework. Information-Theoretic
Metric Learning (ITML) [7] presented the metric learning
problem from the information theory, and achieved the optimal
metric by minimizing the differential relative entropy between
two multivariate Gaussians under constraints on the distance
function. The other category is to learn metrics from pairwise
constraints that are mainly used for clustering and retrieval.
Examples include Relevant Components Analysis (RCA) [1]
and Discriminative Component Analysis (DCA) [19], amongst
others [43], [33], [18], [40]. RCA learns a global linear
transformation from the equivalence constraints. The learned
linear transformation can be used directly to compute distance
between any two examples. DCA and Kernel DCA [19]
improve RCA by exploring negative constraints and aiming to
capture nonlinear relationships using contextual information. Fig. 1. The process of building the semantics-preserving bag-of-words model.
Essentially, RCA and DCA can be viewed as extensions of
Linear Discriminant Analysis (LDA) by exploiting the must-
link constraints and cannot-link constraints. and tagged by users. SIFT features are extracted from the
images to represent these objects. The SIFT features that are
III. F RAMEWORK OF S EMANTICS -P RESERVING located at the same semantic parts of objects are considered as
BAG - OF -W ORDS M ODELS relevant to each other, and will be used as the similar pairwise
constraints in our learning task; on the other hand, any two
A. Overview
SIFT features that are located at different semantic parts of
The BoW model treats an image as a bag of “code- objects are considered as irrelevant, and will be treated as the
words”, which essentially consists of a set of independent dissimilar pairwise constraints in our learning task. We refer
local appearance features. These features are either located to the collections of similar and dissimilar pairwise constraints
by salient region detectors like SIFT, random samplings like as “side information”.
random windows, segmentation, or regular grid. These high- In this paper, we propose a novel learning scheme to
dimensional features may contain much noise and redundancy, optimize the distance metric from the side information. By
and are often difficult to store and use directly. Hence, visual minimizing the semantic loss, the optimized distance metric
words are further generated by performing clustering on these aims to achieve the Semantics-Preserving Codebook (SPC)
features. Through feature clustering, each visual word usually representation, which can be beneficial for image annotation
corresponds to a cluster in the feature vector space. Based and object categorization tasks.
on the visual words, each of the features detected from the
image can be mapped to one of the most similar visual words
by measuring the distance between the feature and all visual B. SPBoW for Object Representation
words. Consequently, a histogram of visual words can be In traditional BoW, an image is represented by the histogram
calculated to represent an image. of visual words from a codebook. This simple representation
BoW can be applied for object annotation by either a naı̈ve has some drawbacks. First of all, both the visual words
Bayes classifier [41] or more complex latent topic analysis extracted from the object regions and the visual words ex-
methods, such as pLSA [34] and LDA [3]. For example, by tracted from the background regions are all incorporated for
a naı̈ve Bayes classification approach, object annotation is generating the BoW model. Such a simple approach however
equivalent to matching the visual word histogram of an image brings the background noise into the resulting model which is
with respect to the visual word histograms of semantic objects. supposed to describe only the object. Moreover, this represen-
The name of an object is annotated to the image if the visual tation may be influenced if an image contains multiple objects.
word histogram of the object is matched from the visual word However, many real-world images usually contain multiple
histogram of the image. objects. As a result, all other irrelevant objects in the images
In this paper, we aim to investigate a new BoW frame- will become noises when building the regular BoW model
work for object representation to overcome the limitations for certain objects. Although this problem may be partially
of existing BoW with applications to image annotation and resolved by the latent topic analysis, it also faces a number of
object detection tasks. In particular, we propose a novel challenges, e.g. how to determine the number of latent topics.
Semantics-Preserving Bag-of-Words (SPBoW) framework. For the above reasons, our new SPBoW approach aims to
Fig. 1 illustrates the flowchart of our framework. First of all, preserve the semantics by modeling each individual object
in the training process, objects in the images are segmented rather than simply modeling a whole image. In particular,
IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 1, NO. 1, 2010 4

we adopt some images from MIT’s Labelme testbed [30] as each image are separated from the background by a bounding
training data, in which objects are well segmented and labeled box. Thus, if two features are in the same bounding box or
by users. By the proposed SPBoW framework, we first apply in the bounding boxes with the same label are treated of the
SIFT to extract features from each image. The SIFT features same semantic meanings.
that are located at the regions of the same semantics (label)
in all the images are collected to represent the semantics. In
order to preserve the semantics in the BoW model, all the
collected features related to the same semantics are clustered
into one or several discriminative visual words for representing
the object based on an optimized distance metric that aims to
minimize the overall semantic loss. The visual words used for
representing an object may describe different semantic parts
or different views of the object. Finally, we note that the set
of visual words used for one object is often different from the
set used for another. This is very different from the regular
BoW model where all objects share the same set of visual
words. Next we present a novel learning technique that aims
to find an optimal distance metric to overcome the limitation
of semantic loss during the codebook generation process.

IV. L EARNING TO O PTIMIZE C ODEBOOKS


Codebook generation is a critical step of building the BoW
model. Instead of generating the codebook by applying simple
k-means clustering in Euclidean space which often leads to
much semantic loss, in this paper, we suggest a novel metric
learning scheme that exploits side information for minimizing Fig. 2. Illustration of side information between objects and feature instances.
the semantic loss in the codebook generation process.
Given the above side information, the goal of our task is
A. Problem Formulation to learn a distance metric A to effectively measure distance
between any two visual features xi1 and xi2 that is often
We first formalize the representation of side information,
represented in the following framework:
which is illustrated in Fig 2. Assume we are given a set √
of pairwise feature instances {(xi1 , xi2 )}N i=1 and a set of d(xi1 , xi2 ) = (xi1 − xi2 )⊤ A(xi1 − xi2 ) (1)
corresponding instance constraints, {(zi1 , zi2 , yi )}N
i=1 , where
xi1 ∈ Rd and xi2 ∈ Rd are two d-dimensional feature where matrix A ∈ Rd×d is the target distance metric that
instances, e.g. SIFT feature vectors; xi1 indicates the first must be positive and semi-definite w.r.t. the properties of a
feature vector in the pair, and xi2 is the second feature vector valid metric, i.e., A ≽ 0. To find an optimal metric A, the
in the pair; zi1 and zi2 are binary indicators to indicate basic principle of our metric learning task is that distances
whether a feature instance is located at the object region or between visual feature vectors of the same semantics should be
the background region in the image. As shown in Fig 2(a), if minimized, and meanwhile distances between feature vectors
feature instance xi1 is on the object region, then zi1 = 1; of different semantics should be maximized. Based on this
otherwise zi1 = 0. The variable yi indicates whether the principle, we can search for the optimal metric that facilitates
feature instances in pair (xi1 , xi2 ) are of the same semantics. clustering the feature vectors of the same semantics into the
If both xi1 and xi2 are on the same semantic parts of objects, same visual words, in which each visual word has certain spe-
e.g. tyre of cars as shown in Fig 2(d), then yi = 1. If two cific semantic meaning. To this end, we formulate our distance
features are of different semantics, i.e., they appear on different metric learning problem into the following optimization:
semantic parts of two different objects (Fig 2(c)) or they are ∑ λ
located at the same object but on different semantic parts, e.g. min zi1 zi2 ξi + tr(AA⊤ ) (2)
tyre and window of cars as shown in Fig 2(b), then yi = −1. A≽0,b
i
2
In general, side information can be generated automatically s.t. yi (∥xi1 − xi2 ∥A − b) ≤ ξi , ξi ≥ 0, i = 1, . . . , n (3)
from the locations of feature points in the well-segmented √
∥A∥ = 1/ λ (4)
images. For example, in the Labelme testbed, objects and
background regions are manually separated for each image, where ∥ · ∥A is the Mahanalobis distance between two fea-
and different parts of the objects are also manually segmented tures under metric A. The first term of the objective func-
by users. Hence, if two feature vectors are located at the tion is the slack variable which accounts for the semantic
same region or at the regions of the same semantic label, loss w.r.t. the side information of n pairwise constraints
they will be considered as the same semantic meaning, i.e., {(xi1 , xi2 , zi1 , zi2 , yi )}ni=1 . With the first inequality constraint,
y1 = 1. Similarly, in PASCAL VOC2006 datasets, objects in minimizing this term will make the distance between two
IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 1, NO. 1, 2010 5

semantically identical features closer and thus more likely to Algorithm 1 The Semantics-Preserving Metric Learning
be assigned to the same visual word. The second term of the (SPML) algorithm
objective function is the regularization term, which prevents INPUT:
N ×d
• SIFT feature matrix: X ∈ R
the overfitting by minimizing the complexity of the model. th
• pairwise constraint (xi1 , xi2 , zi1 , zi2 , yi ), where xi1 is the i1
The second equality constraint is introduced to prevent the SIFT feature, zi1 indicate whether the location of feature xi1
trivial solution by shrinking metric A into a zero matrix, is on the semantic object, and constraints yi = {+1, 0, −1}
and λ is a constant parameter. By solving the optimization represents feature xi1 and xi2 are on the same semantic part of
problem, we can obtain the optimized distance metric A the object, not known, or on different semantic parts.
• regularization parameter λ
and the threshold variable b that could be used to determine
• learning rate parameter γ
whether two features are similar or dissimilar. In general, the PROCEDURE:
above optimization problem belongs to a general semi-definite 1: initialize metric and threshold: A = I, b = b0
programming (SDP), which is often difficult to solve with 2: set iteration step t = 1;
global optima for large applications. 3: repeat
4: (1) update the learning rate:
γ = γ/t, t = t + 1
B. Optimization 5: (2) update the subset of training instances:
In this section, we present a stochastic gradient search St+ = {(xi1 , xi2 , yi )|(1 + yi )∥xi1 − xi2 ∥2A > 1}
St− = {(x∪ i1 , xi2 , yi )|(1 − yi )∥xi1 − xi2 ∥A < 1}
2
algorithm by combining with an active constraint selection St = St +
St −
scheme to efficiently solve the above optimization problem. 6: (3) compute the gradients w.r.t. A
⊤ ⊤
To simplify the formulation, we denote the feature matrix ∇A L ← Z1 Z2 (λA + DX Y DX ),
as X ∈ RNtr ×d where Ntr is the number of SIFT features DX = X1 − X2 ,
in the training set, and d is the feature dimension. We 7: (4) compute the gradients w.r.t. b
∇b L ← tr(Z1 Z2 Y )
also represent all the feature pairs (xi1 ,xi2 ) in the training 8: (5) update metric and threshold:
data by two feature matrices X1 = [x11 , x21 , · · · , xn1 ]⊤ At+1 ← At − γt ∇A L, bt+1 ← bt − γt ∇b L
and X2 = [x12 , x22 , · · · , xn2 ]⊤ , and similarly their con- 9: (6) project A ∑back to the PSD cone:
straints by three matrices Z1 = diag(z11 , z21 , · · · , zn1 ), At+1 = ∑di=1 λi ϕi ϕ⊤ i
Z2 = diag(z12 , z22 , · · · , zn2 ) and Y = diag[y1 , · · · , yn ]. The At+1 ← i max(0, λi )ϕi ϕ⊤ i

proposed iterative optimization scheme is described in the 10: (7) normalize At+1 to satisfy ∥At+1 ∥ = √1λ :

following steps. At+1 ← ∥A 1/ λ
t+1 ∥
At+1
First of all, we actively choose a subset of informative side 11: until convergence
information from the training data as the training instances. OUTPUT:
• feature metric A, threshold variable b
In particular, the training instances must satisfy either one of
the two criterions: (1) the features are of the same semantics
but with large distance in the current metric space; or (2) the
features are of different semantics but with small distance in adopt the following theorem proposed in [15], which provides
the current metric space. a bound for a general sub-gradient method. The detailed proofs
Based on the selected training dataset St in the t-th iteration, and explanations can be found in [31].
we then apply the gradient descent technique to search for the Theorem 1: Let L1 , · · · , LT be a sequence of λ-strongly
optimal metric A and threshold b. convex functions w.r.t the objective function 12 tr(·), where
Finally, to enforce the valid metric constraint, we project L
∏t = L(A, St ). Let A be a closed convex set and define

the current solution of metric A back to a positive semidef- A (A) = arg min ′
A ∈A ∥A − A ∥ . Let A1 , · · · , AT +1 be a
inite (PSD) cone by an eigen decomposition approach. The sequence
∏ of vectors such that A 1 ∈ A and for t ≥ 1,At+1 =
details of the proposed Semantics-Preserving Metric Learning A (A t − γ

t t ) ,where ∇ t is a subgradient of Lt at At . Assume
(SPML) algorithm are described in Algorithm 1, in which that for all t, ∥∇t ∥ ≤ G. Then, for all u ∈ A we have
γ is a learning rate variable that is determined empirically.
1∑ 1∑
T T
G2 (1 + ln(T ))
DX = X1 − X2 is the difference between the two feature Lt (At ) ≤ Lt (u) + (6)
matrices X1 and X2 . Empirically, this iterative algorithm T t=1 T t=1 2λT
converges quickly with no more than 5 iterations. 
By applying Theorem 1, we can prove the following corollary.
C. Convergence Analysis Corollary 2: Let L1 , · · · , LT be a sequence of λ-strongly
We now analyze the convergence of the algorithm. Let us convex
∏ functions. Let A be a closed convex set and define
denote the objective function in the t-th iteration as follows: A (A) = arg minA′ ∈A ∥A − A′ ∥. Let A1 , · · · , AT +1 be
∑ λ
a sequence∏ of matrics such that A1 ∈ A and for t ≥ 1,
L(At , St ) = yi (∥xi1 − xi2 ∥ − bt ) + tr(At A⊤ At+1 = A (At − γt ∇t ), where ∇t is a subgradient of Lt at
t )
2 At . Then, the bound for the proposed objective function is
(xi1 ,xi2 ,yi )∈St
(5) √ ∑
1∑ 1∑
T T
To prove the convergence of the algorithm, we first calculate ∗ ( λ + i ξi )2 (1 + ln(T ))
Lt (At ) ≤ Lt (A ) +
the bound of the objective function after T iterations. Here we T t=1 T t=1 2λT
IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 1, NO. 1, 2010 6

where A∗ is the optimal solution.  particular, consider each object category as a bag of features,
Using Corollary 2 and the ∑Tconvexity property ∑
of the objec- each feature in the object category has a probability of being
T
tive function L, i.e., T1 t−1 Lt (At ) ≤ L( T1 t=1 At ) we generated from the bag. Such a probability can be estimated by
can further show the following corollary of the optimization either the distance to the mean of all features or the frequency
bound. of the features. For example, let us denote by Ci an object
Corollary 3: Let L(At , St ) = Lt (At ), and L(At ) = category and xj some feature, we can estimate the generative
ESt L(At , St ). Assume
∑ the conditions stated
√ in ∑ Corollary 2 and probability p(xj |Ci ) as follows:
denote by Ā = T1 Tt=1 At , and G = ( λ + i ξi ), then we ∥xj −x̂∥2
1
exp− 2σ2
A
have the following result: p(xj |Ci ) = √ (7)
2
2πσ
G (1 + ln(T )) ∑
L(Ā) ≤ L(A∗ ) + where x̂ = n1C
λT i
xj ∈Ci xj , and nCi is the total number of

 features related to the objects from Ci . Based on the above


The proofs to the above two corollaries can be found in http:// estimated probability, we calculate the information entropy of
www.cais.ntu.edu.sg/∼chhoi/SPBOW/proofs.pdf. By denoting the bag as a measurement of the object’s visual complexity:
2 ∑
η(T ) = G (1+ln(T
2λT
))
, we can see that when the iteration H(Ci ) = − p(xj |Ci ) log p(xj |Ci ) (8)
number T → ∞, η(T ) → 0. This corollary thus proves the xj ∈Ci
convergence of the algorithm. Finally, by applying Corollary
Finally, we assign object Ci the number of codes LCi that is
3 and using the first order Taylor expansion of function ln(T ),
proportional to its visual complexity, i.e.,
we obtain that for achieving a solution with accuracy ϵ, the
2
algorithm requires O( G ϵλ ) iterations.
H(Ci )
LCi = ⌊Lmax × ⌋ (9)
log nCi
D. Codebook Generation where Lmax is the maximum size of the SPC for each
A codebook can be generated by clustering the features category. The total number of visual words for all categories
under the learned distance metric into some visual words or is Lmax × M , where M is the number of categories.
codes. Different visual words could represent different views
or different parts of an object. In this paper, we propose Algorithm 2 Codebook Generation Algorithm
to generate the codebook for each object category such that INPUT:
the linkage between the codewords in the codebook and the • features and their object labels {(x, y), x ∈ X , y ∈ C}

high-level semantics of object category can be established • optimized distance metric A


• codebook size assigned for each object LCi , i = 1, · · · , M
effectively, which is essential to bridge the gap between low-
• the number of clusters for clustering K > maxi LCi
level features and high-level semantics. PROCEDURE:
Specifically, for each object category, we first collect all the 1: initialize the number of visual words L = 0
related features from the same object regions, and then perform 2: for i = 1 : M do
the k-means clustering based on the optimized distance metric 3: clustering features of the i-th object Xi = {(x, C)|C = Ci }
A that is obtained from the proposed SPML scheme. By the into K clusters
k-means clustering, we can obtain a set of k clusters (i.e., [cij , rij ] = kmeans(Xi , K)
4: calculate the size∑of each cluster:
visual words or codewords) for this object category. Finally, Sij = x δ(∥x − cij ∥A , rij )
we form a global codebook by gathering all codewords from 5: sort clusters by their sizes
all object categories. We thus refer to the resulting codebook cij ← sort(cij , Sij ) rij ← sort(rij , Sij )
as “Semantics-Preserving Codebook” (SPC). 6: adopt top LCi largest clusters as visual words for the category
In general, there are two important issues for SPC, including wL+j = cij , rL+j = rij , j = 1, · · · , LCi
7: update the number of visual words L = L + LCi
(1) codebook size assignment, and (2) visual word generation. 8: end for
1) Codebook Size Assignment: This is to determine how OUTPUT:
many codes should be assigned for each object category. One • the centers of visual words wk and their range radius rk , k =
straightforward approach is to uniformly assign the same 1, · · · , Lmax
number of codes for every object. This however does not
explore the difference of complexity in semantic understanding 2) Visual Word Generation: This task aims to build the
for different objects. A more desirable approach is to assign codebook for each object category Ci by applying the k-means
varied codebook sizes for different objects. To address this clustering on the associated features to generate a set of LCi
challenge, we introduce two principles for the assignment task: visual words.
Let us denote by Xi a collection of features belonging to
(1) The number of codes increases linearly w.r.t. the visual object category Ci , i.e., Xi = {(x, y)|x ∈ X , y = Ci , Ci ∈
complexity of an object category C}, where y denotes the object category label of feature x, X
(2) Visual complexity of an object category can be measured is the feature space, and C is the label space. The proposed
by the diversity of its associated features. algorithm first applies the k-means clustering on Xi with the
In this paper, we suggest to measure the visual complexity optimized metric A to generate a set of K clusters, denoted
of an object category by applying the information theory. In by {cij , rij |j = 1 · · · , K}, where K is set to be larger than
IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 1, NO. 1, 2010 7

maxi LCi , cij denotes the center of the j-th cluster and rij
denotes the range radius of the cluster, which is defined as the
largest distance from the features to the cluster center.
To reduce noisy clusters, we further sort the K clusters
according to their sizes Sij that are calculated below:
∑ {
1, a ≤ b;
Sij = δ(∥x−cij ∥A , rij ), where δ(a, b) =
0, otherwise.
x

The algorithm then chooses top LCi largest clusters as the


set of visual words for the codebook of category Ci . Finally,
the algorithm gathers all the visual words from every object
category and outputs the set of visual words along with their
ranges, i.e., {wk , rk }L max
k=1 , as the final SPC. The algorithm of
visual word generation is summarized in Algorithm 2.

E. Visual Word Histogram


To apply SPC in the test phase, the key task is to generate
the visual word histogram for a novel test image. In particular,
we first extract SIFT features from the novel image, and then
Fig. 3. Illustration of object annotation using the SPBoW representation.
map each of the SIFT features x ∈ Rd to the visual word
id k in the cookbook. Different from traditional BoW, in our
approach, one visual feature can be assigned to multiple visual we are given a set of labeled image regions {(Ij , C(Ij ))}Ntr
j=1 .
words in different object categories. This is because the ranges Our goal is to automatically annotate a novel image I.
of visual words may overlap each other and the same semantics First of all, we extract SIFT features from these training
may appear in different objects. For example, “window” can regions {x ∈ Ij , j = 1, . . . , Ntr }. For each feature, we
appear in both “building” and “car” objects. Hence, instead of then find its mappings to the visual words in the codebook,
assigning a feature to the closest visual word, we suggest to which is based on the mapping function defined in (10). We
assign the feature to a visual word when the distance between also translate the region’s object labels from the feature x
the feature and the visual word is smaller than the range radius. to the mapping visual word wk when the mapping result is
Specifically, we define a mapping function π(x, k) between nonzero, i.e., π(x, k) = 1. Finally, by gathering all visual
feature x and visual word wk as follows: words associated with a certain semantic object, we estimate
{ the visual word’s conditional distribution:
1, ∥x − wk ∥A < rk ; ∑
π(x, k) = (10)
0, otherwise. {x|C(x)=Ci } π(x, k) + 1
p(wk |Ci ) = ∑ ∑
By the mapping function, we calculate the frequency of a vi- k {x|C(x)=Ci } π(x, k) + V

sual word wk appearing in image I as: fI (k) = x∈I π(x, k). where V is the vocabulary size. In the above probability
Finally, we can obtain the visual word histogram by normal- formula, we adopt the Laplace smoothing to avoid the zero
izing the visual word frequencies as follows: probability issue. With the assumption of uniform distribution
fI (k) of images, the likelihood of object category Ci appearing in
hI (wk ) = ∑Lmax (11) image I can be calculated by a Naı̈ve Bayes model as follows:
v=1 fI (v)

V. G ENERATIVE AND D ISCRIMINATIVE M ODELS WITH p(Ci |I) ∝ p(I|Ci )p(Ci ) ∝ p(Ci ) p(wk |Ci )fI (k) (12)
SPB OW k

Similar to existing BoW models, the proposed SPBoW can where fI (k) is the frequency of visual word wk appearing in
also be easily adopted for existing classification methods, the test image I, and prior p(Ci ) can be calculated based on the
including both generative and discriminative models. Fig. 3 normalized frequencies of the object category that appears in
illustrates the idea of applying SPBoW for annotating a the training data. Finally, we rank the object categories by their
novel image in an object annotation task. Below we discuss likelihood p(Ci |I), Ci = 1 · · · , K, and top N (N = 1, · · · , 10)
two representative methods for applying SPBoW in image ranked categories are used to annotate the image.
annotation and object categorization applications.
B. Discriminative Models
A. Generative Model The learned SPC can also be used in a discriminative
Based on the SPBoW technique, we now discuss how to learning setting. To illustrate this property, we apply the code-
apply the resulting semantics-preserving codebook for building book to train SVM models for classifying the visual objects.
generative models in an object annotation task. Assume that Similarly, we are given a set of training images (or image
IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 1, NO. 1, 2010 8

regions) and their semantic categories {(Ij , C(Ij ))}Ntr


j=1 . Based B. Image Representation
on the SPBoW representation, we can represent each image In our experiments, we adopt SIFT to represent the local
Ij by an Lmax -dimensional vector, which is also called visual visual features. For each image, 1,000 SIFT features are ex-
word histogram hI = [hI (w1 ), hI (w2 ), · · · , hI (wLmax )] by tracted in 128-dimensional vector space. We use SIFT for three
Algorithm 3. Since it is a multi-class classification task, we reasons. Firstly, it is invariant to object scaling, rotation and
then train multiple binary SVM models by one-against-all. affine invariance changes, which is relatively more robust than
Specifically, for the i-th category, we build a binary SVM other feature descriptors, especially for object representation.
classifier as follows: Secondly, SIFT usually performs very well on street scenarios,
1 ∑ which accounts for a large portion of images in our dataset.
min ∥ω∥2 + C ξj (13) Finally, as the regular BoW model often uses SIFT, we also
ω,b 2 j adopt the same technique to ensure a fair comparison.
s.t. yj (i)(ω · hIj − b) ≥ 1 − ξj , ξ ≥ 0, 1 ≤ j ≤ Ntr (14)
C. Experimental Settings
where ω is the weight vector in SVM, C is the penalty
In the experiments, for the regular BoW model, the code-
constant, ξj are slack variables, yj (i) is a binary label function
book is generated by performing k-means over all the SIFT
for the i-th category such that yj (i) = 1 if C(Ij ) = i; and
features extracted from the training dataset. The centers of the
yj (i) = −1 otherwise. In the object detection phase, by the
resulting clusters are collected to form a set of k visual words
similar representation, each novel test image will be classified
as the codebook, in which each cluster represents one visual
by all of the binary SVM classifiers, in which a positive output
word. For the BoW representation, each feature in an image
indicates a specific object is detected on the image.
is then mapped to the nearest visual word in the codebook,
and finally a visual word histogram can be generated by
summarizing the mapping results of all features of the images.
VI. E XPERIMENTS
To examine how SPBoW can also benefit from existing
DML methods, we implement several different SPBoW meth-
In this section, we conduct extensive experiments to em-
ods by adapting four state-of-the-arts metric learning algo-
pirically evaluate the performance of the proposed SPBoW
rithms, including Relevant Component Analysis (RCA) [1],
model and the existing BoW model for image annotation and
Information Theoretic Metric Learning algorithm (ITML) [7],
object categorization tasks. In addition to the proposed metric
Large Margin Nearest Neighbor (LMNN) [39], and Neigh-
learning algorithm, we also show that our SPBoW framework
borhood Components Analysis (NCA) [14]. All of them were
can be integrated with other existing DML techniques. In
implemented in the same experimental settings.
our experiments, we extensively evaluate different implemen-
tations of SPBoW models by adapting other existing DML
algorithms in our framework. D. Experiment I: Annotation Performance
In this experiment, we evaluate the image annotation per-
formance of the proposed techniques. The ground truth was
A. Experimental testbed generated by web users from Labelme project [30]. We adopt
standard performance metrics, i.e., Average Precision (AP@N)
We adopt a dataset from MIT’s Labelme project [30], which and Average Recall (AR@N), to evaluate the annotation per-
consists of 495 objects and 185 images that are mostly related formance at the top N annotated semantic labels/tags.
to downtown streets. The objects include cars, trees, buildings, In our experiment, we perform 5-fold cross validation, in
persons, lights, ladders, sidewalks, air conditions, mail box, which 4 folds are used for building the codebook and 1 fold
signs, bicycle, umbrella, etc. In total, there are more than is used for testing the annotation performance. In our methods,
400,000 local appearance features extracted from these images. there are 2 key parameters: the number of sampled pairwise
We choose this dataset due to several reasons. First of all, constraints and the codebook size. In this experiment, we
this dataset has high-quality user-generated object segmenta- simply fix the constraint size to 10,000 and the codebook
tion and labeling information. The segmentation and labeling size to 2,500. We will examine their effects in subsequent
information can be as detailed as parts of the objects, such experiments. Fig. 4 shows the comparison results of different
as the front light of the car, the door of a building, etc. Such approaches, including a regular BoW method and five imple-
detailed labeling information can help to generate high quality mentations of SPBoW with different DML techniques.
side information for learning the distance metric. Secondly, From Fig. 4, we found that most SPBoW algorithms signifi-
it contains around 495 common objects, which frequently cantly improve the annotation performance of the regular BoW
appear in daily life. For each image, there are on average 12 in both precision and recall. Comparing with other existing
objects positioned and occluded as they used to be in the real DML algorithms, SPBoW with the newly proposed metric
world. It is a great challenging for any model to detect and learning algorithm also has the significant advantage. These
annotate these objects in such a complex situation. Finally, results show that the codebook generated by our SPBoW
all the images are of high resolution and generated from real technique is more discriminative than the regular BoW, and
world, which can help us to examine the performance and SPBoW is effective in reducing the semantic loss during the
applicability of our technique to real applications. codebook generation process.
IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 1, NO. 1, 2010 9

(a) Average Precision (a) AP@N of SPBoW method under different constraints

(b) Average Recall (b) AR@N of SPBoW method under different constraints

Fig. 4. Performance comparison of different approaches for image annotation Fig. 5. Evaluation of constraint sizes on the image annotation performance

E. Experiment II: Evaluation of Varied Constraint Sizes


In this experiment, we study the influence of the number
of constraints on the final annotation performance. We sample
a certain number of constraints from all the user-generated
labels, and gradually increase the number from 1,000 to 10,000
with interval of 1,000 constraints. Under each number of
constraints, we evaluate the performance of object annotation
by the resulting SPBoW. The average precision and average
recall at top N (N = 10, · · · , 100) are summarized in Fig. 5.
From the results, we can see that increasing the number
of constraints in general results in the improvements of the
annotation performance in terms of both precision and recall. Fig. 6. Evaluation of varied codebook sizes (C) on image annotation.
This is reasonable as when more side information is included
for the metric learning task, we expect to learn a better metric,
which is essential to generate the SPC for the annotation task. G. Experiment IV: Object Codebook vs. General Codebook
In practice, the selection of the number of constraints is a Our SPC solution is in general an object based codebook,
tradeoff between efficacy and efficiency. which is denoted as “object codebook”. Unlike the regu-
lar BoW that uses a general codebook without considering
specific objects, our object codebook enjoys a number of
F. Experiment III: Evaluation of Different Codebook Sizes
advantages, such as high efficiency and scalability. In addition,
In this section, we evaluate the influence of codebook sizes similar to the regular BoW, we can also generate a “general
on the final annotation performance. We gradually increase the codebook” by applying the similar metric learning in SPBoW.
codebook size from 2,500 to 4,500, and evaluate the average This experiment aims to compare the performance between
precision and recall results under each setting of codebook object codebook and general codebook.
size. Fig. 6 shows the experimental results. We implement two kinds of SPC. One is an object codebook
The results show that the codebook size does affect the similar to the previous experiments, and the other is a global
annotation performance. In particular, we observe that the SPC similar to the regular BoW except for the usage of the
performance is first improved when increasing the codebook optimized distance metric. Finally, we also include the regular
size from 2, 500 to 3, 000, but is degraded when the size BoW codebook into the comparison. Fig. 7 summarizes the
is larger than 3, 000. From the empirical results, the best comparison results. We first observe that both SPC approaches
codebook size is around 3,000 on this dataset. perform considerably better than the regular BoW codebook.
IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 1, NO. 1, 2010 10

experiment, we fix the total codebook size to 2,500. Fig. 8


shows the experimental results of average precision and recall.

(a) Average Precision

Fig. 8. Comparison between fixed codebook and varied codebook schemes.

The results show that the varied codebook approach outper-


forms the fixed codebook approach by around 22% on average
in terms of both average precision and recall performance.

I. Experiment VI: Application to Object Recognition


(b) Average Recall In this experiment, we apply the proposed SPBoW on the
PASCAL VOC2006 challenge for object recognition to further
Fig. 7. Comparison between general codebook and object codebook. compare its performance with other algorithms. Note that there
are some difference between the VOC2006 dataset and the
Labelme dataset. Different from the manually well segmented
Further, by comparing the difference between object and objects in the Labelme dataset, objects in the VOC2006 data
global codebooks, we found that both of the two object are marked in the images only with a rough bounding box.
codebooks consistently surpasses their corresponding global Also the number of object categories is only 10 for VOC2006
codebooks in all of top annotation results. These results again data, which is much smaller than the Labelme dataset. Finally,
validate the effectiveness of the SPBoW technique. since these two datasets have different data distributions and
Remark. We briefly explain why the object codebook out- different number of categories, we believe using both of them
performs the global codebook. Firstly, the visual words of the can examine the robustness of our techniques.
object codebook are obtained by clustering features related We employ the discriminative model for object detection.
to the same semantic concept, and thus they correspond to Specifically, we use all the VOC2006 training data to learn the
the same semantic meaning; however, in global codebook, codebook as well as to train a set of binary SVM classifiers.
visual words are obtained by clustering features from various Each SVM classifier is then used to detect one object category.
semantic concepts, and thus each visual word may relate to We then test the performance on the VOC2006 test dataset, and
multiple semantic meanings. Hence, the object codebook is compare with the existing BoW model as well as some state-
thus less likely to cause semantic loss. Further, another ad- of-the-art object recognition methods, such as AP06-Lee (Lee
vantage of the object codebook is that it could be more robust et al.) [10], QMUL-LSPCH (Zhang et al.) [10], and XRCE
than the global codebook since in the object codebook only (Perronnin et al.) [29]. We set the codebook size to 500 in the
the semantics-related features will be engaged for clustering, codebook learning process, since there are only 10 different
while for global codebook, all kinds of features including the categories. And we use the default SVM settings (C = 1)
background features will be engaged for clustering, which thus with the RBF kernel of γ = 0.07. The detection performance
more likely suffers from noisy background features. is measured by the area under the ROC curve (AUC).
Table I summarizes the AUC results. First, we found that
H. Experiment V: Fixed vs. Varied Codebook Sizes most of the proposed DML based approaches significantly
outperform the regular BoW. Second, among different DML
One key step of generating our SPC is the codebook size approaches, the proposed SPBoW yields the best average
assignment, which decides how many visual words (codes) performance. Finally, compared to other state-of-the-art ap-
should be assigned to each object. In our approach, we have proaches, SPBoW also performs the best in most cases.
proposed the varied codebook size assignment approach based
on the information theory approach. Hence, this experiment
aims to examine if the proposed varied code size approach is J. Experiment VII: Evaluation of Computational Cost
better than a simple fixed codebook size approach that assigns This experiment is to evaluate the time cost performance.
the uniform number of visual words for every object. In this As we adopt 5-fold cross validation approach, in which 4 folds
IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 1, NO. 1, 2010 11

Categories BOW AP06-Lee QMUL-LSPCH XRCE RCA ITML LMNN NCA SPBoW
bicycle 56.91 79.10 94.80 94.30 93.45 96.98 94.12 95.34 99.89
bus 56.61 63.70 98.10 97.80 97.57 98.17 97.79 95.98 97.15
car 60.31 83.30 97.50 96.70 94.42 93.17 93.13 93.13 94.54
cat 61.08 73.30 93.70 93.30 92.19 94.15 92.97 93.32 93.33
cow 68.53 75.60 93.80 94.00 93.91 92.18 92.77 92.75 94.18
dog 73.22 64.40 87.60 86.60 87.77 92.11 90.06 89.97 94.42
horse 28.83 60.70 92.60 92.50 93.22 95.58 96.18 93.85 95.18
motorbike 36.01 67.20 96.90 95.70 92.19 94.37 94.75 94.19 96.97
person 60.78 55.00 85.50 86.30 92.18 93.33 94.18 91.31 92.68
sheep 60.74 79.20 95.60 95.10 97.19 97.15 92.39 95.67 97.44
Average 56.30 70.15 93.61 93.23 93.41 94.72 93.83 93.55 95.58

TABLE I
AUC VOC2006
RESULTS ON THE DATASET.

of the data are used to generate the codebook and 1 fold is suffer from the semantic loss in the codebook generation pro-
used for object annotation, we thus focus on measuring the cess, our new technique overcomes this drawback by learning
computational time on codebook generation by the methods. an effective distance metric that aims to bridge the semantic
We omit the results of the annotation time cost since they are gap between low-level features and high-level semantics. We
almost similar for all the compared methods. propose a novel measurement of semantic gap and then try
to minimize the gap via distance metric learning. In addition
Method BoW RCA ITML LMNN NCA SPBoW to the new efficient algorithm for solving the challenging
Time Cost (s) 121 3 96 1759 457 8 distance metric learning task, we also propose the object
based codebook generation scheme, which not only improves
TABLE II
T IME EVALUATION OF CODEBOOK GENERATION BY DIFFERENT METHODS . the efficacy, but also significantly reduces the computational
cost. Extensive experiments have been done on both image
annotation and object categorization applications, in which
Table II shows average computational time for generating
encouraging results show that the new SPBoW technique is
the codebook. It consists of time costs of both metric learning
effective and promising for object representation in a large
and the k-means clustering. 50,000 random features are used
range of multimedia applications. Future work will study more
to generate 2,500 visual words by k-means algorithm. There
advanced approaches of improving the estimation of distribu-
are two kinds of codebook generation schemes: the global
tion for the measurement of visual complexity, and investigate
codebook and the object codebook. The global codebook
other distance metric learning techniques for improving the
scheme uses k-means to cluster the 50,000 features into 2,500
performance.
clusters. The object codebook scheme generates clusters within
each category and then combines them to a codebook of size ACKNOWLEDGEMENTS
2,500. So for each category, we only need to generate around
2, 500/|C| clusters from around 50, 000/|C| features, where The work was supported by Singapore NRF Interactive Dig-
|C| = 495 is the size of the categories. The global codebook ital Media R&D Program, under research grant NRF2008IDM-
scheme requires to compute the distances 2, 500 × 50, 000 ∼ IDM004-006, MOE tier-1 Research Grant (RG67/07), and
O(108 ) times per iteration, but the proposed object codebook the National High Technology Research and Development
scheme only needs 2, 500/|C| × 50, 000/|C| × |C| ∼ O(105 ) Program of China(863)(No. 2008AA01Z117).
times per iteration. BoW adopts the global codebook, while R EFERENCES
the other methods employ the object codebook.
[1] A. Bar-Hillel, T. Hertz, N. Shental, and D. Weinshall. Learning distance
From the results, we found that BoW takes even more time functions using equivalence relations. In In Proceedings of the Twentieth
than some of the SPBoW models due to the limitation of the International Conference on Machine Learning, pages 11–18, 2003.
global codebook, even it does not have any cost for metric [2] K. Barnard, P. Duygulu, D. Forsyth, N. D. Freitas, D. M. Blei, J. K,
T. Hofmann, T. Poggio, and J. Shawe-taylor. Matching words and
learning. This again shows that the object codebook is not pictures. Journal of Machine Learning Research, 3:1107–1135, 2003.
only more effective, but also more efficient than the regular [3] D. M. Blei, A. Y. Ng, M. I. Jordan, and J. Lafferty. Latent dirichlet
BoW method. Finally, by comparing the time cost between allocation. JMLR, 3:993–1022, 2003.
[4] L. Cao and L. Fei-Fei. Spatially coherent latent topic model for
different DML techniques, we can see that our algorithm is concurrent segmentation and classification of objects and scenes. In
comparable to the simple RCA method, and is significantly IEEE International Conference on Computer Vision, pages 1–8, 2007.
more efficient than the other state-of-the-art metric learning [5] G. Carneiro and N. Vasconcelos. Formulating semantic image annotation
as a supervised learning problem. In IEEE CVPR, pages 163–168, 2005.
techniques that are usually computationally intensive. [6] C. Dance, J. Willamowski, L. Fan, C. Bray, and G. Csurka. Visual
categorization with bags of keypoints. In ECCV International Workshop
on Statistical Learning in Computer Vision, 2004.
VII. C ONCLUSION [7] J. V. Davis, B. Kulis, P. Jain, S. Sra, and I. S. Dhillon. Information-
This paper proposed a novel framework of Semantics- theoretic metric learning. In ICML’07, pages 209–216, 2007.
[8] P. Duygulu, K. Barnard, J. de Freitas, and D. A. Forsyth. Object
Preserving Bag-of-Words (SPBoW) for object representation. recognition as machine translation: Learning a lexicon for a fixed image
Unlike conventional Bag-of-Words (BoW) models that usually vocabulary. In ECCV, pages 97–112, 2002.
IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 1, NO. 1, 2010 12

[9] M. Everingham, C. W. A Zisserman, and L. V. Gool. The 2006 pascal [35] P. Tirilly, V. Claveau, and P. Gros. Language modeling for bag-of-visual
visual object classes challenge. In Workshop in ECCV’06, 2006. words image categorization. In Proc. ACM Int. Conf. on Content-based
[10] M. Everingham, A. Zisserman, C. K. I. Williams, and L. Van Gool. The image and video retrieval, pages 249–258, Niagara Falls, Canada, 2008.
PASCAL Visual Object Classes Challenge 2006 (VOC2006) Results. [36] E. Tola, V. Lepetit, and P. Fua. A fast local descriptor for dense
http://www.pascal-network.org/challenges/VOC/voc2006/results.pdf. matching. In IEEE Conference on Computer Vision and Pattern
[11] J. Fan, Y. Gao, and H. Luo. Multi-level annotation of natural scenes Recognition (CVPR2008), pages 1–8, 2008.
using dominant image components and semantic concepts. In ACM [37] J. Z. Wang, J. Li, and G. Wiederhold. SIMPLIcity: semantics-sensitive
Multimedia, pages 540–547, 2004. integrated matching for picture libraries. IEEE Transactions on Pattern
[12] A. Globerson and S. Roweis. Metric learning by collapsing classes. In Analysis and Machine Intelligence, 23:947–963, 2001.
NIPS’05, 2005. [38] X.-J. Wang, L. Zhang, F. Jing, and W.-Y. Ma. Annosearch: Image auto-
[13] K.-S. Goh, B. Li, and E. Chang. Using one-class and two-class svms annotation by search. In CVPR’06, pages 1483–1490, 2006.
for multiclass image annotation. IEEE Trans. on Knowl. and Data Eng., [39] K. Weinberger, J. Blitzer, and L. Saul. Distance metric learning for large
17(10):1333–1346, 2005. margin nearest neighbor classification. Advances in Neural Information
[14] J. Goldberger, S. Roweis, G. Hinton, and R. Salakhutdinov. Neighbor- Processing Systems, 18:1473–1480, 2006.
hood component analysis. In NIPS, 2004. [40] L. Wu, S. C. H. Hoi, R. Jin, J. Zhu, and N. Yu. Distance metric learning
from uncertain side information with application to automated photo
[15] E. Hazan, A. Agarwal, and S. Kale. Logarithmic regret algorithms for
tagging. In Proceedings of the seventeen ACM international conference
online convex optimization. Mach. Learn., 69(2-3):169–192, 2007.
on Multimedia (MM’09), pages 135–144, Beijing, China, 2009.
[16] V. Hedau, H. Arora, and N. Ahuja. Matching images under unstable [41] L. Wu, Y. Hu, M. Li, N. Yu, and X.-S. Hua. Scale-invariant visual
segmentations. In IEEE CVPR, pages 1–8, 2008. language modeling for object categorization. Multimedia, IEEE Trans-
[17] T. Hofmann. Probabilistic latent semantic indexing. In SIGIR’99, pages actions on, 11(2):286–294, Feb. 2009.
50–57, Berkeley, CA, 1999. [42] L. Wu, M. Li, Z. Li, W.-Y. Ma, and N. Yu. Visual language modeling for
[18] S. C. H. Hoi, W. Liu, and S.-F. Chang. Semi-supervised distance image classification. In Proc. Int. workshop on multimedia information
metric learning for collaborative image retrieval. In Proceedings of IEEE retrieval (MIR’07), pages 115–124, Augsburg, Bavaria, Germany, 2007.
Conference on Computer Vision and Pattern Recognition (CVPR2008), [43] E. P. Xing, A. Y. Ng, M. I. Jordan, and S. Russell. Distance metric learn-
June 2008. ing with application to clustering with side-information. In NIPS2002,
[19] S. C. H. Hoi, W. Liu, M. R. Lyu, and W.-Y. Ma. Learning distance 2002.
metrics with contextual constraints for image retrieval. In Proc. [44] L. Yang, R. Jin, R. Sukthankar, and Y. Liu. An efficient algorithm for
CVPR2006, New York, US, June 17–22 2006. local distance metric learning. In AAAI, 2006.
[20] J. Jeon, V. Lavrenko, and R. Manmatha. Automatic image annotation
and retrieval using cross-media relevance models. In Proc. 26th ACM Lei Wu received the Bachelor degree in Special
SIGIR Conference, pages 119–126, 2003. Class for Gifted Young (SCGY) in 2005 from Uni-
[21] Y.-G. Jiang, C.-W. Ngo, and J. Yang. Towards optimal bag-of-features versity of Science and Technology of China (USTC),
for object categorization and semantic video retrieval. In Proc. 6th ACM from which he is now pursuing his Ph.D degree
Int. Conf. on Image and video retrieval, pages 494–501, Amsterdam, The in Electronic Engineering and Information Science.
Netherlands, 2007. From 2006 to 2008, he has been a research intern
[22] R. Jin, J. Y. Chai, and L. Si. Effective automatic image annotation via at Microsoft Research Asia working on image an-
a coherent language model and active learning. In Proc. 12th ACM notation and tagging. From 2008 to 2009, he was
International Conference on Multimedia, pages 892–899, New York, visiting Nanyang Technological University working
NY, USA, 2004. on distance metric learning and multimedia retrieval.
[23] C. H. Lampert, M. B. Blaschko, and T. Hofmann. Beyond sliding His research interests include machine learning, mul-
windows: object localization by efficient subwindow search. In CVPR, timedia retrieval, and computer vision. He received Microsoft Fellowship in
2008. 2007.
[24] S. Lazebnik, C. Schmid, and J. Ponce. Beyond bags of features: Spatial
pyramid matching for recognizing natural scene categories. volume 2, Steven C.H. Hoi is currently an Assistant Professor
pages 2169–2178, 2006. in the School of Computer Engineering of Nanyang
[25] J. Li and J. Z. Wang. Real-time computerized annotation of pictures. Technological University, Singapore. He received his
In Proceedings of the 14th annual ACM international conference on Bachelor degree in Computer Science from Tsinghua
Multimedia, pages 911–920, Santa Barbara, CA, USA, 2006. University, Beijing, P.R. China, and his Master and
[26] J. Li, W. Wu, T. Wang, and Y. Zhang. One step beyond histograms: Ph.D degrees in Computer Science and Engineering
Image representation using markov stationary features. In IEEE CVPR from Chinese University of Hong Kong. His research
Conference, pages 1–8, 2008. interests include statistical machine learning, mul-
[27] D. G. Lowe. Distinctive image features from scale-invariant keypoints. timedia information retrieval, Web search and data
IJCV, 60:91–110, 2004. mining. He is a member of IEEE and ACM.
[28] R. Maree, P. Geurts, J. Piater, and L. Wehenkel. Random subwindows for
robust image classification. In Proc. IEEE Conf. on Computer Vision and
Nenghai Yu is currently a Professor in the De-
Pattern Recognition (CVPR’05), pages 34–40, Washington, DC, USA,
partment of Electronic Engineering and Information
2005.
Science of University of Science and Technology
[29] F. Perronnin, C. Dance, G. Csurka, and M. Bressan. Adapted vocab- of China (USTC). He is the Executive Director
ularies for generic visual categorization. In In ECCV, pages 464–475, of MOE-Microsoft Key Laboratory of Multimedia
2006. Computing and Communication, and the Director of
[30] B. C. Russell, A. Torralba, K. P. Murphy, and W. T. Freeman. Labelme: Information Processing Center at USTC. He grad-
A database and web-based tool for image annotation. Int. J. Comput. uated from Tsinghua University,Beijing, China, and
Vision, 77(1-3):157–173, 2008. obtained his M.Sc. Degree in Electronic Engineering
[31] S. Shalev-Shwartz and Y. Singer. Logarithmic regret algorithms for in 1992, and then he joined in USTC and worked
strongly convex repeated games (technical report). The Hebrew Univer- there until now. He received his Ph.D. Degree in
sity., 2007. Information and Communications Engineering from USTC, Hefei, China,
[32] J. Shotton, M. Johnson, and R. Cipolla. Semantic texton forests for in 2004. His research interests are in the field of multimedia information
image categorization and segmentation. In Computer Vision and Pattern retrieval, digital media analysis and representation, media authentication,
Recognition, 2008. CVPR 2008. IEEE Conference on, pages 1–8, 2008. video surveillance and communications etc. He has been responsible for many
[33] L. Si, R. Jin, S. C. H. Hoi, and M. R. Lyu. Collaborative image national research projects. Based on his contribution, Professor Yu and his
retrieval via regularized metric learning. ACM Multimedia Systems research group won the Excellent Person Award and Excellent Collectivity
Journal (MMSJ), 12(1):34–44, 2006. Award simultaneously from the National Hi-tech Development Project of
[34] J. Sivic, B. C. Russell, A. A. Efros, A. Zisserman, and W. T. Freeman. China in 2004. He has contributed more than 80 papers to journals and
Discovering object categories in image collections. In Proceedings of international conferences.
the IEEE International Conference on Computer Vision (ICCV), 2005.

You might also like