Online Multi-Modal Distance Metric Learning With Application To I
Online Multi-Modal Distance Metric Learning With Application To I
2-2016
Steven C. H. HOI
Singapore Management University, [email protected]
Peilin ZHAO
Chunyan MIAO
Zhi-Yong LIU
Citation
WU, Pengcheng; HOI, Steven C. H.; ZHAO, Peilin; MIAO, Chunyan; and LIU, Zhi-Yong. Online multi-modal distance metric
learning with application to image retrieval. (2016). IEEE Transactions on Knowledge and Data Engineering. 28, (2), 454-467. Research
Collection School Of Information Systems.
Available at: http://ink.library.smu.edu.sg/sis_research/2924
This Journal Article is brought to you for free and open access by the School of Information Systems at Institutional Knowledge at Singapore
Management University. It has been accepted for inclusion in Research Collection School Of Information Systems by an authorized administrator of
Institutional Knowledge at Singapore Management University. For more information, please email [email protected].
1
Abstract—Distance metric learning (DML) is an important technique to improve similarity search in content-based image retrieval.
Despite being studied extensively, most existing DML approaches typically adopt a single-modal learning framework that learns the
distance metric on either a single feature type or a combined feature space where multiple types of features are simply concatenated.
Such single-modal DML methods suffer from some critical limitations: (i) some type of features may significantly dominate the others
in the DML task due to diverse feature representations; and (ii) learning a distance metric on the combined high-dimensional feature
space can be extremely time-consuming using the naive feature concatenation approach. To address these limitations, in this paper, we
investigate a novel scheme of online multi-modal distance metric learning (OMDML), which explores a unified two-level online learning
scheme: (i) it learns to optimize a distance metric on each individual feature space; and (ii) then it learns to find the optimal combination
of diverse types of features. To further reduce the expensive cost of DML on high-dimensional feature space, we propose a low-rank
OMDML algorithm which not only significantly reduces the computational cost but also retains highly competing or even better learning
accuracy. We conduct extensive experiments to evaluate the performance of the proposed algorithms for multi-modal image retrieval,
in which encouraging results validate the effectiveness of the proposed technique.
Index Terms—content-based image retrieval, multi-modal retrieval, distance metric learning, online learning
1 I NTRODUCTION of all features; and (ii) the naive concatenation approach may
result in a combined high-dimensional feature space, making
One of the core research problems in multimedia retrieval
the subsequent DML task computationally intensive.
is to seek an effective distance metric/function for comput-
ing similarity of two objects in content-based multimedia To overcome the above limitations, this paper investigates
retrieval tasks [1], [2], [3]. Over the past decades, multimedia a novel framework of Online Multi-modal Distance Metric
researchers have spent much effort in designing a variety Learning (OMDML), which learns distance metrics from
of low-level feature representations and different distance multi-modal data or multiple types of features via an efficient
measures [4], [5], [6]. Finding a good distance metric/function and scalable online learning scheme. Unlike the above con-
remains an open challenge for content-based multimedia re- catenation approach, the key ideas of OMDML are twofold:
trieval tasks till now. In recent years, one promising direction (i) it learns to optimize a separate distance metric for each
to address this challenge is to explore distance metric learning individual modality (i.e., each type of feature space), and (ii)
(DML) [7], [8], [9] by applying machine learning techniques it learns to find an optimal combination of diverse distance
to optimize distance metrics from training data or side infor- metrics on multiple modalities. Moreover, OMDML takes ad-
mation, such as historical logs of user relevance feedback in vantages of online learning techniques for high efficiency and
content-based image retrieval (CBIR) systems [6], [7]. scalability towards large-scale learning tasks. To further reduce
Although various DML algorithms have been proposed the computational cost, we also propose a Low-rank Online
in literature [7], [10], [11], [12], [13], most existing DML Multi-modal DML (LOMDML) algorithm, which avoids the
methods in general belong to single-modal DML in that they need of doing intensive positive semi-definite (PSD) projec-
learn a distance metric either on a single type of feature or tions and thus saves a significant amount of computational cost
on a combined feature space by simply concatenating multiple for DML on high-dimensional data. As a summary, the major
types of diverse features together. In a real-world application, contributions of this paper include:
such approaches may suffer from some practical limitations: (i)
• We present a novel framework of Online Multi-modal
some types of features may significantly dominate the others
Distance Metric Learning (OMDML), which simultane-
in the DML task, weakening the ability to exploit the potential
ously learns optimal metrics on each individual modality
Corresponding author: Steven C.H. HOI and Pengcheng WU are with the and the optimal combination of the metrics from multiple
School of Information Systems, Singapore Management University, Singapore modalities via efficient and scalable online learning;
178902, E-mail: {chhoi,pcwu}@smu.edu.sg • We further propose a low-rank OMDML algorithm which
Peilin Zhao is with the Data Analytics Department, Institute for Infocomm
Research, A*STAR, Singapore 138632, E-mail: [email protected] by significantly reducing computational costs for high-
Chunyan Miao is with the School of Computer Engineering, Nanyang Tech- dimensional data without PSD projection;
nological University, Singapore 639798, E-mail: [email protected] • We offer theoretical analysis of the OMDML method;
Zhi-Yong Liu is with the State Key Lab of Management and Control for
Complex System, Institute of Automation, Chinese Academy of Sciences, • We conduct an extensive set of experiments to evaluate
Beijing, China, E-mail: [email protected] the performance of the proposed techniques for CBIR
2
tasks using multiple types of features. the retrieval process that often requires an effective indexing
The remainder of this paper is organized as follows. Section scheme, which are out of this paper’s scope.
2 reviews related work. Section 3 first gives the problem
formulation, and then presents our method of online multi- 2.2 Distance Metric Learning
modal metric learning, followed by proposing an improved Distance metric learning has been extensively studied in both
low-rank algorithm. Section 4 provides theoretical analysis for machine learning and multimedia retrieval communities [30],
the proposed algorithms, Section 5 discusses our experimental [7], [31], [32], [33], [34], [35], [36]. The essential idea is to
results, and finally Section 6 concludes this work. learn an optimal metric which minimizes the distance between
similar/related images and simultaneously maximizes the dis-
2 R ELATED WORK tance between dissimilar/unrelated images. Existing DML
Our work is related to three major groups of research: content- studies can be grouped into different categories according
based image retrieval, distance metric learning, and online to different learning settings and principles. For example, in
learning. In the following, we briefly review the closely related terms of different types of constraint settings, DML techniques
representative works in each group. are typically categorized into two groups:
• Global supervised approaches [30], [7]: to learn a metric
2.1 Content-based Image Retrieval on a global setting, e.g., all constraints will be satisfied
simultaneously;
With the rapid growth of digital cameras and photo sharing
• Local supervised approaches [32], [33]: to learn a metric
websites, image retrieval has become one of the most impor-
in the local sense, e.g., the given local constraints from
tant research topics in the past decades, among which content-
neighboring information will be satisfied.
based image retrieval is one of key challenging problems [1],
[2], [3]. The objective of CBIR is to search images by Moreover, according to different training data forms, DML
analyzing the actual contents of the image as opposed to studies in machine learning typically learn metrics directly
analyzing metadata like keywords, title and author, such that from explicit class labels [32], while DML studies in multime-
extensive efforts have been done for investigating various dia mainly learn metrics from side information, which usually
low-level feature descriptors for image representation [14]. can be obtained in the following two forms:
For example, researchers have spent many years in studying • Pairwise constraints [7], [9]: A must-link constraint set
various global features for image representation, such as color S and a cannot-link constraint set D are given, where
features [14], edge features [14], and texture features [15]. a pair of images (pi , pj ) ∈ S if pi is related/similar
Recent years also witness the surge of research on local feature to pj , otherwise (pi , pj ) ∈ D. Some literature uses
based representation, such as the bag-of-words models [16], the term equivalent/positive constraint in place of “must-
[17] using local feature descriptors (e.g., SIFT [18]). link”, and the term inequivalent/negative constraint in
Conventional CBIR approaches usually choose rigid dis- place of “cannot-link”.
tance functions on some extracted low-level features for • Triple constraints [20]: A triplet set P is given, where
multimedia similarity search, such as the classical Euclidean P = {(pt , p+ − + −
t , pt )|(pt , pt ) ∈ S; (pt , pt ) ∈ D, t =
distance or cosine similarity. However, there exists one key 1, . . . , T }, S contains related pairs and D contains un-
limitation that the fixed rigid similarity/distance function may related pairs, i.e., p is related/similar to p+ and p is
not be always optimal because of the complexity of visual im- unrelated/dissimilar to p− . T denotes the cardinality of
age representation and the main challenge of the semantic gap entire triplet set.
between the low-level visual features extracted by computers When only explicit class labels are provided, one can also
and high-level human perception and interpretation. Hence, construct side information by simply considering relationships
recent years have witnesses a surge of active research efforts of instances in same class as related, and relationships of
in design of various distance/similarity measures on some low- instances belonging to different classes as unrelated. In our
level features by exploiting machine learning techniques [19], works, we focus on triple constraints.
[20], [21], among which some works focus on learning to hash Finally, in terms of learning methodology, most existing
for compact codes [22], [19], [23], [24], [25], and some others DML studies generally employ batch learning methods which
can be categorized into distance metric learning that will be often assume the whole collection of training data must be
introduced in the next subsection. Our work is also related to given before the learning task and train a model from scratch,
multimodal/multiview studies, which have been widely studied except for a few recent DML studies which begin to explore
on image classification and object recognition fields [26], online learning techniques [37], [38]. All these works gener-
[27], [28], [29]. However, it is usually hard to exploit these ally address single-modal DML, which is different from our
techniques directly on CBIR because (i) in general, image focus on multi-modal DML. We also note that our work is very
classes will not be given explicitly on CBIR tasks, (ii) even if different from the existing multiview DML study [26] which is
classes are given, the number will be very large, (iii) image concerned with regular classification tasks by learning a metric
datasets tend to be much larger on CBIR than on classification on training data with explicit class labels, making it difficult to
tasks. We thus exclude the direct comparisons to such existing be compared with our method directly. We note that our work
works in this paper. There are still some other open issues is different from another multimodal learning study in [39]
in CBIR studies, such as the efficiency and scalability of which addresses a very different problem of search-based face
3
Top-ranked Images
return
Submit
query
Image
Triplet training data Database
Extract
Features on z
Features on Features on
Modal 1 … Modal … Modal
Features on
Modal
Multi-modal Image
Ranking Features on
Similarity Function
Modal 1
Fig. 1. Overview of the proposed multi-modal distance metric learning scheme for multi-modal retrieval in CBIR
annotation where their multimodal learning is formulated with model is updated with the loss whenever it is nonzero. The
a batch learning task for optimizing a specific loss function overall objective of an online learning task is to minimize the
tailored for search-based face annotation tasks from weakly cumulative loss over the entire sequence of received instances.
labeled data. Finally, we note that our work is also differ- In literature, a variety of algorithms have been proposed for
ent from some existing distance learning studies that learn online learning [43], [44], [45], [46], [47]. Some well-known
nonlinear distance functions using kernel or deep learning examples include the Hedge algorithm for online prediction
methods [21], [40], [35]. In comparison to the linear distance with expert advice [48], the Perceptron algorithm [43], the
metric learning methods, kernel methods usually may achieve family of passive-Aggressive (PA) learning algorithms [44],
better learning accuracy in some scenarios, but falls short and the online gradient descent algorithms [49]. There is also
in being difficult to scale up for large-scale applications due some study that attempts to improve the scalability of online
to the curse of kernelization, i.e., the learning cost increases kernel methods, such as [50] which proposed a bounded online
dramatically when the number of training instances increases. gradient descent for addressing online kernel-based classifica-
Thus, our empirical study is focused on direct comparisons to tion tasks. In this work, we apply online learning techniques,
the family of linear methods. i.e., the Hedge, PA, and online gradient descent algorithms,
to tackle the multi-modal distance metric learning task for
content-based image retrieval. Besides, we note that this work
2.3 Online Learning was partially inspired by the recent study of online multiple
Our work generally falls in the category of online learning kernel learning which aims to address online classification
methodology, which has been extensively studied in machine tasks using multiple kernels [51]. In the following, we give a
learning [41], [42]. Unlike batch learning methods that usually brief overview of several popular online learning algorithms.
suffer from expensive re-training cost when new training data
arrive, online learning sequentially makes a highly efficient 2.3.1 Hedge Algorithms
(typically constant) update for each new training data, making The Hedge algorithm [48], [52] is a learning algorithm which
it highly scalable for large-scale applications. In general, aims to dynamically combine multiple strategies in an optimal
online learning operates on a sequence of data instances with way, i.e., making the final cumulative loss asymptomatically
time stamps. At each time step, an online learning algorithm approach that of the best strategy. Its key idea is to main-
processes an incoming example by first predicting its class tain a dynamic weigh-distribution over the set of strategies.
label; after the prediction, it receives the true class label which During the online learning process, the distribution is updated
is then used to measure the suffered loss between the predicted according to the performance of those strategies. Specifically,
label and the true label; at the end of each time step, the the weight of every strategy is decreased exponentially with
4
respect to its suffered loss, making the overall strategy ap- is to learn the distance metrics in the learning phase in order to
proaching the best strategy. facilitate the image ranking task in the retrieval phase. We note
that these two phases may operate concurrently in practice,
2.3.2 Passive-Aggressive Learning where the learning phase may never stop by learning from
As a classical well-known online learning technique, the endless stream training data.
Perceptron algorithm [43] simply updates the model by adding During the learning phase, we assume triplet training data
an incoming instance with a constant weight whenever it instances arrive sequentially, which is natural for a real-world
is misclassified. Recent years have witnessed a variety of CBIR system. For example, in online relevance feedback, a
algorithms proposed to improve Perceptron [53], [44], which user is often asked to provide feedback to indicate if a retrieved
usually follow the principle of maximum margin learning in image is related or unrelated to a query; as a result, users’
order to maximize the margin of the classifier. Among them, relevance feedback log data can be collected to generate the
one of the most notable approaches is the family of Passive- training data in a sequential manner for the learning task [55].
Aggressive (PA) learning algorithms [44], which updates the Once a triplet of images is received, we extract different low-
model whenever the classifier fails to produce a large margin level feature descriptors on multiple modalities from these
on the incoming instance. In particular, the family of online images. After that, every distance function on a single modality
PA learning is formulated to trade off the minimization of can be updated by exploiting the corresponding features and
the distance between the target classifier and the previous label information. Simultaneously, we also learn the optimal
classifier, and the minimization of the loss suffered by the combination of different modalities to obtain the final optimal
target classier on the current instance. The PA algorithms enjoy distance function, which is applied to rank images in the
good efficiency and scalability due to their simple closed-form retrieval phase.
solutions. Finally, both theoretical analysis and most empirical During the retrieval phase, when the CBIR system receives
studies demonstrate the advantages of the PA algorithms over a query from users, it first applies the similar approach to
the classical Perceptron algorithm. extract low-level feature descriptors on multiple modalities,
then employs the learned optimal distance function to rank
2.3.3 Online Gradient Descent the images in the database, and finally presents the user with
Besides Perceptron and PA methods, another well-known on- the list of corresponding top-ranked images. In the following,
line learning method is the family of Online Gradient Descent we first give the notation used throughout the rest of this paper,
(OGD) algorithms, which applies the family of online convex and then formulate the problem of multi-modal distance metric
optimization techniques to optimize some particular objective learning followed by presenting online algorithms to solve it.
function of an online learning task [49]. It enjoys solid theoret-
ical foundation of online convex optimization, and thus works 3.2 Notation
effectively in empirical applications. When the training data is
For the notation used in this paper, we use bold upper case
abundant and computing resources are comparatively scarce,
letter to denote a matrix, for example, M ∈ Rn×n , and bold
some existing studies showed that a properly designed OGD
lower case letter to denote a vector, for example, p ∈ Rn . We
algorithm can asymptotically approach or even outperform a
adopt I to denote an identity matrix. Formally, we define the
respective batch learning algorithm [54].
following terms and operates:
• m: the number of modalities (types of features).
3 O NLINE M ULTI - MODAL D ISTANCE M ETRIC • ni : the dimensionality of the i-th visual feature space
L EARNING (modality).
(i)
3.1 Overview • p : the i-th type of visual feature (modality) of the
In literature, many techniques have been proposed to improve corresponding image p(i) ∈ Rni .
(i)
the performance of CBIR. Some existing studies have made • M : the optimal distance metric on the i-th modality,
efforts on investigating novel low-level feature descriptors in where M(i) ∈ Rni ×ni .
(i)
order to better represent visual content of images, while others • W : a linear transformation matrix by decomposing
T
have focused on the investigation of designing or learning M , such that, M(i) = W(i) W(i) , Wi ∈ Rri ×ni ,
(i)
effective distance/similarity measures based on some extracted where ri is the dimensionality of projected feature space.
low-level features. In practice, it is hard to find a single • S: a positive constraint set, where a pair (pi , pj ) ∈ S if
best low-level feature representation that consistently beats and only if pi is related/similar to pj .
the others at all scenarios. Thus, it is highly desirable to • D: a negative constraint set, where a pair (pi , pj ) ∈ S if
explore machine learning techniques to automatically combine and only if pi is unrelated/dissimilar to pj .
+ − +
multiple types of diverse features and their respective distance • P: a triplet set, where P = {(pt , pt , pt )|(pt , pt ) ∈
−
measures. We refer to this open research problem as a multi- S; (pt , pt ) ∈ D, t = 1, . . . , T }, where T denotes the
modal distance metric learning task, and present two new algo- cardinality of entire triplet set.
rithms to solve it in this section. Figure 1 illustrates the system • di (p2 , p2 ): the distance function of two images p1 and
flow of the proposed multi-modal distance metric learning p2 on the i-th type of visual feature (modality).
scheme for content-based image retrieval, which consists of When only one modality is considered, we will omit the
two phases, i.e., learning phase and retrieval phase. The goal superscript (i) or subscript i in the above terms.
5
to learn the optimal combinational weights. We discuss each 3.5 Low-Rank Online Multi-modal Distance Metric
of the two learning tasks in detail below. Learning Algorithm
(i)
Let us denote by Mt the matrix on the i-th modality at step
(i) One critical drawback of the proposed OMDML algorithm
t. To learn the optimal metric Mt on an individual modality, in Algorithm 1 is the PSD projection step, which can be
following the similar ideas of OASIS [20] and PA [44], we can computationally intensive when some feature space is of
formulate the optimization task of the online distance metric high dimensionality. In this section, we present a low-rank
learning as follows: learning algorithm to significantly improve the efficiency and
(i) 1 (i) scalability of OMDML.
Mt+1 = arg min kM − Mt kF + Cξ, (4)
M 2 Instead of learning a full-rank matrix, for each M(i) , our
+ −
s.t. ℓ((pt , pt , pt ); di ) ≤ ξ, ξ ≥ 0 goal is to learn a low-rank decomposition, i.e.,
It is not difficult to derive the closed-form solution:
M(i) := W(i)⊤ W(i) ,
(i) (i) (i) (i)
Mt+1 = Mt − τt Vt (5)
(i) (i) where Wi ∈ Rri ×ni and ri ≪ ni . Thus, for any two images
where τt and Vt are computed as follows: p1 and p2 , the distance function on the i-th modality can be
(i) (i)
τt = min(C, ℓ((pt , p+ − 2
t , pt ); di )/kVt kF ),
expressed as:
(i)
Vt = (pt − p+ + ⊤ − − ⊤
t )(pt − pt ) − (pt − pt )(pt − pt ) . di (p1 , p2 ) = (p1 − p2 )T W(i)⊤ W(i) (p1 − p2 )
In the above, we omit the superscript (i) for each pt .
One main issue of the above solution, as existed in OA- Following the similar idea in the previous section, we can
(i)
SIS [20], is that it does not guarantee the resulting matrix apply online learning techniques to solve Wt and θt , respec-
(i) tively. In this section, we consider the Online Gradient Descent
Mt+1 is positive semi-definite (PSD), which is not desirable
(i)
for DML. To fix this issue, at the end of each learning iteration, (OGD) approach to solve Wt . In particular, we denote by
we will need to perform a PSD projection of the matrix M (i)
onto the PSD domain: ℓt = ℓ((pt , p+ −
t , pt ); di )
= max(0, d(pt , p+ −
t ) − d(pt , pt ) + 1),
(i) (i)
Mt+1 ← P SD(Mt+1 ).
and introduce the following notation
Another key task of multi-modal DML is to learn the
optimal combinational weights θ = (θ(1) , . . . , θ(m) ), where (i) (i) (i)
qt = Wt pt , q+ + − −
t = Wt pt , qt = Wt pt ,
θ(i) is set to 1/m at the beginning of the learning task. We
apply the well-known Hedge algorithm [48] to update the (i)
we can compute the gradient of ℓt with respect to W(i) :
combinational weights online, which is a simple and effective
algorithm for online learning with expert advice. In particular, ∂ℓt
(i)
given a triplet training instance (pt , p+ −
t , pt ), at the end of
∇t W(i) =
∂W(i)
each online learning iteration, the weight is updated as follows: ri (i) (i) + (i) −
!
X ∂ℓt ∂qj,t ∂ℓt ∂qj,t ∂ℓt ∂qj,t
(i) (i)
θt β z t = + + + −
(i)
θt+1 = (6) j=1
∂qj,t ∂W(i) ∂qj,t ∂W(i) ∂qj,t ∂W(i) W(i) =Wt
(i)
Pm (i) z(i)
i=1 θt β
t
⊤ ⊤
= 2(−q+ − ⊤ + +
t + qt )pt + 2(−qt + qt )pt + 2(qt − q− −
t )pt ,
where β ∈ (0, 1) is a discounting parameter to penalize the
(i)
poor modality, and zt is an indicator of ranking result on where qj,t is the j-th entry of qt .
(i) (i)
the current instance, i.e., zt = I(ft > 0) = I(di (pt , p+ t )− We then follow the idea of Online Gradient Descent [49] to
(i) (i)
di (pt , p−t ) > 0) which outputs 1 when ft = di (pt , p+
t )− update Wt+1 of each modality as follows:
(i)
di (pt , p−t ) > 0 and 0 otherwise. In particular, ft > 0,
(i) (i)
+ −
namely di (pt , pt ) > di (pt , pt ), indicates the current i-th Wt+1 ← Wt − η∇t W(i) (7)
metric makes a mistake on predicting the ranking of the triplet
(pt , p+ −
t , pt ).
where η is a learning rate parameter.
Finally, Algorithm 1 summarizes the details of the proposed Similarly, we also apply the Hedge algorithm as intro-
Online Multi-modal Distance Metric Learning (OMDML) duced in the previous section to update the combinational
algorithm. weight θt . Finally, Algorithm 2 summarizes the details of
Remark on Space and Time complexity. The space com- the proposed Low-rank Online Multi-modal Metric Learning
plexity of the algorithm is O( m 2
P
i=1 ni ). Denoting n =
algorithm (LOMDML).
max(n1 , . . . , nm ), the worst-case space complexity is simply Clearly this algorithm naturally preserves the PSD prop-
O(m × n2 ). The overall time complexity is linear with respect erty of the resulting distance metric M(i) = W(i)⊤ W(i)
to T — the total number of training triplets. The most and thus avoids the needs of performing the intensive PSD
computationally intensive step is the PSD projection step, projection. By assuming all r1 = . . . = rm = r and
which can be O(n3 ) for a dense matrix. Hence, the worst- n = max(n1 , . . . , nm ), the overall time complexity of the
case time overall complexity is O(T × m × n3 ). algorithm is O(T × m × r × n).
7
√
Algorithm 2 LOMDML—Low-rank OMDML algorithm By choosing β = √ T
√ , we then have
T + ln m
1: INPUT: r !
ln m √
• Discount weight parameter: β ∈ (0, 1) M ≤ 2 1+ min F (M(i) , ℓ, P) + ln m + T ln m
T 1≤i≤m
• Margin parameter: γ > 0
• Learning rate parameter: η > 0
(i) (i)
2: Initialization: θ1 = 1/m, Wt , ∀i = 1, . . . , m In general, it is not difficult to prove the above theorem
3: for t = 1, 2, . . . , T do by combining the results of the Hedge algorithm and the PA
4: Receive: (pt , p+ −
t , pt ) online learning, similar to the technique used in [51]. More
(i)
5: Compute: ft = di (pt , p+ )−d (p , p− ), i = 1, . . . , m details about the proof can be found in the online supplemental
Pm (i)t (i) i t t
6: Compute: ft = i=1 θt ft file 1 . Basically the above theorem indicates that the total
7: if ft + γ > 0 then number
√ of mistakes of the proposed algorithm is bounded by
8: for i = 1, 2, . . . , m do O( T ) compared with the optimal single metric.
(i) (i)
9: Set zt = I(ft > 0)
(i) (i) (i)
10: Update θt+1 ← θt β zt 5 E XPERIMENTS
(i) (i)
11: Wt+1 ← Wt − η∇t W(i) by Eq. (7) In this section, we conduct an extensive set of experiments to
12: end for P evaluate the efficacy of the proposed algorithms for similarity
m (i)
13: Θt+1 = i=1 θt+1 search with multiple types of visual features in CBIR.
(i) (i)
14: θt+1 ← θt+1 /Θt+1 , i = 1, . . . , m
15: end if 5.1 Experimental Testbeds
16: end for
We adopt four publicly-available image data sets in our exper-
iments, which have been widely adopted for the benchmarks
4 T HEORETICAL A NALYSIS of content-based image retrieval, image classification and
recognition tasks. TABLE 1 summarizes the statistics of these
We now analyze the theoretical performance of the proposed databases.
algorithms. To be concise, we give a theorem for the bound
of mistakes made by Algorithm 1 for predicting the relative TABLE 1
similarity of the sequence of triplet training instances. The List of image databases in our testbed.
similar result can be derived for Algorithm 2.
For the convenience of discussions in this section, we define: Datasets size classes # avg # per class
Caltech101 8,677 101 85.91
(i) (i)
zt = I ft > 0 , Indoor 15,620 67 233.14
ImageCLEF 7,157 20 367.85
where I(x) is an indicator function that outputs 1 when x is Corel 5,000 50 100
true and 0 otherwise. We further define the optimal margin ImageCLEFFlickr 1,007,157 21 47959.86
similarity function error for M(i) with respect to a collection
of training examples P = {(pt , p+ −
t , pt ), t = 1, . . . , T } as The first testbed is the “caltech101”2, which has been widely
h i adopted for object recognition and image retrieval [56], [20].
kM(i) − Ik2F + 2C Tt=1 ℓt (di )
P
This dataset contains 101 object categories and 8,677 images.
F (M(i) , ℓ, P) = min
M(i) min(C, 1) The second testbed is the “indoor” dataset3 , which was used
for recognizing indoor scenes [57]. This dataset consists of 67
where ℓt (di ) denotes ℓ((pt , p+ −
t , pt ); di ). We then have the indoor categories, and 15,620 images. The numbers of images
following theorem for the mistake bound of the proposed in different categories are diverse, but each category contains
OMDML algorithm. at least 100 images. It is further divided into 5 subsets: store,
home, public spaces, leisure, and working place. We simply
Theorem 1. After receiving a sequence of T training ex-
consider it as a dataset of 67 categories and evaluate different
amples, denoted by P = {(pt , p+ −
t , pt ), t = 1, . . . , T }, algorithms on the whole indoor collection.
the number of mistakes M on predicting the ranking of
The third testbed is the “ImageCLEF” dataset4 , which was
(pt , p+ −
t , pt ) made by running Algorithm 1, denoted by
! also used in [58]. It is a medical image dataset and has 7,157
T T m images in 20 categories.
(i) (i)
X X X
M = I (ft > 0) = I θt f t > 0 The fourth testbed is the “Corel” dataset [7], which consists
t=1 t=1 i=1 of photos from COREL image CDs. It has 50 categories,
is bounded as follows each of which has exactly 100 images randomly selected from
T related examples in COREL image CDs.
2 ln(1/β) X (i) 2 ln m
M ≤ min zt +
1 − β 1≤i≤m t=1 1−β 1. http://omdml.stevenhoi.org/
2. http://www.vision.caltech.edu/Image Datasets/Caltech101/
2 ln(1/β) 2 ln m 3. http://web.mit.edu/torralba/www/indoor.html
≤ min F (M(i) , ℓ, P) + 4. http://imageclef.org/
1 − β 1≤i≤m 1−β
8
We also combine “ImageCLEF” with a collection of one For the clustering step, we adopt a forest of 16 kd-trees and
million social photos crawled from Flickr, this larger set is search 2048 neighbors to speed up the clustering task. By
named “ImageCLEFFlickr”. We treat the Flickr photos as a combining different descriptors (SIFT/SURF) and vocabulary
special class of background noisy photos, which are mainly sizes (200/1000), we extract four types of local features:
used to test the scalability of our algorithms. SIFT200, SIFT1000, SURF200 and SURF1000. Finally, we
adopt the TF-IDF weighing scheme to generate the final
5.2 Experimental Setup bag-of-visual-words for describing the local features. For all
For each database, we split the whole dataset into three disjoint learning algorithms, we normalize the feature vectors to ensure
partitions: a training set, a test set, and a validation set. In that every feature entry is in [0, 1].
particular, we randomly choose 500 images to form a test
set, and other 500 images to build up a validation set. The 5.4 Comparison Algorithms
remaining images are used to form a training set for learning To extensively evaluate the efficacy of our algorithms, we
similarity functions. compare the proposed two online multi-modal DML algo-
To generate side information in the form of triplet instances rithms, i.e., OMDML and LOMDML, against a number of
for learning the ranking functions, we sample triplet con- existing representative DML algorithms, including RCA [30],
straints from the images in the training set according to their LMNN [32], and OASIS [20]. As a heuristic baseline method,
ground truth labels. Specifically, we generate a triplet instance we also evaluate the square Euclidean distance, denoted as
by randomly sampling two images belonging to the same class “EUCL-*”.
and one image from a different class. In total, we generate To adapt the existing DML methods for multi-modal image
100K triplet instances for each standard dataset (except for retrieval, we have implemented several variants of each DML
the small-scale and large-scale experiments). algorithm by exploring three fusion strategies [59], [60]:
To fairly evaluate different algorithms, we choose their 1) “Best” — applying DML for each modality individually
parameters by following the same cross validation scheme. For and then selecting the best modality. We name these al-
simplicity, we empirically set ri = r = 50 for the i-th modality gorithms with suffix “-B”, e.g., RCA-B, in which we first
in the LOMDML algorithm and set the maximum iteration learn metrics over each modality separately on the train-
to 500 for LMNN. To evaluate the retrieval performance, we ing set by Relevance Component Analysis (RCA) [30].
adopt the mean Average Precision (mAP) and top-K retrieval After that, we validate the retrieval performance of all
accuracy. As a widely used IR metric, mAP value averages metrics on corresponding modality against the validation
the Average Precision (AP) value of all the queries, each set, and then choose the modality with the highest mAP
of which denotes the area under precision-recall curve for a as the best modality. We report the mAP score over the
query. The precision value is the ratio of related examples over best modality by ranking on test set with RCA.
total retrieved examples, while the recall value is the ratio of 2) “Concatenation” — an early fusion approach by concate-
related examples retrieved over total related examples in the nating features of all modalities before applying DML.
database. We name these algorithms with suffix “-C”, e.g., LMNN-
Finally, we run all the experiments on a Linux machine with C, in which we first concatenate all types of features
2.33GHz 8-core Intel Xeon CPU and 16GB RAM. together, and then learn the optimal metric on this com-
bined feature space by LMNN [32], and finally evaluate
5.3 Diverse Visual Features for Image Descriptors the mAP score on the optimal metric.
We adopt both global and local feature descriptors to extract 3) “Uniform combination” — a late fusion approach by
features for representing images in our experiments. Each uniformly combining all modalities after metric learning.
feature will correspond to one modality in the algorithm. We name these algorithms with suffix “-U”, e.g., OASIS-
Before the feature extraction, we have preprocessed the images U, in which we first learn an optimal metric by OA-
by resizing all the images to the scale of 500×500 pixels while SIS [20] for each modality, and then uniformly combine
keeping the aspect ratio unchanged. all distance functions for the final ranking.
Specifically, for global features, we extract five types of
features to represent an image, namely 5.5 Evaluation on Small-Scale Datasets
• Color histogram and color moments (n = 81), In this section, we build four small-scale data sets, named
• Edge direction histogram (n = 37), “Caltech101(S)”, “Indoor(S)”, “COREL(S)” and “ImageCLE-
• Gabor wavelets transformation (n = 120), F(S)”, from the corresponding standard datasets by first choos-
• Local binary pattern (n = 59), ing 10 object categories, and then randomly sampling 50
• GIST features (n = 512). examples from each category. We adopt 5 global features
For local features, we extract the bag-of-visual-words rep- described above as the multi-modal inputs. To construct triplet
resentation using two kinds of descriptors: constraints for online learning approaches, we generate all
• SIFT — we adopt the Hessian-Affine interest region positive pairs (two images belong to the same class), and
detector with a threshold of 500; for each positive pair we randomly select an image from the
• SURF — we use the SURF detector with a threshold of other different classes to form a triplet. In total, about 10K
500. triplets are generated for each dataset. TABLE 2 summarizes
9
corel caltech101
EUCL−C EUCL−C
0.6 RCA−C 0.6 RCA−C
OASIS−C OASIS−C
RCA−U RCA−U
OASIS−U OASIS−U
0.5 LOMDML 0.5 LOMDML
0.4 0.4
Prec
Prec
0.3 0.3
0.2 0.2
0.1 0.1
0 0
20 40 60 80 100 20 40 60 80 100
@K @K
0.14 0.85
0.12
0.8
Prec
Prec
0.1
0.75
0.08
0.06
0.7
0.04
0.65
0.02
0 0.6
20 40 60 80 100 20 40 60 80 100
@K @K
the evaluation results on the small-scale data sets, from which generally tend to perform better than the best single metric
we can draw the following observations. approaches (with suffix“-B”). This is primarily because com-
bining multiple types of features with learning could better
TABLE 2 explore the potential of all the features, which validates the
Evaluation of the mAP performance. importance of the proposed technique.
Alg. COREL(S) Caltech101(S) Indoor(S) ImageCLEF(S)
Second, some of the uniformly combination algorithms (i.e.,
Eucl-B 0.4431 0.4299 0.1726 0.4325 the late fusion strategy) failed to outperform the best single
RCA-B 0.5097 0.4984 0.1915 0.4492 metric approach in some cases, e.g., “RCA-U” (compared with
LMNN-B 0.4876 0.5462 0.1852 0.5231 “RCA-B”) and “LMNN-U” (compared with “LMNN-B”) on
OASIS-B 0.4445 0.5072 0.1884 0.4424 Caltech101(S). This implies that uniform concatenation is not
Eucl-C 0.5220 0.4306 0.1842 0.4431 optimal to combine different kinds of features. Thus, it is
RCA-C 0.6437 0.6156 0.2078 0.5927 critical to identify the effective features via machine learning
LMNN-C 0.5816 0.5894 0.2027 0.5821
OASIS-C 0.5657 0.5441 0.2017 0.5618 and then assign them higher weights.
Eucl-U 0.5220 0.4306 0.1842 0.4431 Third, among all the compared algorithms, the proposed
RCA-U 0.5625 0.4860 0.1894 0.4909 OMDML and LOMDML algorithms outperform the other
LMNN-U 0.6026 0.4282 0.2007 0.4647 algorithms. Finally, it is interesting to observe that the pro-
OASIS-U 0.5679 0.5419 0.1989 0.5338 posed low-rank algorithm (LOMDML) not only improves the
OMDML 0.6620 0.6543 0.2113 0.6824 efficiency and scalability of OMDML, but also enhances the
LOMDML 0.6975 0.6646 0.2250 0.7080 retrieval accuracy. This is probably because by learning met-
First of all, the two kinds of fusion strategies, i.e., early rics in intrinsic lower-dimensional space, we may potentially
fusion (with suffix“-C”) and late fusion (with suffix“-U”), avoid the impact of overfitting and noise issues.
10
ImageCLEF+Flickr
TABLE 6
EUCL−C
Comparison between LOMDML and OMDL-LR 0.9
RCA−C
OASIS−C
(gaussianmeanvar). RCA−U
OASIS−U
0.85 LOMDML
Prec
Caltech101(S) 0.6646 0.5994
mAP 0.75
Indoor(S) 0.2250 0.2088
ImageCLEF(S) 0.7080 0.6729 0.7
Time cost (in sec.) COREL(S) 22.11 209.57
0.65
0.655 0.590
0.650
5.9 Evaluation on the Large-scale Dataset
0.585
extend our framework in resolving other types of multimodal [22] R. Salakhutdinov and G. Hinton, “Semantic hashing,” International
data analytics tasks beyond image retrieval. Journal of Approximate Reasoning, vol. 50, no. 7, pp. 969–978, Jul.
2009. [Online]. Available: http://dx.doi.org/10.1016/j.ijar.2008.11.006
[23] H. Jegou, F. Perronnin, M. Douze, J. Sanchez, P. Perez, and C. Schmid,
“Aggregating local image descriptors into compact codes,” IEEE Trans.
ACKNOWLEDGEMENTS Pattern Anal. Mach. Intell., vol. 34, no. 9, pp. 1704–1716, Sep. 2012.
[Online]. Available: http://dx.doi.org/10.1109/TPAMI.2011.235
This work was supported by Singapore MOE tier-1 research [24] K. Chatfield, V. S. Lempitsky, A. Vedaldi, and A. Zisserman, “The devil
grant from Singapore Management University, Singapore. is in the details: an evaluation of recent feature encoding methods,” in
BMVC, 2011, pp. 1–12.
[25] A. Joly and O. Buisson, “Random maximum margin hashing,” in
R EFERENCES Proceedings of IEEE Conference on Computer Vision and Pattern
Recognition (CVPR’11), Washington, DC, USA, 2011, pp. 873–880.
[1] M. S. Lew, N. Sebe, C. Djeraba, and R. Jain, “Content-based multimedia [26] D. Zhai, H. Chang, S. Shan, X. Chen, and W. Gao, “Multiview metric
information retrieval: State of the art and challenges,” Multimedia learning with global consistency and local smoothness,” ACM Trans. on
Computing, Communications and Applications, ACM Transactions on, Intelligent Systems and Technology, vol. 3, no. 3, p. 53, 2012.
vol. 2, no. 1, pp. 1–19, 2006. [27] W. Di and M. Crawford, “View generation for multiview maximum dis-
[2] Y. Jing and S. Baluja, “Visualrank: Applying pagerank to large-scale agreement based active learning for hyperspectral image classification,”
image search,” Pattern Analysis and Machine Intelligence, IEEE Trans- Geoscience and Remote Sensing, IEEE Transactions on, vol. 50, no. 5,
actions on, vol. 30, no. 11, pp. 1877–1890, 2008. pp. 1942–1954, 2012.
[3] D. Grangier and S. Bengio, “A discriminative kernel-based approach [28] S. Akaho, “A kernel method for canonical correlation analysis,” in In
to rank images from text queries,” Pattern Analysis and Machine Proceedings of the International Meeting of the Psychometric Society.
Intelligence, IEEE Transactions on, vol. 30, no. 8, pp. 1371–1384, 2008. Springer-Verlag, 2001.
[4] A. K. Jain and A. Vailaya, “Shape-based retrieval: a case study with [29] J. D. R. Farquhar, H. Meng, S. Szedmak, D. R. Hardoon, and J. Shawe-
trademark image database,” Pattern Recognition, no. 9, pp. 1369–1390, taylor, “Two view learning: Svm-2k, theory and practice,” in Advances
1998. in Neural Information Processing Systems. MIT Press, 2006.
[5] Y. Rubner, C. Tomasi, and L. J. Guibas, “The earth movers distance as [30] A. Bar-Hillel, T. Hertz, N. Shental, and D. Weinshall, “Learning distance
a metric for image retrieval,” International Journal of Computer Vision, functions using equivalence relations,” in Proceedings of International
vol. 40, p. 2000, 2000. Conference on Machine Learning, 2003, pp. 11–18.
[6] A. W. M. Smeulders, M. Worring, S. Santini, A. Gupta, and R. Jain, [31] J.-E. Lee, R. Jin, and A. K. Jain, “Rank-based distance metric learning:
“Content-based image retrieval at the end of the early years,” Pattern An application to image retrieval,” in Proceedings of IEEE Conference
Analysis and Machine Intelligence, IEEE Transactions on, vol. 22, on Computer Vision and Pattern Recognition, Anchorage, AK, 2008.
no. 12, pp. 1349–1380, 2000. [32] K. Weinberger, J. Blitzer, and L. Saul, “Distance metric learning for
[7] S. C. Hoi, W. Liu, M. R. Lyu, and W.-Y. Ma, “Learning distance metrics large margin nearest neighbor classification,” in Advances in Neural
with contextual constraints for image retrieval,” in Proceedings of IEEE Information Processing Systems, 2006, pp. 1473–1480.
Conference on Computer Vision and Pattern Recognition, New York, [33] C. Domeniconi, J. Peng, and D. Gunopulos, “Locally adaptive metric
US, Jun. 17–22 2006, dCA. nearest-neighbor classification,” IEEE Trans. Pattern Analysis and Ma-
[8] L. Si, R. Jin, S. C. Hoi, and M. R. Lyu, “Collaborative image retrieval via chine Intelligence, vol. 24, no. 9, pp. 1281 – 1285, 2002.
regularized metric learning,” ACM Multimedia Systems Journal, vol. 12, [34] P. Wu, S. C. H. Hoi, P. Zhao, and Y. He, “Mining social images with
no. 1, pp. 34–44, 2006. distance metric learning for automated image tagging,” in Proceedings
[9] S. C. Hoi, W. Liu, and S.-F. Chang, “Semi-supervised distance metric of the fourth ACM international conference on Web search and data
learning for collaborative image retrieval,” in Proceedings of IEEE mining. ACM, 2011, pp. 197–206.
Conference on Computer Vision and Pattern Recognition, Jun. 2008. [35] P. Wu, S. C. Hoi, H. Xia, P. Zhao, D. Wang, and C. Miao, “Online
[10] G. H. J. Goldberger, S. Roweis and R. Salakhutdinov, “Neighbourhood multimodal deep similarity learning with application to image retrieval,”
components analysis,” in Advances in Neural Information Processing in Proceedings of the 21st ACM international conference on Multimedia.
Systems, 2005. ACM, 2013, pp. 153–162.
[11] K. Fukunaga, Introduction to Statistical Pattern Recognition. Elsevier,
[36] X. Gao, S. C. Hoi, Y. Zhang, J. Wan, and J. Li, “Soml: Sparse online
1990.
metric learning with application to image retrieval,” in Proceedings of
[12] A. Globerson and S. Roweis, “Metric learning by collapsing classes,”
the Twenty-Eighth AAAI Conference on Artificial Intelligence, 2014.
in Advances in Neural Information Processing Systems, 2005.
[37] P. Jain, B. Kulis, I. S. Dhillon, and K. Grauman, “Online metric
[13] L. Yang, R. Jin, R. Sukthankar, and Y. Liu, “An efficient algorithm for
learning and fast similarity search,” in Advances in Neural Information
local distance metric learning,” in Association for the Advancement of
Processing Systems, 2008, pp. 761–768.
Artificial Intelligence, 2006.
[14] A. K. Jain and A. Vailaya, “Image retrieval using color and shape,” [38] R. Jin, S. Wang, and Y. Zhou, “Regularized distance metric learning:
Pattern Recognition, vol. 29, pp. 1233–1244, 1996. Theory and algorithm,” in Advances in Neural Information Processing
Systems, 2009, pp. 862–870.
[15] B. S. Manjunath and W.-Y. Ma, “Texture features for browsing and
retrieval of image data,” Pattern Analysis and Machine Intelligence, [39] D. Wang, S. C. H. Hoi, P. Wu, J. Zhu, Y. He, and C. Miao, “Learning
IEEE Transactions on, vol. 18, no. 8, pp. 837–842, 1996. to name faces: a multimodal learning scheme for search-based face
[16] J. Sivic, B. C. Russell, A. A. Efros, A. Zisserman, and W. T. Freeman, annotation,” in SIGIR, 2013, pp. 443–452.
“Discovering objects and their location in images,” in IEEE Conference [40] H. Xia, P. Wu, and S. C. H. Hoi, “Online multi-modal distance learning
on Computer Vision and Pattern Recognition, 2005. for scalable multimedia retrieval,” in WSDM, 2013, pp. 455–464.
[17] J. Yang, Y.-G. Jiang, A. G. Hauptmann, and C.-W. Ngo, “Evaluating [41] S. Shalev-Shwartz, “Online learning and online convex optimization,”
bag-of-visual-words representations in scene classification,” in ACM Foundations and Trends in Machine Learning, vol. 4, no. 2, pp. 107–
International Conference on Multimedia Information Retrieval, 2007, 194, 2011.
pp. 197–206. [42] S. C. Hoi, J. Wang, and P. Zhao, “Libol: A library for online learning
[18] D. G. Lowe, “Object recognition from local scale-invariant features,” in algorithms,” Journal of Machine Learning Research, vol. 15, pp.
IEEE International Conference on Computer Vision, 1999, pp. 1150– 495–499, 2014. [Online]. Available: https://github.com/LIBOL
1157. [43] F. Rosenblatt, “The perceptron: A probabilistic model for information
[19] R. S. Mohammad Norouzi, David Fleet, “Hamming distance metric storage and organization in the brain,” Psychological Review, vol. 7, pp.
learning,” in Advances in Neural Information Processing Systems, 2012. 551–585, 1958.
[20] G. Chechik, V. Sharma, U. Shalit, and S. Bengio, “Large scale online [44] K. Crammer, O. Dekel, J. Keshet, S. Shalev-Shwartz, and Y. Singer,
learning of image similarity through ranking,” Journal of Machine “Online passive-aggressive algorithms,” Journal of Machine Learning
Learning Research, vol. 11, pp. 1109–1135, 2010. Research, vol. 7, pp. 551–585, 2006.
[21] H. Chang and D.-Y. Yeung, “Kernel-based distance metric learning for [45] M. Dredze, K. Crammer, and F. Pereira, “Confidence-weighted linear
content-based image retrieval,” Image and Vision Computing, vol. 25, classification,” in Proceedings of International Conference on Machine
no. 5, pp. 695–703, 2007. Learning, 2008, pp. 264–271.
13
[46] K. Crammer, A. Kulesza, and M. Dredze, “Adaptive regularization of Steven C. H. Hoi is currently an Associate
weight vectors,” in Advances in Neural Information Processing Systems, Professor of the School of Information Sytems,
2009, pp. 414–422. Singapore Management Unviersity, Singapore.
[47] P. Zhao, S. C. H. Hoi, and R. Jin, “Double updating online learning,” Prior to joining SMU, he was Associate Profes-
Journal of Machine Learning Research, vol. 12, pp. 1587–1615, 2011. sor with Nanyang Technological University, Sin-
[48] Y. Freund and R. E. Schapire, “A decision-theoretic generalization of gapore. He received his Bachelor degree from
on-line learning and an application to boosting,” Journal of Computer Tsinghua University, P.R. China, in 2002, and his
and System Sciences, vol. 55, no. 1, pp. 119–139, 1997. Ph.D degree in computer science and engineer-
[49] M. Zinkevich, “Online convex programming and generalized infinites- ing from The Chinese University of Hong Kong,
imal gradient ascent,” in Proceedings of International Conference on in 2006. His research interests are machine
Machine Learning, 2003, pp. 928–936. learning and data mining and their applications
[50] S. C. H. Hoi, J. Wang, P. Zhao, R. Jin, and P. Wu, “Fast bounded online to multimedia information retrieval (image and video retrieval), social
gradient descent algorithms for scalable kernel-based online learning,” media and web mining, and computational finance, etc, and he has
in ICML, 2012. published over 150 refereed papers in top conferences and journals
[51] S. C. Hoi, R. Jin, P. Zhao, and T. Yang, “Online multiple kernel in these related areas. He has served as Associate Editor-in-Chief for
classification,” Machine Learning, vol. 90, no. 2, pp. 289–316, 2013. Neurocomputing Journal, general co-chair for ACM SIGMM Workshops
[52] Y. Freund and R. E. Schapire, “Adaptive game playing using multi- on Social Media (WSM’09, WSM’10, WSM’11), program co-chair for the
plicative weights,” Games and Economic Behavior, vol. 29, no. 1, pp. fourth Asian Conference on Machine Learning (ACML’12), book editor
79–103, 1999. for “Social Media Modeling and Computing”, guest editor for ACM Trans-
[53] Y. Li and P. M. Long, “The relaxed online maximum margin algorithm,” actions on Intelligent Systems and Technology (ACM TIST), technical
in Advances in Neural Information Processing Systems, 1999, pp. 498– PC member for many international conferences, and external reviewer
504. for many top journals and worldwide funding agencies, including NSF in
[54] L. Bottou and Y. LeCun, “Large scale online learning,” in Advances in US and RGC in Hong Kong.
Neural Information Processing Systems, 2003.
[55] S. C. Hoi, M. R. Lyu, and R. Jin, “A unified log-based relevance
feedback scheme for image retrieval,” Knowledge and Data Engineering,
IEEE Transactions on, vol. 18, no. 4, pp. 509–204, 2006.
[56] G. Griffin, A. Holub, and P. Perona, “Caltech-256 object category
dataset,” California Institute of Technology, Tech. Rep. 7694, 2007.
[57] A. Quattoni and A. Torralba, “Recognizing indoor scenes,” in IEEE Peilin Zhao received his PhD from the School
Conference on Computer Vision and Pattern Recognition, 2009. of Computer Engineering at the Nanyang Tech-
[58] L. Yang, R. Jin, L. B. Mummert, R. Sukthankar, A. Goode, B. Zheng, nological University, Singapore, in 2012 and
S. C. H. Hoi, and M. Satyanarayanan, “A boosting framework for his bachelor degree from Zhejiang University,
visuality-preserving distance metric learning and its application to Hangzhou, P.R. China, in 2008. His research
medical image retrieval,” Pattern Analysis and Machine Intelligence, interests are statistical machine learning, and
IEEE Transactions on, vol. 32, no. 1, pp. 30–44, 2010. data mining.
[59] C. G. Snoek, M. Worring, and A. W. Smeulders, “Early versus late
fusion in semantic video analysis,” in Proceedings of the 13th annual
ACM international conference on Multimedia, 2005, pp. 399–402.
[60] J. Kludas, E. Bruno, and S. Marchand-Maillet, “Information fusion
in multimedia information retrieval,” Adaptive Multimedial Retrieval:
Retrieval, User, and Semantics, pp. 147–159, 2008.
Fig. 6. Qualitative evaluation of top-5 retrieved images by different algorithms. For each block, the first image is the
query, and the results from the first line to the sixth line represents “Eucl-C”, “RCA-C”, “OASIS-C”, “RCA-U”, “OASIS-U”
and “LOMDML” respectively. The left column is from the “Corel” dataset and the right is from the “Caltech101” dataset.