0% found this document useful (0 votes)

16 views

Online Multi-Modal Distance Metric Learning With Application To I

Research paper

Uploaded by

Ramyasri

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

16 views

Online Multi-Modal Distance Metric Learning With Application To I

Research paper

Uploaded by

Ramyasri

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 15

Singapore Management University

Institutional Knowledge at Singapore Management University

Research Collection School Of Information Systems School of Information Systems

2-2016

Online multi-modal distance metric learning with

application to image retrieval
Pengcheng WU

Steven C. H. HOI
Singapore Management University, [email protected]

Peilin ZHAO

Chunyan MIAO

Zhi-Yong LIU

Follow this and additional works at: http://ink.library.smu.edu.sg/sis_research

Part of the Databases and Information Systems Commons

Citation
WU, Pengcheng; HOI, Steven C. H.; ZHAO, Peilin; MIAO, Chunyan; and LIU, Zhi-Yong. Online multi-modal distance metric
learning with application to image retrieval. (2016). IEEE Transactions on Knowledge and Data Engineering. 28, (2), 454-467. Research
Collection School Of Information Systems.
Available at: http://ink.library.smu.edu.sg/sis_research/2924

This Journal Article is brought to you for free and open access by the School of Information Systems at Institutional Knowledge at Singapore
Management University. It has been accepted for inclusion in Research Collection School Of Information Systems by an authorized administrator of
Institutional Knowledge at Singapore Management University. For more information, please email [email protected].
1

Online Multi-modal Distance Metric Learning

with Application to Image Retrieval
Pengcheng Wu, Steven C. H. Hoi, Peilin Zhao, Chunyan Miao, Zhi-Yong Liu

Abstract—Distance metric learning (DML) is an important technique to improve similarity search in content-based image retrieval.
Despite being studied extensively, most existing DML approaches typically adopt a single-modal learning framework that learns the
distance metric on either a single feature type or a combined feature space where multiple types of features are simply concatenated.
Such single-modal DML methods suffer from some critical limitations: (i) some type of features may significantly dominate the others
in the DML task due to diverse feature representations; and (ii) learning a distance metric on the combined high-dimensional feature
space can be extremely time-consuming using the naive feature concatenation approach. To address these limitations, in this paper, we
investigate a novel scheme of online multi-modal distance metric learning (OMDML), which explores a unified two-level online learning
scheme: (i) it learns to optimize a distance metric on each individual feature space; and (ii) then it learns to find the optimal combination
of diverse types of features. To further reduce the expensive cost of DML on high-dimensional feature space, we propose a low-rank
OMDML algorithm which not only significantly reduces the computational cost but also retains highly competing or even better learning
accuracy. We conduct extensive experiments to evaluate the performance of the proposed algorithms for multi-modal image retrieval,
in which encouraging results validate the effectiveness of the proposed technique.

Index Terms—content-based image retrieval, multi-modal retrieval, distance metric learning, online learning

1 I NTRODUCTION of all features; and (ii) the naive concatenation approach may
result in a combined high-dimensional feature space, making
One of the core research problems in multimedia retrieval
the subsequent DML task computationally intensive.
is to seek an effective distance metric/function for comput-
ing similarity of two objects in content-based multimedia To overcome the above limitations, this paper investigates
retrieval tasks [1], [2], [3]. Over the past decades, multimedia a novel framework of Online Multi-modal Distance Metric
researchers have spent much effort in designing a variety Learning (OMDML), which learns distance metrics from
of low-level feature representations and different distance multi-modal data or multiple types of features via an efficient
measures [4], [5], [6]. Finding a good distance metric/function and scalable online learning scheme. Unlike the above con-
remains an open challenge for content-based multimedia re- catenation approach, the key ideas of OMDML are twofold:
trieval tasks till now. In recent years, one promising direction (i) it learns to optimize a separate distance metric for each
to address this challenge is to explore distance metric learning individual modality (i.e., each type of feature space), and (ii)
(DML) [7], [8], [9] by applying machine learning techniques it learns to find an optimal combination of diverse distance
to optimize distance metrics from training data or side infor- metrics on multiple modalities. Moreover, OMDML takes ad-
mation, such as historical logs of user relevance feedback in vantages of online learning techniques for high efficiency and
content-based image retrieval (CBIR) systems [6], [7]. scalability towards large-scale learning tasks. To further reduce
Although various DML algorithms have been proposed the computational cost, we also propose a Low-rank Online
in literature [7], [10], [11], [12], [13], most existing DML Multi-modal DML (LOMDML) algorithm, which avoids the
methods in general belong to single-modal DML in that they need of doing intensive positive semi-definite (PSD) projec-
learn a distance metric either on a single type of feature or tions and thus saves a significant amount of computational cost
on a combined feature space by simply concatenating multiple for DML on high-dimensional data. As a summary, the major
types of diverse features together. In a real-world application, contributions of this paper include:
such approaches may suffer from some practical limitations: (i)
• We present a novel framework of Online Multi-modal
some types of features may significantly dominate the others
Distance Metric Learning (OMDML), which simultane-
in the DML task, weakening the ability to exploit the potential
ously learns optimal metrics on each individual modality
Corresponding author: Steven C.H. HOI and Pengcheng WU are with the and the optimal combination of the metrics from multiple
School of Information Systems, Singapore Management University, Singapore modalities via efficient and scalable online learning;
178902, E-mail: {chhoi,pcwu}@smu.edu.sg • We further propose a low-rank OMDML algorithm which
Peilin Zhao is with the Data Analytics Department, Institute for Infocomm
Research, A*STAR, Singapore 138632, E-mail: [email protected] by significantly reducing computational costs for high-
Chunyan Miao is with the School of Computer Engineering, Nanyang Tech- dimensional data without PSD projection;
nological University, Singapore 639798, E-mail: [email protected] • We offer theoretical analysis of the OMDML method;
Zhi-Yong Liu is with the State Key Lab of Management and Control for
Complex System, Institute of Automation, Chinese Academy of Sciences, • We conduct an extensive set of experiments to evaluate
Beijing, China, E-mail: [email protected] the performance of the proposed techniques for CBIR
2

tasks using multiple types of features. the retrieval process that often requires an effective indexing
The remainder of this paper is organized as follows. Section scheme, which are out of this paper’s scope.
2 reviews related work. Section 3 first gives the problem
formulation, and then presents our method of online multi- 2.2 Distance Metric Learning
modal metric learning, followed by proposing an improved Distance metric learning has been extensively studied in both
low-rank algorithm. Section 4 provides theoretical analysis for machine learning and multimedia retrieval communities [30],
the proposed algorithms, Section 5 discusses our experimental [7], [31], [32], [33], [34], [35], [36]. The essential idea is to
results, and finally Section 6 concludes this work. learn an optimal metric which minimizes the distance between
similar/related images and simultaneously maximizes the dis-
2 R ELATED WORK tance between dissimilar/unrelated images. Existing DML
Our work is related to three major groups of research: content- studies can be grouped into different categories according
based image retrieval, distance metric learning, and online to different learning settings and principles. For example, in
learning. In the following, we briefly review the closely related terms of different types of constraint settings, DML techniques
representative works in each group. are typically categorized into two groups:
• Global supervised approaches [30], [7]: to learn a metric

2.1 Content-based Image Retrieval on a global setting, e.g., all constraints will be satisfied
simultaneously;
With the rapid growth of digital cameras and photo sharing
• Local supervised approaches [32], [33]: to learn a metric
websites, image retrieval has become one of the most impor-
in the local sense, e.g., the given local constraints from
tant research topics in the past decades, among which content-
neighboring information will be satisfied.
based image retrieval is one of key challenging problems [1],
[2], [3]. The objective of CBIR is to search images by Moreover, according to different training data forms, DML
analyzing the actual contents of the image as opposed to studies in machine learning typically learn metrics directly
analyzing metadata like keywords, title and author, such that from explicit class labels [32], while DML studies in multime-
extensive efforts have been done for investigating various dia mainly learn metrics from side information, which usually
low-level feature descriptors for image representation [14]. can be obtained in the following two forms:
For example, researchers have spent many years in studying • Pairwise constraints [7], [9]: A must-link constraint set
various global features for image representation, such as color S and a cannot-link constraint set D are given, where
features [14], edge features [14], and texture features [15]. a pair of images (pi , pj ) ∈ S if pi is related/similar
Recent years also witness the surge of research on local feature to pj , otherwise (pi , pj ) ∈ D. Some literature uses
based representation, such as the bag-of-words models [16], the term equivalent/positive constraint in place of “must-
[17] using local feature descriptors (e.g., SIFT [18]). link”, and the term inequivalent/negative constraint in
Conventional CBIR approaches usually choose rigid dis- place of “cannot-link”.
tance functions on some extracted low-level features for • Triple constraints [20]: A triplet set P is given, where
multimedia similarity search, such as the classical Euclidean P = {(pt , p+ − + −
t , pt )|(pt , pt ) ∈ S; (pt , pt ) ∈ D, t =
distance or cosine similarity. However, there exists one key 1, . . . , T }, S contains related pairs and D contains un-
limitation that the fixed rigid similarity/distance function may related pairs, i.e., p is related/similar to p+ and p is
not be always optimal because of the complexity of visual im- unrelated/dissimilar to p− . T denotes the cardinality of
age representation and the main challenge of the semantic gap entire triplet set.
between the low-level visual features extracted by computers When only explicit class labels are provided, one can also
and high-level human perception and interpretation. Hence, construct side information by simply considering relationships
recent years have witnesses a surge of active research efforts of instances in same class as related, and relationships of
in design of various distance/similarity measures on some low- instances belonging to different classes as unrelated. In our
level features by exploiting machine learning techniques [19], works, we focus on triple constraints.
[20], [21], among which some works focus on learning to hash Finally, in terms of learning methodology, most existing
for compact codes [22], [19], [23], [24], [25], and some others DML studies generally employ batch learning methods which
can be categorized into distance metric learning that will be often assume the whole collection of training data must be
introduced in the next subsection. Our work is also related to given before the learning task and train a model from scratch,
multimodal/multiview studies, which have been widely studied except for a few recent DML studies which begin to explore
on image classification and object recognition fields [26], online learning techniques [37], [38]. All these works gener-
[27], [28], [29]. However, it is usually hard to exploit these ally address single-modal DML, which is different from our
techniques directly on CBIR because (i) in general, image focus on multi-modal DML. We also note that our work is very
classes will not be given explicitly on CBIR tasks, (ii) even if different from the existing multiview DML study [26] which is
classes are given, the number will be very large, (iii) image concerned with regular classification tasks by learning a metric
datasets tend to be much larger on CBIR than on classification on training data with explicit class labels, making it difficult to
tasks. We thus exclude the direct comparisons to such existing be compared with our method directly. We note that our work
works in this paper. There are still some other open issues is different from another multimodal learning study in [39]
in CBIR studies, such as the efficiency and scalability of which addresses a very different problem of search-based face
3

Learning Phase Retrieval Phase

Top-ranked Images
return

Submit
query
Image
Triplet training data Database
Extract

Features on z
Features on Features on
Modal 1 … Modal … Modal

update update update

Extract
Similarity Similarity Similarity
Function on Function on Function on
Modal 1 Modal Modal
Features on
Update Modal

Features on
Modal
Multi-modal Image
Ranking Features on
Similarity Function
Modal 1

Fig. 1. Overview of the proposed multi-modal distance metric learning scheme for multi-modal retrieval in CBIR

annotation where their multimodal learning is formulated with model is updated with the loss whenever it is nonzero. The
a batch learning task for optimizing a specific loss function overall objective of an online learning task is to minimize the
tailored for search-based face annotation tasks from weakly cumulative loss over the entire sequence of received instances.
labeled data. Finally, we note that our work is also differ- In literature, a variety of algorithms have been proposed for
ent from some existing distance learning studies that learn online learning [43], [44], [45], [46], [47]. Some well-known
nonlinear distance functions using kernel or deep learning examples include the Hedge algorithm for online prediction
methods [21], [40], [35]. In comparison to the linear distance with expert advice [48], the Perceptron algorithm [43], the
metric learning methods, kernel methods usually may achieve family of passive-Aggressive (PA) learning algorithms [44],
better learning accuracy in some scenarios, but falls short and the online gradient descent algorithms [49]. There is also
in being difficult to scale up for large-scale applications due some study that attempts to improve the scalability of online
to the curse of kernelization, i.e., the learning cost increases kernel methods, such as [50] which proposed a bounded online
dramatically when the number of training instances increases. gradient descent for addressing online kernel-based classifica-
Thus, our empirical study is focused on direct comparisons to tion tasks. In this work, we apply online learning techniques,
the family of linear methods. i.e., the Hedge, PA, and online gradient descent algorithms,
to tackle the multi-modal distance metric learning task for
content-based image retrieval. Besides, we note that this work
2.3 Online Learning was partially inspired by the recent study of online multiple
Our work generally falls in the category of online learning kernel learning which aims to address online classification
methodology, which has been extensively studied in machine tasks using multiple kernels [51]. In the following, we give a
learning [41], [42]. Unlike batch learning methods that usually brief overview of several popular online learning algorithms.
suffer from expensive re-training cost when new training data
arrive, online learning sequentially makes a highly efficient 2.3.1 Hedge Algorithms
(typically constant) update for each new training data, making The Hedge algorithm [48], [52] is a learning algorithm which
it highly scalable for large-scale applications. In general, aims to dynamically combine multiple strategies in an optimal
online learning operates on a sequence of data instances with way, i.e., making the final cumulative loss asymptomatically
time stamps. At each time step, an online learning algorithm approach that of the best strategy. Its key idea is to main-
processes an incoming example by first predicting its class tain a dynamic weigh-distribution over the set of strategies.
label; after the prediction, it receives the true class label which During the online learning process, the distribution is updated
is then used to measure the suffered loss between the predicted according to the performance of those strategies. Specifically,
label and the true label; at the end of each time step, the the weight of every strategy is decreased exponentially with
4

respect to its suffered loss, making the overall strategy ap- is to learn the distance metrics in the learning phase in order to
proaching the best strategy. facilitate the image ranking task in the retrieval phase. We note
that these two phases may operate concurrently in practice,
2.3.2 Passive-Aggressive Learning where the learning phase may never stop by learning from
As a classical well-known online learning technique, the endless stream training data.
Perceptron algorithm [43] simply updates the model by adding During the learning phase, we assume triplet training data
an incoming instance with a constant weight whenever it instances arrive sequentially, which is natural for a real-world
is misclassified. Recent years have witnessed a variety of CBIR system. For example, in online relevance feedback, a
algorithms proposed to improve Perceptron [53], [44], which user is often asked to provide feedback to indicate if a retrieved
usually follow the principle of maximum margin learning in image is related or unrelated to a query; as a result, users’
order to maximize the margin of the classifier. Among them, relevance feedback log data can be collected to generate the
one of the most notable approaches is the family of Passive- training data in a sequential manner for the learning task [55].
Aggressive (PA) learning algorithms [44], which updates the Once a triplet of images is received, we extract different low-
model whenever the classifier fails to produce a large margin level feature descriptors on multiple modalities from these
on the incoming instance. In particular, the family of online images. After that, every distance function on a single modality
PA learning is formulated to trade off the minimization of can be updated by exploiting the corresponding features and
the distance between the target classifier and the previous label information. Simultaneously, we also learn the optimal
classifier, and the minimization of the loss suffered by the combination of different modalities to obtain the final optimal
target classier on the current instance. The PA algorithms enjoy distance function, which is applied to rank images in the
good efficiency and scalability due to their simple closed-form retrieval phase.
solutions. Finally, both theoretical analysis and most empirical During the retrieval phase, when the CBIR system receives
studies demonstrate the advantages of the PA algorithms over a query from users, it first applies the similar approach to
the classical Perceptron algorithm. extract low-level feature descriptors on multiple modalities,
then employs the learned optimal distance function to rank
2.3.3 Online Gradient Descent the images in the database, and finally presents the user with
Besides Perceptron and PA methods, another well-known on- the list of corresponding top-ranked images. In the following,
line learning method is the family of Online Gradient Descent we first give the notation used throughout the rest of this paper,
(OGD) algorithms, which applies the family of online convex and then formulate the problem of multi-modal distance metric
optimization techniques to optimize some particular objective learning followed by presenting online algorithms to solve it.
function of an online learning task [49]. It enjoys solid theoret-
ical foundation of online convex optimization, and thus works 3.2 Notation
effectively in empirical applications. When the training data is
For the notation used in this paper, we use bold upper case
abundant and computing resources are comparatively scarce,
letter to denote a matrix, for example, M ∈ Rn×n , and bold
some existing studies showed that a properly designed OGD
lower case letter to denote a vector, for example, p ∈ Rn . We
algorithm can asymptotically approach or even outperform a
adopt I to denote an identity matrix. Formally, we define the
respective batch learning algorithm [54].
following terms and operates:
• m: the number of modalities (types of features).
3 O NLINE M ULTI - MODAL D ISTANCE M ETRIC • ni : the dimensionality of the i-th visual feature space
L EARNING (modality).
(i)
3.1 Overview • p : the i-th type of visual feature (modality) of the

In literature, many techniques have been proposed to improve corresponding image p(i) ∈ Rni .
(i)
the performance of CBIR. Some existing studies have made • M : the optimal distance metric on the i-th modality,

efforts on investigating novel low-level feature descriptors in where M(i) ∈ Rni ×ni .
(i)
order to better represent visual content of images, while others • W : a linear transformation matrix by decomposing
T
have focused on the investigation of designing or learning M , such that, M(i) = W(i) W(i) , Wi ∈ Rri ×ni ,
(i)

effective distance/similarity measures based on some extracted where ri is the dimensionality of projected feature space.
low-level features. In practice, it is hard to find a single • S: a positive constraint set, where a pair (pi , pj ) ∈ S if
best low-level feature representation that consistently beats and only if pi is related/similar to pj .
the others at all scenarios. Thus, it is highly desirable to • D: a negative constraint set, where a pair (pi , pj ) ∈ S if
explore machine learning techniques to automatically combine and only if pi is unrelated/dissimilar to pj .
+ − +
multiple types of diverse features and their respective distance • P: a triplet set, where P = {(pt , pt , pt )|(pt , pt ) ∈
−
measures. We refer to this open research problem as a multi- S; (pt , pt ) ∈ D, t = 1, . . . , T }, where T denotes the
modal distance metric learning task, and present two new algo- cardinality of entire triplet set.
rithms to solve it in this section. Figure 1 illustrates the system • di (p2 , p2 ): the distance function of two images p1 and
flow of the proposed multi-modal distance metric learning p2 on the i-th type of visual feature (modality).
scheme for content-based image retrieval, which consists of When only one modality is considered, we will omit the
two phases, i.e., learning phase and retrieval phase. The goal superscript (i) or subscript i in the above terms.
5

where k·kF denotes the Frobenius norm, ∆ = {θ| m (i)

P
3.3 Problem Formulation i=1 θ =
(i)
Our goal is to learn a distance function from side information 1, θ ∈ [0, 1], ∀i} and ℓt (·) is a loss function such as
for content-based image retrieval. We restrict our discussion ℓ((pt , p+ − + −
t , pt ); d) = max(0, d(pt , pt ) − d(pt , pt ) + 1).
for learning the family of Mahalanobis distances. In particular,
for any two images p1 , p2 ∈ Rn , where n is the dimension- The constraints in Eqn.(2) are implicitly imposed in the above
ality of represented feature space, we aim to learn an optimal hinge loss function, and C is a regularization parameter to
distance metric M to calculate the distance between p1 and prevent overfitting.
p2 as the following distance function:
d(p1 , p2 ) = (p1 − p2 )⊤ M(p1 − p2 ); M 0, (1) 3.4 OMDML Algorithm
One way is to directly solve the optimization task in Eqn.(3)
where M 0 denotes that M is a positive semi-definite (PSD) via a batch learning approach. This is however not a good
matrix, i.e., p⊤ Mp ≥ 0 for any nonzero real vector p ∈ Rn . solution primarily for two key reasons:
Obviously, if one chooses M as the identity matrix I, the
• A critical drawback of such a batch training solution
above formula is reduced to the (square) Euclidean distance.
is that it suffers from extremely high re-training cost,
To formulate the learning task, we assume a collection of
i.e, whenever there is a new training instance, the entire
training data instances are given (sequentially) in the form
model has to be completely re-trained from scratch,
of triplet constraints, i.e., P = {(pt , p+ −
t , pt ), t = 1, . . . , T },
making it non-scalable for real-world applications;
where each triplet indicates the relationship of three images,
• Beside, solving Eqn.(3) directly can be computationally
i.e., image pt is similar to image p+ t and dissimilar to pt .
−
very expensive for a large amount of training data;
Typically, we can pose such a triplet relationship as the
following constraint To address these challenges, we present an online learning
algorithm to tackle the multi-modal distance metric learning
d(pt , p+ −
t ) ≤ d(pt , pt ) − 1; ∀t = 1, . . . , T ; (2) task.
where −1 is a margin parameter to ensure a sufficiently large Algorithm 1 OMDML — Online Multi-modal DML
difference.
1: INPUT:
The above discussion generally assumes DML on single-
• Discount weight: β ∈ (0, 1)
modal data. We now generalize it to multi-modal data. In
• regularization parameter: C > 0
particular, we assume each image can be represented by a
• margin parameter: γ ≥ 0
total of m feature spaces (modalities) and assume each feature
space Fi is a ni -dimensional vector space, i.e., Fi = Rni . The 2: Initialization:
(i)
general idea of our multi-modal distance metric leaning is to • θ1 = 1/m, ∀i = 1, . . . , m
(i)
learn a separate optimal distance metric M(i) ∈ Rni ×ni for • Mb1 = I, ∀i = 1, . . . , m
each feature space as 3: for t = 1, 2, . . . , T do
(i) (i) (i) (i) (i)
di (p1 , p2 ) = (p1 − p2 )⊤ M(i) (p1 − p2 ); M(i) 0,
(i) 4: Receive: (pt , p+ −
t , pt )
(i)
5: ft = di (pt , p+ ) − d (p , p− ), ∀i = 1, . . . , m
Pm (i) t (i) i t t
and meanwhile learn an optimal combination of the distance 6: ft = i=1 θt ft
functions from different modalities to obtain the final optimal 7: if ft + γ > 0 then
distance function: 8: for i = 1, 2, . . . , m do
m (i) (i)
X (i) (i)
9: Set zt = I(ft > 0)
d(p1 , p2 ) = θ(i) di (p1 , p2 ) 10:
(i)
Update θt+1 ← θt β zt
(i) (i)

i=1 (i) (i) (i) (i)

m 11: Update Mt+1 ← Mt − τt Vt by Eq. (5)
(i) (i) (i) (i) (i) (i)
X
= θ(i) (p1 − p2 )⊤ M(i) (p1 − p2 ) 12: Update Mt+1 ← P SD(Mt+1 )
i=1 13: end for P
m (i)
(i)
where θ ∈ [0, 1] denotes the combination weight for the i- 14: Θt+1 = i=1 θt+1
(i) (i)
(i) (i)
th modality and p1 , p2 ∈ Fi denote the visual features on 15: θt+1 ← θt+1 /Θt+1 , ∀i = 1, . . . , m
the space of i-th modality. In the following, without loss of 16: end if
(i) (i) 17: end for
clarity, we will simplify denote di (p1 , p2 ) as di (p1 , p2 ) by
removing the superscript.
To simultaneously learn both the optimal combination The key challenge to online multi-modal distance metric
weights θ = (θ(1) , . . . , θ(m) ) and the optimal individual learning tasks is to develop an efficient and scalable learning
distance metric {M(i) |i = 1 . . . , m}, we cast the multi- scheme that can optimize both the distance metric on each in-
modal distance metric learning problem into the following dividual modality and meanwhile optimize the combinational
optimization task: weights of different modalities. To this end, we propose to
m T explore an online distance metric learning algorithm, i.e., a
1X X
variant of OASIS [20] and PA [44], to learn the individual dis-
kM(i) k2F + C ℓt (pt , p+ , p−

min min t t ); d (3)
θ∈∆ M(i) 0 2 i=1 t=1 tance metric, and apply the well-known Hedge algorithm [48]
6

to learn the optimal combinational weights. We discuss each 3.5 Low-Rank Online Multi-modal Distance Metric
of the two learning tasks in detail below. Learning Algorithm
(i)
Let us denote by Mt the matrix on the i-th modality at step
(i) One critical drawback of the proposed OMDML algorithm
t. To learn the optimal metric Mt on an individual modality, in Algorithm 1 is the PSD projection step, which can be
following the similar ideas of OASIS [20] and PA [44], we can computationally intensive when some feature space is of
formulate the optimization task of the online distance metric high dimensionality. In this section, we present a low-rank
learning as follows: learning algorithm to significantly improve the efficiency and
(i) 1 (i) scalability of OMDML.
Mt+1 = arg min kM − Mt kF + Cξ, (4)
M 2 Instead of learning a full-rank matrix, for each M(i) , our
+ −
s.t. ℓ((pt , pt , pt ); di ) ≤ ξ, ξ ≥ 0 goal is to learn a low-rank decomposition, i.e.,
It is not difficult to derive the closed-form solution:
M(i) := W(i)⊤ W(i) ,
(i) (i) (i) (i)
Mt+1 = Mt − τt Vt (5)
(i) (i) where Wi ∈ Rri ×ni and ri ≪ ni . Thus, for any two images
where τt and Vt are computed as follows: p1 and p2 , the distance function on the i-th modality can be
(i) (i)
τt = min(C, ℓ((pt , p+ − 2
t , pt ); di )/kVt kF ),
expressed as:
(i)
Vt = (pt − p+ + ⊤ − − ⊤
t )(pt − pt ) − (pt − pt )(pt − pt ) . di (p1 , p2 ) = (p1 − p2 )T W(i)⊤ W(i) (p1 − p2 )
In the above, we omit the superscript (i) for each pt .
One main issue of the above solution, as existed in OA- Following the similar idea in the previous section, we can
(i)
SIS [20], is that it does not guarantee the resulting matrix apply online learning techniques to solve Wt and θt , respec-
(i) tively. In this section, we consider the Online Gradient Descent
Mt+1 is positive semi-definite (PSD), which is not desirable
(i)
for DML. To fix this issue, at the end of each learning iteration, (OGD) approach to solve Wt . In particular, we denote by
we will need to perform a PSD projection of the matrix M (i)
onto the PSD domain: ℓt = ℓ((pt , p+ −
t , pt ); di )
= max(0, d(pt , p+ −
t ) − d(pt , pt ) + 1),
(i) (i)
Mt+1 ← P SD(Mt+1 ).
and introduce the following notation
Another key task of multi-modal DML is to learn the
optimal combinational weights θ = (θ(1) , . . . , θ(m) ), where (i) (i) (i)
qt = Wt pt , q+ + − −
t = Wt pt , qt = Wt pt ,
θ(i) is set to 1/m at the beginning of the learning task. We
apply the well-known Hedge algorithm [48] to update the (i)
we can compute the gradient of ℓt with respect to W(i) :
combinational weights online, which is a simple and effective
algorithm for online learning with expert advice. In particular, ∂ℓt
(i)
given a triplet training instance (pt , p+ −
t , pt ), at the end of
∇t W(i) =
∂W(i)
each online learning iteration, the weight is updated as follows: ri (i) (i) + (i) −
!
X ∂ℓt ∂qj,t ∂ℓt ∂qj,t ∂ℓt ∂qj,t
(i) (i)
θt β z t = + + + −
(i)
θt+1 = (6) j=1
∂qj,t ∂W(i) ∂qj,t ∂W(i) ∂qj,t ∂W(i) W(i) =Wt
(i)
Pm (i) z(i)
i=1 θt β
t
⊤ ⊤
= 2(−q+ − ⊤ + +
t + qt )pt + 2(−qt + qt )pt + 2(qt − q− −
t )pt ,
where β ∈ (0, 1) is a discounting parameter to penalize the
(i)
poor modality, and zt is an indicator of ranking result on where qj,t is the j-th entry of qt .
(i) (i)
the current instance, i.e., zt = I(ft > 0) = I(di (pt , p+ t )− We then follow the idea of Online Gradient Descent [49] to
(i) (i)
di (pt , p−t ) > 0) which outputs 1 when ft = di (pt , p+
t )− update Wt+1 of each modality as follows:
(i)
di (pt , p−t ) > 0 and 0 otherwise. In particular, ft > 0,
(i) (i)
+ −
namely di (pt , pt ) > di (pt , pt ), indicates the current i-th Wt+1 ← Wt − η∇t W(i) (7)
metric makes a mistake on predicting the ranking of the triplet
(pt , p+ −
t , pt ).
where η is a learning rate parameter.
Finally, Algorithm 1 summarizes the details of the proposed Similarly, we also apply the Hedge algorithm as intro-
Online Multi-modal Distance Metric Learning (OMDML) duced in the previous section to update the combinational
algorithm. weight θt . Finally, Algorithm 2 summarizes the details of
Remark on Space and Time complexity. The space com- the proposed Low-rank Online Multi-modal Metric Learning
plexity of the algorithm is O( m 2
P
i=1 ni ). Denoting n =
algorithm (LOMDML).
max(n1 , . . . , nm ), the worst-case space complexity is simply Clearly this algorithm naturally preserves the PSD prop-
O(m × n2 ). The overall time complexity is linear with respect erty of the resulting distance metric M(i) = W(i)⊤ W(i)
to T — the total number of training triplets. The most and thus avoids the needs of performing the intensive PSD
computationally intensive step is the PSD projection step, projection. By assuming all r1 = . . . = rm = r and
which can be O(n3 ) for a dense matrix. Hence, the worst- n = max(n1 , . . . , nm ), the overall time complexity of the
case time overall complexity is O(T × m × n3 ). algorithm is O(T × m × r × n).
7

√
Algorithm 2 LOMDML—Low-rank OMDML algorithm By choosing β = √ T
√ , we then have
T + ln m
1: INPUT: r !
ln m √
• Discount weight parameter: β ∈ (0, 1) M ≤ 2 1+ min F (M(i) , ℓ, P) + ln m + T ln m
T 1≤i≤m
• Margin parameter: γ > 0
• Learning rate parameter: η > 0
(i) (i)
2: Initialization: θ1 = 1/m, Wt , ∀i = 1, . . . , m In general, it is not difficult to prove the above theorem
3: for t = 1, 2, . . . , T do by combining the results of the Hedge algorithm and the PA
4: Receive: (pt , p+ −
t , pt ) online learning, similar to the technique used in [51]. More
(i)
5: Compute: ft = di (pt , p+ )−d (p , p− ), i = 1, . . . , m details about the proof can be found in the online supplemental
Pm (i)t (i) i t t
6: Compute: ft = i=1 θt ft file 1 . Basically the above theorem indicates that the total
7: if ft + γ > 0 then number
√ of mistakes of the proposed algorithm is bounded by
8: for i = 1, 2, . . . , m do O( T ) compared with the optimal single metric.
(i) (i)
9: Set zt = I(ft > 0)
(i) (i) (i)
10: Update θt+1 ← θt β zt 5 E XPERIMENTS
(i) (i)
11: Wt+1 ← Wt − η∇t W(i) by Eq. (7) In this section, we conduct an extensive set of experiments to
12: end for P evaluate the efficacy of the proposed algorithms for similarity
m (i)
13: Θt+1 = i=1 θt+1 search with multiple types of visual features in CBIR.
(i) (i)
14: θt+1 ← θt+1 /Θt+1 , i = 1, . . . , m
15: end if 5.1 Experimental Testbeds
16: end for
We adopt four publicly-available image data sets in our exper-
iments, which have been widely adopted for the benchmarks
4 T HEORETICAL A NALYSIS of content-based image retrieval, image classification and
recognition tasks. TABLE 1 summarizes the statistics of these
We now analyze the theoretical performance of the proposed databases.
algorithms. To be concise, we give a theorem for the bound
of mistakes made by Algorithm 1 for predicting the relative TABLE 1
similarity of the sequence of triplet training instances. The List of image databases in our testbed.
similar result can be derived for Algorithm 2.
For the convenience of discussions in this section, we define: Datasets size classes # avg # per class
Caltech101 8,677 101 85.91
(i) (i)
zt = I ft > 0 , Indoor 15,620 67 233.14
ImageCLEF 7,157 20 367.85
where I(x) is an indicator function that outputs 1 when x is Corel 5,000 50 100
true and 0 otherwise. We further define the optimal margin ImageCLEFFlickr 1,007,157 21 47959.86
similarity function error for M(i) with respect to a collection
of training examples P = {(pt , p+ −
t , pt ), t = 1, . . . , T } as The first testbed is the “caltech101”2, which has been widely
h i adopted for object recognition and image retrieval [56], [20].
 kM(i) − Ik2F + 2C Tt=1 ℓt (di ) 
P
This dataset contains 101 object categories and 8,677 images.
F (M(i) , ℓ, P) = min
M(i)  min(C, 1)  The second testbed is the “indoor” dataset3 , which was used
for recognizing indoor scenes [57]. This dataset consists of 67
where ℓt (di ) denotes ℓ((pt , p+ −
t , pt ); di ). We then have the indoor categories, and 15,620 images. The numbers of images
following theorem for the mistake bound of the proposed in different categories are diverse, but each category contains
OMDML algorithm. at least 100 images. It is further divided into 5 subsets: store,
home, public spaces, leisure, and working place. We simply
Theorem 1. After receiving a sequence of T training ex-
consider it as a dataset of 67 categories and evaluate different
amples, denoted by P = {(pt , p+ −
t , pt ), t = 1, . . . , T }, algorithms on the whole indoor collection.
the number of mistakes M on predicting the ranking of
The third testbed is the “ImageCLEF” dataset4 , which was
(pt , p+ −
t , pt ) made by running Algorithm 1, denoted by
! also used in [58]. It is a medical image dataset and has 7,157
T T m images in 20 categories.
(i) (i)
X X X
M = I (ft > 0) = I θt f t > 0 The fourth testbed is the “Corel” dataset [7], which consists
t=1 t=1 i=1 of photos from COREL image CDs. It has 50 categories,
is bounded as follows each of which has exactly 100 images randomly selected from
T related examples in COREL image CDs.
2 ln(1/β) X (i) 2 ln m
M ≤ min zt +
1 − β 1≤i≤m t=1 1−β 1. http://omdml.stevenhoi.org/
2. http://www.vision.caltech.edu/Image Datasets/Caltech101/
2 ln(1/β) 2 ln m 3. http://web.mit.edu/torralba/www/indoor.html
≤ min F (M(i) , ℓ, P) + 4. http://imageclef.org/
1 − β 1≤i≤m 1−β
8

We also combine “ImageCLEF” with a collection of one For the clustering step, we adopt a forest of 16 kd-trees and
million social photos crawled from Flickr, this larger set is search 2048 neighbors to speed up the clustering task. By
named “ImageCLEFFlickr”. We treat the Flickr photos as a combining different descriptors (SIFT/SURF) and vocabulary
special class of background noisy photos, which are mainly sizes (200/1000), we extract four types of local features:
used to test the scalability of our algorithms. SIFT200, SIFT1000, SURF200 and SURF1000. Finally, we
adopt the TF-IDF weighing scheme to generate the final
5.2 Experimental Setup bag-of-visual-words for describing the local features. For all
For each database, we split the whole dataset into three disjoint learning algorithms, we normalize the feature vectors to ensure
partitions: a training set, a test set, and a validation set. In that every feature entry is in [0, 1].
particular, we randomly choose 500 images to form a test
set, and other 500 images to build up a validation set. The 5.4 Comparison Algorithms
remaining images are used to form a training set for learning To extensively evaluate the efficacy of our algorithms, we
similarity functions. compare the proposed two online multi-modal DML algo-
To generate side information in the form of triplet instances rithms, i.e., OMDML and LOMDML, against a number of
for learning the ranking functions, we sample triplet con- existing representative DML algorithms, including RCA [30],
straints from the images in the training set according to their LMNN [32], and OASIS [20]. As a heuristic baseline method,
ground truth labels. Specifically, we generate a triplet instance we also evaluate the square Euclidean distance, denoted as
by randomly sampling two images belonging to the same class “EUCL-*”.
and one image from a different class. In total, we generate To adapt the existing DML methods for multi-modal image
100K triplet instances for each standard dataset (except for retrieval, we have implemented several variants of each DML
the small-scale and large-scale experiments). algorithm by exploring three fusion strategies [59], [60]:
To fairly evaluate different algorithms, we choose their 1) “Best” — applying DML for each modality individually
parameters by following the same cross validation scheme. For and then selecting the best modality. We name these al-
simplicity, we empirically set ri = r = 50 for the i-th modality gorithms with suffix “-B”, e.g., RCA-B, in which we first
in the LOMDML algorithm and set the maximum iteration learn metrics over each modality separately on the train-
to 500 for LMNN. To evaluate the retrieval performance, we ing set by Relevance Component Analysis (RCA) [30].
adopt the mean Average Precision (mAP) and top-K retrieval After that, we validate the retrieval performance of all
accuracy. As a widely used IR metric, mAP value averages metrics on corresponding modality against the validation
the Average Precision (AP) value of all the queries, each set, and then choose the modality with the highest mAP
of which denotes the area under precision-recall curve for a as the best modality. We report the mAP score over the
query. The precision value is the ratio of related examples over best modality by ranking on test set with RCA.
total retrieved examples, while the recall value is the ratio of 2) “Concatenation” — an early fusion approach by concate-
related examples retrieved over total related examples in the nating features of all modalities before applying DML.
database. We name these algorithms with suffix “-C”, e.g., LMNN-
Finally, we run all the experiments on a Linux machine with C, in which we first concatenate all types of features
2.33GHz 8-core Intel Xeon CPU and 16GB RAM. together, and then learn the optimal metric on this com-
bined feature space by LMNN [32], and finally evaluate
5.3 Diverse Visual Features for Image Descriptors the mAP score on the optimal metric.
We adopt both global and local feature descriptors to extract 3) “Uniform combination” — a late fusion approach by
features for representing images in our experiments. Each uniformly combining all modalities after metric learning.
feature will correspond to one modality in the algorithm. We name these algorithms with suffix “-U”, e.g., OASIS-
Before the feature extraction, we have preprocessed the images U, in which we first learn an optimal metric by OA-
by resizing all the images to the scale of 500×500 pixels while SIS [20] for each modality, and then uniformly combine
keeping the aspect ratio unchanged. all distance functions for the final ranking.
Specifically, for global features, we extract five types of
features to represent an image, namely 5.5 Evaluation on Small-Scale Datasets
• Color histogram and color moments (n = 81), In this section, we build four small-scale data sets, named
• Edge direction histogram (n = 37), “Caltech101(S)”, “Indoor(S)”, “COREL(S)” and “ImageCLE-
• Gabor wavelets transformation (n = 120), F(S)”, from the corresponding standard datasets by first choos-
• Local binary pattern (n = 59), ing 10 object categories, and then randomly sampling 50
• GIST features (n = 512). examples from each category. We adopt 5 global features
For local features, we extract the bag-of-visual-words rep- described above as the multi-modal inputs. To construct triplet
resentation using two kinds of descriptors: constraints for online learning approaches, we generate all
• SIFT — we adopt the Hessian-Affine interest region positive pairs (two images belong to the same class), and
detector with a threshold of 500; for each positive pair we randomly select an image from the
• SURF — we use the SURF detector with a threshold of other different classes to form a triplet. In total, about 10K
500. triplets are generated for each dataset. TABLE 2 summarizes
9

corel caltech101

EUCL−C EUCL−C
0.6 RCA−C 0.6 RCA−C
OASIS−C OASIS−C
RCA−U RCA−U
OASIS−U OASIS−U
0.5 LOMDML 0.5 LOMDML

0.4 0.4
Prec

Prec
0.3 0.3

0.2 0.2

0.1 0.1

0 0
20 40 60 80 100 20 40 60 80 100
@K @K

(a) Precision at Top-K on “Corel’ (b) Precision at Top-K on “Caltech101”

indoor ImageCLEF
0.2
EUCL−C 0.95 EUCL−C
RCA−C RCA−C
0.18 OASIS−C OASIS−C
RCA−U RCA−U
OASIS−U 0.9 OASIS−U
0.16 LOMDML LOMDML

0.14 0.85

0.12
0.8
Prec

Prec

0.1

0.75
0.08

0.06
0.7

0.04

0.65
0.02

0 0.6
20 40 60 80 100 20 40 60 80 100
@K @K

(c) Precision at Top-K on “Indoor” (d) Precision at Top-K on “ImageCLEF”

Fig. 2. Evaluation of average precision at Top-K results on the datasets

the evaluation results on the small-scale data sets, from which generally tend to perform better than the best single metric
we can draw the following observations. approaches (with suffix“-B”). This is primarily because com-
bining multiple types of features with learning could better
TABLE 2 explore the potential of all the features, which validates the
Evaluation of the mAP performance. importance of the proposed technique.
Alg. COREL(S) Caltech101(S) Indoor(S) ImageCLEF(S)
Second, some of the uniformly combination algorithms (i.e.,
Eucl-B 0.4431 0.4299 0.1726 0.4325 the late fusion strategy) failed to outperform the best single
RCA-B 0.5097 0.4984 0.1915 0.4492 metric approach in some cases, e.g., “RCA-U” (compared with
LMNN-B 0.4876 0.5462 0.1852 0.5231 “RCA-B”) and “LMNN-U” (compared with “LMNN-B”) on
OASIS-B 0.4445 0.5072 0.1884 0.4424 Caltech101(S). This implies that uniform concatenation is not
Eucl-C 0.5220 0.4306 0.1842 0.4431 optimal to combine different kinds of features. Thus, it is
RCA-C 0.6437 0.6156 0.2078 0.5927 critical to identify the effective features via machine learning
LMNN-C 0.5816 0.5894 0.2027 0.5821
OASIS-C 0.5657 0.5441 0.2017 0.5618 and then assign them higher weights.
Eucl-U 0.5220 0.4306 0.1842 0.4431 Third, among all the compared algorithms, the proposed
RCA-U 0.5625 0.4860 0.1894 0.4909 OMDML and LOMDML algorithms outperform the other
LMNN-U 0.6026 0.4282 0.2007 0.4647 algorithms. Finally, it is interesting to observe that the pro-
OASIS-U 0.5679 0.5419 0.1989 0.5338 posed low-rank algorithm (LOMDML) not only improves the
OMDML 0.6620 0.6543 0.2113 0.6824 efficiency and scalability of OMDML, but also enhances the
LOMDML 0.6975 0.6646 0.2250 0.7080 retrieval accuracy. This is probably because by learning met-
First of all, the two kinds of fusion strategies, i.e., early rics in intrinsic lower-dimensional space, we may potentially
fusion (with suffix“-C”) and late fusion (with suffix“-U”), avoid the impact of overfitting and noise issues.
10

TABLE 3 system, i.e., performing a linear projection for each image

Running time cost (in sec.) on “COREL(S)”. instance p by p ← Wp. The time cost for retrieval on
RCA-C LMNN-C OASIS-C RCA-U
OMDML is thus the same as the original Euclidean distance,
5.07 1442.66 404.35 2.91 while the time cost on LOMDML is the same as Euclidean
LMNN-U OASIS-U OMDML LOMDML distance on dimension-reduced feature space. To avoid the
858.94 376.77 34765.13 22.11 trivial redundant results, we thus skip the time cost evaluation
of retrieval in our experiments.

TABLE 3 shows the running CPU time cost (in seconds) on

5.7 Evaluation of online mistake rate of individual
the “COREL(S)” data set. We can see that, the running time
metric learning on each single modality
of LOMDML results in a speedup factor of 10 comparison to
OASIS, and the gain in efficiency will increase when the data To further examine how the proposed LOMDML algorithm
set gets larger or the data dimensionality increases. Conversely, performs in comparison to individual metric learning on each
OMDML has the extremely high computational cost because single modality, we evaluate the online average mistake rate
a PSD projection is performed after each iteration, which can of the proposed LOMDML algorithm and single-modal metric
be O(n3 ) for a dense matrix. A possible solution to tackle this learning schemes on each individual modality. Figure 3 shows
problem is that in we could perform the PSD projection after the experimental results on the “COREL” data set. Several
a bunch of iterations, instead of after each iteration. observations can be drawn from the results as follows.
First of all, we notice that for all the schemes, the online cu-
5.6 Evaluation on the Standard Datasets mulative mistake rate consistently decreases when the number
of iterations increases in the online learning process. Second,
TABLE 4 among all kinds of features, we found that the scheme of
Evaluation of the mAP performance. single-modal metric learning on “Surf1000” achieved the best
performance. Finally, by comparing the proposed LOMDML
Alg. COREL Caltech101 Indoor ImageCLEF
scheme and the best single-modal metric learning, we found
Eucl-B 0.1877 0.2187 0.0469 0.5523 that LOMDML consistently achieves the smaller mistake rate
RCA-B 0.2305 0.2837 0.0499 0.6010
OASIS-B 0.1958 0.3025 0.0522 0.6723 than that of the best single-modal metric learning scheme
Eucl-C 0.2628 0.2259 0.0559 0.5752
in the entire online learning process. This encouraging result
RCA-C 0.2714 0.2473 0.0604 0.6272 again validates the efficacy of the proposed multi-modal online
OASIS-C 0.3202 0.3660 0.0726 0.7394 learning scheme for combining multiple modalities in an
Eucl-U 0.2628 0.2259 0.0559 0.5752 effective way.
RCA-U 0.2992 0.2413 0.0565 0.6161 Corel
OASIS-U 0.3594 0.3243 0.0705 0.6891 0.35

LOMDML 0.4137 0.4128 0.0804 0.8155 Edge

0.3

We further evaluate the proposed algorithms on standard- Sift200

0.25
Mistake rate

sized image datasets. We exclude LMNN and OMDML be- GIST

Gabor
cause of their extremely high computational cost. Following 0.2 Color
the standard experimental setup with 5 global features and 4 Surf200
local features, TABLE 4 summarizes the experimental results, 0.15 LBP
Sift1000
Figure 2 presents the top-K precisions on four datasets and
Surf1000
TABLE 5 shows the running time cost on the COREL dataset 0.1
with 100K triplet instances. From the results, we observed that LOMDML

the proposed LOMDML algorithm considerably surpasses all 0.05

1 2 3 4 5 6 7 8 9 10
the other approaches for most cases. This clearly validates t 4
x 10
the efficacy of the proposed algorithm for learning effective
metrics on multi-modal data. Finally, in terms of the time Fig. 3. Evaluation of online mistake rates of LOMDML
cost, the proposed LOMDML algorithm is considerably more and single-modal metric learning on individual modalities
efficient and scalable than the other algorithms, making it on the “Corel” dataset
practical for large-scale applications.
TABLE 5 5.8 Comparison with Online Multi-modal Distance
Running time (in sec.) on “COREL”. Learning (OMDL) with Multiple Kernels
RCA-C OASIS-C RCA-U OASIS-U LOMDML In this section, we compare the proposed LOMDML algo-
468.19 65060.93 184.3 8781.54 789.81 rithm with an existing Online Multi-modal Distance Learning
method (OMDL-LR) [40], which is a kernel-based low-rank
Remark. We note that the learnt metric/function can be online learning approach to learning distance functions on
easily integrated into a generic image indexing and retrieval multi-modal data by combining multiple kernels. We evaluate
11

ImageCLEF+Flickr
TABLE 6
EUCL−C
Comparison between LOMDML and OMDL-LR 0.9
RCA−C
OASIS−C
(gaussianmeanvar). RCA−U
OASIS−U
0.85 LOMDML

Metric Dataset LOMDML OMDL-LR

0.8
COREL(S) 0.6975 0.6693

Prec
Caltech101(S) 0.6646 0.5994
mAP 0.75
Indoor(S) 0.2250 0.2088
ImageCLEF(S) 0.7080 0.6729 0.7
Time cost (in sec.) COREL(S) 22.11 209.57
0.65

COREL(S) Caltech101(S) 0.6

20 40 60 80 100
0.670 0.600
@K
0.665
0.595 Fig. 5. Precision at Top-K on “ImageCLEF+Flickr”
0.660

0.655 0.590

0.650
5.9 Evaluation on the Large-scale Dataset
0.585

0.645 To examine its scalability, we apply the proposed algorithm

0 5 10 15 20 0 5 10 15 20
on a large-scale image retrieval application on “ImageCLE-
Indoor(S) ImageCLEF(S) F+Flickr”, which has over one million images and 300K triplet
0.209 0.675 training data. TABLE 7 shows the mAP performance of the
five algorithms.
0.208 0.670
TABLE 7
0.207 0.665
Evaluation of mAP on the “ImageCLEF+Flickr” dataset.
0.206 0.660
Eucl-C RCA-C OASIS-C RCA-U OASIS-U LOMDML
0 5 10 15 20 0 5 10 15 20 0.5766 0.6163 0.7161 0.6219 0.7028 0.7413

Fig. 4. Evaluation of the mAP (y-axis) of OMDL-LR w.r.t.

the number of Nearest Neighbors (x-axis). Clearly, our proposed algorithm OLMDML achieves the
best mAP. Figure 5 presents the top-K precisions on Image-
CLEF+Flickr. We can have the similar observation that our
the mAP performance and the training time cost of OMDL-LR proposed methods significantly outperform the state of the
on four datasets, “COREL(S)”, “Caltech101(S)”, “Indoor(S)” art, in terms of precision. In short, the proposed algorithm
and “ImageCLEF(S)”, under the same experimental setting significantly outperforms the state of the art, in terms of both
as the previous sections. The parameters for the OMDL-LR mAP and retrieval accuracy performance measures.
algorithm are set as follows: (i) dLR , the dimensionality of the
low-rank for all the models is set to 50, the same as the rank 5.10 Qualitative Comparison
setting of r for the LOMDML algorithm; (ii) other hyper-
Finally, to examine the qualitative retrieval performance, we
parameters, including C1 , C2 , η and the number of nearest
randomly sample some query images from the query set, and
neighbors (“NN”) for graph Laplacian, are determined by grid
compare the qualitative image similarity search by different
search on a separated validation set. Fig. 4 shows the mAP
algorithms. Figure 6 shows the comparison of retrieval results
with respect to “NN” on each dataset.
on “COREL” and “Caltech101” datasets using different algo-
From the comparison results in TABLE 6, we observed
rithms. From the visual results, we can see that LOMDML
that LOMDML is even better than OMDL-LR in terms of the
generally returns more related results than the other baselines.
mAP performance. This may seem counterintuitive as OMDL-
LR is a kernel-based approach. However, we conjecture that
this is primarily because OMDL-LR fairly depends on a good 6 C ONCLUSIONS
selection of the underlying kernels and the parameters of the This paper investigated a novel family of online multi-modal
kernel functions. With carefully selected kernels, OMDL-LR distance metric learning (OMDML) algorithms for CBIR tasks
would likely achieve better results. However, how to tune by exploiting multiple types of features. We pinpointed some
and find the best kernels is beyond the scope of this paper. major limitations of traditional DML approaches in practice,
In terms of training time cost, we observed that LOMDML and presented the online multi-modal DML method which
is considerably more efficient than OMDL-LR. Similar to simultaneously learns both the optimal distance metric on
OMDML, the most computationally intensive step in OMDL- each individual feature space and the optimal combination
LR is the PSD projection, which can be O(r3 ) for a dense of multiple metrics on different types of features. Further,
matrix, thus the overall time complexity is O(T × m × r3 ). In we proposed the low-rank online multi-modal DML algorithm
the above experiment, the dimensions of raw features range (LOMDML), which not only runs more efficiently and scal-
from 37 to 512, which are much smaller than r2 = 2500. ably, but also achieves the state-of-the-art performance among
Thus, LOMDML consumes much less time than OMDL-LR. the competing algorithms in our experiments. Future work can
12

extend our framework in resolving other types of multimodal [22] R. Salakhutdinov and G. Hinton, “Semantic hashing,” International
data analytics tasks beyond image retrieval. Journal of Approximate Reasoning, vol. 50, no. 7, pp. 969–978, Jul.
2009. [Online]. Available: http://dx.doi.org/10.1016/j.ijar.2008.11.006
[23] H. Jegou, F. Perronnin, M. Douze, J. Sanchez, P. Perez, and C. Schmid,
“Aggregating local image descriptors into compact codes,” IEEE Trans.
ACKNOWLEDGEMENTS Pattern Anal. Mach. Intell., vol. 34, no. 9, pp. 1704–1716, Sep. 2012.
[Online]. Available: http://dx.doi.org/10.1109/TPAMI.2011.235
This work was supported by Singapore MOE tier-1 research [24] K. Chatfield, V. S. Lempitsky, A. Vedaldi, and A. Zisserman, “The devil
grant from Singapore Management University, Singapore. is in the details: an evaluation of recent feature encoding methods,” in
BMVC, 2011, pp. 1–12.
[25] A. Joly and O. Buisson, “Random maximum margin hashing,” in
R EFERENCES Proceedings of IEEE Conference on Computer Vision and Pattern
Recognition (CVPR’11), Washington, DC, USA, 2011, pp. 873–880.
[1] M. S. Lew, N. Sebe, C. Djeraba, and R. Jain, “Content-based multimedia [26] D. Zhai, H. Chang, S. Shan, X. Chen, and W. Gao, “Multiview metric
information retrieval: State of the art and challenges,” Multimedia learning with global consistency and local smoothness,” ACM Trans. on
Computing, Communications and Applications, ACM Transactions on, Intelligent Systems and Technology, vol. 3, no. 3, p. 53, 2012.
vol. 2, no. 1, pp. 1–19, 2006. [27] W. Di and M. Crawford, “View generation for multiview maximum dis-
[2] Y. Jing and S. Baluja, “Visualrank: Applying pagerank to large-scale agreement based active learning for hyperspectral image classification,”
image search,” Pattern Analysis and Machine Intelligence, IEEE Trans- Geoscience and Remote Sensing, IEEE Transactions on, vol. 50, no. 5,
actions on, vol. 30, no. 11, pp. 1877–1890, 2008. pp. 1942–1954, 2012.
[3] D. Grangier and S. Bengio, “A discriminative kernel-based approach [28] S. Akaho, “A kernel method for canonical correlation analysis,” in In
to rank images from text queries,” Pattern Analysis and Machine Proceedings of the International Meeting of the Psychometric Society.
Intelligence, IEEE Transactions on, vol. 30, no. 8, pp. 1371–1384, 2008. Springer-Verlag, 2001.
[4] A. K. Jain and A. Vailaya, “Shape-based retrieval: a case study with [29] J. D. R. Farquhar, H. Meng, S. Szedmak, D. R. Hardoon, and J. Shawe-
trademark image database,” Pattern Recognition, no. 9, pp. 1369–1390, taylor, “Two view learning: Svm-2k, theory and practice,” in Advances
1998. in Neural Information Processing Systems. MIT Press, 2006.
[5] Y. Rubner, C. Tomasi, and L. J. Guibas, “The earth movers distance as [30] A. Bar-Hillel, T. Hertz, N. Shental, and D. Weinshall, “Learning distance
a metric for image retrieval,” International Journal of Computer Vision, functions using equivalence relations,” in Proceedings of International
vol. 40, p. 2000, 2000. Conference on Machine Learning, 2003, pp. 11–18.
[6] A. W. M. Smeulders, M. Worring, S. Santini, A. Gupta, and R. Jain, [31] J.-E. Lee, R. Jin, and A. K. Jain, “Rank-based distance metric learning:
“Content-based image retrieval at the end of the early years,” Pattern An application to image retrieval,” in Proceedings of IEEE Conference
Analysis and Machine Intelligence, IEEE Transactions on, vol. 22, on Computer Vision and Pattern Recognition, Anchorage, AK, 2008.
no. 12, pp. 1349–1380, 2000. [32] K. Weinberger, J. Blitzer, and L. Saul, “Distance metric learning for
[7] S. C. Hoi, W. Liu, M. R. Lyu, and W.-Y. Ma, “Learning distance metrics large margin nearest neighbor classification,” in Advances in Neural
with contextual constraints for image retrieval,” in Proceedings of IEEE Information Processing Systems, 2006, pp. 1473–1480.
Conference on Computer Vision and Pattern Recognition, New York, [33] C. Domeniconi, J. Peng, and D. Gunopulos, “Locally adaptive metric
US, Jun. 17–22 2006, dCA. nearest-neighbor classification,” IEEE Trans. Pattern Analysis and Ma-
[8] L. Si, R. Jin, S. C. Hoi, and M. R. Lyu, “Collaborative image retrieval via chine Intelligence, vol. 24, no. 9, pp. 1281 – 1285, 2002.
regularized metric learning,” ACM Multimedia Systems Journal, vol. 12, [34] P. Wu, S. C. H. Hoi, P. Zhao, and Y. He, “Mining social images with
no. 1, pp. 34–44, 2006. distance metric learning for automated image tagging,” in Proceedings
[9] S. C. Hoi, W. Liu, and S.-F. Chang, “Semi-supervised distance metric of the fourth ACM international conference on Web search and data
learning for collaborative image retrieval,” in Proceedings of IEEE mining. ACM, 2011, pp. 197–206.
Conference on Computer Vision and Pattern Recognition, Jun. 2008. [35] P. Wu, S. C. Hoi, H. Xia, P. Zhao, D. Wang, and C. Miao, “Online
[10] G. H. J. Goldberger, S. Roweis and R. Salakhutdinov, “Neighbourhood multimodal deep similarity learning with application to image retrieval,”
components analysis,” in Advances in Neural Information Processing in Proceedings of the 21st ACM international conference on Multimedia.
Systems, 2005. ACM, 2013, pp. 153–162.
[11] K. Fukunaga, Introduction to Statistical Pattern Recognition. Elsevier,
[36] X. Gao, S. C. Hoi, Y. Zhang, J. Wan, and J. Li, “Soml: Sparse online
1990.
metric learning with application to image retrieval,” in Proceedings of
[12] A. Globerson and S. Roweis, “Metric learning by collapsing classes,”
the Twenty-Eighth AAAI Conference on Artificial Intelligence, 2014.
in Advances in Neural Information Processing Systems, 2005.
[37] P. Jain, B. Kulis, I. S. Dhillon, and K. Grauman, “Online metric
[13] L. Yang, R. Jin, R. Sukthankar, and Y. Liu, “An efficient algorithm for
learning and fast similarity search,” in Advances in Neural Information
local distance metric learning,” in Association for the Advancement of
Processing Systems, 2008, pp. 761–768.
Artificial Intelligence, 2006.
[14] A. K. Jain and A. Vailaya, “Image retrieval using color and shape,” [38] R. Jin, S. Wang, and Y. Zhou, “Regularized distance metric learning:
Pattern Recognition, vol. 29, pp. 1233–1244, 1996. Theory and algorithm,” in Advances in Neural Information Processing
Systems, 2009, pp. 862–870.
[15] B. S. Manjunath and W.-Y. Ma, “Texture features for browsing and
retrieval of image data,” Pattern Analysis and Machine Intelligence, [39] D. Wang, S. C. H. Hoi, P. Wu, J. Zhu, Y. He, and C. Miao, “Learning
IEEE Transactions on, vol. 18, no. 8, pp. 837–842, 1996. to name faces: a multimodal learning scheme for search-based face
[16] J. Sivic, B. C. Russell, A. A. Efros, A. Zisserman, and W. T. Freeman, annotation,” in SIGIR, 2013, pp. 443–452.
“Discovering objects and their location in images,” in IEEE Conference [40] H. Xia, P. Wu, and S. C. H. Hoi, “Online multi-modal distance learning
on Computer Vision and Pattern Recognition, 2005. for scalable multimedia retrieval,” in WSDM, 2013, pp. 455–464.
[17] J. Yang, Y.-G. Jiang, A. G. Hauptmann, and C.-W. Ngo, “Evaluating [41] S. Shalev-Shwartz, “Online learning and online convex optimization,”
bag-of-visual-words representations in scene classification,” in ACM Foundations and Trends in Machine Learning, vol. 4, no. 2, pp. 107–
International Conference on Multimedia Information Retrieval, 2007, 194, 2011.
pp. 197–206. [42] S. C. Hoi, J. Wang, and P. Zhao, “Libol: A library for online learning
[18] D. G. Lowe, “Object recognition from local scale-invariant features,” in algorithms,” Journal of Machine Learning Research, vol. 15, pp.
IEEE International Conference on Computer Vision, 1999, pp. 1150– 495–499, 2014. [Online]. Available: https://github.com/LIBOL
1157. [43] F. Rosenblatt, “The perceptron: A probabilistic model for information
[19] R. S. Mohammad Norouzi, David Fleet, “Hamming distance metric storage and organization in the brain,” Psychological Review, vol. 7, pp.
learning,” in Advances in Neural Information Processing Systems, 2012. 551–585, 1958.
[20] G. Chechik, V. Sharma, U. Shalit, and S. Bengio, “Large scale online [44] K. Crammer, O. Dekel, J. Keshet, S. Shalev-Shwartz, and Y. Singer,
learning of image similarity through ranking,” Journal of Machine “Online passive-aggressive algorithms,” Journal of Machine Learning
Learning Research, vol. 11, pp. 1109–1135, 2010. Research, vol. 7, pp. 551–585, 2006.
[21] H. Chang and D.-Y. Yeung, “Kernel-based distance metric learning for [45] M. Dredze, K. Crammer, and F. Pereira, “Confidence-weighted linear
content-based image retrieval,” Image and Vision Computing, vol. 25, classification,” in Proceedings of International Conference on Machine
no. 5, pp. 695–703, 2007. Learning, 2008, pp. 264–271.
13

[46] K. Crammer, A. Kulesza, and M. Dredze, “Adaptive regularization of Steven C. H. Hoi is currently an Associate
weight vectors,” in Advances in Neural Information Processing Systems, Professor of the School of Information Sytems,
2009, pp. 414–422. Singapore Management Unviersity, Singapore.
[47] P. Zhao, S. C. H. Hoi, and R. Jin, “Double updating online learning,” Prior to joining SMU, he was Associate Profes-
Journal of Machine Learning Research, vol. 12, pp. 1587–1615, 2011. sor with Nanyang Technological University, Sin-
[48] Y. Freund and R. E. Schapire, “A decision-theoretic generalization of gapore. He received his Bachelor degree from
on-line learning and an application to boosting,” Journal of Computer Tsinghua University, P.R. China, in 2002, and his
and System Sciences, vol. 55, no. 1, pp. 119–139, 1997. Ph.D degree in computer science and engineer-
[49] M. Zinkevich, “Online convex programming and generalized infinites- ing from The Chinese University of Hong Kong,
imal gradient ascent,” in Proceedings of International Conference on in 2006. His research interests are machine
Machine Learning, 2003, pp. 928–936. learning and data mining and their applications
[50] S. C. H. Hoi, J. Wang, P. Zhao, R. Jin, and P. Wu, “Fast bounded online to multimedia information retrieval (image and video retrieval), social
gradient descent algorithms for scalable kernel-based online learning,” media and web mining, and computational finance, etc, and he has
in ICML, 2012. published over 150 refereed papers in top conferences and journals
[51] S. C. Hoi, R. Jin, P. Zhao, and T. Yang, “Online multiple kernel in these related areas. He has served as Associate Editor-in-Chief for
classification,” Machine Learning, vol. 90, no. 2, pp. 289–316, 2013. Neurocomputing Journal, general co-chair for ACM SIGMM Workshops
[52] Y. Freund and R. E. Schapire, “Adaptive game playing using multi- on Social Media (WSM’09, WSM’10, WSM’11), program co-chair for the
plicative weights,” Games and Economic Behavior, vol. 29, no. 1, pp. fourth Asian Conference on Machine Learning (ACML’12), book editor
79–103, 1999. for “Social Media Modeling and Computing”, guest editor for ACM Trans-
[53] Y. Li and P. M. Long, “The relaxed online maximum margin algorithm,” actions on Intelligent Systems and Technology (ACM TIST), technical
in Advances in Neural Information Processing Systems, 1999, pp. 498– PC member for many international conferences, and external reviewer
504. for many top journals and worldwide funding agencies, including NSF in
[54] L. Bottou and Y. LeCun, “Large scale online learning,” in Advances in US and RGC in Hong Kong.
Neural Information Processing Systems, 2003.
[55] S. C. Hoi, M. R. Lyu, and R. Jin, “A unified log-based relevance
feedback scheme for image retrieval,” Knowledge and Data Engineering,
IEEE Transactions on, vol. 18, no. 4, pp. 509–204, 2006.
[56] G. Griffin, A. Holub, and P. Perona, “Caltech-256 object category
dataset,” California Institute of Technology, Tech. Rep. 7694, 2007.
[57] A. Quattoni and A. Torralba, “Recognizing indoor scenes,” in IEEE Peilin Zhao received his PhD from the School
Conference on Computer Vision and Pattern Recognition, 2009. of Computer Engineering at the Nanyang Tech-
[58] L. Yang, R. Jin, L. B. Mummert, R. Sukthankar, A. Goode, B. Zheng, nological University, Singapore, in 2012 and
S. C. H. Hoi, and M. Satyanarayanan, “A boosting framework for his bachelor degree from Zhejiang University,
visuality-preserving distance metric learning and its application to Hangzhou, P.R. China, in 2008. His research
medical image retrieval,” Pattern Analysis and Machine Intelligence, interests are statistical machine learning, and
IEEE Transactions on, vol. 32, no. 1, pp. 30–44, 2010. data mining.
[59] C. G. Snoek, M. Worring, and A. W. Smeulders, “Early versus late
fusion in semantic video analysis,” in Proceedings of the 13th annual
ACM international conference on Multimedia, 2005, pp. 399–402.
[60] J. Kludas, E. Bruno, and S. Marchand-Maillet, “Information fusion
in multimedia information retrieval,” Adaptive Multimedial Retrieval:
Retrieval, User, and Semantics, pp. 147–159, 2008.

Chunyan Miao is an Associate Professor in the

School of Computer Engineering at Nanyang
Technological University (NTU). Her research
focus is on infusing intelligent agents into in-
teractive new media (virtual, mixed, mobile and
pervasive media) to create novel experiences
and dimensions in game design, interactive nar-
rative and other real world agent systems. She
has done significant research work her research
areas and published many top quality interna-
tional conference and journal papers.

Zhi-Yong Liu received his Bachelor degree of

Engineering from Tianjin University in 1997,
Master degree of Engineering from Chinese A-
cademy of Sciences in 2000, and Ph.D degree
from the Chinese University of Hong Kong in
2003. He is currently a professor at the State
Pengcheng Wu received his PhD degree from Key Laboratory of Management and Control for
the School of Computer Engineering at the Complex Systems, Institute of Automation, Chi-
Nanyang Technological University, Singapore, nese Academy of Sciences, China. His research
and his bachelor degree from Xiamen University, interests include image analysis, pattern recog-
P.R. China. He is currently a research fellow in nition, machine learning and computer vision.
the School of Information Systems, Singapore
Management University. His research interest-
s include multimedia information retrieval, ma-
chine learning and data mining.
14

Fig. 6. Qualitative evaluation of top-5 retrieved images by different algorithms. For each block, the first image is the
query, and the results from the first line to the sixth line represents “Eucl-C”, “RCA-C”, “OASIS-C”, “RCA-U”, “OASIS-U”
and “LOMDML” respectively. The left column is from the “Corel” dataset and the right is from the “Caltech101” dataset.