2018_A Review on Multi-task Metric Learning_Yang Et Al_Big Data Analytics
2018_A Review on Multi-task Metric Learning_Yang Et Al_Big Data Analytics
R EV I EW Open Access
*Correspondence:
[email protected] Abstract
1
National Laboratory of Pattern Distance metric plays an important role in machine learning which is crucial to the
Recognition, 95 East
Zhongguancun Road, 100190 performance of a range of algorithms. Metric learning, which refers to learning a proper
Beijing, China distance metric for a particular task, has attracted much attention in machine learning.
Full list of author information is In particular, multi-task learning deals with the scenario where there are multiple
available at the end of the article
related metric learning tasks. By jointly training these tasks, useful information is shared
among the tasks, which significantly improves their performances. This paper reviews
the literature on multi-task metric learning. Various methods are investigated
systematically and categorized into four families. The central ideas of these methods
are introduced in detail, followed by some representative applications. Finally, we
conclude the review and propose a number of future work directions.
Keywords: Multi-task learning, Metric learning, Review
Background
In the area of machine learning, pattern recognition, and data mining, the concept of
distance metric usually plays an important role. For many algorithms, a proper distance
metric is critical to their performances. For example, the nearest neighbor classification
relies on the metric to identify the nearest neighbor and determine their class, whilst
k-means clustering uses the metric to determine which cluster a sample should belong to.
The metric is usually used as a measure of the similarity or dissimilarity, and there are
various types of pre-defined distance metrics, such as Euclidean distance, cosine simi-
larity, Hamming distance, etc. However, in practical applications, these general-purpose
metrics are insufficient to catch the sundry particular properties of various tasks. There-
fore, researchers propose learning a metric from data for particular tasks, to improve
algorithm performance. This is termed metric learning [1–7].
With the advent of data science, challenging and evolving problems have arisen.
Obtaining training data is a costly process, hence complex models are being trained
on small datasets, resulting in poor generalization. Alongside this the number of tasks
to be learnt has increased significantly. To overcome these problems, multi-task learn-
ing is proposed [8–13]. It aims to consider multiple tasks simultaneously at a higher
level, whilst transferring useful information among different tasks to improve their
performances.
After multi-task learning was proposed by Caruana [8] in 1997, various strategies have
been designed based on different assumptions. There are also some closely related topics,
such as transfer learning [14, 15], domain adaptation [16], meta-learning [17], life-long
© The Author(s). 2018 Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0
International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and
reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the
Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://
creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
Yang et al. Big Data Analytics (2018) 3:3 Page 2 of 23
learning [18], learning to learn [19], etc. In spite of some minor discrepancies among
them, they share the same basic idea that the performance is improved by considering
multiple learning tasks jointly and sharing information with other tasks.
Under such a background, it is natural to consider the problem of multi-task metric
learning. However, most multi-task learning algorithms designed for traditional models
are difficult for applying to metric learning algorithms due to the obvious differences
between the two kinds of models. To resolve this problem, a series of multi-task met-
ric learning approaches are specifically designed for the metric learning models. By
properly coupling multiple metric learning tasks, their performances are effectively
improved.
Metric learning has the particularity that its effect on performance can be only given
indirectly by the algorithm relying on the metric. This requires a different way to con-
struct the multi-task learning framework from traditional models. As far as we know,
there is no review at present on the multi-task metric learning, hence this paper will give
a general overview of the existing works.
The rest of the paper is organized as follows. First we provide an overview of the
basic concepts of metric learning and briefly introduce multi-task metric learning. Next,
various strategies of multi-task metric learning approaches are reviewed. We then intro-
duce some representative applications of multi-task metric learning, and conclude with a
discussion on potential future issues.
Overview
In this section, we first provide an overview of metric learning, including its concept and
several representative algorithms. Then a general description about multi-task metric
learning is presented, leaving the details of the algorithms for the next section.
Using these constraints, we briefly introduce the strategies of some metric learning
approaches. Xing’s method [1] aims to maximize the sum of distances between dis-
similar pairs while keeping the sum of squared distances between similar pairs to be
small. It is an example of learning with positive/negative pairs. Large Margin Nearest
Neighbors (LMNN) [2, 24] requires the k nearest neighbors to belong to the same class
and pushes out all the imposters (instances of other classes existing in the neighbor-
hood). The side-information is provided by the relative constraints. Information-theoretic
metric learning (ITML) [3], which is also built with positive/negative pairs, models the
problem with log-determinant. Sparse Metric Learning [6] uses the mixed L2,1 norm
to obtain a joint feature selection during metric learning, and Huang et al. [4, 5] pro-
poses a unified framework for Generalized Sparse Metric Learning (GSML). Robust
Metric Learning (RML) [25] deals with the noisy training constraints based on robust
optimization.
It is notable that learning a Mahalanobis matrix can also be regarded as learning a linear
transformation. For any symmetric positive semi-definite Mahalanobis matrix M, there is
a symmetric decomposition M = L L and the distance can be then reformulated as
Yang et al. Big Data Analytics (2018) 3:3 Page 4 of 23
dM (xi , xj ) = (xi − xj ) M(xi − xj )
= (xi − xj ) L L(xi − xj )
(1)
= (Lxi − Lxj ) (Lxi − Lxj )
=Lxi − Lxj 2 .
By (1), the Mahalanobis metric defined by M is equivalent to the Euclidean distance
after performing the linear transformation L, and thus metric learning can be also per-
formed by learning such a linear transformation. Neighbourhood Component Analysis
(NCA) [26] is an example of this class that optimizes the expected leave-one-out error of
a stochastic nearest neighbor classifier by learning a linear transformation. Furthermore,
the linear metric can be easily extended to the non-linear metric by replacing the linear
transformation L with a non-linear transformation f, which is defined as
type of useful information is shared among different metric learning tasks; (2) how such
information is shared by the proposed model and algorithm. Parameswaran et al. [31]
proposes the first multi-task metric learning approach in 2010, and in the following years
a variety of strategies have been proposed for multi-task metric learning. We generally
categorize them into the following families according to the way how the information is
shared:
There are some representative works in each family and we will introduce them in detail
in next section. Figure 1 gives a summary of the multi-task metric learning approaches
mentioned in this paper.
Fig. 1 A summary of multi-task metric learning approaches. This figure gives a summary of the approaches
mentioned in this paper, where the name of each method is under the branch of its corresponding type
Yang et al. Big Data Analytics (2018) 3:3 Page 6 of 23
the Mahalanobis matrix to keep the notations uniform, which may be different from the
original papers.
Large margin multi-task metric learning (mt-LMNN) Parameswaran et al. [31] pro-
poses a multi-task learning model based on the idea to share a common composition of
the Mahalanobis matrices. It is motivated by the regularized multi-task learning (RMTL)
[9], and obtained by adapting RMTL to the large-margin nearest neighbor metric learning
(LMNN) [2, 24]. To couple multiple tasks, each Mahalanobis matrix is decomposed into
a common part M0 and a task-specific part Mt . Thus the distance between two points
xi , xj ∈ X defined by the metric of the t-th task is defined as
By restricting that M0 0 and Mt 0, ∀t, the Mahalanobis matrix for each task is
ensured to be positive semi-definite, which induces a positive semi-definite metric. In this
Yang et al. Big Data Analytics (2018) 3:3 Page 7 of 23
model, M0 picks up the general trends across all tasks while Mt gathers the individual
information for each task. The obtained regularization of mt-LMNN is
T
γ0 M0 − I2F + γt Mt 2F . (3)
t=1
Multi-task multi-feature similarity learning learning M 2 SL Wang et al. [32] pro-
poses a multi-task multi-feature metric learning approach to adapt the metric learning to
large scale visual applications. For each sample, M types of features are extracted and the
metrics are learnt individually for each feature. For each feature channel, there are T tasks
and each task learns a distance metric. To make the information shared among tasks, the
Mahalanobis matrix of the t-th task in the m-th feature channel is defined to be a combi-
(m) (m)
nation of a common part M0 and an individual part Mt . Then the authors incorporate
such a formulation into the idealized kernel learning [33] and obtain the multi-feature
multi-task metric learning model as
1 γt
M T M
1 (m) (m) 2
min γ0 (m)
M0 + (m)
Mt
b,M 2 F F
m=1 b0 t=1 m=1 bt
C ij η
T T
+ ξt + bt 2
N 2 p
t=1 ij∈S t=0
ij ij ij ij ij ij (m) (m)
s.t. δt dt − d̃t ≥ σt − ξt , ξt ≥ 0, bt ≥ 0, p > 1, Mt 0
part and discriminating parts respectively, and the last term is the regularization on these
weights.
Using this approach, the information contained in different tasks is shared among them
and the multiple features are used in a more effective way. It uses the same strategy as
mt-LMNN to construct the multi-task metric learning model and thus has the similar
advantages/disadvantages.
Multi-task sparse compositional metric learning (mt-SCML) Shi et al. [34] proposes
a multi-task metric learning framework from the perspective of sparse combination. The
authors first propose a sparse compositional metric learning (SCML) approach which
regards a Mahalanobis matrix as a nonnegative weighted sum of K rank-1 positive semi-
definite matrices:
K
M= wi bi b
i , with w ≥ 0, (4)
i=1
where the bi ’s are D-dimensional column vectors. Noting that the distance between any
two points (x, y) determined by M is calculated by
K 2
2
dM (x, y) = (x − y) M(x − y) = wi b
i (x − y) ,
i=1
the vectors bi ’s span the common low-dimensional subspace in which the metric is
defined.
Using such a formulation, each rank-1 matrix is a basis and the metric can be refor-
mulated as a sparse combination of these bases. Then the metric learning is a process of
learning such weights, which is shown as
1
min Lw (xi , xj , xk ) + βw1 ,
w |C|
(xi ,xj ,xk )∈C
with [·]+ = max(0, ·), and the 1 regularization encourages a sparse solution of w.
When there are T tasks to be learned together, the multi-task learning can be easily
obtained by applying a structure regularization on these weights. To be specific, the
authors assume that the different tasks share a common low-dimensional subspace for
the reconstruction weights, and use a mixed norm to obtain the structure sparsity. The
formulation of mt-SCML is shown as
T
1
min Lwt (xi , xj , xk ) + βW2,1 ,
W |Ct |
t=1 (xi ,xj ,xk )∈Ct
where W is a T ×K nonnegative matrix where each row wt defines the reconstruct weight
vector for the t-th task, and W2,1 is the 2 /1 mixed norm. It equals to the 1 norm
applied to the 2 norm of the columns of W, which induces the group sparsity at the
column level, i.e., it encourages some columns to be zero together and thus make the
different tasks share the same reconstruction bases.
This method naturally introduces the idea of group sparse to construct multi-task met-
ric learning, and the proposed approach is not difficult to be realized. However, this
Yang et al. Big Data Analytics (2018) 3:3 Page 9 of 23
algorithm requires the set of rank-one metrics to be pre-trained and thus cannot be
optimized simultaneously with the weights.
Two-level multi-task metric learning (TMTL) Liu, et al. [35] proposes a two-level
multi-task metric learning approach that combines multiple metrics directly without an
explicit optimization procedure. It is developed based on KISSME [36], which is a met-
ric learning approach motivated by a statistical inference and defines the Mahalanobis
matrix as
−1 −1
M= S − D .
This model is extended to two-level multi-task learning paradigm in a rather simple way.
The authors first learn a Mahalanobis matrix for each task respectively and a common
metric for all samples. Then the final individual Mahalanobis matrix is given by a direct
weighted composition
1, if xi ∈ N(xj ) or xj ∈ N(xi );
yij = (5)
0, otherwise,
where N(xi ) indicates the nearest neighbor set of xi calculated by Euclidean distance. The
Eq. (5) indeed assumes that if a point is one of the nearest neighbors of the other point,
they should have the same label. Then the model of unlabeled model can be formulated as
⎧
T ⎨
2
min gl yij 1 − xi − xj Mt +M0
M ⎩ Ntl (Ntl − 1)
t=1 (xi ,xj )∈Dtl
2β
+ gu yij 1 − xi − xj Mt +M0
Ntu (Ntu − 1)
(xi ,xj )∈Dtu
⎫
λ ⎬
+ Mt 2F + γ TM0 2F ,
2 ⎭
s.t. M 0,
where Dtl and Dtu represent the sets of labeled data pairs and unlabeled data pairs respec-
tively, Ntl and Ntu are the numbers of the labeled and unlabeled training data, λ and γ
Yang et al. Big Data Analytics (2018) 3:3 Page 10 of 23
are both hyper-parameters to control the regularization on the individual parts and the
common part, and M represents all the Mt ’s and M0 for brevity.
This method utilizes the unlabeled data by assigning labels for them according to the
original distances. The strategy of constructing the multi-task learning is same as the
previous ones.
Fig. 2 An example of enhanced visual tree. The visual tree is constructed on the CIFAR-100 image set with
100 categories and its depth is 4. This figure is from the original paper [39]
Yang et al. Big Data Analytics (2018) 3:3 Page 11 of 23
of the enhanced visual tree for CIFAR-100. In this tree, the categories are organized in a
hierarchical structure according to their similarities.
According to the construction procedure of the visual tree, categories on the same
branch are more similar to each other than the ones on other branches. Thus, it is reason-
able to perform multi-task metric learning over the sibling child nodes under the same
parent node to utilize the inter-node visual correlation among them. The authors exploit
the same strategy as mtLMNN [31] which decomposes the metric into a common part
and an individual part as
dt (xi , xj ) = (xi − xj ) (M0 + Mt )(xi − xj ),
where M0 defines the common metric shared among all sibling child nodes and Mt
defines the node-specific metric.
For root node, the joint objective function is then defined as
2 T
min γ0 M0 − I F + αt tr[ M0 + Mt ]
M0 ,...,MT
t=1
⎡ ⎤
T
2 2
+ ⎣γt Mt + dt (xi , xj ) + ξi,j,k ⎦ ,
F
t=1 i,j i,j,k
(6)
where the parameters γ0 and γt ’s control the regularization on the common part and
individual part respectively.
For non-root nodes at the mid-level of the visual tree, besides the inter-node corre-
lations, the inter-level visual correlations between the parent node and its sibling child
nodes at the next level should be also exploited. Since all nodes on the same branch are
similar, any node p characterizes the common visual properties of its sibling child nodes.
On the other hand, the task-specific metric Mp for node p contains the task-specific com-
position. Thus, it is reasonable to utilize the task-specific metric of node p to help the
learning of its sibling child nodes. Based on this idea, the regularization βM0 − Mp 2
is added into the objective of (6) for non-root nodes, where M0 is the common metric
shared among the sibling child nodes under parent node p and Mp is the task-specific
metric for node p at the upper level.
This method introduces the hierarchical visual tree into multi-task metric learning,
which is used to guide the multi-task learning and thus provides a more powerful
capability of describing the relationship among tasks.
distribution [42] and automatically learns the relationships between tasks by a regular-
ization. Since the parameter to be learned in metric learning is a matrix rather than a
vector, the authors concatenate all columns of the Mahalanobis matrix to form a vector
for each task M̃t = vec(Mt ) and then apply the regularization of MTRL to it: M̃ −1 M̃
where M̃ = [vec(M1 ), . . . , vec(MT )]. It is equivalent to apply the following matrix-variant
normal prior distribution to M̃t ’s.
In this definition, the row covariance matrix Id2 models the relationships between features
and the column covariance matrix models the relationships between different vector-
ized Mahalanobis matrices M̃’s. Thus, indeed determines the relationships between
tasks. Since it cannot be given a priori in most cases, the authors propose to estimate it
from data automatically.
The obtained model is shown in (7) and can be solved by alternating optimization.
T
2
2
min g yti,j 1 − xti − xtj
{Mt }, nt (nt − 1) t
t=1 i<j
λ1
T
λ2 −1
+ Mt 2F + tr(M̃ M̃ )
2 2 (7)
t=1
s.t. Mt 0, ∀t
M̃ = (vec(M1 ), . . . , vec(MT ))
0, tr( ) = 1.
In that paper, the authors further propose a transfer metric learning based on this model
by training the concatenated Mahalanobis matrix of only target task while leaving other
matrices fixed as source tasks. The idea of learning the relationship between tasks is inter-
esting, but the covariance between the vectorized Mahalanobis matrices does not explain
well from the perspective of distance metric.
T
T
J(M1 , . . . , MT ) = W(i, j)M̃i − M̃j 22
i=1 j=1
(8)
=trace M̃(DIA − W)M̃
=trace M̃LM̃ ,
where M̃i = vec(Mi ) converts the Mahalanobis matrix of the i-task into a vector in a
column-wise manner, DIA is a diagonal matrix where DIA(i, i) = Tj=1 W(i, j), and thus
Yang et al. Big Data Analytics (2018) 3:3 Page 13 of 23
the matrix L = DIA − W indeed defines the graph Laplacian matrix. The model can be
optimized by alternating method.
In this work, the authors empirically set the adjacency matrix as W(i, j) = 1, which
indeed defines every pair of tasks are related. It is not difficult to prove that such a regular-
ization is just a variant of the regularization of mt-LMNN. Therefore, these two methods
are closely related in this special case.
This work naturally introduces the graph regularization into multi-task learning by
applying a Laplacian to the vectorized Mahalanobis matrices. However, the relationship
between two metrics is still vague, and the Laplacian matrix L is not easy to be reasonably
determined.
In this framework, the loss L and constraints Ct are used to incorporate the side-
information from training samples into the learning process, while the regularization
D(Mt , Mc ) encourages the metric of each task to be similar to a common one Mc , and
D(M0 , Mc ) further regularizes the common metric to be close to a pre-defined metric.
Without more prior information available, M0 is set to the identity I to define a Euclidean
metric.
The mt-LMNN can be easily included as a special case of this framework by D(X, Y) =
X − Y2F . The only difference exists on the constraints: the Mahalanobis matrix of the
t-th task in mt-LMNN is M0 + Mt , where both the two parts are positive semi-definite;
the Mahalanobis matrix of the t-th task in (9) with Frobenius norm is Mt and the pos-
itive semi-definiteness of only this matrix is required. The authors indicate that the
latter actually provides a more reasonable model because the constraints in mt-LMNN is
unnecessary to be so strict.
Geometry preserving multi-task metric learning (GPmtML) Yang et al. [44] proposes
the geometry preserving multi-task metric learning approach based on the general frame-
work (9). Different from most previous approaches, the GPmtML considers the multi-task
metric learning problem from the perspective of propagating the relative distance. The
authors indicate that learning of a metric is a process of embedding the supervised infor-
mation from training samples into the learnt metric, and thus it is reasonable to couple
the multiple tasks by sharing the embedded supervised information among metrics. As
we have illustrated, it is an important class of metric learning approaches which learn
the metric from relative distances given by triplets, and thus it is reasonable to propa-
gate such relationships about the relative distance between metrics. Motivated by this,
Yang et al. Big Data Analytics (2018) 3:3 Page 14 of 23
the authors propose the concept of geometry preserving probabilistic [44, 45] to mea-
sure such kind of characteristic between two metrics defined by Mahalanobis matrices
A and B.
PGf (dA , dB ) =P dA (x1 , y1 ) > dA (x2 , y2 ) ∧ dB (x1 , y1 ) > dB (x2 , y2 )
+ P dA (x1 , y1 ) < dA (x2 , y2 ) ∧ dB (x1 , y1 ) < dB (x2 , y2 )
+ P dA (x1 , y1 ) = dA (x2 , y2 ) ∧ dB (x1 , y1 ) = dB (x2 , y2 ) ,
where (x1 , y1 , x2 , y2 ) ∼ f and ∧ denotes the logical “and” operator.
Then the geometry preserving multi-task metric learning is proposed which aims to
increase the geometry preserving probabilistic. The method is obtained by using the von
Neumann divergence [46, 47] (10) as regularization in (9).
DvN (M, Mc ) = tr M log M − M log Mc − M + Mc (10)
Sharing transformation
According to (1). Learning a Mahalanobis distance is equivalent to learning a correspond-
ing linear transformation. There are indeed some metric learning algorithms that aim
to learn such a transformation directly, and it naturally provides a way to construct a
multi-task metric learning by sharing some parts of transformation.
Multi-task metric learning based on common subspace (mtMLCS) Yang et.al [48]
proposes a multi-task learning method based on the assumption of common subspace.
The idea is motivated by multi-task feature learning [11] which learns a common sparse
representations across multiple tasks. Based on the same assumption that all the tasks
share a common low-dimensional subspace, the authors propose a multi-task framework
for metric learning by transformation.
To couple multiple tasks with a common low-dimensional subspace, the authors notice
that for any low-rank Mahalanobis matrix M, the corresponding linear transformation
matrix L is of full row rank and has the size of r×d, where r = rank(M) is the dimension of
the subspace. Applying a compact SVD to L, there is L = U V where V is a d × r matrix
defining a projection to the low-dimensional subspace, and U defines a transformation
in the subspace. This fact motivates a straightforward multi-task strategy with common
subspace: to share a common projection matrix V and learn an individual transformation
.
Rt = Ut t for each task.
However, it is computationally complex to apply an orthogonal constraint to V. On the
other hand, it’s notable that the orthogonality is not necessary for V to define a subspace.
As well as V is of the size r × d and r < d, it indeed defines a subspace of dimensionality
no more than r with some extra full-rank transformation in the subspace. Therefore, a
Yang et al. Big Data Analytics (2018) 3:3 Page 15 of 23
common matrix L of size r×d is used to realize the common projection instead of V , and
the extra transformation can be absorbed by Rt . The obtained model for multi-task metric
learning is to a transformation for each task Lt = Rt L0 where L0 defines the common
subspace and Rt defines the task-specific metric. This strategy is then incorporated into
the LMCA [49] which is a variant of LMNN [2] by learning the transformation.
This approach is simple to implement. Compared with the approaches that learn
metrics by learning Mahalanobis matrices, mtMLCS does not require the symmetric
positive-definite constraints, and thus is much easier to optimize. However, this model is
not convex and thus the global optimum cannot be obtained.
Coupled projection multi-task metric learning (CP-mtML) Bhattarai et al. [50] pro-
poses a multi-task metric learning approach which also focuses on the methods that
learns a linear transformation. In this paper, the authors refer the transformation in (1)
as “projection”, and the idea to couple different tasks is to decompose it into a common
projection and a task-specific projection. Different from mtMLCS in which the com-
mon projection and task-specific projection are concatenated, CP-mtML decomposes the
projection in the manner of distance:
dt2 (xi , xj ) =dL2 0 (xi , xj ) + dL2 t (xi , xj )
=L0 xi − L0 xj 22 + Lt xi − Lt xj 22
It is easy to show that the relation among different tasks is the same as mt-LMNN where
both of them obtain the distance by summing the squared distances of common and task-
specific parts:
dt2 (xi , xj ) =(xi − xj ) L
0 L 0 + L
t L t (xi − xj )
Deep multi-task metric learning (DMML) Soleimani et al. [51] proposes a multi-
task learning version of deep metric learning. The method is constructed based on the
discriminative deep metric learning (DDML) [29]. For any pair of points, the DDML
transforms the two points with a neural network, and then the distance is defined to be
the Euclidean distance of their transformations. Thus the process of metric learning is
done by learning the parameters of the network.
The DMML uses a straightforward way to construct a multi-task version of DDML by
sharing the same first layer. Assuming there are T tasks, the outputs for two points xi,t , xj,t
(1)
in the t-th task are h(1)
1,t = s W xi,t + b
(1) and h(1) = s W(1) x + b(1) , where all tasks
2,t j,t
share a common weights matrix W(1) and a common bias vector b(1) , and s is a nonlinear
operator such as tanh. Then the outputs the second layer is calculated separately for each
Yang et al. Big Data Analytics (2018) 3:3 Page 16 of 23
task as h(2)
1,t = s W(2) (1)
t h1,t + b(2)
t and h(2)
2,t = s W(2) (1)
t h2,t + b(2)
t , where each task use the
task-specific weights matrix W(2) (2)
t and bias vector bt , and s is the non-linear operator
again. The obtained distance now can be calculated by
d2 (xi,t , xj,t ) = h(2) (2) 2
1,t − h1,t 2 .
1
T
min J = g 1 − li,j (τ − d2 (xi,t , xj,t ))
W,b 2
t=1 i,j
λ λ
T
(2) (2)
+ W(1) 2F + b(1) 22 + Wt 2F + bt 22 ,
2 2
t=1
1
where g(z) = log(1 + exp(βz)) is the smoothed approximation for [z]+ = max(z, 0),
β
and β controls its sharpness.
This method is based on a simple yet effective idea which a part of the network weights
are shared across multiple tasks. It is not difficult to implement by slightly modify the
original network architecture. However, only the first layer is shared across different tasks
in this model, which may be not the optimal choice and it is not easy to determine how
many layers should be shared.
where z is the feature representation of the input image G xil or G xi2 , Lt is the label set
for the t-th task, and 1{lt = j} is an indicator function that takes value one when j is equal
to the ground truth lt and zero otherwise. Using this framework, several auxiliary tasks
can be included by using different label set Lt , such as identification, attributes, pose, etc.
Please refer to [52] for more details.
The strategy to construct the multi-task metric learning used in this paper is common
in the community of multi-task learning. It is a flexible model by using different auxiliary
tasks. However, for some task, it is difficult to choose a proper auxiliary task, and a bad
auxiliary task may induce deterioration of the performance.
Applications
Multi-task metric learning has been widely used in a variety of practical applications, and
we would like to introduce some representative works in this section.
Semantic categorization and social tagging with knowledge transfer among tasks
Wang et al. [32] uses their proposed multi-task multi-feature similarity learning to solve
the large scale visual applications. The metrics for visual categorization and automatic
tagging are learned jointly based on the framework, which benefits from several perspec-
tives. First, M2 SL learns a metric for each feature instead of concatenating the multiple
features into one feature. This effectively reduces the computation complexity growth
from O M2 d2 to O Md2 and also the risk of over-fitting. Second, the multi-task frame-
work is more flexibility to explore the intrinsic model sharing and feature weighting
relations on image data with large amount of classes. Third, the knowledge is transferred
among semantic labels and social tagging information by the model. This combines the
information fusion from both sides for effective image understanding.
The authors compare the performances of two versions of M2 SL (linear and kernelized)
with some other methods and the experimental results are shown in Fig. 3. From the
Fig. 3 Comparison of M2 SL with other methods. The performance curves of M2 SL and other methods on
ImageNet-250 dataset are shown, where M2 SL-K and M2 SL-L indicate the kernelized and linear M2 SL
respectively. The x-axis represents the number of tasks while y-axis the mean accuracy of all tasks. This figure
is from the original paper [32]
Yang et al. Big Data Analytics (2018) 3:3 Page 18 of 23
results, the kernelized M2 SL always achieves the best performance, especially when the
number of tasks are greater. For the linear M2 SL, it also outperforms the single-task MSL.
Thus, the knowledge transfer by multi-task learning effectively improves the performance
of metric learning.
Person re-identification over camera networks Ma et al. [43] uses their proposed
multi-task maximally collapsing metric learning to solve the person re-identification over
camera networks. Person re-identification in a camera network is a challenging problem
because the data are collected from different cameras. The method to use a common
metric overlooks the differences between cameras, and thus the authors propose to use a
multi-task learning approach for this problem. With the MtMCML, an particular metric
is learned for each pair of cameras, while the common information can be shared among
them. The experimental results show that the multi-task approach works substantially
better than other state-of-the-art methods as shown in Fig. 4.
Large-scale face retrieval Bhattarai et al. [50] uses their proposed coupled projection
multi-task metric learning to solve the large-scale face retrieval. They use the multi-task
framework to learn different tasks on heterogeneous datasets simultaneously, where a
common projection is used to share information among these tasks. The tasks include
face identity, age recognition, and expression recognition. By jointly learning these tasks,
the authors get an improved performance as shown in Fig. 5.
Offline signature verification Soleimani et al. [51] aims to deal with the offline signa-
ture verification problem using the deep multi-task metric learning. For offline signature
verification, there are writer-dependent (WD) approaches and writer-independent (WI)
approaches. These two approaches benefits from their particular advantages respectively.
These two approaches are well integrated in this model where the shared layer acts as a
WI approach while the separated layers learn WD factors. In the experiments, the DMML
Fig. 4 Comparison of MtMCML with other methods. The performance of MtMCML and other methods on
GRID datasets are presented, where the x-axis and y-axis represent the rank score and matching rate
respectively. From the results, the multi-task learning approach evidently improves the performance of
matching rate. This figure is from the original paper [43]
Yang et al. Big Data Analytics (2018) 3:3 Page 19 of 23
Fig. 5 Comparison of CP-mtML with other methods. Age retrieval performance (1-call@K) for different K with
auxiliary task of identity matching. This figure is from the original paper [50]
achieves better performance than other methods. For example, on the UTSig dataset and
using the HOG feature, the DMML achieves equal error rate (ERR) of 17.45% while the
SVM achieves ERR of 20.63%; using the DRT feature, the DMML achieves ERR of 20.28%
while the SVM achieves ERR of 27.64%.
Hierarchical large-scale image classification Zheng et al. [39] uses their proposed
hierarchical multi-task metric learning to solve the large-scale image classification prob-
lem. To deal with the large-scale problem, the authors first learn a visual tree to organize
large number of image categories hierarchically in a coarse-to-fine fashion. Then a series
metrics are learnt hierarchically. Using the HMML, both the inter-node visual corre-
lations and the inter-level visual correlations are utilized. The inter-node correlation
is obtained directly from the multi-task framework, while the inter-level correlation is
obtained by passing the task-specific part into the next level. The experimental results
shown in Fig. 6 demonstrate that the multi-task model obtain better performance on
large-scale classification.
Fig. 6 Comparison of HMML with other methods. Accuracy comparison on the ILSVRC-2012 image set with
1000 image categories. This figure is from the original paper [39]
Yang et al. Big Data Analytics (2018) 3:3 Page 20 of 23
Person re-identification with auxiliary tasks McLaughlin et al. [52] uses the multi-task
learning to improve the performance of person re-identification. Using their proposed
deep convernets metric learning with multi-task learning, the authors train the network
to jointly perform verification and identification and to recognize attributes related to the
clothing and pose of the person in each image. The main job of the network is to learn a
metric using similar and dissimilar pairs. With the help of auxiliary tasks (attribute recog-
nition), the network learn a metric to give a satisfactory performance. Figure 7 shows the
experimental results. It is obvious that the accuracy is effectively improved by introducing
auxiliary tasks.
Conclusion
In this paper, we have systematically reviewed multi-task metric learning. Following a
brief overview of metric learning, various multi-task learning approaches are categorized
into four families and introduced respectively. We then review the motivations, models,
and algorithms of them, and also discuss and compare some closely related approaches.
Finally some representative applications of multi-task metric learning are illustrated.
For future work, we suggest potential issues for exploration. First, the theoretical anal-
ysis of multi-task metric learning should be addressed. There has long been an important
issue yielding multiple results [53–56], with most studies focusing on how multi-task
learning improves the generalization [57] of a conventional algorithm. However, as men-
tioned earlier, the metric learning improves the performances of the algorithms who use
the metric indirectly. This makes these results difficult for application to metric learning
algorithms. There has also been some research [58–61] on the theoretical analysis of met-
ric learning, however it has been to difficult to explain these in the context of multi-task
learning, Whilst Yang et al. [44] has attempted to provide an intuitive explanation, the
issue pertaining to multi-task learning remains unresolved. Second, how to avoid the neg-
ative transfer among tasks. Existing approaches are designed to couple multiple metrics
without considering the problem of negative transfer, and thus it is likely to deteriorate
the performances when the tasks are not related. Third, most existing multi-task metric
learning approaches are designed for global linear metrics. Thus it should be extended to
Fig. 7 Comparison of mtDCML with other methods. CMC curve after multitask training on VIPeR dataset. This
figure is from the original paper [52]
Yang et al. Big Data Analytics (2018) 3:3 Page 21 of 23
more types of metric learning approaches, including local metric learning and non-linear
metric learning. Finally, increased applications of multi-task metric learning are expected
to be discovered.
Funding
The paper was partially supported by National Natural Science Foundation of China (NSFC) under grant no.61403388,
no.61473236, Natural science fund for colleges and universities in Jiangsu Province under grant no. 17KJD520010, Suzhou
Science and Technology Program under grant no. SYG201712, SZS201613, Key Program Special Fund in XJTLU (KSF-A-01),
and UK Engineering and Physical Sciences Research Council (EPSRC) grant numbers EP/I009310/1, EP/M026981/1.
Authors’ contributions
PY carried out the whole structure of the idea and the mainly drafted the manuscript. KH provided the guidance of the
whole manuscript and revised the draft. AH participated the discussion and gave valuable suggestion on the idea. All
authors read and approved the final manuscript.
Competing interests
The authors declare that they have no competing interests.
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Author details
1 National Laboratory of Pattern Recognition, 95 East Zhongguancun Road, 100190 Beijing, China. 2 Xi’an
Jiaotong-Liverpool University, 111 Ren’ai Road, 215123 Suzhou, China. 3 University of Stirling, FK9 4LA Stirling, UK, Scotland.
References
1. Xing EP, Ng AY, Jordan MI, Russell SJ. Distance metric learning with application to clustering with side-information.
In: Advances in Neural Information Processing Systems 15 [Neural Information Processing Systems, NIPS 2002,
December 9-14, 2002, Vancouver, British Columbia, Canada]. 2002. p. 505–12. http://papers.nips.cc/paper/2164-
distance-metric-learning-with-application-to-clustering-with-side-information.
2. Weinberger KQ, Saul LK. Distance metric learning for large margin nearest neighbor classification. J Mach Learn Res.
2009;10:207–44.
3. Davis JV, Kulis B, Jain P, Sra S, Dhillon IS. Information-theoretic metric learning. In: Proceedings of the 24th
International Conference on Machine Learning. 2007. p. 209–16.
4. Huang K, Ying Y, Campbell C. Gsml: A unified framework for sparse metric learning. In: Ninth IEEE International
Conference on Data Mining. 2009. p. 189–98.
5. Huang K, Ying Y, Campbell C. Generalized sparse metric learning with relative comparisons. Knowl Inf Syst.
2011;28(1):25–45.
6. Ying Y, Huang K, Campbell C. Sparse metric learning via smooth optimization. In: Bengio Y, Schuurmans D,
Lafferty J, Williams CKI, Culotta A, editors. Advances in Neural Information Processing Systems 22. 2009. p. 2214–222.
7. Ying Y, Li P. Distance metric learning with eigenvalue optimization. J Mach Learn Res. 2012;13:1–26.
8. Caruana R. Multitask learning. Mach Learn. 1997;28(1):41–75.
9. Evgeniou T, Pontil M. Regularized multi-task learning. In: Proceedings of the Tenth ACM SIGKDD International
Conference on Knowledge Discovery and Data Mining. 2004. p. 109–17.
10. Argyriou A, Micchelli CA, Pontil M, Ying Y. A spectral regularization framework for multi-task structure learning.
In: Advances in Neural Information Processing Systems 20. 2008. p. 25–32.
11. Argyriou A, Evgeniou T. Convex multi-task feature learning. Mach Learn. 2008;73(3):243–72.
12. Zhang J, Ghahramani Z, Yang Y. Flexible latent variable models for multi-task learning. Mach Learn. 2008;73(3):
221–42.
13. Zhang Y, Yeung DY. A convex formulation for learning task relationships in multi-task learning. In: Proceedings of
the Twenty-Sixth Conference Annual Conference on Uncertainty in Artificial Intelligence. 2010. p. 733–442.
14. Pan SJ, Yang Q. A survey on transfer learning. IEEE Trans Knowl Data Eng. 2010;22(10):1345–59.
15. Dai W, Yang Q, Xue GR, Yu Y. Boosting for transfer learning. In: Proceedings of the 24th International Conference
on Machine Learning, ICML ’07. New York: ACM; 2007. p. 193–200.
16. Gopalan R, Li R, Chellappa R. Domain adaptation for object recognition: An unsupervised approach. In: Proceedings
of IEEE International Conference on Computer Vision, ICCV 2011. p. 999–1006.
Yang et al. Big Data Analytics (2018) 3:3 Page 22 of 23
17. Vilalta R, Drissi Y. A perspective view and survey of meta-learning. Artif Intell Rev. 2002;18(2):77–95.
18. Thrun S. Lifelong learning algorithms. In: Learning to Learn. USA: Springer; 1998. p. 181–209.
19. Thrun S, Pratt L. Learning to Learn. USA: Springer; 2012.
20. Burago D, Burago Y, Ivanov S. A Course in Metric Geometry. USA: American Mathematical Society; 2001. Chap. Ch 1.1.
21. Mahalanobis PC. On the generalised distance in statistics. In: Proceedings National Institute of Science, vol. 2. India;
1936. p. 49–55.
22. Bellet A, Habrard A, Sebban M. A survey on metric learning for feature vectors and structured data. arXiv preprint
arXiv:1306.6709v4, 2014.
23. Kulis B. Metric learning: A survey. Found Trends Mach Learn. 2013;5(4):287–364.
24. Weinberger KQ, Blitzer J, Saul L. Distance metric learning for large margin nearest neighbor classification.
In: Advances in Neural Information Processing Systems 18. 2006.
25. Huang K, Jin R, Xu Z, Liu CL. Robust metric learning by smooth optimization. In: The 26th Conference on
Uncertainty in Artificial Intelligence. 2010. p. 244–51.
26. Goldberger J, Roweis S, Hinton G, Salakhutdinov R. Neighbourhood components analysis. In: Advances in Neural
Information Processing Systems. 2004. p. 513–20.
27. Schmidhuber J. Deep learning in neural networks: An overview. Neural Netw. 2015;61:85–117.
28. Salakhutdinov R, Hinton G. Learning a nonlinear embedding by preserving class neighbourhood structure.
In: Artificial Intelligence and Statistics. 2007. p. 412–9.
29. Hu J, Lu J, Tan Y. Discriminative deep metric learning for face verification in the wild. In: 2014 IEEE Conference on
Computer Vision and Pattern Recognition, CVPR 2014, Columbus, OH, USA, June 23-28, 2014. 2014. p. 1875–82.
30. Vapnik VN. Statistical Learning Theory, 1st ed. USA: Wiley; 1998.
31. Parameswaran S, Weinberger K. Large margin multi-task metric learning. In: Advances in Neural Information
Processing Systems 23. 2010. p. 1867–75.
32. Wang S, Jiang S, Huang Q, Tian Q. Multi-feature metric learning with knowledge transfer among semantics and
social tagging. In: 2012 IEEE Conference on Computer Vision and Pattern Recognition, Providence, RI, USA, June
16-21, 2012. 2012. p. 2240–7.
33. Kwok JT, Tsang IW. Learning with idealized kernels. In: Machine Learning, Proceedings of the Twentieth
International Conference (ICML 2003), August 21-24, 2003, Washington, DC, USA; 2003. p. 400–7. http://www.aaai.
org/Library/ICML/2003/icml03-054.php.
34. Shi Y, Bellet A, Sha F. Sparse compositional metric learning. In: Proceedings of the Twenty-Eighth AAAI Conference
on Artificial Intelligence, July 27 -31, 2014, Québec City, Québec, Canada; 2014. p. 2078–084. http://www.aaai.org/
ocs/index.php/AAAI/AAAI14/paper/view/8224.
35. Liu H, Zhang X, Wu P. Two-level multi-task metric learning with application to multi-classification. In: 2015 IEEE
International Conference on Image Processing, ICIP 2015, Quebec City, QC, Canada, September 27-30, 2015; 2015.
p. 2756–60.
36. Köstinger M, Hirzer M, Wohlhart P, Roth PM, Bischof H. Large scale metric learning from equivalence constraints.
In: 2012 IEEE Conference on Computer Vision and Pattern Recognition, Providence, RI, USA, June 16-21, 2012; 2012.
p. 2288–95.
37. Li Y, Tao D. Online semi-supervised multi-task distance metric learning. In: IEEE International Conference on Data
Mining Workshops, ICDM Workshops 2016, December 12-15, 2016, Barcelona, Spain; 2016. p. 474–9.
38. Jin R, Wang S, Zhou Y. Regularized distance metric learning: Theory and algorithm. In: Advances in Neural
Information Processing Systems, vol. 22. 2009. p. 862–70.
39. Zheng Y, Fan J, Zhang J, Gao X. Hierarchical learning of multi-task sparse metrics for large-scale image
classification. Pattern Recogn. 2017;67:97–109.
40. Zhang Y, Yeung DY. Transfer metric learning by learning task relationships. In: Proceedings of the Tenth ACM
SIGKDD International Conference on Knowledge Discovery and Data Mining. 2010.
41. Zhang Y, Yeung DY. Transfer metric learning with semi-supervised extension. ACM Trans Intell Syst Tech (TIST).
2012;3(3):54–15428.
42. Gupta AK, Nagar DK. Matrix Variate Distributions. Chapman & Hall/CRC Monographs and Surveys in Pure and
Applied Mathematics, vol. 104. London: Chapman & Hall; 2000.
43. Ma L, Yang X, Tao D. Person re-identification over camera networks using multi-task distance metric learning.
IEEE Trans Image Process. 2014;23(8):3656–70.
44. Yang P, Huang K, Liu CL. Geometry preserving multi-task metric learning. Mach Learn. 2013;92(1):133–75.
45. Yang P, Huang K, Liu CL. Geometry preserving multi-task metric learning. In: European Conference on Machine
Learning and Knowledge Discovery in Databases, vol. 7523. 2012. p. 648–64.
46. Dhillon IS, Tropp JA. Matrix nearness problems with bregman divergences. SIAM J Matrix Anal Appl. 2008;29:1120–46.
47. Kulis B, Sustik MA, Dhillon IS. Low-rank kernel learning with bregman matrix divergences. J Mach Learn Res.
2009;10:341–76.
48. Yang P, Huang K, Liu C. A multi-task framework for metric learning with common subspace. Neural Comput Applic.
2013;22(7-8):1337–47.
49. Torresani L, Lee K. Large margin component analysis. In: Advances in Neural Information Processing Systems 19,
Proceedings of the Twentieth Annual Conference on Neural Information Processing Systems, Vancouver, British
Columbia, Canada, December 4-7, 2006; 2006. p. 1385–92. http://papers.nips.cc/paper/3088-large-margin-
component-analysis.
50. Bhattarai B, Sharma G, Jurie F. Cp-mtml: Coupled projection multi-task metric learning for large scale face retrieval.
In: 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV, USA, June 27-30,
2016; 2016. p. 4226–35.
51. Soleimani A, Araabi BN, Fouladi K. Deep multitask metric learning for offline signature verification. Pattern Recogn
Lett. 2016;80:84–90.
52. McLaughlin N, del Rincón JM, Miller PC. Person reidentification using deep convnets with multitask learning.
IEEE Trans Circ Syst Video Techn. 2017;27(3):525–39.
Yang et al. Big Data Analytics (2018) 3:3 Page 23 of 23
53. Baxter J. A bayesian/information theoretic model of learning to learn via multiple task sampling. Mach Learn.
1997;28(1):7–39.
54. Baxter J. A model of inductive bias learning. J Artif Intell Res. 2000;12:149–98.
55. Blitzer J, Crammer K, Kulesza A, Pereira F, Wortman J. Learning bounds for domain adaptation. In: Advances in
Neural Information Processing Systems 20, Proceedings of the Twenty-First Annual Conference on Neural
Information Processing Systems, Vancouver, British Columbia, Canada, December 3-6, 2007; 2007. p. 129–36. http://
papers.nips.cc/paper/3212-learning-bounds-for-domain-adaptation.
56. Ben-David S, Blitzer J, Crammer K, Kulesza A, Pereira F, Vaughan JW. A theory of learning from different domains.
Mach Learn. 2010;79(1-2):151–75.
57. Bousquet O, Elisseeff A. Stability and generalization. J Mach Learn Res. 2002;2:499–526.
58. Balcan MF, Blum A, Srebro N. A theory of learning with similarity functions. Mach Learn. 2008;72(1-2):89–112.
59. Wang L, Sugiyama M, Yang C, Hatano K, Feng J. Theory and algorithm for learning with dissimilarity functions.
Neural Comput. 2009;21(5):1459–84.
60. Perrot M, Habrard A. A theoretical analysis of metric hypothesis transfer learning. In: Proceedings of the 32nd
International Conference on Machine Learning, ICML 2015, Lille, France, 6-11 July 2015; 2015. p. 1708–17. http://
jmlr.org/proceedings/papers/v37/perrot15.html.
61. Bellet A, Habrard A. Robustness and generalization for metric learning. Neurocomputing. 2015;151:259–67.