Universal Multi-Task Kernels
Universal Multi-Task Kernels
Abstract
In this paper we are concerned with reproducing kernel Hilbert spaces HK of functions from an
input space into a Hilbert space Y , an environment appropriate for multi-task learning. The re-
producing kernel K associated to HK has its values as operators on Y . Our primary goal here is to
derive conditions which ensure that the kernel K is universal. This means that on every compact
subset of the input space, every continuous function with values in Y can be uniformly approx-
imated by sections of the kernel. We provide various characterizations of universal kernels and
highlight them with several concrete examples of some practical importance. Our analysis uses
basic principles of functional analysis and especially the useful notion of vector measures which
we describe in sufficient detail to clarify our results.
Keywords: multi-task learning, multi-task kernels, universal approximation, vector-valued repro-
ducing kernel Hilbert spaces
1. Introduction
The problem of studying representations and methods for learning vector-valued functions has re-
ceived increasing attention in Machine Learning in the recent years. This problem is motivated
by several applications in which it is required to estimate a vector-valued function from a set of
input/output data. For example, one is frequently confronted with situations in which multiple su-
pervised learning tasks must be learned simultaneously. This problem can be framed as that of
learning a vector-valued function f = ( f 1 , f2 , . . . , fn ), where each of its components is a real-valued
function and corresponds to a particular task. Often, these tasks are dependent on each other in
c 2008 Andrea Caponnetto, Charles A. Micchelli, Massimiliano Pontil and Yiming Ying.
C APONNETTO , M ICCHELLI , P ONTIL AND Y ING
that they share some common underlying structure. By making use of this structure, each task is
easier to learn. Empirical studies indicate that one can benefit significantly by learning the tasks
simultaneously as opposed to learning them one by one in isolation (see, e.g., Evgeniou et al., 2005,
and references therein).
In this paper, we build upon the recent work of Micchelli et al. (2006) by addressing the issue of
universality of multi-task kernels. Multi-task kernels were recently discussed in Machine Learning
context by Micchelli and Pontil (2005), however there is an extensive literature on multi-task kernels
as there are important both in theory and practice (see Amodei, 1997; Burbea and Masani, 1984;
Caponnetto and De Vito, 2006; Carmeli et al., 2006; Devinatz, 1960; Lowitzsh, 2005; Reisert and
Burkhardt, 2007; Vazquez and Walter, 2003, and references therein for more information)
A multi-task kernel K is the reproducing kernel of a Hilbert space of functions from an input
space X which takes values in a Hilbert space Y . For example, in the discussion above, Y = R n .
Generally, the kernel K is defined on X × X and takes values as an operator from Y to itself. 1 When
Y is n-dimensional, the kernel K takes values in the set of n × n matrix. The theory of reproducing
kernel Hilbert spaces (RKHS) as described in Aronszajn (1950) for scalar-valued functions has
extensions to any vector-valued Y . Specifically, the RKHS is formed by taking the closure of the
linear span of kernel sections {K(·, x)y, x ∈ X , y ∈ Y }, relative to the RKHS norm. We emphasize
here that this fact is fundamentally tied to a norm induced by K and is generally non-constructive.
Here, we are concerned with conditions on the kernel K which ensure that all continuous functions
from X to Y can be uniformly approximated on any compact subset of X by the linear span of
kernel sections.
As far as we are aware, the first paper which addresses this question in Machine Learning
literature is Steinwart (2001). Steinwart uses the expression universal kernel and we follow that
terminology here. The problem of identifying universal kernels was also discussed by Poggio et
al. (2002). One of us was introduced to this problem in a lecture given at City University of Hong
Kong by Zhou (2003). Subsequently, some aspects of this problem were treated in Micchelli et al.
(2003) and Micchelli and Pontil (2004) and then in detail in Micchelli et al. (2006).
The question of identifying universal kernels has a practical basis. We wish to learn a continuous
target function f : X → Y from a finite number of samples. The learning algorithm used for this
purpose should be consistent. That is, as the samples size increases, the discrepancy between the
target function and the function learned from the data should tend to zero. Kernel-based algorithms
(Schölkopf and Smola, 2002; Shawe-Taylor and Cristianini, 2004) generally use the representer
theorem and learn a function in the linear span of kernel sections. Therefore, here we interpret
consistency to mean that, for any compact subset Z of the input space X and every continuous
target function f : X → Y , the discrepancy between the target function and the learned function
goes to zero uniformly on Z as the sample size goes to infinity. It is important to keep in mind
that our input space is not assumed to be compact. However, we do assume that it is a Hausdorff
topological space so that there is an abundance of compact subsets, for example any finite subset of
the input space is compact.
Consistency in the sense we described above is important in order to study the statistical perfor-
mance of learning algorithms based on RKHS. For example, Chen et al. (2004) and Steinwart et al.
(2006) studied statistical analysis of soft margin SVM algorithms, Caponnetto and De Vito (2006)
gave a detailed analysis of the regularized least-squares algorithm over vector-valued RKHS and
1. Sometimes, such a kernel is called operator-valued or matrix-valued kernel if Y is infinite of finite dimensional,
respectively. However, for simplicity sake we adopt the terminology multi-task kernel throughout the paper.
1616
U NIVERSAL MULTI - TASK KERNELS
proved universal consistency of this algorithm assuming that the kernel is universal and fulfills the
additional condition that the operators K(x, x) have finite trace. The results in these papers imply
universal consistency of kernel-based learning algorithms when the considered kernel is universal.
One more interesting application of universal kernels is described in Gretton et al. (2006).
This paper is organized as follows. In Section 2, we review the basic definition and properties
of multi-task kernels, define the notion of universal kernel and describe some examples. In Section
3, we introduce the notion of feature map associated to a multi-task kernel and show its relevance
to the question of universality. The main result in this section is Theorem 4, which establishes that
the closure of the RKHS in the space of continuous functions is the same as the closure of the space
generated by the feature map. The importance of this result is that universality of a kernel can
be established directly by considering its features. In Section 4 we provide an alternate proof of
Theorem 4 which uses the notion of vector measures and discuss ancillary results useful for several
concrete examples of some practical importance highlighted in Section 5.
Table 1: Notation.
1617
C APONNETTO , M ICCHELLI , P ONTIL AND Y ING
We begin by introducing some notation. We let Y be a Hilbert space with inner product (·, ·) Y
(we drop the subscript Y when confusion does not arise). The vector-valued functions will take
values on Y . We denote by L (Y ) the space of all bounded linear operators from Y into itself, with
the operator norm kAk := supkyk=1 kAyk, A ∈ L (Y ) and by L+ (Y ) the set of all bounded, positive
semi-definite linear operators, that is, A ∈ L+ (Y ) provided that, for any y ∈ Y , (y, Ay) ≥ 0. We also
denote, for any A ∈ L (Y ), by A∗ its adjoint. Finally, for every m ∈ N, we define Nm = {1, . . . , m}.
Table 1 summarizes the notation used in paper.
CK (Z , Y ) := span{Kx y : x ∈ Z , y ∈ Y }, (2)
∑ ∑ y pi (K(xi , x j )) pq yq j ≥ 0. (3)
i, j∈Nm p,q∈Nn
From the above observation, we conclude that K is a kernel if and only if (K(xi , x j ) p,q as the matrix
with row index (p, i) ∈ Nn × Nm and column index (q, j) ∈ Nn × Nm is positive semi-definite. This
fact makes possible, as long as the dimension of Y is finite, reducing the proof of some properties
of operator-valued kernels to the proof of analogous properties of scalar-valued kernels; this process
is illustrated by the following Proposition.
1618
U NIVERSAL MULTI - TASK KERNELS
Proposition 3 Let G and K be n × n multi-task kernels. Then, the element-wise product kernel
K◦
G : X × X → Rn × Rn defined, for any x,t ∈ X and p, q ∈ Nn , by K ◦ G(x,t) pq :=
K(x,t) pq G(x,t) pq is an n × n multi-task kernel.
Proof We have to check the positive semi-definiteness of K ◦ G. To see this, for any m ∈ N, {y i ∈
Rn : i ∈ Nm } and {xi ∈ X : i ∈ Nm } we observe that
∑ (yi , K ◦ G(xi , x j )y j ) = ∑ ∑ y pi yq j K(xi , x j ) pq G(xi , x j ) pq . (4)
i, j∈Nm p,i q, j
By inequality (3), it follows that the matrix K(xi , x j ) pq is positive semi-definite as the matrix
with (p, i) and (q, j) as row and column indices respectively, and so is G(xi , x j ) pq . Applying
the Schur Lemma (Aronszajn, 1950) to these matrices implies that Equation (4) is nonnegative, and
hence proves the assertion.
We now present some examples of multi-task kernels. They will be used in Section 5 to illustrate
the general results in Sections 3 and 4.
The first example is adapted from Micchelli and Pontil (2005).
is a multi-task kernel.
The operators B j model smoothness across the components of the vector-valued function. For ex-
ample, in the context of multi-task learning (see, e.g., Evgeniou et al., 2005, and references therein),
we set Y = Rn , hence B j are n × n matrices. These matrices model the relationships across the
tasks. Evgeniou et al. (2005) considered kernels of the form (5) with m = 2, B 1 a multiple of the
identity matrix and B2 a low rank matrix. A specific case for X = Rd is
where x · t is the standard inner product in Rd and λ ∈ [0, 1] . This kernel has an interesting in-
terpretation. Using only the first term on the right hand side of the above equation (λ = 1) corre-
sponds to learning all tasks as the same task, that is, all components of the vector-valued function
f = ( f1 , . . . , fn ) are the same function, which will be a linear function since the kernel G 1 is lin-
ear. Whereas, using only the second term (λ = 0) corresponds to learning independent tasks, that
is, the components of the function f will be generally different functions. These functions will be
quadratic since G2 is a quadratic polynomial kernel. Thus, the above kernel combines two heteroge-
neous kernels to form a more flexible one. By choosing the parameter λ appropriately, the learning
model can be tailored to the data at hand.
We note that if K is a diagonal matrix-valued kernel, then each component of a vector-valued
function in the associated RKHS of K can be represented, independently of the other components,
as a function in the RKHS of a scalar kernel. However, in general, a multi-task kernel will not be
diagonal and, more importantly, will not be reduced to a diagonal one by linearly transforming the
1619
C APONNETTO , M ICCHELLI , P ONTIL AND Y ING
output space. For example, the kernel in Equation (5) cannot be reduced to a diagonal kernel, unless
all the matrices B j , j ∈ Nm can all be simultaneously transformed into a diagonal matrix. Therefore,
in general, the component functions share some underlying structure which is reflected by the choice
of the kernel and cannot be treated as independent objects. This fact is further illustrated by the next
example.
is a matrix-valued kernel on X .
A specific instance of the above example is described by Vazquez and Walter (2003) in the context
of system identification. It corresponds to the choices that X 0 = X = R and Tp (x) = x + τ p , where
τ p ∈ R. In this case, the kernel K models “delays” between the components of the vector-valued
function. Indeed, it is easy to verify that, for this choice, for all f ∈ H K and p ∈ Nn ,
to be a kernel over X = Rd , we will show later in Section 5 that this is not true, unless all entries of
the matrix σ are the same.
Our next example called Hessian of Gaussian is motivated by the problem of learning gradients
(Solak et al., 2002; Mukherjee and Zhou, 2006). In many applications, one wants to learn an un-
known real-valued function f (x), x = (x1 , . . . , xd ) ∈ Rd and its gradient function ∇ f = (∂1 f , . . . , ∂d f )
where, for any j ∈ Nd , ∂ p f denotes the p-th partial derivative of f . Here the outputs yip denotes
the observation of derivative of p-th derivative at sample x i . Therefore, this problem is an appealing
example of multi-task learning: learn the target function and its gradient function jointly.
To see why this problem is related with the Hessian of Gaussian, we adopt the Gaussian process
(Rasmussen and Williams, 2006) viewpoint of kernel methods. In this perspective, kernels are
interpreted as covariance functions of Gaussian prior probability distributions over suitable sets of
functions. More specifically, the (unknown) target function f is usually assumed as the realizations
of random variables indexed by its input vectors in a zero-mean Gaussian process. The Gaussian
process can be fully specified by giving the covariance matrix for any finite set of zero-mean random
1620
U NIVERSAL MULTI - TASK KERNELS
variables { f (xi ) : i ∈ Nm }. The covariance between the functions corresponding to the inputs x i and
2
x j can be defined by a given Mercer kernel, for example, the Gaussian kernel G(x) = exp(− kxk σ )
with σ > 0, that is,
cov( f (xi ), f (x j )) = G(xi − x j ).
Consequently, the covariance between ∂ p f and ∂q f is given by
This suggests to us to use the Hessian of Gaussian to model the correlation of gradient function ∇ f
as we present in the following example.
2
Example 3 We let Y = X = Rn , and, for any x = (x p : p ∈ Nn ) ∈ X , G(x) = exp(− kxk
σ ) with σ > 0.
Then, the Hessian matrix of G given by
is a matrix-valued kernel.
To illustrate our final example we let L2 (R) be the Hilbert space of square integrable functions
on R with the norm khkL2 := R h (x)dx. Moreover, we denote by W 1 (R) the Sobolev space of
2
R 2
order one, which is defined as the space of real-valued functions h on R whose norm
12
khkW 1 := khk2L2 + kh0 k2L2
is finite.
Example 4 Let Y = L2 (R), X = R and consider the linear space of functions from R to Y which
have finite norm !
∂ f (x, ·) 2
Z
2 2
kfk = k f (x, ·)kW 1 + dx.
R ∂x W1
Then this is an RKHS with multi-task kernel given, for every x,t ∈ X , by
Z
(K(x,t)y)(r) = e−π|x−t| e−π|r−s| y(s)ds, ∀y ∈ Y , r ∈ R.
R
This example may be appropriate to learn the heat distribution in a medium if we think of x as time.
Another potential application extends the discussion following Example 1. Specifically, we consider
the case that the input x represents both time and a task (e.g., the profile identifying a customer)
and the output is the regression function associated to that task (e.g., the preference function of a
customer, see Evgeniou et al., 2005, for more information). So, this example may be amenable for
learning the dynamics of the tasks.
Further examples for the case that Y = L2 (Rd ) will be provided in Section 5. We also postpone
to that section the proof of the claims in Examples 1-4 as well as the discussion about the universality
of the kernels therein.
We end this section with some remarks. It is well known that universality of kernels is a main hy-
pothesis in the proof of the consistency of kernel-based learning algorithms. Universal consistency
of learning algorithms and their error analysis also rely on the capacity of the RKHS. In particular,
1621
C APONNETTO , M ICCHELLI , P ONTIL AND Y ING
following the exact procedure for the scalar case in Cucker and Smale (2001), one sufficient condi-
tion for universal consistency of vector-valued (multi-task) learning algorithms is the compactness
of the unit ball of vector-valued RKHS relative to the space of continuous vector-valued functions.
Another alternate sufficient condition was proved in Caponnetto and De Vito (2006) for the regu-
larized least-squares algorithm over vector-valued RKHS. There, it was assumed that, in addition
to the universality of the kernel, the trace of the operators K(x, x) is finite, for every x ∈ X . Clearly,
both conditions are fulfilled by the multi-task kernels presented above if the output space Y is finite
dimensional, but they become non trivial in the infinite dimensional case. However, it is not clear to
the authors whether either of these two conditions is necessary for universal consistency. We hope
to come back to this problem in the future.
1622
U NIVERSAL MULTI - TASK KERNELS
Our definition of the phrase “the feature representation is universal” means that C Φ (Z , Y ) = C (Z , Y )
for every compact subset Z of the input space X . The theorem below demonstrates, as we men-
tioned above, that the kernel K is universal if and only if its feature representation is universal. The
content of Theorem 4 and of the other results of this Section (Lemmas 5, 6 and Proposition 7) are
graphically represented by the diagram in Table 2
Theorem 4 If K is a continuous multi-task kernel with feature representation Φ, then for every
compact subset Z of X , we have that CK (Z , Y ) = CΦ (Z , Y ).
Proof The theorem follows straightforwardly from Lemmas 5, 6 and Proposition 7, which we
present below.
As we know, the feature representation of a given kernel is not unique, therefore we conclude by
Theorem 4 that if some feature representation of a multi-task kernel is universal then every feature
representation is universal.
We shall give two different proofs of this general theorem. The first one will use a technique
highlighted in Micchelli and Pontil (2005) and will be given in this section. The second proof will
be given in the next section and uses the notion of vector measure. Both approaches adopt the point
of view of Micchelli et al. (2006), in which Theorem 4 is proved in the special case that Y = R.
We now begin to explain in detail our first proof. We denote the unit ball in Y by B 1 := {y :
y ∈ Y , kyk ≤ 1} and let Z be a prescribed compact subset of X . Recall that B 1 is not compact in
the norm topology on Y unless Y is finite dimensional. But it is compact in the weak topology
on Y since Y is a Hilbert space (see, e.g., Yosida, 1980). Remember that a basis for the open
neighborhood of the origin in the weak topology is a set of the form {y : y ∈ Y , |(y, y i )| ≤ 1, i ∈ Nm },
where y1 , . . . , ym are arbitrary vectors in Y . We put on B1 the weak topology and conclude, by
Tychonoff’s theorem (see, e.g., Folland, 1999, p.136), that the set Z × B 1 is also compact in the
product topology.
The above observation allows us to associate Y -valued functions defined on Z to scalar-valued
functions defined on Z × B1 . Specifically, we introduce the map ι : C (Z , Y ) → C (Z × B1 ) which
maps any function f ∈ C (Z , Y ) to the function ι( f ) ∈ C (Z × B1 ) defined by the action
ι( f ) : (x, y) 7→ ( f (x), y)Y , ∀(x, y) ∈ (Z × B1 ). (10)
Consequently, it follows that the map ι is isometric, since
sup k f (x)kY = sup sup |( f (x), y)Y | = sup sup |ι( f )(x, y)|,
x∈Z x∈Z kyk≤1 x∈Z y∈B1
where the first equality follows by Cauchy-Schwarz inequality. Moreover, we will denote by
ι(C (Z , Y )) the image of C (Z , Y ) under the mapping ι. In particular, this space is a closed lin-
ear subspace of C (Z × B1 ) and, hence, a Banach space.
Similarly, to any multi-task kernel K on Z we associate a scalar kernel J on Z × B 1 defined, for
every (x, y), (t, u) ∈ X × B1 , as
J((x, y), (t, u)) := (K(x,t)u, y). (11)
Moreover, we denote by CJ (Z × B1 ) the closure in C (Z × B1 ) of the set of the sections of the kernel,
{J((x, y), (·, ·)) : (x, y) ∈ Z × B1 }. It is important to realize that whenever K is a valid multi-task
kernel, then J is a valid scalar kernel.
1623
C APONNETTO , M ICCHELLI , P ONTIL AND Y ING
CΨ (Z × B1 ) CJ (Z × B1 )
x x
ι
ι
CK (Z , Y ) CΦ (Z , Y )
Table 2: The top equality is Proposition 7, the bottom equality is Theorem 4 and the left and right
arrows are Lemma 5 and 6, respectively.
The lemma below relates the set CK (Z , Y ) to the corresponding set CJ (Z × B1 ) for the kernel J
on Z × B1 .
Proof The assertion follows by Equation (11) and the continuity of the map ι.
In order to prove Theorem 4, we also need to provide a similar lemma for the set C Φ (Z , Y ).
Before we state the lemma, we note that knowing the features of the multi-task kernel K leads us to
the features for the scalar-kernel J associated to K. Specifically, for every (x, y), (t, u) ∈ X × B 1 , we
have that
J((x, y), (t, u)) = (Ψ(x, y), Ψ(t, u))W , (12)
Ψ(x, y) = Φ(x)y, x ∈ X , y ∈ B1 .
Thus, Equation (12) parallels Equation (8) except that X is replaced by X × B 1 . We also denote by
CΨ (Z × B1 ) = (Ψ(·), w)W : w ∈ W , the closed linear subspace of C (Z × B1 ).
Proof The proof is immediate. Indeed, for each x ∈ X , w ∈ W , y ∈ Y , we have that (Φ ∗ (x)w, y)Y =
(w, Φ(x)y)W = (Ψ(x, y), w)W .
1624
U NIVERSAL MULTI - TASK KERNELS
Moreover, we have that kLk = kνk, where kνk = |ν|(Ω) and kLk is the operator norm of L defined
by kLk = supkgk∞,Ω =1 |L(g)|. Recall that a Borel measure ν is regular if, for any E ∈ B(X ),
ν(E) = inf ν(U) : E ⊆ U,U open = sup ν(U) ¯ : U¯⊆ E, U¯compact .
In particular, every finite Borel measure on Ω is regular, see Folland (1999, p.217). We denote by
M (Ω) the space of all regular signed measures on Ω with total variation norm. We emphasize here
that the Riesz representation theorem stated above requires the compactness of the underlying space
Ω.
As mentioned above, Z × B1 is compact relative to the weak topology if Z is compact. This
enables us to use the Riesz representation theorem on the underlying space Ω = Z × B 1 to show the
following proposition.
Proposition 7 For any compact set Z ⊆ X , and any continuous multi-task kernel K with feature
representation Φ, we have that CΨ (Z × B1 ) = CJ (Z × B1 ).
Proof For any compact set Z ⊆ X , recall that Z × B1 is compact if B1 is endowed with the weak
topology of Y . Hence, the result follows by applying Theorem 4 in Micchelli et al. (2006) to the
scalar kernel J on the set Z × B1 with the feature representation given by Equation (12). However,
for the convenience of the reader we review the steps of the argument used to prove that theorem.
The basic idea is the observation that two closed subspaces of a Banach space are equal if and only if
whenever a continuous linear functional vanishes on either one of the subspaces, it must also vanish
on the other one. This is a consequence of the Hahn-Banach Theorem (see, e.g., Lax, 2002). In the
case at hand, we know by the Riesz Representation Theorem that all continuous linear functionals
L on C (Z × B1 ) are given by a regular signed Borel measure ν, that is for every F ∈ C (Z × B 1 ), we
have that Z
L(F) = F(x, y)dν(x, y).
Z ×B1
Now, suppose that L vanishes on CJ (Z × B1 ), then we conclude, by (12), that
Z Z
0= (Ψ(x, y), Ψ(t, u))W dν(x, y)dν(t, u).
Z ×B1 Z ×B1
Also, since K is assumed to be continuous relative to the operator norm and Z is compact we
have that kΨ(x, y)k2W = kΨ(x)yk2W = (K(x, x)y, y) ≤ supx∈Z kK(x, x)k < ∞. This together with the
equation
Z Z
Ψ(x, y)dν(x, y) ≤ kΨ(x, y)dν(x, y)k dν(x, y) ≤ sup kK(x, x)k|ν|(Z × B1 )
Z ×B1 W Z ×B1 x
imply that the integrand Z ×B1 Ψ(x, y)dν(x, y) exists. Consequently, it follows that
R
Z Z Z 2
(Ψ(x, y), Ψ(t, u))W dν(x, y)dν(t, u) = Ψ(x, y)dν(x, y) (13)
Z ×B1 Z ×B1 Z ×B1 W
1625
C APONNETTO , M ICCHELLI , P ONTIL AND Y ING
Definition 8 A map µ : B(Z ) → Y is called a Borel vector measure if µ is countably additive, that
is, µ(∪∞j=1 E j ) = ∑∞j=1 µ(E j ) in the norm of Y , for all sequences {E j : j ∈ N} of pairwise disjoint
sets in B(Z )
It is important to note that the definition of vector measure given in Diestel and Uhl, Jr. (1977)
only requires it to be finitely additive. For our purpose here, we only use countably additive mea-
sures and thus do not require the more general setting used in Diestel and Uhl, Jr. (1977).
1626
U NIVERSAL MULTI - TASK KERNELS
For any vector measure µ, the variation of µ is defined, for any E ∈ B(Z ), by the equation
|µ|(E) := sup ∑ kµ(A j )k : {A j : j ∈ N} pairwise disjoint and ∪ j∈N A j = E .
j∈N
In our terminology we conclude from (Diestel and Uhl, Jr., 1977, p.3) that µ is a vector measure if
and only if the corresponding variation |µ| is a scalar measure as explained in Section 3. Whenever
|µ|(Z ) < ∞, we call µ a vector measure of bounded variation on Z . Moreover, we say that a Borel
vector measure µ on Z is regular if its variation measure |µ| is regular as defined in Section 3.
We denote by M (Z , Y ) the Banach space of all vector measures with bounded variation and norm
kµk := |µ|(Z ).
For any scalar measure ν ∈ M (Z × B1 ), we define a Y -valued function on B(Z ), by the equa-
tion Z
µ(E) := ydν(x, y), ∀E ∈ B(Z ). (15)
E×B1
Let us confirm that µ is indeed a vector measure. For this purpose, choose any sequence of pairwise
disjoint subsets {E j : j ∈ N} ⊆ B(Z ), and observe that
Z Z
∑ kµ(E j )kY ≤ ∑ E j B1
dν(x, y) ≤ |ν|(Z × B1 ),
j∈N j∈N
which implies that |µ|(Z ) is finite and, hence, µ is a regular vector measure. This observation
suggests that we define, for any f ∈ C (Z , Y ), the integral of f relative to µ as
Z Z Z
( f (x), dµ(x)) := ( f (x), y)dν(x, y). (16)
Z Z B1
To prepare for our description of the dual space of C (Z , Y ), we introduce, for each f ∈ C (Z , Y ),
a linear functional Lµ defined by,
Z
Lµ f := ( f (x), dµ(x)). (17)
Z
Then, we have the following useful lemmas, see the appendix for their proofs.
1627
C APONNETTO , M ICCHELLI , P ONTIL AND Y ING
Since we established the isometry between C ∗ (Z , Y ) and M (Z , Y ), it follows that, for every regular
vector measure there corresponds a scalar measure on Z × B 1 for which Equation (15) holds true.
In order to provide our alternate proof Zof Theorem 4, we need to attend to one further issue.
Specifically, we need to define the integral K(t, x)(dµ(x)) as an element in Y . For this purpose,
Z
for any µ ∈ M (Z , Y ) and t ∈ Z we define a linear functional Lt on Y at y ∈ Y as
Z
Lt y := (K(x,t)y, dµ(x)).
Z
Since its norm has the property kLt k ≤ supx∈Z kK(x,t)k kµk, by the Riesz representation lemma,
we conclude that there exists a unique element ȳ in Y such that
Z
(K(x,t)y, dµ(x)) = ( ȳ, y).
Z
Z
It is this vector ȳ which we denote by the integral K(t, x)(dµ(x)).
Z Z
Similarly, we define the integral Φ(x)(dµ(x)) as an element in W . To do this, we note that
Z
kΦ(x)k = kΦ∗ (x)k and kΦ∗ (x)yk2 = hK(x, x)y, yi. Hence, we conclude that there exists a constant
κ such that, for all x ∈ X , kΦ(x)k ≤ kK(x, x)k 2 ≤ κ. Consequently, the linear functional L on W
1
We have now assembled all the necessary properties of vector measures to provide an alternate
proof of Theorem 4.
Alternate Proof of Theorem 4. We see from the feature representation (7) that
Z Z Z
K(t, x)(dµ(x)) = Φ∗ (t)Φ(x)(dµ(x)) = Φ∗ (t) Φ(x)(dµ(x)) , ∀t ∈ Z .
Z Z Z
1628
U NIVERSAL MULTI - TASK KERNELS
Z Z
From this equation, we easily see that if Φ(x)(dµ(x)) = 0 then, for every t ∈ Z , K(t, x)(dµ(x)) =
Z Z Z
0. On the other hand, applying (18) with the choice w = Φ(x)(dµ(x)) we get
Z
Z Z Z 2
Φ∗ (t) Φ(x)(dµ(x)), dµ(t) = Φ(x)(dµ(x)) .
Z Z Z W
Z Z
Therefore, if, for any t ∈ Z , K(t, x)(dµ(x)) = 0 then Φ(x)(dµ(x)) = 0, or equivalently, by
Z Z
Equation (18), Z
(Φ∗ (x)w, dµ(x)) = 0, ∀w ∈ W .
Z
Consequently, a linear functional vanishes on CK (Z , Y ) if and only if it vanishes on CΦ (Z , Y ) and
thus, we obtained that CK (Z , Y ) = CΦ (Z , Y ).
We end this section with a review of our approach to the question of universality of multi-task
kernels. The principal tool we employ is a notion of functional analysis referred to as the annihilator
set. Recall the notion of the annihilator of a set V which is defined by
Z
V ⊥ := µ ∈ M (Z , Y ) : (v(x), dµ(x)) = 0, ∀v ∈ V .
Z
Notice that the annihilator of the closed linear span of V is the same as that of V . Consequently,
by applying the basic principle stated at the beginning of this section , we conclude that the linear
span of V is dense in C (Z , Y ), that is, span(V ) = C (Z , Y ) if and only if the annihilator V ⊥ = {0}.
Hence, applying this observation to the set of kernel sections K(Z ) := {K(·, x)y : x ∈ Z , y ∈ Y } or
to the set of its corresponding feature sections Φ(Z ) := {Φ∗ (·)w : w ∈ W }, we obtain from Lemma
10 and Theorem 4, the summary of our main result.
Theorem 11 Let Z be a compact subset of X , K a continuous multi-task kernel, and Φ its feature
representation. Then, the following statements are equivalent.
1. CK (Z , Y ) = C (Z , Y ).
2. CΦ (Z , Y ) = C (Z , Y ).
n o
3. K(Z ) = µ ∈ M (Z , Y ) : Z K(t, x)(dµ(x)) = 0, ∀t ∈ Z = 0 .
⊥
R
n o
4. Φ(Z )⊥ = µ ∈ M (Z , Y ) : Z Φ(x)(dµ(x)) = 0 = 0 .
R
5. Universal Kernels
In this section, we prove the universality of some kernels, based on Theorem 11 developed above.
Specifically, the examples highlighted in Section 2 will be discussed in detail.
Kernel’s universality is a main hypothesis in the proof of consistency of learning algorithms.
Universal consistency of the regularized least-squares algorithm over vector-valued RKHS was
proved in Caponnetto and De Vito (2006); there, it was assumed that, in addition to universality
of the kernel, the trace of the operators K(x, x) is finite. In particular, this extra condition on the
kernel holds, for the Example 1 highlighted in Section 2, when the operators B j are trace class,
and does not hold for Example 4. It is not clear to the authors whether the finite trace condition is
necessary for consistency.
1629
C APONNETTO , M ICCHELLI , P ONTIL AND Y ING
For any {x j ∈ X : j ∈ Nm } and {y j ∈ Y : j ∈ Nm }, we know that (G(xi , x j ))i, j∈Nm and ((Byi , y j ))i, j∈Nm
are positive semi-definite. Applying Schur’s lemma implies that the matrix (G(x i , x j )(Byi , y j ))i, j∈Nm
is positive semi-definite and hence, K is positive semi-definite. Moreover, K ∗ (x,t) = K(x,t) =
K(t, x) for any x,t ∈ X . Therefore, we conclude by Definition 1 that K is a multi-task kernel.
Our goal below is to use the feature representation of the scalar kernel G to introduce the corre-
sponding one for kernel K. To this end, we first let W be a Hilbert space and φ : X → W a feature
map of the scalar kernel G, so that
Then, we introduce the tensor vector space W Y . Algebraically, this vector space is spanned by
N
and
w ⊗ (y1 + y2 ) = w ⊗ y1 + w ⊗ y2 .
We can turn the tensor space into an inner product space by defining, for any w 1 ⊗ y1 , w2 ⊗ y2 ∈
W Y,
N
and {yi : i ∈ N} respectively, then W Y is exactly the Hilbert space spanned by the orthonormal
N
basis {wi ⊗ y j : i, j ∈ N} under the inner product defined above. For instance, if W = R k and
Y = Rn , then W Y = Rkn .
N
The above tensor product suggests that we define the map Φ : X → L (Y , W ⊗ Y ) of kernel K
by √
Φ(x)y := φ(x) ⊗ By, ∀x ∈ X , y ∈ Y ,
and it follows that Φ∗ : X → L (W ⊗ Y , Y ) is given by
√
Φ∗ (x)(w ⊗ y) := (φ(x), w)W By, ∀x ∈ X , w ∈ W , and y ∈ Y . (20)
From the above observation, it is easy to check, for any x,t ∈ X and y, u ∈ Y , that (K(x,t)y, u) =
hΦ(x)y, Φ(t)ui. Therefore, we conclude that Φ is a feature map for the multi-task kernel K.
Finally, we say that an operator B ∈ L+ (Y ) is positive definite if (By, y) is positive whenever y
is nonzero. We are now ready to present the result on universality of kernel K.
1630
U NIVERSAL MULTI - TASK KERNELS
Proof By Theorem 11 and the feature representation (20), we only need to show that Φ(Z ) ⊥ = {0}
if and only if G is universal and the operator B is positive definite.
We begin with the sufficiency. Suppose that there exists a nonzero vector measure µ such that,
for any w ⊗ y ∈ W ⊗ Y , there holds
Z
∗
Z √
(Φ (x)(w ⊗ y), dµ(x)) = (φ(x), w)W ( By, dµ(x)) = 0. (21)
Z Z
√
Here, with a little abuse of notation we interpret, for a fixed y ∈ Y , ( By, dµ(x)) as a scalar measure
defined, for any E ∈ B(Z ), by
Z √ √
( By, dµ(x)) = ( By, µ(E)).
E
√
Since√µ ∈ M (Z , Y ), ( By, dµ(x)) is a regular signed scalar measure. Therefore, we see from (21)
that ( By, dµ(x)) ∈ φ(Z )⊥ for any y ∈√Y . Remember that G is universal if and only if√φ(Z )⊥ = {0},
and thus we conclude from (21) that ( By, dµ(x)) = 0 for any y ∈ Y . It follows that √
( By, µ(E)) = 0
for any y ∈ Y and E ∈ B(Z ). Thus, for any fixed set E taking the choice y = Bµ(E) implies
that (Bµ(E), µ(E)) = 0. Since E is arbitrary, this means µ = 0 and thus finishes the proof for the
sufficiency.
To prove the necessity, suppose first that G is not universal and hence, we know that, for some
compact
Z subset Z of X , there exists a nonzero scalar measure ν ∈ M (Z ) such that ν ∈ φ(Z ) ⊥ , that
is, (φ(x), w)dν(x) = 0 for any w ∈ W . This suggests to us to choose, for a nonzero y 0 ∈ Y , the
Z
nonzero vector measure µ = y0 ν defined by µ(E) := y0 ν(E) for any E ∈ B(Z ). Hence, the integral
in Equation (21) equals
Z
∗
√ Z
(Φ (x)(w ⊗ y), dµ(x)) = ( By, y0 ) (φ(x), w)W dν(x) = 0.
Z Z
Therefore, we conclude that there exists a nonzero vector measure µ ∈ Φ(Z ) ⊥ , which implies that
K is not universal.
If B is not positive definite, √ 1 ∈ Y such that (By1 , y1 ) =
namely, there exists a nonzero element y√
0. However, we observe that k By1 k2 = (By1 , y1 ) which implies that By1 = 0. This suggests to
us to choose a nonzero vector measure µ = y1 ν with some nonzero scalar measure ν. Therefore, we
conclude, for any w ∈ W and y ∈ Y , that
Z √ Z
(Φ∗ (x)(w ⊗ y), dµ(x)) = ( By, y1 ) (φ(x), w)W dν(x)
Z ZZ
√
= (y, By1 ) (φ(x), w)W dν(x) = 0,
Z
which implies that the nonzero vector measure µ ∈ Φ(Z )⊥ . This finishes the proof of the theorem.
1631
C APONNETTO , M ICCHELLI , P ONTIL AND Y ING
We now proceed further and consider kernels produced by a finite combination of scalar kernels
and operators. Specifically, we consider, for any j ∈ Nm , that G j : X × X → R be a scalar kernel
and B j ∈ L+ (Y ). We are interested in the kernel defined, for any x,t ∈ X , by
K(x,t) := ∑ G j (x,t)B j .
j∈Nm
Suppose also, for each scalar kernel G j , that there exists a Hilbert feature space W j and a feature
map φ j : X → W j .
To explain the associated feature map of kernel K, we need to define its feature space. For this
purpose, let H j be a Hilbert space with inner products (·, ·) j for j ∈ Nm and we introduce the direct
sum Hilbert space ⊕ j∈Nm H j as follows. The elements in this space are of the form (h1 , . . . , hm ) with
h j ∈ H j , and its inner product is defined, for any (h1 , . . . , hm ), (h01 , . . . , h0m ) ∈ ⊕ j∈Nm H j , by
This observation suggests to us to define the feature space of kernel K by the direct sum Hilbert
space W := ⊕ j∈Nm (W j ⊗ Y ), and its the map Φ : X → L (Y , W ), for any x ∈ X and y ∈ Y , by
√ √
Φ(x)y := (φ1 (x) ⊗ B1 y, . . . , φm (x) ⊗ Bm y). (22)
Using the above observation, it is easy to see that, for any x,t ∈ X , K(x,t) = Φ ∗ (x)Φ(t). Thus K is
a multi-task kernel and Φ is a feature map of K.
We are now in a position to state the result about the universality of the kernel K.
Theorem 13 Suppose that G j : X × X → R is a continuous scalar universal kernel, and B j ∈
L+ (Y ) for j ∈ Nm . Then, K(x,t) := ∑ j∈Nm G j (x,t)B j is universal if and only if ∑ j∈Nm B j is pos-
itive definite.
Proof Following Theorem 11, we need to prove that Φ(Z )⊥ = {0} for any compact set Z if and
only if ∑ j∈Nm B j is positive definite. To see this, observe that µ ∈ Φ(Z )⊥ implies, for any (w1 ⊗
y1 , . . . , wm ⊗ ym ) ∈ ⊕ j∈Nm (W j ⊗ Y ), that
Z p
∑ (φ j (x), w j )W (
Z j∈Nm
j
B j y j , dµ(x)) = 0.
1632
U NIVERSAL MULTI - TASK KERNELS
To move on to the next step, we will show that Equation (24) is true if and only if
Therefore, we conclude that µ ∈ Φ(Z )⊥ if and only if the above equation holds true.
Obviously, by Equation (26), we see that if ∑ j∈Nm B j is positive definite then µ = 0. This means
that kernel K is universal. Suppose that ∑ j∈Nm B j is not positive definite, that is, there exists a
1
nonzero y0 ∈ Y such that k ∑ j∈Nm B j 2 y0 k2 := (∑ j∈Nm B j )y0 , y0 = 0. Hence, choosing a nonzero
vector measure µ := y0 ν, with ν a nonzero scalar measure, implies that Equation (26) holds true
and, thus kernel K is not universal. This finishes the proof of the theorem.
Now we are in a position to analyze Examples 1 and 4 given in the Section 2. Since the function
K considered in Example 1 is in the form of (22), we conclude that it is a multi-task kernel.
We now discuss a class of kernels which includes that presented in Example 4. To this end,
we use the notation Z+ = {0} ∪ N and, for any smooth function f : Rm → R and index α =
α ∂ f (x) |α|
(α1 , . . . , αm ) ∈ Zm
+ , we denote the α-th partial derivative by ∂ f (x) := ∂α1 x1 ...∂αm xm . Then, recall
that the Sobolev space W k with integer order k is the space of real valued functions with norm
defined by Z
2
k f kW k := ∑ Rm
|∂α f (x)|2 dx, (27)
|α|≤k
where |α| = ∑ j∈Nm α j , see Stein (1970). This space can be extended to any fractional index s > 0.
To see this, we need the Fourier transform defined, for any f ∈ L 1 (Rm ), as
Z
fˆ(ξ) := e−2πihx,ξi f (x)dx, ∀ξ ∈ Rm ,
Rm
see Stein (1970). It has a natural extension to L2 (Rm ) satisfying the Plancherel formula k f kL2 (Rm ) =
k fˆkL2 (Rm ) . In particular, we observe, for any α = (α1 , . . . , αm ) ∈ Zm
+ and ξ = (ξ1 , . . . , ξm ) ∈ R , that
m
α f (ξ) = fˆ(ξ)(2πiξ )α1 . . . (2πiξ )αm . Hence, by Plancherel formula, we see, for any f ∈ W k with
∂d 1 m
k ∈ N, that its norm k f kW k is equivalent to
Z 21
(1 + 4πkξk2 )k | fˆ(ξ)|2 dξ .
Rm
This observation suggests to us to introduce fractional Sobolev space W s (see, e.g., Stein, 1970)
with any order s > 0 with norm defined, for any function f , by
Z
2
k f kW s := (1 + 4π2 kξk2 )s | fˆ(ξ)|2 dξ.
Rm
1633
C APONNETTO , M ICCHELLI , P ONTIL AND Y ING
m
Finally, we need the Sobolev embedding lemma which states that, for any s > 2, there exists an
absolute constant c such that, for any f ∈ W s and any x ∈ Rm , there holds
| f (x)| ≤ ck f kW s ,
Proposition 14 Let Y = L2 (Rd ), X = R and H be the space of real-valued functions with norm
Z h i
2
2 ∂f 2
k f k := f (x, ·) + (x, ·) dx.
∂x
d+1 d+1
R W 2 W 2
Then this is an RKHS with universal multi-task kernel given, for every x,t ∈ X by
Z
(K(x,t)y)(r) = e −π|x−t|
e−πkr−sk y(s)ds, ∀y ∈ Y , r ∈ Rd . (28)
Rd
Proof For any fixed t ∈ Rd , it follows from the Sobolev embedding lemma that
| f (x,t)| ≤ ck f (·,t)kW 1 .
Combining this with the definition of Sobolev space W 1 given by Equation (27), we have that
Z
2
k f (x)k2Y ≤ c2 k f (·,t)kW 1 dt
ZRd Z
2 ∂f 2
= c2 f (x,t) + (x,t) dt dx ≤ c2 k f k2 .
Rd R ∂x
Since, for any y ∈ B1 and x ∈ R, |(y, f (x))Y | ≤ kykY k f (x)kY ≤ k f (x)kY , by the above equation
there exists a constant c0 such that, for any y ∈ B1 , x ∈ R and f ∈ H ,
|(y, f (x))Y | ≤ c0 k f k.
Hence, by the Riesz representation lemma, H is an RKHS (Micchelli and Pontil, 2005).
Next, we confirm Equation (28) is the kernel associated to H . To this end, it suffices to show
that the reproducing property holds, that is, for any f ∈ H , y ∈ Y and x ∈ X
On the other hand, note that Kx y(x0 ) := K(x0 , x)y ∈ Y , and consider its Fourier transform
Z Z
0
b x)y)(ξ, τ) =
(K(·, e−2πihx ,ξi e−2πihr,τi (K(x0 , x)y)(r)drdx0 .
Rd R
1634
U NIVERSAL MULTI - TASK KERNELS
Using Equation (28) and the Plancherel formula, the integral on the right hand of the above equation
is equal to
e−2πihx,ξi ŷ(τ)
. (30)
(1 + 4π2 |ξ|2 ) (1 + 4π2 kτk2 ) d+1
2
Putting (30) into the above equation, we immediately know that the reproducing property (29) holds
true. This verifies that K is the reproducing kernel of the Hilbert space H .
To prove the universality of this kernel, let Z be any prescribed compact subset of X , we define
the Laplace kernel, for any x,t ∈ R, by G(x,t) := e−|x−t| and the operator B : L2 (Rd ) → L2 (Rd ) by
Z
Bg(r) := e−kr−sk g(s)ds, ∀ r ∈ Rd .
Rd
c = cd ĝ(τ)
Bg(τ) d+1 . (31)
(1 + 4π2 kτk2 ) 2
By Theorem 12, it now suffices to prove that G is universal and B is positive definite. To this
end, note that there exists cd such that
e−2πihx−t,ξi
Z
G(x,t) = cd 2 2
dξ.
R 1 + 4π |ξ|
Since the weight function 1+4π12 |ξ|2 is positive, G is universal according to Micchelli et al. (2003).
To show the positive definiteness of B, we obtain from Equation (31) and the Plancherel formula
that
|ĝ(τ)|2 dτ
Z
(Bg, g) = cd d+1 .
Rd (1 + 4π2 kτk2 ) 2
We now discuss continuous parameterized multi-task kernels. For this purpose, let Ω be a locally
compact Hausdorff space and, for any ω ∈ Ω, B(ω) be an n × n positive semi-definite matrix. We
are interested in the kernel of the following form
Z
K(x,t) = G(ω)(x,t)B(ω)d p(ω), ∀x,t ∈ X , (32)
Ω
where p is a measure on Ω. We investigate this kernel in the case that, for any ω ∈ Ω, G(ω) is
a scalar kernel with a feature representation given, for any x,t ∈ X , by the formula G(ω)(x,t) =
f = L2 (Ω, W ⊗ Y , p) with norm defined,
hφω (x), φω (t)iW . Now, we introduce the Hilbert space W
for any f : Ω → W ⊗ Y , by
Z
k f k2f := k f (ω)k2W ⊗Y d p(ω).
W Ω
1635
C APONNETTO , M ICCHELLI , P ONTIL AND Y ING
By an argument similar to that used just before Theorem 13, we conclude that K is a multi-task
kernel and has the feature map Φ with feature space W f.
We are ready to present a sufficient condition on the universality of K.
Theorem 15 Let p be a measure on Ω and for every ω in the support of p, let G(ω) be a continuous
universal kernel and B(ω) a positive definite operator. Then, the multi-task kernel K defined by
Equation (32) is universal.
Proof Following Theorem 11, for a compact set Z ⊆ X suppose that there exists a vector measure
µ such that Z p
φω (x) ⊗ B(ω)(dµ(x)) = 0.
Z
p
Therefore, there exists a ω0 ∈ support(p) satisfying Z φω0 (x) ⊗ B(ω0 ) dµ(x) = 0. Equivalently,
R
p
Z φω0 (x) B(ω0 )dµ(x), y = 0 for any y ∈ Y . Since we assume G(ω0 ) is universal, appealing to
R
the
p feature characterization in the scalar case (Micchelli et al., 2006) implies that the scalar measure
( B(ω0 )dµ(x), y) = 0. Consequently, we obtain that µ ≡ 0 since y ∈ Y is arbitrary. This completes
the proof of this theorem.
Example 5 Suppose the measure p over [0, ∞) does not concentrate on zero and B(ω) be a positive
definite n × n matrix for each ω ∈ (0, ∞). Then the kernel K(x,t) = 0∞ e−ωkx−tk B(ω)d p(ω) is a
R 2
Further specializing this example, we choose the measure p to be the Lebesgue measure on
[0, ∞) and choose B(ω) in the following manner. Let A be n × n symmetric matrices. For every
ω > 0, we define the (i, j)-th entry of the matrix B(ω) as e−ωAi j , i, j ∈ Nn . Recall that a matrix A is
conditionally negative semi-definite if, for any ci ∈ R, i ∈ Nn with ∑i∈Nn ci = 0, then the quadratic
form satisfies ∑i, j∈Nn ci Ai j c j ≤ 0. A well-known theorem of I. J. Schoenberg (see, e.g., Micchelli,
1986) state that B(ω) is positive semi-definite for all ω > 0 if and only if A is conditionally negative
semi-definite. Moreover, if the elements of the conditionally negative semi-definite matrix A satisfy,
for any i, j ∈ Nn , the inequalities Ai j > 12 (Aii + A j j ) and Aii > 0, then B(ω) is positive definite
(Micchelli, 1986). With this choice of A, the universal kernel in Example 5 becomes
1
K(x,t) = , ∀i, j ∈ Nn .
ij kx − tk2 + Ai j
1636
U NIVERSAL MULTI - TASK KERNELS
Since G is a scalar reproducing kernel on Z , the last term of the above equality is nonnegative, and
hence K is positive semi-definite matrix-valued kernel. This completes the proof of the assertion.
We turn our attention to the characterization of the universality of K defined by Equation (33).
To this end, we assume that the scalar kernel G has a feature map φ : X e → W and define the
mapping Φ(x) : R → W , for any y = (y1 , . . . , yn ) ∈ R , by Φ(x)y = ∑ p∈Nn y p φ(Tp x). Its adjoint
n n
operator Φ(x)∗ : W → Rn is given, for any w ∈ W , as Φ∗ (x)w = (hφ(T1 x), wiW , . . . , hφ(Tn x), wiW ).
Then, for any x,t ∈ X , the kernel K(x,t) = Φ∗ (x)Φ(t) and thus, we conclude that W is the feature
space of K and Φ is its feature map.
We also need some further notation and definitions. For a map T : X → X e , we denote its range
space by T X := {T x : x ∈ X } and T (E) := {x : T x ∈ E} for any E ⊂ X . In addition, we say that
−1 e
T is continuous if T −1 (U) is open whenever U is a open set in X e . Finally, for any scalar Borel
measure ν on X and a continuous map T from X to X e , we introduce the image measure ν ◦ T −1 on
e e
X defined, for any E ∈ B(X ), by (ν ◦ T )(E) := ν({x ∈ X : T x ∈ E}).
−1
We are ready to state the result about universality of the kernel K in Equation (33).
e be continuous for each p ∈ Nn and
Proposition 17 Let G be a scalar universal kernel, Tp : X → X
define the kernel K by Equation (33). Then K is universal if and only the sets Tq X , q ∈ Nn , are
pairwise disjoint and Tq is one-to-one for each q ∈ Nn .
Proof Following Theorem 11, for any compact set Z ⊆ X , it suffices to verify the equation
Φ(Z )⊥ = {0}. Before doing so, we recall that, by Lemma 10 and the remark which followed
it, for any vector measure µ ∈ M (Z , Rn ), there exists a scalar regular measure ν ∈ M (Z × B1 ) such
that Z Z
dµ(t) = y1 dν(t, y), . . . , yn dν(t, y) .
B1 B1
Hence, any vector measure µ can be represented as µ = (µ 1 , . . . , µn ) where each µi is a scalar measure.
Then, µ ∈ Φ(Z )⊥ can be rewritten as
Z
∑ φ(Tqt)dµq (t) = 0.
q∈Nn Z
1637
C APONNETTO , M ICCHELLI , P ONTIL AND Y ING
Since Tq is continuous for any q ∈ Nn , the range space Tq Z is compact and so is Z e . Recall from
e if and only if its feature map φ is
Micchelli et al. (2006) that the scalar kernel G is universal on Z
e
universal on Z . Therefore, the above equation is reduced to the form
∑ µq ◦ Tq−1 = 0.
q∈Nn
With the above derivation, we can now prove the necessity. Suppose that {Tq X : q ∈ Nn } is not
pairwise disjoint. Without loss of generality, we assume that T1 X ∩T2 X 6= 0. / That means there exists
x1 , x2 ∈ X such that T1 x1 = T2 x2 = z0 . Let µq ≡ 0 for q ≥ 3, and denote by δx=x0 the point distribution
at x0 ∈ X . Then, choosing µ1 = δx=x1 , and µ2 = −δx=x2 implies that Equation (34) holds true. By
Theorem 11 in Section 4, we know that K is not universal. This completes the first assertion.
Now suppose that there is a map, for example Tp , which is not one-to-one. This implies that
there exists x1 , x2 ∈ X , x1 6= x2 , such that Tp x1 = Tp x2 . Hence, if we let µq = 0 for any q 6= p and
µ p = δx=x1 − δx=x2 then ∑q∈Nn µq ◦ Tq−1 = 0,. But µ p 6= 0, hence, by Theorem 11, K is not universal.
This completes the our assertion.
Finally, we prove the sufficiency. Since µq ◦ Tq−1 only lives on Tq X and {Tq X : q ∈ Nn } is
pairwise disjoint, then ∑q∈Nn µq ◦ Tq−1 = 0 is equivalent to µq ◦ Tq−1 = 0 for each q ∈ Nn . How-
ever, since Tq is one-to-one, E = Tq−1 (Tq (E)) for each Borel set E ∈ B(X ). This means that
µq (E) = µq ◦ Tq−1 (Tq (E)) = 0 for any E ∈ B(X ). This concludes the proof of the proposition.
We end this subsection with detailed proofs of our claims about the examples presented in
Section 2. Indeed, we already proved the positive semi-definiteness of the kernel in Example 2 by
Proposition 16. Below, we prove the claim that the function K given by Equation (6) is not a kernel
in general.
Proposition 18 Let σ pq > 0 and σ pq = σqp for any p, q ∈ Nn . Then, the matrix-valued function
defined by
2 n
K(x,t) := e−σ pq kx−yk , ∀x,t ∈ X
p,q=1
is positive semi-definite.
We choose any distinct positive integers p0 and q0 . In Equation (35), we specify any m, n with
m ≥ n such that p0 , q0 ∈ Nm , x1 , . . . , xn with x p0 6= xq0 and set c = kx p0 − xq0 k2 . Therefore, we
1638
U NIVERSAL MULTI - TASK KERNELS
Proof The fact that K is positive semi-definite directly follows from the observation, for any m ∈ N,
{yi : yi ∈ Rn , i ∈ Nm } and {xi : xi ∈ X , i ∈ Nm }, that
2
n/2 Z
∑ ∑ hyi , ξie 2πihxi ,ξi 2
(yi , K(xi , x j )y j ) = 4π 2πσ e−σkξk dξ.
i, j,∈Nm Rn i∈Nm
In order to prove the universality of K, we follow Theorem 11. For this purpose, we assume that
Z is a compact subset of X and µ ∈ K (Z )⊥ , that is,
Z
K(x,t)(dµ(t)) = 0 ∀x ∈ Z . (38)
Z
By Equation (37), this equation is equivalent to
Z Z
2
e2πihx,ξi ξe−σkξk e−2πiht,ξi (ξ, dµ(t))dξ = 0, ∀x ∈ Z ,
Rn Z
which implies, by integrating both sides of this equation with respect to x ∈ R n , that
Z Z 2
2
e−σkξk e−2πiht,ξi (ξ, dµ(t)) dξ = 0.
Rn Z
1639
C APONNETTO , M ICCHELLI , P ONTIL AND Y ING
Taking the k-th derivative with respect to ξ of both sides of this equation and set ξ = 0, we have, for
every k ∈ N, that Z
t k dµ(t) = 0.
Z
Since polynomials are dense in C (Z ), we conclude from the above equation that µ = 0. Hence, by
Theorem 11, the kernel K is universal when n = 1.
If n ≥ 2, we choose µq = 0 for q ≥ 3 and dµ1 (t) = dt1 (δt2 =1 − δt2 =−1 ) ∏np=3 δt p =0 and dµ2 (t) =
(δt1 =−1 − δt1 =1 )dt2 ∏np=3 δt p =0 , and note that
sin(2πξ1 )
Z
e−2πiht,ξi dµ1 (t) = (−2πi sin(2πξ2 )) ,
[−1,1]n πξ1
and
sin(2πξ2 )
Z
e−2πiht,ξi dµ2 (t) = (2πi sin(2πξ1 )) .
[−1,1]n πξ2
Therefore, we conclude that
Z Z Z
e−2πiht,ξi (ξ, dµ(t)) = ξ1 e−2πiht,ξi dµ1 (t) + ξ2 e−2πiht,ξi dµ2 (t) = 0.
[−1,1]n [−1,1]n [−1,1]n
1640
U NIVERSAL MULTI - TASK KERNELS
We first show that K is a multi-task kernel. To see this, for any m ∈ N, {x i : xi ∈ X , i ∈ Nm }, and
{yi : yi ∈ L2 (Ω), i ∈ Nm } there holds
Z Z
∑i, j∈Nm (K(xi , x j )y j , yi ) = ∑i, j∈Nm G((xi ,t), (x j , s))y j (s)yi (t)dtds
Ω Ω
Z 2
= ∑i∈Nm φ(xi , s)yi (s)ds ≥ 0,
Ω
and also its adjoint operator Φ∗ is given, for any w ∈ W , by Φ∗ (x)w = hφ(x, ·), wiW . Hence, for any
x, x0 ∈ X , we conclude that K(x, x0 ) = Φ∗ (x)Φ(x0 ) which implies that K is a multi-task kernel and Φ
is its associated feature map.
Our next goal is to prove the universality of K.
Theorem 20 Let G and K be defined as in Equations (39) and (40). If G is a universal scalar kernel
then K is a universal multi-task kernel.
Proof By Theorem 11, it suffices to show that, for any compact Z ⊆ X , whenever there exists a
vector measure µ such that
Z
Φ(x)(dµ(x)) = 0,
Z
Since Z and Ω are both compact, then Z × Ω is also compact by Tychonoff theorem (Folland, 1999,
p.136). By assumption, G is universal on X × Ω and φ is its feature map, and thus we conclude that
the scalar measure dµ(x, s) is the zero measure. This means that, for any E ∈ B(Z ) and E 0 ∈ B(Ω),
Z
µ(E)(s)ds = 0.
E0
Since E, E 0 are arbitrary, we conclude that the vector measure µ = 0 which completes the assertion.
1641
C APONNETTO , M ICCHELLI , P ONTIL AND Y ING
6. Conclusion
Acknowledgments
This work was supported by EPSRC Grants GR/T18707/01 and EP/D052807/1 by the IST Pro-
gramme of the European Community, under the PASCAL Network of Excellence IST-2002-506778.
The first author was supported by the NSF grant 0325113, the FIRB project RBIN04PARL, the EU
Integrated Project Health-e-Child IST-2004-027749, and the City University of Hong Kong grant
No.7200111(MA). The second author is supported by NSF grant DMS 0712827.
We are grateful to Alessandro Verri, Head of the Department of Computer Science at the Univer-
sity of Genova for providing us with the opportunity to complete part of this work in a scientifically
stimulating and friendly environment. We also wish to thank the referees for their valuable com-
ments.
Appendix A.
1642
U NIVERSAL MULTI - TASK KERNELS
Proof of Lemma 9 By the definition of the integral appearing in the right-hand side of equation it
follows (17) (see, e.g., Diestel and Uhl, Jr., 1977), for any f ∈ C (Z , Y ), that
if kykY ≤ 1 and h(y) = kykY , if kykY ≥ 1, and introduce another function in C (Z , Y ) given by
y
µ(A j )
f¯:= ∑ j∈Nn function f = h ◦ f¯is in C(Z , Y ) as well, because f¯∈ C (Z , Y )
kµ(A j )kY f j . Therefore, the
c
and, for any y, y0 ∈ Y , kh(y)−h(y0 )k 0
Y ≤ 2ky−y kY . Moreover, we observe, for any x ∈ ∪ j∈Nn E j ,
that f (x) = g(x) and, for any x ∈ Z , that k f (x)kY ≤ 1.
We are now ready to estimate the total variation of µ. First, observe that
Z
Z
k f (x) − g(x)kY d|µ|(x) ≤ ∑ (n + 1)|µ|(E j ) ≤ ε,
j∈Nn
1643
C APONNETTO , M ICCHELLI , P ONTIL AND Y ING
References
L. Amodei. Reproducing kernels of vector–valued function spaces. In Proc. of Chamonix, A. Le
Meehaute et al. Eds., pages 1–9, 1997.
N. Aronszajn. Theory of reproducing kernels. Trans. Amer. Math. Soc. 68:337–404, 1950.
J. Burbea and P. Masani. Banach and Hilbert Spaces of Vector-Valued Functions. Pitman Research
Notes in Mathematics Series, 90, 1984.
A. Caponnetto and E. De Vito. Optimal rates for regularized least-squares algorithm. Foundations
of Computational Mathematics, 7:331–368, 2007.
C. Carmeli, E. De Vito, and A. Toigo. Vector valued reproducing kernel Hilbert spaces of integrable
functions and Mercer theorem. Analysis and Applications, 4:377–408, 2006.
F. Cucker and S. Smale. On the mathematical foundations of learning. Bull. Amer. Math. Soc., 39:1–
49, 2001.
D. R. Chen, Q. Wu, Y. Ying, and D.X. Zhou. Support vector machine soft margin classifiers: error
analysis. Journal of Machine Learning Research, 5:1143–1175, 2004.
A. Devinatz. On measurable positive definite operator functions. J. Lonon Math. Soc., 35:417–424,
1960.
J. Diestel and J. J. Uhl, Jr. Vector Measures. AMS, Providence (Math Surveys 15), 1977.
T. Evgeniou, C. A. Micchelli and M. Pontil. Learning multiple tasks with kernel methods. J. Ma-
chine Learning Research, 6:615–637, 2005.
G. B. Folland. Real Analysis: Modern Techniques and Their Applications. 2nd edition, New York,
John Wiley & Sons, 1999.
A. Gretton, K.M. Borgwardt, M. Rasch, B. Schölkopf and A.J. Smola. A kernel method for the
two-sample problem. In Advances in Neural Information Processing Systems 19, B. Sch ölkopf,
J. Platt and T. Hoffman editors, pages 513–520, MIT Press, 2007.
1644
U NIVERSAL MULTI - TASK KERNELS
S. Lowitzsch. A density theorem for matrix-valued radial basis functions. Numerical Algorithms,
39:253-256, 2005.
C. A. Micchelli, Interpolation of scattered data: distances matrices and conditionally positive defi-
nite functions. Constructive Approximation, 2:11–22, 1986.
C. A. Micchelli and M. Pontil. A function representation for learning in Banach spaces. In Proceed-
ings of the 17th Annual Conference on Learning Theory (COLT’04), pages 255–269, 2004.
C.A. Micchelli and M. Pontil. Feature space perspectives for learning the kernel. Machine Learning,
66:297–319, 2007.
C. A. Micchelli, Y. Xu, and P. Ye. Cucker Smale learning theory in Besov spaces. NATO Science
Series sub Series III Computer and System Science, 190:47–68, 2003.
C. A. Micchelli, Y. Xu, and H. Zhang. Universal kernels. J. Machine Learning Research, 7:2651-
2667, 2006.
S. Mukherjee and D.X. Zhou. Learning coordinate covariances via gradients, J. of Machine Learn-
ing Research 7:519-549, 2006.
C. E. Rasmussen and C. K. I. Williams. Gaussian Processes for Machine Learning, MIT Press,
2006.
M. Reisert and H. Burkhardt. Learning equivariant functions with matrix valued kernels. J. Machine
Learning Research, 8:385–408, 2007.
B. Schölkopf and A. J. Smola. Learning with Kernels. The MIT Press, Cambridge, MA, USA, 2002.
J. Shawe-Taylor and N. Cristianini. Kernel Methods for Pattern Analysis. Cambridge University
Press, 2004.
E. Solak, R. Murray-Smith, W.E. Leithead, D.J. Leith and C.E. Rasmussen. Derivative observations
in Gaussian Process models of dynamic Systems. In Advances in Neural Information Processing
Systems 15, S. Becker, S. Thrun and K. Obermayer editors, pages 1033–1040, MIT Press, 2003.
E. M. Stein. Singular Integrals and Differential Properties of Functions, Princeton University Press,
Princeton, NJ, 1970.
I. Steinwart. On the influence of the kernel on the consistency of support vector machines. J. Ma-
chine Learning Research, 2:67–93, 2001.
I. Steinwart, D. Hush, and C. Scovel. Function classes that approximate the Bayes risk. In Proceed-
ing of the 19th Annual Conference on Learning Theory, pages 79–93, 2006.
1645
C APONNETTO , M ICCHELLI , P ONTIL AND Y ING
E. Vazquez and E. Walter. Multi-output support vector regression. In Proceedings of the 13th IFAC
Symposium on System Identification, 2003.
D. X. Zhou. Density problem and approximation error in learning theory. Preprint, 2003.
1646