Superkernels
Superkernels
net/publication/237444281
Superkernels
CITATIONS READS
0 66
3 authors, including:
All content following this page was uploaded by Cheng Soon Ong on 19 May 2014.
Abstract
We consider the problem of choosing a kernel suitable for estimation
using a Gaussian Process estimator or a Support Vector Machine. A
novel solution is presented which involves defining a Reproducing Ker-
nel Hilbert Space on the space of kernels itself. By utilizing an analog
of the classical representer theorem, the problem of choosing a kernel
from a parameterized family of kernels (e.g. of varying width) is reduced
to a statistical estimation problem akin to the problem of minimizing a
regularized risk functional. Various classical settings for model or kernel
selection are special cases of our framework.
1 Introduction
Choosing suitable kernel functions for estimation using Gaussian Processes and Support
Vector Machines is an important step in the inference process. To date, there are few if
any systematic techniques to assist in this choice. Even the restricted problem of choosing
the “width” of a parameterized family of kernels (e.g. Gaussian) has not had a simple and
elegant solution.
A recent development [1] which solves the above problem in a restricted sense involves
the use of semidefinite programming to learn an arbitrary positive semidefinite matrix K,
subject to minimization of criteria such as the kernel target alignment [1], the maximum of
the posterior probability [2], the minimization of a learning-theoretical bound [3], or subject
to cross-validation settings [4]. The restriction mentioned is that the methods work with the
kernel matrix, rather than the kernel itself. Furthermore, whilst demonstrably improving
the performance of estimators to some degree, they require clever parameterization and
design to make the method work in the particular situations and rely on more difficult
mathematical programming problems. There are still no general principles to guide the
choice of a) which family of kernels to choose, b) efficient parameterizations over this
space, and c) suitable penalty terms to combat overfitting. (The last point is particularly an
issue when we have a very large set of semidefinite matrices at our disposition).
Whilst not yet providing a complete solution to these problems, this paper does present
a framework that allows the optimization within a parameterized family relatively simply,
and crucially, intrinsically captures the tradeoff between the size of the family of kernels
and the sample size available. Furthermore, the solution presented is for optimizing kernels
themselves, rather than the kernel matrix as in [1].
Outline of the Paper We show (Section 2) that for most kernel-based learning methods
there exists a functional, the quality functional, which plays a similar role to the empiri-
cal risk functional, and that subsequently (Section 3) the introduction of a kernel on ker-
nels, a so-called superkernel, in conjunction with regularization on the Reproducing Ker-
nel Hilbert Space formed on kernels leads to a systematic way of parameterizing function
classes whilst managing overfitting. We give several examples of superkernels (Section 4)
and show (Section 5) how they can be used practically. Due to space constraints we only
consider Support Vector classification.
2 Quality Functionals
Let Xtrain := {x1 , . . . , xm } denote the set of training data and Ytrain := {y1 , . . . , ym } the
set of corresponding labels, jointly drawn iid from some probability distribution P (x, y)
on X × Y. Furthermore, let Xtest and Ytest denote the corresponding test sets (drawn from
the same P (x, y)). Let X := Xtrain ∪ Xtest and Y := Ytrain ∪ Ytest .
We introduce a new class of functionals Q on data which we call quality functionals. Their
purpose is to indicate, given a kernel k and the training data (Xtrain , Ytrain ), how suitable
the kernel is for explaining the training data. Examples of quality functionals are the kernel
target alignment, the negative log posterior, the minimum of the regularized risk functional,
or any luckiness function for kernel methods.
Example 1 (Kernel Target Alignment) This quality functional was introduced in [5] to
assess the “alignment” of a kernel with training labels. It is defined by
y > Ky
Qalignment
emp [k, X, Y ] := 1 − , (3)
kyk22 kKk2
where y denotes the vector of elements of YP , kyk2 denotes the `2 norm of y, and kKk2 is
the Frobenius norm: kKk22 := trKK > = i,j Kij 2
. Note that the definition in [5] looks
somewhat different, yet it is algebraically identical to (3).
By decomposing K into its eigensystem, one can see that (3) is minimized if K = yy > , in
which case
y > yy > y kyk42
Qalignment
emp [k ∗ , X, Y ] = 1 − 2 =1− = 0. (4)
>
kyk2 kyy k2 kyk22 kyk22
It is clear that one cannot expect that Qalignment
emp [k ∗ , X, Y ] = 0 for data other than the set
chosen to determine k ∗ .
where kf k2H is the RKHS norm of f . By virtue of the representer theorem (see e.g., [4, 7])
we know that the minimizer over f ∈ H of (5) can be written as a kernel expansion.
For a given loss c this leads to the quality functional
" m
#
regrisk 1 X λ >
Qemp [k, X, Y ] := minm c(xi , yi , [Kα]i ) + α Kα . (6)
α∈R m i=1 2
The minimizer of (6) is more difficult to find, since we have to carry out a double min-
imization over K and α. First, note that for K = βyy > and α = βkyk 1
2 y, Kα = y
> −1 regrisk λ
and α Kα = β . Thus Qemp [k, X, Y ] = 2β . For sufficiently large β, we can make
Qregrisk
emp [k, X, Y ] arbitrarily close to 0.
If we add a scale restriction to K, by setting trK = 1, we can determine the minimum of
1 > m
(6) as follows. Set K = kzk 2 zz , where z ∈ R , and α = z. Then Kα = z and so
m m
1 X λ X λ
c(xi , yi , [Kα]i ) + α> Kα = c(xi , yi , zi ) + kzk22 .
m i=1 2 i=1
2
Choosing each zi = argminζ c(xi , yi , ζ) + λ2 ζ 2 yields the minimum with respect to z. The
proof that K is the global minimizer of this quality functional is omitted for brevity.
Definition 3 (Reproducing Kernel Hilbert Space) Let X be a nonempty set (often called
the index set) and denote by H a Hilbert space of functions f : X → R. Then H is
called a reproducing
p kernel Hilbert space endowed with the dot product h·, ·i (and the
norm kf k := hf, f i) if there exists a function k : X × X → R satisfying, x, x0 ∈ X :
1. k has the reproducing property hf, k(x, ·)i = f (x) for all f ∈ H; in particular,
hk(x, ·), k(x0 , ·)i = k(x, x0 ).
2. k spans H, i.e. H = span{k(x, ·)|x ∈ X } where X is the completion of X.
The advantage of optimization in an RKHS is that under certain conditions the optimal
solutions can be found as the linear combination of a finite number of basis functions,
regardless of the dimensionality of the space H, as can be seen in the theorem below.
Definition 5 (Super Reproducing Kernel Hilbert Space) Let X be a nonempty set and
let X := X × X (the compounded index set). Then the Hilbert space p H of functions
k : X → R, endowed with a dot product h·, ·i (and the norm kkk = hk, ki) is called a
Super Reproducing Kernel Hilbert Space if there exists a superkernel k : X × X → R with
the following properties:
1. k has the reproducing property hk, k(x, ·)i = k(x) for all k ∈ H, in particular,
hk(x, ·), k(x0 , ·)i = k(x, x0 ).
2. k spans H, i.e. H = span{k(x, ·)|x ∈ X }.
3. For any fixed x ∈ X the superkernel k is a kernel in its second argument, i.e. for
any fixed x ∈ X , the function k(x, x0 ) := k(x, (x, x0 )) with x, x0 ∈ X is a kernel.
What distinguishes H from a normal RKHS is the particular form of its index set (X = X 2 )
and the additional condition on k to be a kernel in its second argument for any fixed first
argument. This condition somewhat limits the choice of possible kernels. On the other
hand, it allows for simple optimization algorithms which consider kernels k ∈ H, which
are in the convex cone of k.
Analogously to the definition of the regularized risk functional (5), we define the regular-
ized quality functional:
λs
Qreg [k, X, Y ] := Qemp [k, X, Y ] + kkk2 , (10)
2
where λs > 0 is a regularization constant and kkk2 denotes the RKHS norm in H. Mini-
mization of Qreg is less prone to overfitting than minimizing Qemp , since the regularization
term λ2s kkk2 effectively controls the complexity of the class of kernels under consideration.
Regularizers other than λ2s kkk2 are also possible.
The question arising immediately from (10) is how to minimize the regularized quality
functional efficiently. In the following we show that the minimum can be found as a linear
combination of superkernels.
4 Examples of Superkernels
Having introduced the theoretical basis of the Super-RKHS, we need to answer the question
whether practically useful k exist which satisfy the conditions of Definition 5. We address
this question by giving a set of general recipes for building such kernels.
is a superkernel: for any fixed x, k(x, (x, x0 )) is a sum of kernel functions, hence it is
a kernel itself (since k i (x, x0 ) is a kernel if√k is).
√ To show√that k is a kernel, note that
k(x, x0 ) = hΦ(x), Φ(x0 )i, where Φ(x) := ( c0 , c1 k 1 (x), c2 k 2 (x), . . .).
Then ∂θj R(θ) = D2j f (x(θ), θ), where j ∈ N and D2 denotes the derivative with respect to
the second argument of f .
We immediately can conclude the following for the quality functional arising from Rreg .
P
Corollary 8 For f = i αi k(xi , x) and a convex loss function c the regularized quality
functional is convex in k.
The corresponding regularized quality functional is:
λs
Qregrisk
reg [k, X, Y ] = Qregrisk
emp [k, X, Y ] + kkk2H (18)
2
Since the minimizer of (18) can be written as a kernel expansion (by the representer theo-
rem for Super-RKHS), the optimal regularized quality functional can be written as (using
the soft margin loss and K ijpq := k((xi , xj ), (xp , xq )):
m m
1 X X
Qregrisk
reg [K, α, β, X, Y ] = max 0, 1 − y i α j βpq K ijpq (19)
m i=1 j,p,q=1
m m
λ X λ0 X
+ αi αj βpq K ijpq + βij βpq K ijpq
2 i,j,p,q=1 2 i,j,p,q=1
Using the same non-optimized parameters for different data sets we achieved results com-
parable to other recent work on classification such as boosting, optimized SVMs, and kernel
target alignment [9, 10, 5] (note that we use a much smaller part of the data for training:
only 60% rather than 90%). Results based on Qreg are comparable to hand tuned SVMs
(right most column), except for the ionosphere data. We suspect that this is due to the small
training sample.
Summary and Outlook The regularized quality functional allows the systematic solu-
tion of problems associated with the choice of a kernel. Quality criteria that can be used
include target alignment, regularized risk and the log posterior. The regularization implicit
in our approach allows the control of overfitting that occurs if one optimizes over a too
large a choice of kernels.
A very promising aspect of the current work is that it opens the way to theoretical analyses
of the price one pays by optimising over a larger set K of kernels. Current and future
research is devoted to working through this analysis and subsequently developing methods
for the design of good superkernels.
Acknowledgements This work was supported by a grant of the Australian Research
Council. The authors thank Grace Wahba for helpful comments and suggestions.
References
[1] G. Lanckriet, N. Cristianini, P. Bartlett, and M. Jordan. Kernel target alignment. In
ICML 2002. Morgan Kaufmann, 2002.
[2] C. K. I. Williams. Prediction with Gaussian processes: From linear regression to
linear prediction and beyond. In M. I. Jordan, editor, Learning and Inference in
Graphical Models. Kluwer Academic, 1998.
[3] O. Chapelle, V. Vapnik, O. Bousquet, and S. Mukherjee. Choosing kernel parameters
for support vector machines. Machine Learning, 2002. Forthcoming.
[4] G. Wahba. Spline Models for Observational Data, volume 59 of CBMS-NSF Regional
Conference Series in Applied Mathematics. SIAM, Philadelphia, 1990.
[5] N. Cristianini, A. Elisseeff, and J. Shawe-Taylor. On optimizing kernel alignment.
Technical Report NC2-TR-2001-087, NeuroCOLT, http://www.neurocolt.com, 2001.
[6] V. Vapnik. The Nature of Statistical Learning Theory. Springer, New York, 1995.
[7] B. Schölkopf and A. J. Smola. Learning with Kernels. MIT Press, Cambridge, MA,
2002.
[8] S. Fine and K. Scheinberg. Efficient SVM training using low-rank kernel representa-
tion. Technical report, IBM Watson Research Center, New York, 2000.
[9] Y. Freund and R. E. Schapire. Experiments with a new boosting algorithm. In Pro-
ceedings of the International Conference on Machine Learning, pages 148–146. Mor-
gan Kaufmann Publishers, 1996.
[10] Gunnar Rätsch, Takashi Onoda, and Klaus-Robert Müller. Soft margins for adaboost.
Machine Learning, 42(3):287–320, 2001.