0% found this document useful (0 votes)

6 views

Superkernels

The document discusses choosing suitable kernel functions for estimation using Gaussian Processes and Support Vector Machines. It presents a novel framework that allows optimizing kernels within parameterized families by defining a 'superkernel' on the space of kernels. This reduces the problem to a statistical estimation akin to regularized risk minimization and unifies various model selection settings.

Uploaded by

João Vieira

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

6 views

Superkernels

Uploaded by

João Vieira

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 9

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/237444281

Superkernels

Article · January 2003

CITATIONS READS
0 66

3 authors, including:

Cheng Soon Ong Alexander J. Smola

The Commonwealth Scientific and Industrial Research Organisation Carnegie Mellon University
109 PUBLICATIONS 6,091 CITATIONS 134 PUBLICATIONS 50,031 CITATIONS

SEE PROFILE SEE PROFILE

All content following this page was uploaded by Cheng Soon Ong on 19 May 2014.

The user has requested enhancement of the downloaded file.

Superkernels

Cheng Soon Ong, Alexander J. Smola, Robert C. Williamson

Research School of Information Sciences and Engineering
The Australian National University
Canberra, 0200 ACT, Australia
{Cheng.Ong, Alex.Smola, Bob.Williamson}@anu.edu.au

Abstract
We consider the problem of choosing a kernel suitable for estimation
using a Gaussian Process estimator or a Support Vector Machine. A
novel solution is presented which involves defining a Reproducing Ker-
nel Hilbert Space on the space of kernels itself. By utilizing an analog
of the classical representer theorem, the problem of choosing a kernel
from a parameterized family of kernels (e.g. of varying width) is reduced
to a statistical estimation problem akin to the problem of minimizing a
regularized risk functional. Various classical settings for model or kernel
selection are special cases of our framework.

1 Introduction
Choosing suitable kernel functions for estimation using Gaussian Processes and Support
Vector Machines is an important step in the inference process. To date, there are few if
any systematic techniques to assist in this choice. Even the restricted problem of choosing
the “width” of a parameterized family of kernels (e.g. Gaussian) has not had a simple and
elegant solution.
A recent development [1] which solves the above problem in a restricted sense involves
the use of semidefinite programming to learn an arbitrary positive semidefinite matrix K,
subject to minimization of criteria such as the kernel target alignment [1], the maximum of
the posterior probability [2], the minimization of a learning-theoretical bound [3], or subject
to cross-validation settings [4]. The restriction mentioned is that the methods work with the
kernel matrix, rather than the kernel itself. Furthermore, whilst demonstrably improving
the performance of estimators to some degree, they require clever parameterization and
design to make the method work in the particular situations and rely on more difficult
mathematical programming problems. There are still no general principles to guide the
choice of a) which family of kernels to choose, b) efficient parameterizations over this
space, and c) suitable penalty terms to combat overfitting. (The last point is particularly an
issue when we have a very large set of semidefinite matrices at our disposition).
Whilst not yet providing a complete solution to these problems, this paper does present
a framework that allows the optimization within a parameterized family relatively simply,
and crucially, intrinsically captures the tradeoff between the size of the family of kernels
and the sample size available. Furthermore, the solution presented is for optimizing kernels
themselves, rather than the kernel matrix as in [1].
Outline of the Paper We show (Section 2) that for most kernel-based learning methods
there exists a functional, the quality functional, which plays a similar role to the empiri-
cal risk functional, and that subsequently (Section 3) the introduction of a kernel on ker-
nels, a so-called superkernel, in conjunction with regularization on the Reproducing Ker-
nel Hilbert Space formed on kernels leads to a systematic way of parameterizing function
classes whilst managing overfitting. We give several examples of superkernels (Section 4)
and show (Section 5) how they can be used practically. Due to space constraints we only
consider Support Vector classification.

2 Quality Functionals
Let Xtrain := {x1 , . . . , xm } denote the set of training data and Ytrain := {y1 , . . . , ym } the
set of corresponding labels, jointly drawn iid from some probability distribution P (x, y)
on X × Y. Furthermore, let Xtest and Ytest denote the corresponding test sets (drawn from
the same P (x, y)). Let X := Xtrain ∪ Xtest and Y := Ytrain ∪ Ytest .
We introduce a new class of functionals Q on data which we call quality functionals. Their
purpose is to indicate, given a kernel k and the training data (Xtrain , Ytrain ), how suitable
the kernel is for explaining the training data. Examples of quality functionals are the kernel
target alignment, the negative log posterior, the minimum of the regularized risk functional,
or any luckiness function for kernel methods.

Definition 1 (Empirical Quality Functional) Given a kernel k, and data X, Y , define

Qemp [k, X, Y ] to be an empirical quality functional if it depends on k only via k(xi , xj )
where xi , xj ∈ X; i.e. if there exists a function q such that Qemp [k, X, Y ] = q(K, X, Y )
where K = [k(xi , xj )]i,j is the kernel matrix.
The basic idea is that Qemp could be used to adapt k in a manner such that Qemp is
minimized, based on this single dataset X, Y . Given a sufficiently rich class K of ker-
nels k it is in general possible to find a kernel k ∗ ∈ K that attains arbitrarily small
values of Qemp [k ∗ , Xtrain , Ytrain ] for any training set. However, it is very unlikely that
Qemp [k ∗ , Xtest , Ytest ] would be similarly small in general. Analogously to the standard
methods of statistical learning theory, we instead aim to minimize the expected quality
functional:

Definition 2 (Expected Quality Functional) Suppose Qemp is an empirical quality func-

tional. Then Q[k] := EX,Y [Qemp [k, X, Y ]] (1)
is the expected quality functional, where the expectation is taken with respect to P m .
Note the similarity P between Qemp [k, X, Y ] and the empirical risk of an estimator
1 m
Remp [f, X, Y ] = m i=1 c(xi , yi , f (xi )) (where c is a suitable loss function): in both
cases we compute the value of a functional which depends on some sample X, Y drawn
from P (x, y) and a function, and in both cases we have
Q[k] = EX,Y [Qemp [k, X, Y ]] and R[f ] = EX,Y [Remp [f, X, Y ]] . (2)
Here R[f ] is known as the expected risk. We now present some examples of quality func-
tionals, and derive their exact minimizers whenever possible.

Example 1 (Kernel Target Alignment) This quality functional was introduced in [5] to
assess the “alignment” of a kernel with training labels. It is defined by
y > Ky
Qalignment
emp [k, X, Y ] := 1 − , (3)
kyk22 kKk2
where y denotes the vector of elements of YP , kyk2 denotes the `2 norm of y, and kKk2 is
the Frobenius norm: kKk22 := trKK > = i,j Kij 2
. Note that the definition in [5] looks
somewhat different, yet it is algebraically identical to (3).
By decomposing K into its eigensystem, one can see that (3) is minimized if K = yy > , in
which case
y > yy > y kyk42
Qalignment
emp [k ∗ , X, Y ] = 1 − 2 =1− = 0. (4)
>
kyk2 kyy k2 kyk22 kyk22
It is clear that one cannot expect that Qalignment
emp [k ∗ , X, Y ] = 0 for data other than the set
chosen to determine k ∗ .

Example 2 (Regularized Risk Functional) Regularized Risk functionals are commonly

used in Support Vector Machines and related kernel methods [4, 6, 7]. If H is the Re-
producing Kernel Hilbert Space (RKHS) associated with the kernel k, they have the form
m
1 X λ
Rreg [f, X, Y ] := c(xi , yi , f (xi )) + kf k2H , (5)
m i=1 2

where kf k2H is the RKHS norm of f . By virtue of the representer theorem (see e.g., [4, 7])
we know that the minimizer over f ∈ H of (5) can be written as a kernel expansion.
For a given loss c this leads to the quality functional
" m
#
regrisk 1 X λ >
Qemp [k, X, Y ] := minm c(xi , yi , [Kα]i ) + α Kα . (6)
α∈R m i=1 2

The minimizer of (6) is more difficult to find, since we have to carry out a double min-
imization over K and α. First, note that for K = βyy > and α = βkyk 1
2 y, Kα = y
> −1 regrisk λ
and α Kα = β . Thus Qemp [k, X, Y ] = 2β . For sufficiently large β, we can make
Qregrisk
emp [k, X, Y ] arbitrarily close to 0.
If we add a scale restriction to K, by setting trK = 1, we can determine the minimum of
1 > m
(6) as follows. Set K = kzk 2 zz , where z ∈ R , and α = z. Then Kα = z and so
m m
1 X λ X λ
c(xi , yi , [Kα]i ) + α> Kα = c(xi , yi , zi ) + kzk22 .
m i=1 2 i=1
2

Choosing each zi = argminζ c(xi , yi , ζ) + λ2 ζ 2 yields the minimum with respect to z. The
proof that K is the global minimizer of this quality functional is omitted for brevity.

Example 3 (Negative Log-Posterior) This functional is similar to Rreg [f, X, Y ] since it

includes a regularization term (the negative log prior) and a loss term (the negative log-
likelihood). In addition, it also includes the log-determinant of K which measures the size
of the space spanned by K. The quality functional is

logposterior 1 > −1 1
Qemp [k, X, Y ] := minm − log p(yi |xi , fi ) + f K f + log det K (7)
f ∈R 2 2
Note that any K which does not have full rank will send (7) to −∞, and thus such cases
need to be excluded. However merely fixing det K = 1 is overkill, since then one can
1
simply set K = βkyk−2 yy > + β − m−1 (1 − kyk−2 yy > ) (8)
which leads to |K| = 1. Under the assumption that the minimum of − log p(yi , xi , fi )
with respect to fi is attained at fi = yi , we can see that β → ∞ still leads to the overall
minimum of Qlogposterior
emp [k, X, Y ].
Other examples, such as cross-validation, leave-one-out estimators, the Luckiness frame-
work, the Radius-Margin bound are omitted for brevity.
The above examples illustrate how many existing methods for assessing the quality of a
kernel fit within the quality functional framework. We also saw that given a rich enough
class of kernels K, optimisation of Q over K, would result in a kernel that would be useless
for prediction purposes. This is yet another example of the danger of optimizing too much
— there is (still) no free lunch.
It is therefore evident that problems with “overfitting” of the kernel are not just a mere
theoretical inconvenience. The design of specific algorithms to combat overfitting, e.g., for
kernel target alignment or for minimization of the negative log posterior gives ample proof
of such problems in practical situations.

3 A Super Reproducing Kernel Hilbert Space

We now introduce a method for optimising quality functionals in an effective way. The
method we propose involves the introduction of a Reproducing Kernel Hilbert Space on
the kernel k itself — a “Super”-RKHS. We begin with the basic properties of an RKHS.

Definition 3 (Reproducing Kernel Hilbert Space) Let X be a nonempty set (often called
the index set) and denote by H a Hilbert space of functions f : X → R. Then H is
called a reproducing
p kernel Hilbert space endowed with the dot product h·, ·i (and the
norm kf k := hf, f i) if there exists a function k : X × X → R satisfying, x, x0 ∈ X :
1. k has the reproducing property hf, k(x, ·)i = f (x) for all f ∈ H; in particular,
hk(x, ·), k(x0 , ·)i = k(x, x0 ).
2. k spans H, i.e. H = span{k(x, ·)|x ∈ X } where X is the completion of X.
The advantage of optimization in an RKHS is that under certain conditions the optimal
solutions can be found as the linear combination of a finite number of basis functions,
regardless of the dimensionality of the space H, as can be seen in the theorem below.

Theorem 4 (Representer Theorem) Denote by Ω : [0, ∞) → R a strictly monotonic

increasing function, by X a set, and by c : (X × R2 )m → R ∪ {∞} an arbitrary loss
function. Then each minimizer f ∈ H of the regularized risk
c ((x1 , y1 , f (x1 )) , . . . , (xm , ym , f (xm ))) + Ω (kf kH ) (9)
Pm
admits a representation of the form f (x) = i=1 αi k(xi , x).
The above definition allows us to define an RKHS on kernels X × X → R, simply by
introducing X := X × X and by treating k as functions k : X → R:

Definition 5 (Super Reproducing Kernel Hilbert Space) Let X be a nonempty set and
let X := X × X (the compounded index set). Then the Hilbert space p H of functions
k : X → R, endowed with a dot product h·, ·i (and the norm kkk = hk, ki) is called a
Super Reproducing Kernel Hilbert Space if there exists a superkernel k : X × X → R with
the following properties:
1. k has the reproducing property hk, k(x, ·)i = k(x) for all k ∈ H, in particular,
hk(x, ·), k(x0 , ·)i = k(x, x0 ).
2. k spans H, i.e. H = span{k(x, ·)|x ∈ X }.
3. For any fixed x ∈ X the superkernel k is a kernel in its second argument, i.e. for
any fixed x ∈ X , the function k(x, x0 ) := k(x, (x, x0 )) with x, x0 ∈ X is a kernel.

What distinguishes H from a normal RKHS is the particular form of its index set (X = X 2 )
and the additional condition on k to be a kernel in its second argument for any fixed first
argument. This condition somewhat limits the choice of possible kernels. On the other
hand, it allows for simple optimization algorithms which consider kernels k ∈ H, which
are in the convex cone of k.
Analogously to the definition of the regularized risk functional (5), we define the regular-
ized quality functional:
λs
Qreg [k, X, Y ] := Qemp [k, X, Y ] + kkk2 , (10)
2
where λs > 0 is a regularization constant and kkk2 denotes the RKHS norm in H. Mini-
mization of Qreg is less prone to overfitting than minimizing Qemp , since the regularization
term λ2s kkk2 effectively controls the complexity of the class of kernels under consideration.
Regularizers other than λ2s kkk2 are also possible.
The question arising immediately from (10) is how to minimize the regularized quality
functional efficiently. In the following we show that the minimum can be found as a linear
combination of superkernels.

Corollary 6 (Representer Theorem for Super-RKHS) Let H be a super-RKHS and de-

note by Ω : [0, ∞) → R a strictly monotonic increasing function, by X a set, and by Q
an arbitrary quality functional. Then each minimizer k ∈ H of the regularized quality
functional λs
Q[k, X, Y ] + kkk2 (11)
2
m
X
admits a representation of the form k(x, x0 ) = βij k((xi , xj ), (x, x0 )).
i,j=1
Proof All we need to do is rewrite (11) so that it satisfies the conditions of Theorem 4. Let
xij := (xi , xj ). Then Q[k, X, Y ] has the properties of a loss function, as it only depends
on k via its values at xij . Furthermore, λ2s kkk2 is an RKHS regularizer, so the representer
theorem applies and the expansion of k follows.
This result shows that even though we are optimizing over an entire (potentially infinite
dimensional) Hilbert space of kernels, we are able to find the optimal solution by choosing
among a finite dimensional subspace. The dimension required (m2 ) is, not surprisingly, sig-
nificantly larger than the number of kernels required in a kernel function expansion which
makes a direct approach possible only for small problems. However, sparse expansion
techniques, such as [8, 7], can be used to make the problem tractable in practice.

4 Examples of Superkernels
Having introduced the theoretical basis of the Super-RKHS, we need to answer the question
whether practically useful k exist which satisfy the conditions of Definition 5. We address
this question by giving a set of general recipes for building such kernels.

Example 4 (Power Series Construction) Denote by k a kernel where k(x, x0 ) ≥ 0 for

all x, x0P∈ X , and by g : R → R a function with positive Taylor expansion coefficients
∞
g(ξ) = i=0 ci ξ i and convergence radius R. Then for k 2 (x, x0 ) ≤ R we have that
∞
X
k(x, x0 ) := g(k(x)k(x0 )) = ci (k(x)k(x0 ))i (12)
i=0

is a superkernel: for any fixed x, k(x, (x, x0 )) is a sum of kernel functions, hence it is
a kernel itself (since k i (x, x0 ) is a kernel if√k is).
√ To show√that k is a kernel, note that
k(x, x0 ) = hΦ(x), Φ(x0 )i, where Φ(x) := ( c0 , c1 k 1 (x), c2 k 2 (x), . . .).

Example 5 (Harmonic Superkernel) A special case of (12) is the harmonic superkernel:

Denote by k a kernel with k : X × X → [0, 1] (e.g., RBF kernels satisfy this property), and
set ci := (1 − λh )λih for some 0 < λh < 1. Then we have
∞
X i 1 − λh
k(x, x0 ) = (1 − λh ) (λh k(x)k(x0 )) = . (13)
i=0
1 − λh k(x)k(x0 )
Example 6 (Gaussian Harmonic Superkernel) For k(x, x0 ) = exp(−σ 2 kx − x0 k2 ) this
leads to 1 − λh
k((x, x0 ), (x00 , x000 )) = . (14)
1 − λh exp (−σ 2 (kx − x0 k2 + kx00 − x000 k2 ))
For λh → 1, k converges to δx,x0 ; that is, the expression kkk2 converges to the Frobenius
norm of k on X × X.

g(ξ) Power series expansion R We can find further superkernels,

exp ξ 1
1 + 1! 1 n
ξ + . . . + n! ξ + ... ∞ simply by consulting tables on
power series of functions. Ta-
ξ ξ3 ξ (2n+1)
sinh ξ 1! + 3! + ... + (2n+1)! + ... ∞ ble 1 contains a list of suitable
2 (2n) expansions. Recall that expan-
cosh ξ 1 + ξ2! + . . . + ξ(2n)! + . . . ∞ sions such as (12) were mainly
ξ ξ3 2n+1
arctanhξ 1 + 3 + . . . + ξ2n+1 + . . . 1 chosen for computational conve-
ξ ξ2 n nience, in particular whenever it
− ln(1 − ξ) 1 + 2 + . . . + ξn + . . . 1 is not clear which particular class
of kernels would be useful for the
Table 1: Examples of Superkernels
expansion.

Example 7 (Explicit Construction) If we know or have a reasonable guess as to which

kernels could be potentially relevant (e.g., a range of scales of kernel width, polynomial
degrees, etc.), we may begin with a set of candidate kernels, say k1 , . . . , kn and define
n
X
k(x, x0 ) := ci ki (x)ki (x0 ), ki (x) > 0, ∀x. (15)
i=1
0
Clearly
√ k √is a superkernel, √ since k(x, x ) = hΦ(x), Φ(x0 )i, where Φ(x) :=
( c1 k1 (x), c2 k2 (x), . . . , cn kn (x)).

5 An Application: Minimization of the Regularized Risk

Recall that in the case of the Regularized Risk functional, the regularized quality optimiza-
tion problem takes on the form
m
1 X λ λs
minimize c(xi , yi , f (xi )) + kf k2H + kkk2H . (16)
f ∈H,k∈X m i=1 2 2
For fixed k, the problem can be formulated as a constrained minimization problem in f , and
subsequently expressed in terms of the Lagrange multipliers α. However, this minimum
depends on k, and for efficient minimization we would like to compute the derivatives with
respect to k. The following lemma tells us how (it is an extension of a result in [3] and we
omit the proof for brevity):

Lemma 7 Let x ∈ Rm and denote by f (x, θ), ci : Rm → R convex functions, where f is

parameterized by θ. Let R(θ) be the minimum of the following optimization problem (and
denote by x(θ) its minimizer):
minimize f (x, θ) subject to ci (x) ≤ 0 for all 1 ≤ i ≤ n. (17)
x∈Rm

Then ∂θj R(θ) = D2j f (x(θ), θ), where j ∈ N and D2 denotes the derivative with respect to
the second argument of f .
We immediately can conclude the following for the quality functional arising from Rreg .
P
Corollary 8 For f = i αi k(xi , x) and a convex loss function c the regularized quality
functional is convex in k.
The corresponding regularized quality functional is:
λs
Qregrisk
reg [k, X, Y ] = Qregrisk
emp [k, X, Y ] + kkk2H (18)
2
Since the minimizer of (18) can be written as a kernel expansion (by the representer theo-
rem for Super-RKHS), the optimal regularized quality functional can be written as (using
the soft margin loss and K ijpq := k((xi , xj ), (xp , xq )):
m m
1 X X
Qregrisk
reg [K, α, β, X, Y ] = max 0, 1 − y i α j βpq K ijpq (19)
m i=1 j,p,q=1
m m
λ X λ0 X
+ αi αj βpq K ijpq + βij βpq K ijpq
2 i,j,p,q=1 2 i,j,p,q=1

Minimization of (19) is achieved by alternating between minimization over α for fixed β

(this is a quadratic optimization problem), and subsequently minimization over β (with
βij ≥ 0 to ensure positivity of the kernel matrix) for fixed α.
Low Rank Approximation While being finite in the number of parameters (despite the
optimization over two possibly infinite dimensional Hilbert spaces H and H), (19) still
presents a formidable optimization problem in practice (we have m2 coefficients for β). For
an explicit expansion of type (15) we can optimize in the expansion coefficients ki (x)ki (x0 )
directly, which means that we simply have a quality functional with an `2 penalty on the
expansion coefficients. Such an approach is recommended if there are few terms in (15).
In the general case (or if n m), we resort to a low-rank approximation, as described in
[8, 7]. This means that we pick from k((xi , xj ), ·) with 1 ≤ i, j ≤ m a small fraction of
terms which approximate k on X × X sufficiently well.

6 Experimental Results and Summary

Experimental Setup To test our claims of kernel adaptation via regularized quality func-
tionals we performed preliminary tests on datasets from the UCI repository (Pima, Iono-
sphere, Wisconsin diagnostic breast cancer) and the USPS database of handwritten digits
(’6’ vs. ’9’). The datasets were split into 60% training data and 40% test data (except for
the USPS data, where the default split was used) and to obtain reliable results the experi-
ments were repeated over 200 random 60/40 splits. We deliberately did not attempt to tune
parameters and instead made the following choices uniformly for all four sets:
• The kernel width σ was set to σ −1 = 100d, where d is the dimensionality of the
data. We deliberately chose a too large value in comparison with the usual rules
of thumb [7] to avoid good default kernels.
1
• λ was adjusted so that λm = 100 (that is C = 100 in the Vapnik-style parameter-
ization of SVMs). This has commonly been reported to yield good results.
• λh for the Gaussian Harmonic Superkernel was chosen to be 0.6 throughout, giv-
ing adequate coverage over various kernel widths in (13) (small λh focus almost
exclusively on wide kernels, λh close to 1 will treat all widths equally).
• The superkernel regularization was set to λs = 10−4 .
We compared the results with the performance of a generic Support Vector Machine with
the same values chosen for σ and λ and one for which λ, σ had been hand-tuned.
Results Despite the fact that we did not try to tune the parameters we were able to achieve
highly competitive results as shown in Table 2. It is also worth noticing that the number of
superkernels required after a low-rank decomposition of the superkernel matrix contained
typically less than 10 superkernels, thus rendering the optimization problem not much more
costly than a standard Support Vector Machine (even with a very high quality 10−5 approx-
imation of K) and that after the optimization of (19), typically less than 5 were being used.
This dramatically reduced the computational burden.
Rreg Qreg Best in Tuned
Data(size) Train Test Train Test [9, 10] SVM
pima(768) 25.2±2.0 26.2±3.3 22.2±1.4 23.2±2.0 23.5 22.9±2.0
ionosph(351) 13.4±2.0 16.5±3.4 10.9±1.5 13.4±2.4 6.2 6.1±1.9
wdbc(569) 5.7±0.8 5.7±1.3 2.1±0.6 2.7±1.0 3.2 2.5±0.9
usps(1424) 2.1 3.4 1.5 2.8 NA 2.5
View publication stats
Table 2: Training and test error in percent

Using the same non-optimized parameters for different data sets we achieved results com-
parable to other recent work on classification such as boosting, optimized SVMs, and kernel
target alignment [9, 10, 5] (note that we use a much smaller part of the data for training:
only 60% rather than 90%). Results based on Qreg are comparable to hand tuned SVMs
(right most column), except for the ionosphere data. We suspect that this is due to the small
training sample.
Summary and Outlook The regularized quality functional allows the systematic solu-
tion of problems associated with the choice of a kernel. Quality criteria that can be used
include target alignment, regularized risk and the log posterior. The regularization implicit
in our approach allows the control of overfitting that occurs if one optimizes over a too
large a choice of kernels.
A very promising aspect of the current work is that it opens the way to theoretical analyses
of the price one pays by optimising over a larger set K of kernels. Current and future
research is devoted to working through this analysis and subsequently developing methods
for the design of good superkernels.
Acknowledgements This work was supported by a grant of the Australian Research
Council. The authors thank Grace Wahba for helpful comments and suggestions.

References
[1] G. Lanckriet, N. Cristianini, P. Bartlett, and M. Jordan. Kernel target alignment. In
ICML 2002. Morgan Kaufmann, 2002.
[2] C. K. I. Williams. Prediction with Gaussian processes: From linear regression to
linear prediction and beyond. In M. I. Jordan, editor, Learning and Inference in
Graphical Models. Kluwer Academic, 1998.
[3] O. Chapelle, V. Vapnik, O. Bousquet, and S. Mukherjee. Choosing kernel parameters
for support vector machines. Machine Learning, 2002. Forthcoming.
[4] G. Wahba. Spline Models for Observational Data, volume 59 of CBMS-NSF Regional
Conference Series in Applied Mathematics. SIAM, Philadelphia, 1990.
[5] N. Cristianini, A. Elisseeff, and J. Shawe-Taylor. On optimizing kernel alignment.
Technical Report NC2-TR-2001-087, NeuroCOLT, http://www.neurocolt.com, 2001.
[6] V. Vapnik. The Nature of Statistical Learning Theory. Springer, New York, 1995.
[7] B. Schölkopf and A. J. Smola. Learning with Kernels. MIT Press, Cambridge, MA,
2002.
[8] S. Fine and K. Scheinberg. Efficient SVM training using low-rank kernel representa-
tion. Technical report, IBM Watson Research Center, New York, 2000.
[9] Y. Freund and R. E. Schapire. Experiments with a new boosting algorithm. In Pro-
ceedings of the International Conference on Machine Learning, pages 148–146. Mor-
gan Kaufmann Publishers, 1996.
[10] Gunnar Rätsch, Takashi Onoda, and Klaus-Robert Müller. Soft margins for adaboost.
Machine Learning, 42(3):287–320, 2001.

Linear Algebra Cheat Sheet
No ratings yet
Linear Algebra Cheat Sheet
2 pages
Hyperkernels: Cheng Soon Ong, Alexander J. Smola, Robert C. Williamson
No ratings yet
Hyperkernels: Cheng Soon Ong, Alexander J. Smola, Robert C. Williamson
8 pages
Machine Learning Using Hyperkernels: Only For Supervised Learning
No ratings yet
Machine Learning Using Hyperkernels: Only For Supervised Learning
8 pages
Kernel Models 1233
No ratings yet
Kernel Models 1233
56 pages
SinhaDu16 PDF
No ratings yet
SinhaDu16 PDF
20 pages
Easy Multiple Kernel Learning: January 2014
No ratings yet
Easy Multiple Kernel Learning: January 2014
7 pages
Mva - Slides Machine Learning With Kernel Methods
No ratings yet
Mva - Slides Machine Learning With Kernel Methods
644 pages
0701907v3
No ratings yet
0701907v3
53 pages
Kernal Methods Machine Learning
No ratings yet
Kernal Methods Machine Learning
53 pages
Support Vector Machines: Kernels: CS4780/5780 - Machine Learning Fall 2011 Thorsten Joachims Cornell University
No ratings yet
Support Vector Machines: Kernels: CS4780/5780 - Machine Learning Fall 2011 Thorsten Joachims Cornell University
15 pages
Be Central
No ratings yet
Be Central
98 pages
Talitckii et al. - Eﬃcient Convex Algorithms for Universal Kernel Learning
No ratings yet
Talitckii et al. - Eﬃcient Convex Algorithms for Universal Kernel Learning
40 pages
2021 UNAS REFER Rafi Yon Saputra 173112706420242 Kernel Primer
No ratings yet
2021 UNAS REFER Rafi Yon Saputra 173112706420242 Kernel Primer
65 pages
2401.02879v2
No ratings yet
2401.02879v2
20 pages
Vahid
No ratings yet
Vahid
18 pages
2008 Infinite Kernel Learning Via Infinite An
No ratings yet
2008 Infinite Kernel Learning Via Infinite An
34 pages
Kernels and Distances For Structured Data
No ratings yet
Kernels and Distances For Structured Data
28 pages
To Machine Learning: Isabelle Guyon
No ratings yet
To Machine Learning: Isabelle Guyon
40 pages
ICS E4030 Lecture1
No ratings yet
ICS E4030 Lecture1
37 pages
Kernel Functions: Tejumade Afonja Jan 2, 2017 6 Min Read
No ratings yet
Kernel Functions: Tejumade Afonja Jan 2, 2017 6 Min Read
6 pages
Introduction To Kernels: Max Welling
No ratings yet
Introduction To Kernels: Max Welling
16 pages
2014 02 26 Kernels
No ratings yet
2014 02 26 Kernels
140 pages
Lecture 05
No ratings yet
Lecture 05
49 pages
07 Kernels
No ratings yet
07 Kernels
6 pages
An Introduction To Kernel Methods: C. Campbell
No ratings yet
An Introduction To Kernel Methods: C. Campbell
38 pages
ML Recap
No ratings yet
ML Recap
96 pages
Quantum-Classical Multiple Kernel Learning
No ratings yet
Quantum-Classical Multiple Kernel Learning
15 pages
4c Kernels
No ratings yet
4c Kernels
31 pages
Search Results PCA TF IDF
No ratings yet
Search Results PCA TF IDF
21 pages
SVM Kernel Functions
No ratings yet
SVM Kernel Functions
12 pages
Cours2 ML
No ratings yet
Cours2 ML
21 pages
Kernel PCA
No ratings yet
Kernel PCA
13 pages
Ds 11
No ratings yet
Ds 11
21 pages
Lecture17 Kernels
No ratings yet
Lecture17 Kernels
23 pages
Paul Honeiné, Cédric Richard, Patrick Flandrin, Jean-Baptiste Pothin
No ratings yet
Paul Honeiné, Cédric Richard, Patrick Flandrin, Jean-Baptiste Pothin
4 pages
Chapter 7
No ratings yet
Chapter 7
64 pages
Principe Icassp2011 Klms
No ratings yet
Principe Icassp2011 Klms
124 pages
cs229 Notes3
No ratings yet
cs229 Notes3
30 pages
Quantum Machine Learning in Feature Hilbert Spaces: Maria@xanadu - Ai
No ratings yet
Quantum Machine Learning in Feature Hilbert Spaces: Maria@xanadu - Ai
12 pages
Kernel Adaptive Filtering PDF
No ratings yet
Kernel Adaptive Filtering PDF
124 pages
SVM Class 2
No ratings yet
SVM Class 2
87 pages
04 TemporalDiffHung Hung
No ratings yet
04 TemporalDiffHung Hung
93 pages
C. Cifarelli Et Al - Incremental Classification With Generalized Eigenvalues
No ratings yet
C. Cifarelli Et Al - Incremental Classification With Generalized Eigenvalues
25 pages
Machine Learning With Kernel Methods
No ratings yet
Machine Learning With Kernel Methods
760 pages
Machine Learning Based System Identification With
No ratings yet
Machine Learning Based System Identification With
9 pages
Statistical Learning Theory
No ratings yet
Statistical Learning Theory
4 pages
Gaussian Process Kernels For Pattern Discovery and Extrapolation
No ratings yet
Gaussian Process Kernels For Pattern Discovery and Extrapolation
10 pages
Efficient Algorithms for Kernel Aggregation Queries
No ratings yet
Efficient Algorithms for Kernel Aggregation Queries
14 pages
Handout 03 Classic Classifiers
No ratings yet
Handout 03 Classic Classifiers
39 pages
ML Lecture06 2
No ratings yet
ML Lecture06 2
63 pages
ML Word To PDF
No ratings yet
ML Word To PDF
229 pages
Theoretical Bioinformatics and Machine Learning - Hochreiter - 2013
No ratings yet
Theoretical Bioinformatics and Machine Learning - Hochreiter - 2013
400 pages
Time Series Forecasting by Using Wavelet Kernel SVM
No ratings yet
Time Series Forecasting by Using Wavelet Kernel SVM
52 pages
UNIT 1,2,3
No ratings yet
UNIT 1,2,3
17 pages
ml mod 4
No ratings yet
ml mod 4
26 pages
Kernel Functions
No ratings yet
Kernel Functions
35 pages
Support Vector Machine (SVM)
No ratings yet
Support Vector Machine (SVM)
45 pages
Machine Learning Lecture Notes
No ratings yet
Machine Learning Lecture Notes
119 pages
Elementary Calculus
From Everand
Elementary Calculus
George N. Frempong
No ratings yet
Geometric functions in computer aided geometric design
From Everand
Geometric functions in computer aided geometric design
Oscar Ruiz
No ratings yet
Sequences and Infinite Series, A Collection of Solved Problems
From Everand
Sequences and Infinite Series, A Collection of Solved Problems
Steven Tan
No ratings yet
MATH 210 Final Lecture Notes
No ratings yet
MATH 210 Final Lecture Notes
13 pages
7423
No ratings yet
7423
276 pages
Rising Star Academy
No ratings yet
Rising Star Academy
17 pages
Mat
No ratings yet
Mat
29 pages
Linear Algebra Chapter 6 - Vector Spaces Associated With Matrices
No ratings yet
Linear Algebra Chapter 6 - Vector Spaces Associated With Matrices
9 pages
Statistical Methods in Geodesy
No ratings yet
Statistical Methods in Geodesy
135 pages
NeuralPower Predict and Deploy Energy-Efficient Convolutional Neural Networks
No ratings yet
NeuralPower Predict and Deploy Energy-Efficient Convolutional Neural Networks
16 pages
Midterm Solution
No ratings yet
Midterm Solution
11 pages
Moon Asaki Snipes Application Inspired Linear Algebra
No ratings yet
Moon Asaki Snipes Application Inspired Linear Algebra
538 pages
Linear Guest PDF
No ratings yet
Linear Guest PDF
436 pages
1B.Sc - Part III Mathematics 2024-25
No ratings yet
1B.Sc - Part III Mathematics 2024-25
24 pages
A First Course in Linear Optimization
No ratings yet
A First Course in Linear Optimization
196 pages
Exam 1 - Practice 2: C Harvard Math 21b
No ratings yet
Exam 1 - Practice 2: C Harvard Math 21b
16 pages
ebk • Faugeras_O_et_al ® The_Geometry_of_Multiple_Images_the_Laws_That_Govern_the_Formation_of_Multiple_Images_of_a_Scene_Andsome_of_Their_Applications @ The_Mit_Press © 2001
No ratings yet
ebk • Faugeras_O_et_al ® The_Geometry_of_Multiple_Images_the_Laws_That_Govern_the_Formation_of_Multiple_Images_of_a_Scene_Andsome_of_Their_Applications @ The_Mit_Press © 2001
645 pages
Lecture00-Linear Algebra Recap
No ratings yet
Lecture00-Linear Algebra Recap
461 pages
RMT ML Book-1
No ratings yet
RMT ML Book-1
446 pages
Linear Algebra Week 2
No ratings yet
Linear Algebra Week 2
34 pages
Syllabus MECCE Syllabus Final
No ratings yet
Syllabus MECCE Syllabus Final
14 pages
IITG MA101 Endsem Question Paper
100% (2)
IITG MA101 Endsem Question Paper
12 pages
Using Maple in Linear Algebra (Notes) .Ps
No ratings yet
Using Maple in Linear Algebra (Notes) .Ps
66 pages
The Cross Entropy Method For Classification
No ratings yet
The Cross Entropy Method For Classification
8 pages
Determinants, Finite-Difference Operators and Boundary Value Problems
No ratings yet
Determinants, Finite-Difference Operators and Boundary Value Problems
42 pages
Geometrical Meaning Moore Penrose
No ratings yet
Geometrical Meaning Moore Penrose
4 pages
Dr. Omar Abu Arqub, CV
100% (1)
Dr. Omar Abu Arqub, CV
11 pages
Tezpur University BTech (Mechanical Engineering) Syllabus F
No ratings yet
Tezpur University BTech (Mechanical Engineering) Syllabus F
41 pages
Unicycle Model and Control
No ratings yet
Unicycle Model and Control
27 pages
IAS Mains Mathematics 1993
No ratings yet
IAS Mains Mathematics 1993
9 pages
Eigenvalues and Eigenvectors
No ratings yet
Eigenvalues and Eigenvectors
5 pages
Basis and Dimension
No ratings yet
Basis and Dimension
4 pages

Superkernels

Uploaded by

Superkernels

Uploaded by

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

Article · January 2003

Cheng Soon Ong Alexander J. Smola

SEE PROFILE SEE PROFILE

The user has requested enhancement of the downloaded file.

Cheng Soon Ong, Alexander J. Smola, Robert C. Williamson

Definition 1 (Empirical Quality Functional) Given a kernel k, and data X, Y , define

Definition 2 (Expected Quality Functional) Suppose Qemp is an empirical quality func-

Example 2 (Regularized Risk Functional) Regularized Risk functionals are commonly

Example 3 (Negative Log-Posterior) This functional is similar to Rreg [f, X, Y ] since it

3 A Super Reproducing Kernel Hilbert Space

Theorem 4 (Representer Theorem) Denote by Ω : [0, ∞) → R a strictly monotonic

Corollary 6 (Representer Theorem for Super-RKHS) Let H be a super-RKHS and de-

Example 4 (Power Series Construction) Denote by k a kernel where k(x, x0 ) ≥ 0 for

Example 5 (Harmonic Superkernel) A special case of (12) is the harmonic superkernel:

g(ξ) Power series expansion R We can find further superkernels,

Example 7 (Explicit Construction) If we know or have a reasonable guess as to which

5 An Application: Minimization of the Regularized Risk

Lemma 7 Let x ∈ Rm and denote by f (x, θ), ci : Rm → R convex functions, where f is

Minimization of (19) is achieved by alternating between minimization over α for fixed β

6 Experimental Results and Summary

You might also like