0% found this document useful (0 votes)

56 views

Exercices Kernel Trick

Uploaded by

aymasrayman

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

56 views

Exercices Kernel Trick

Uploaded by

aymasrayman

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 24

MVA ”Kernel methods in machine learning”

Exercices
Julien Mairal and Jean-Philippe Vert

Exercice 1. Kernels
Study whether the following kernels are positive definite:
1. X = (−1, 1), K (x, x0 ) = 1
1−xx0
0
2. X = N, K (x, x0 ) = 2x+x
0
3. X = N, K (x, x0 ) = 2xx
4. X = R+ , K (x, x0 ) = log (1 + xx0 )
5. X = R, K (x, x0 ) = exp (−|x − x0 |2 )
6. X = R, K (x, x0 ) = cos (x + x0 )
7. X = R, K (x, x0 ) = cos (x − x0 )
8. X = R+ , K (x, x0 ) = min(x, x0 )
9. X = R+ , K (x, x0 ) = max(x, x0 )
10. X = R+ , K (x, x0 ) = min(x, x0 )/ max(x, x0 )
11. X = N, K (x, x0 ) = GCD (x, x0 )
12. X = N, K (x, x0 ) = LCM (x, x0 )
13. X = N, K (x, x0 ) = GCD (x, x0 ) /LCM (x, x0 )
14. Given a probability space (Ω, A, P ), on X = A:
∀A, B ∈ A , K (A, B) = P (A ∩ B) − P (A)P (B) .

1
15. Let X be a set and f, g : X → R+ two non-negative functions:

∀x, y ∈ X K(x, y) = min(f (x)g(y), f (y)g(x))

16. Given a non-empty finite set E, on X = P(E) = {A : A ⊂ E}:

|A ∩ B |
∀A, B ⊂ E , K (A, B) = ,
|A ∪ B |
0
where | F | denotes the cardinality of F , and with the convention 0
= 0.

Exercice 2. Function and kernel boundedness

Consider a p.d. kernel K : X × X → R such that K(x, z) ≤ b2 for all x, z in
X . Show that kf k∞ = supx∈X |f (x)| ≤ b for any function f in the unit ball
of the corresponding RKHS.

Exercice 3. Non-expansiveness of the Gaussian kernel

Consider the Gaussian kernel K : Rp × Rp → R such that for all pair of
points x, x0 in Rp ,
α 0 2
K(x, x0 ) = e− 2 kx−x k ,
where k.k is the Euclidean norm on Rp . Call H the RKHS of K and consider
its RKHS mapping ϕ : Rp → H such that K(x, x0 ) = hϕ(x), ϕ(x0 )iH for all
x, x0 in Rp . Show that
√
kϕ(x) − ϕ(x0 )kH ≤ αkx − x0 k.

The mapping is called non-expansive whenever α ≤ 1.

Exercice 4. Kernels encoding equivalence classes.

Consider a similarity measure K : X × X → {0, 1} with K(x, x) = 1 for all
x in X . Prove that K is p.d. if and only if, for all x, x0 , x00 in X ,

• K(x, x0 ) = 1 ⇔ K(x0 , x) = 1, and

• K(x, x0 ) = K(x0 , x00 ) = 1 ⇒ K(x, x00 ) = 1.

2
Exercice 5. RKHS

1. Let K1 and K2 be two positive definite kernels on a set X , and α, β

two positive scalars. Show that αK1 + βK2 is positive definite, and
describe its RKHS.

2. Let X be a set and F be a Hilbert space. Let Ψ : X → F, and

K : X × X → R be:

∀x, x0 ∈ X , K(x, x0 ) = hΨ(x), Ψ(x0 )iF .

Show that K is a positive definite kernel on X , and describe its RKHS.

3. Prove that for any p.d. kernel K on a space X , a function f : X → R

belongs to the RKHS H with kernel K if and only if there exists λ > 0
such that K(x, x0 ) − λf (x)f (x0 ) is p.d.

Exercice 6. Completeness of the RKHS

We want to finish the construction of the RKHS associated to a positive
definite kernel K given in the course. Remember we have defined the set of
functions:
( n )
X
H0 = αi Kxi : n ∈ N, α1 , . . . , αn ∈ R, x1 , . . . , xn ∈ X
i=1

and for any two functions f, g ∈ H0 , given by:

m
X n
X
f= ai K x i , g= b j Kyj ,
i=1 j=1

we have defined the operation:

X
hf, giH0 := ai bj K (xi , yj ) .
i,j

In the course we have shown that H0 endowed with this inner product is a
pre-Hilbert space. Let us now show how to finish the construction of the
RKHS from H0

3
1. Show that any Cauchy sequence (fn ) in H0 converges pointwisely to a
function f : X → R defined by f (x) = limn→+∞ fn (x).

2. Show that any Cauchy sequence (fn )n∈N in H0 which converges point-
wise to 0 satisfies:
lim k fn kH0 = 0 .
n→+∞

3. Let H ⊂ RX be the set of functions f : X → R which are pointwise

limits of Cauchy sequences in H0 , i.e., if (fn ) is a Cauchy sequence in
H0 , then f (x) = limn→+∞ fn (x). Show that H0 ⊂ H.

4. If (fn ) and (gn ) are two Cauchy sequences in H0 , which converge point-
wisely to two functions f and g ∈ H, show that the inner product
hfn , gn iH0 converges to a number which only depends on f and g. This
allows us to define formally the operation:

hf, giH = lim hfn , gn iH0 .

n→+∞

5. Show that h., .iH is an inner product on H.

6. Show that H0 is dense in H (with respect to the metric defined by the

inner product h., .iH )

7. Show that H is complete.

8. Show that H is a RKHS whose reproducing kernel is K.

Exercice 7. Uniqueness of the RKHS

Prove that if K : X × X is a positive definite function, then it is the r.k. of
a unique RKHS. (Hint: consider the linear space spanned by the functions
Kx : t 7→ K(x, t), and use the fact that a linear subspace F of a Hilbert space
H is dense in H if and only 0 is the only vector orthgonal to all vectors in
F)

Exercice 8. Conditionally positive definite kernels

Let X be a set. A function k : X × X → R is called conditionally positive

4
definite (c.p.d.) if and only if it is symmetric and satisfies:
n
X
ai aj k(xi , xj ) ≥ 0
i,j=1

Pn
for any n ∈ N, x1 , x2 , . . . , xn ∈ X n and a1 , a2 , . . . , an ∈ Rn with i=1 ai = 0
.

1. Show that a positive definite (p.d.) function is c.p.d.

2. Is a constant function p.d.? Is it c.p.d.?

3. If X is a Hilbert space, then is k(x, y) = −||x − y||2 p.d.? Is it c.p.d.?

4. Let X be a nonempty set, and x0 ∈ X a point. For any function

k : X × X → R, let k̃ : X × X → R be the function defined by:

k̃(x, y) = k(x, y) − k(x0 , x) − k(x0 , y) + k(x0 , x0 ).

Show that k is c.p.d. if and only if k̃ is p.d.

5. Let k be a c.p.d. kernel on X such that k(x, x) = 0 for any x ∈ X .

Show that there exists a Hilbert space H and a mapping Φ : X → H
such that, for any x, y ∈ X ,

k(x, y) = −||Φ(x) − Φ(y)||2 .

6. Show that if k is c.p.d., then the function exp(tk(x, y)) is p.d. for all
t≥0

7. Conversely, show that if the function exp(tk(x, y)) is p.d. for any t ≥ 0,
then k is c.p.d.

8. Show that the negative shortest-path distance on a tree1 is c.p.d over

the set of vertices (a tree is an undirected graph without loops). Is the
negative shortest-path distance over graphs c.p.d. in general?
1
I.e., the function k(x, y) = −d(x, y), where d(x, y) is the shortest-path distance be-
tween x and y, that is, the minimum number of edges of any path that connects x to
y.

5
Exercice 9. COCO
Given two sets of real numbers X = (x1 , . . . , xn ) ∈ Rn and Y = (y1 , . . . , yn ) ∈
Rn , the covariance between X and Y is defined as
covn (X, Y ) = En (XY ) − En (X)En (Y ) ,
Pn
where En (U ) = ( i=1 ui )/n. The covariance is useful to detect linear rela-
tionships between X and Y . In order to extend this measure to potential
nonlinear relationships between X and Y , we consider the following criterion:
CnK (X, Y ) = max covn (f (X), g(Y )) ,
f,g∈BK

where K is a positive definite kernel on R, BK is the unit ball of the RKHS

of K, and f (U ) = (f (u1 ), . . . , f (un )) for a vector U = (u1 , . . . , un ).
1. Express simply CnK (X, Y ) for the linear kernel K(a, b) = ab.
2. For a general kernel K, express CnK (X, Y ) in terms of the Gram matri-
ces of X and Y .

Exercice 10. RKHS-induced semi-metrics

Let H be a RKHS of functions with domain X , associated to a measurable
p.d. kernel K : X × X → R. Consider two probability distributions P and
Q on X . Show that
sup |EP [f (X)] − EQ [f (Z)]|2 = E[K(X, X 0 ) + K(Z, Z 0 ) − 2K(X, Z)],
kf kH ≤1

where X, X 0 ∼ P, and Z, Z 0 ∼ Q are jointly independent.

Exercice 11. Kernel PCA for data denoising

Let X be a space endowed with a p.d. kernel K, and Φ : X → H a mapping
to a Hilbert space H such that for all x, x0 ∈ X ,
hΦ(x), Φ(x0 )i = K(x, x0 ) .
Let S = {x1 , . . . , xn } be a set of points in X , and
n
1X
m= Φ(xi )
n i=1
their barycenter in the feature space.

6
1. For x ∈ X , let
Ψ(x) = Pd (Φ(x) − m) + m
where Pd is the projection onto the linear span of the first d kernel
principal components of S. Show that Ψ(x) can be expressed as
n
X
Ψ(x) = γi Φ(xi ) ,
i=1

for some γi to be explicitly computed.

2. For y ∈ X , express

f (y) = k Φ(y) − Ψ(x) k2

in terms of kernel evaluations. Explain why minimizing f (y) can be

thought of as a method to ”denoise” x.
0 k2

3. Express f and ∇f in the case X = Rp and K(x, x0 ) = exp − k x−x
2σ 2 .
Propose an iterative algorithm (for example gradient descent) to find
a local minimum of f in that case.

4. Download the USPS ZIP code data from

http://statweb.stanford.edu/~tibs/ElemStatLearn/data.html Vi-
sualize (a subset of) the dataset in two dimensions with kernel PCA,
for different kernels. Implement the procedure discussed in question 4,
and test it on some data that you have corrupted with noise. Compute
how similar the denoised images are from the original (uncorrupted)
images as a function of the number of principal components used.

Exercice 12. Kernel k-means

In order to cluster a set of vectors x1 , . . . , xn ∈ Rp into K groups, we consider
the minimization of: n
X
C(z, µ) = k xi − µ z i k 2
i=1

over the cluster assignment variable zi (taking values in 1, . . . , K for all i =

1, . . . , n) and over the cluster means µi ∈ Rp , i = 1, . . . , K.

7
1. Starting from an initial assignment z 0 , we can try to minimize C(z, µ)
by iterating:

µi = argmin C(z i , µ) , z i+1 = argmin C(z, µi ) .

µ z

Explicit how both minimization can be carried out (note: this method
is called k-means).

2. Propose a similar iterative algorithm to perform k-means in the RKHS

H of a p.d. kernel K over Rp , i.e., to minimize:
n
X
CK (z, µ) = k Φ(xi ) − µzi k2 ,
i=1

where Φ : Rp → H satisfies Φ(x)> Φ(x0 ) = K(x, x0 ).

3. Let Z be the n × K assignment matrix with Pnvalues Zij = 1 if xi is

assigned to cluster j, 0 otherwise. Let Nj = i=1 Zij be the number of
points assigned to cluster j, and L be the K × K diagonal matrix with
entries Lii = 1/Ni . Show that minimizing CK (z, µ) is equivalent to
maximizing over the assignment matrix Z the trace of L1/2 Z > KZL1/2 .

4. Let H = ZL1/2 . What can we say about H > H? Do you see a connec-
tion between kernel k-means and kernel PCA? Propose an algorithm
to estimate Z from the solution of kernel PCA.

5. Implement the two variants of kernel k-means (Questions 2 and 4).

Test them with different kernels (linear, Gaussian) on the Libras Move-
ment Data Set 2 (n = 360, p = 90, K = 15). Visualize the data mapped
to the first two principal components for different kernels, and check
how well clustering recovers the 15 classes. (note: only use the first 90
attributes for clustering, the 91st one is the class label).

Exercice 13. Kernel LDA

Fisher’s linear discriminant analysis (LDA) is a method for supervised bi-
nary classification of finite-dimensional vectors. Given two sets of points
2
http://archive.ics.uci.edu/ml/datasets/Libras+Movement

8

S1 = x11 , . . . , x1n1 and S2 = x21 , . . . , x2n2 in Rp , let us denote by mi =
1
Pli i
ni j=1 xj , and by:

SB = (m1 − m2 )(m1 − m2 )> , (1)

XX
SW = (x − mi )(x − mi )> , (2)
i=1,2 x∈Si

the between and within class scatter matrices, respectively. LDA constructs
the function
fw (x) = w> x ,
where w is the vector which maximizes
w> SB w
J(w) = .
w > SW w
1. Why does it make sense to maximize J(w)? What do we expect to
find? (you can take as example the case where the two sets S1 and S2
form two clusters, e.g., two Gaussians).

2. We want to extend LDA to the feature space H induced by a positive

definite kernel K by the relations K(x, x0 ) =< Φ(x), Φ(x0 ) >H . For a
vector w ∈ H that is a linear combination of the form
ni
XX
w= αji Φ(xij ) ,
i=1,2 j=1

express J(w) and fw (x) as a function of α and K.

Exercice 14. Rademacher complexity

A Rademacher variable is a random variables σ that can take two possible
values, −1 and +1, with equal probability 1/2.
1. Let (u1 , u2 , . . . , uN ) be N vectors in a Hilbert space endowed with
an inner product < ., . >, and let σ1 , σ2 , . . . , σN be N independent
Rademacher variables. Show that:
N X N
! N
X X
E σi σj < ui , uj > = k ui k2 .
i=1 j=1 i=1

9
2. Let K be a positive definite kernel on a space X , HK denote the associ-
ated reproducing kernel Hilbert space, and BR = {f ∈ HK , k f kHK ≤ R}.
Let a set of points S = (x1 , x2 , . . . , xN ) with xi ∈ X (i = 1, . . . , N ),
and let σ1 , σ2 , . . . , σN be N independent Rademacher variables. Show
that: v
XN u N
uX
E sup σi f (xi ) ≤ Rt K (xi , xi ) .
f ∈BR i=1 i=1

Exercice 15. Some upper bounds for learning theory

Let K be a positive definite kernel on a measurable set X , (HK , k . kHK )
denote the corresponding reproducing kernel Hilbert space, λ > 0, and ϕ :
R → R a function. We assume that:
κ = sup K (x, x) < +∞ ,
x∈X

and we note BR = {f ∈ HK , k f kHK ≤ R}. Let us define, for all f ∈ H and

x ∈ X,
Rϕ (f, x) = ϕ (f (x)) + λk f k2HK .
1. ϕ is said to be Lipschitz if there exists a constant L > 0 such that,
for all u, v ∈ R, | ϕ (u) − ϕ (v) | ≤ L | u − v |. Show that, in that case,
there exists a constant C1 to be determined such that, for all x ∈ X
and f, g ∈ BR :
| Rϕ (f, x) − Rϕ (g, x) | ≤ C1 k f − g kHK .

2. ϕ is said to be convex if for all u, v ∈ R and t ∈ [0, 1], ϕ (tu + (1 − t)v) ≤

tϕ(u)+(1−t)ϕ(v). We assume that ϕ is convex, and that for all x ∈ X ,
there exists fx ∈ H which minimizes f 7→ Rϕ (f, x). Show that there
exists a constant C2 > 0 to be determined, such that:
∆
ψ(f, x) = Rϕ (f, x) − Rϕ (fx , x) ≥ C2 k f − fx k2HK .

3. Under the hypothesis of questions 2.1 and 2.2, show that there exists
a constant C, to be determined, such that if X is a random variable
with values in X , then:
∀f ∈ BR , Eψ (f, X)2 ≤ CEψ (f, X) .

10
Exercice 16. Dual coordinate ascent algorithms for SVMs

1. We recall the primal formulation of SVMs seen in the class (slide 142).
n
1X
min max(0, 1 − yi f (xi )) + λkf k2H ,
f ∈H n
i=1

and its dual formulation (slide 152)

1
maxn 2α> y − α> Kα such that 0 ≤ yi αi ≤ , for all i.
α∈R 2λn
The coordinate ascent method consists of iteratively optimizing with
respect to one variable, while fixing the other ones. Assuming that
you want to maximize the dual by following this approach. Find (and
justify) the update rule for αj .
2. Consider now the primal formulation of SVMs with intercept
n
1X
min max(0, 1 − yi (f (xi ) + b)) + λkf k2H ,
f ∈H,b∈R n
i=1

Can we still apply the representer theorem? Why? Derive the corre-
sponding dual formulation by using Lagrangian duality. Can we apply
the coordinate ascent method to this dual? If yes, what are the update
rules?
3. Consider a coordinate ascent method to this dual that consists of up-
dating two variables (αi , αj ) at a time (while fixing the n − 2 other
variables). What are the update rules for these two variables?

Exercice 17. 2-SVM

The 2-SVM algorithm is a method for supervised binary classification. Given
a training set (xi , yi )i=1,...,n of training patterns x1 , . . . , xn in a space X en-
dowed with a positive definite kernel K, and a set of corresponding labels
y1 , . . . , yn ∈ {−1, 1}, it solves the following problem:
( n )
1X
min L(f (xi ), yi ) + λ||f ||2 ,
f ∈HK n i=1

11
where ||f || is the norm of f in the RKHS HK of the kernel K, and L is the
square hinge loss function:

L(u, y) = max(1 − uy, 0)2 .

Write the primal and dual problems associated to the 2-SVM, and compare
the result with the SVM studied in the course.

Exercice 18. Kernel mean embedding

Let us consider a Borel probability measure P of some random variable X
on a compact set X . Let K : X × X → R be a continuous, bounded, p.d.
kernel and H be its RKHS. The kernel mean embedding of P is defined as
the function
µ(P ) : X → R
y 7→ EX∼P [K(X, y)].

1. Show that µ(P ) is in H and that EX∼P [f (X)] = hf, µ(P )iH for any
f ∈ H.
Remark: If P and Q are two Borel probability measures, then

µ(P ) = µ(Q) implies {EX∼P [f (X)] = EX∼Q [f (X)] for all f ∈ H} .

When H is dense in the space of continuous bounded functions on X ,

this relation is sufficient to show that P = Q. Hence, the kernel mean
embedding (single point in the RKHS!) carries all information about the
distribution. We call such kernels “universal”. It is possible to show
that the Gaussian kernel is universal.

2. Consider the empirical distribution

n
1X
PS = δx ,
n i=1 i

where S = {x1 , . . . , xn } is a finite subset of X and δxi is a Dirac distri-

bution centered at xi . Show that
p
4 EK(X, X)
ES [kµ(P ) − µ(PS )kH ] ≤ √ ,
n

12
where ES is the expectation by randomizing over the training set (each
xi is a r.v. distributed according to P ). Remember that you are allowed
to (and you should!) use any existing result from the slides.
3. Consider the quantity

M M D(S1 , S2 ) = ES [kµ(PS1 ) − µ(PS2 )k2H

for two sets S1 = (x1 , . . . , xn ) and S2 = (y1 , . . . , ym ). Show that

( n m
)!2
1X 1 X
M M D(S1 , S2 ) = sup f (xi ) − f (yj ) ,
kf kH ≤1 n i=1 m j=1

and give a formula for this quantity in terms of kernel evaluations only.
Remark: this is called the maximum mean discrepancy criterion, which
can be used for statistical testing (are S1 and S2 coming from the same
distribution?).
4. We consider X = Rd and the normalized Gaussian kernel with band-
−d kx−yk2
width σ: K(x, y) = σ exp − 2σ2 . For any two sets S1 and S2 ,
show that M M D(S1 , S2 ) is a decreasing function of σ.

Exercice 19. Sobolev spaces

1. Let

H = f : [0, 1] → R , absolutely continuous, f 0 ∈ L2 ([0, 1]), f (0) = 0 ,

endowed with the bilinear form

Z 1
∀f, g ∈ H , hf, giH = f 0 (u)g 0 (u)du .
0

Show that H is an RKHS, and compute its reproducing kernel.

2. Same question when

H = f : [0, 1] → R , absolutely continuous, f 0 ∈ L2 ([0, 1]), f (0) = f (1) = 0 ,

13
3. Same question, when H is endowed with the bilinear form:
Z 1
∀f, g ∈ H , hf, giH = (f (u)g(u) + f 0 (u)g 0 (u)) du .
0

4. Same question when

H = f : [0, 1] → R , f 0 exists and absolutely continuous, f 00 ∈ L2 ([0, 1]), f (0) = f (0 0) = 0 ,

endowed with the bilinear form

Z 1
∀f, g ∈ H , hf, giH = f 00 (u)g 00 (u)du .
0

Exercice 20. Splines

Let H = C2 ([0, 1]) be the set of twice continuously differentiable functions
f : [0, 1] → R, and H1 ⊂ H be the set of functions f ∈ H that satisfy:
f (0) = f 0 (0) = 0.
1. Show that H1 endowed with the norm:
Z 1
2
k f kH 1 = f 00 (t)2 dt
0

is a reproducing kernel Hilbert space (RKHS), and compute the repro-

ducing kernel K1 .
2. Let H2 be the set of affine functions f : [0, 1] → R (i.e., the functions
that can be written as f (x) = ax + b, with a, b ∈ R). Show that H2
endowed with the norm:
k f k2H2 = f (0)2 + f 0 (0)2
is a RKHS and compute the corresponding kernel K2 .
3. Deduce that H endowed with the norm:
Z 1
2
k f kH = f 00 (t)2 dt + f (0)2 + f 0 (0)2
0

is a RKHS and compute the reproducing kernel K.

14
4. Let 0 < x1 < . . . < xn < 1 and (y1 , . . . , yn ) ∈ Rn . In order to esti-
mate a regression function f : [0, 1] → R, we consider the following
optimization problem:
n Z 1
1X
min (f (xi ) − yi )2 + λ f 00 (t)2 dt. (3)
f ∈H n 0
i=1

Show that any solution of (3) can be expanded as:

n
X
fˆ(x) = αi K1 (xi , x) + β1 x + β2 ,
i=1

with α = (α1 , . . . , αn )0 ∈ Rn et β = (β0 , β1 )0 ∈ R2 .

5. Let I be the n × n identity matrix, M be the square n × n matrix

defined by: (
K1 (xi , xj ) 6 j,
si i =
Mi,j =
K1 (xi , xj ) + nλ si i = j,
T be the n × 2 matrix:
 
1 x1
 .. ..  ,
T = . . 
1 xn

and y = (y1 , . . . , yn )0 . Show that α and β satisfy:

(
T 0 α = 0,
M α + T β = y.

6. Deduce that α and β are given by:

−1 0
(
−1 0 −1 −1
α=M I − T (T M T ) T M y,
−1
β = (T 0 M −1 T ) T 0 M −1 y.

7. Show that

• fˆ ∈ C2 ([0, 1]);

15
• fˆ is a polynomial of degree 3 on each interval [xi , xi+1 ] for i =
1, . . . , n − 1;
• fˆ is an affine function on both intervals [0, x1 ] and [xn , 1] .
fˆ is called a spline.

Exercice 21. Duality

Let (x1 , y1 ), . . . , (xn , yn ) a training set of examples where xi ∈ X , a space
endowed with a positive definite kernel K, and yi ∈ {−1, 1}, for i = 1, . . . , n.
HK denotes the RKHS of the kernel K. We want to learn a function f :
X 7→ R by solving the following optimization problem:
n
1X
min `yi (f (xi )) such that k f kHK ≤ B , (4)
f ∈HK n
i=1

where `y is a convex loss functions (for y ∈ {−1, 1}) and B > 0 is a parameter.
1. Show that there exists λ ≥ 0 such that the solution to problem (7) can
be found be solving the following problem:
min R(Kα) + λα> Kα , (5)
α∈Rn

where K is the n×n Gram matrix and R : Rn 7→ R should be explicited.

2. Compute the Fenchel-Legendre transform3 R∗ of R in terms of the
Fenchel-Legendre transform `∗y of `y .
3. Adding the slack variable u = Kα, the problem (7) can be rewritten
as a constrained optimization problem:
min R(u) + λα> Kα such that u = Kα . (6)
α∈Rn ,u∈Rn

Express the dual problem of (6) in terms of R∗ , and explain how a

solution to (6) can be found from a solution to the dual problem.
3
For any function f : RN 7→ R, the Fenchel-Legendre transform (or convex conjugate)
of f is the function f ∗ : RN 7→ R defined by
f ∗ (u) = sup x> u − f (x) .
x∈RN

16
4. Explicit the dual problem for the logistic and squared hinge losses:
`y (u) = log(1 + e−yu ) .
`y (u) = max(0, 1 − yu)2 .

Exercice 22. Bn -splines

The convolution between two functions f, g : R → R is defined by:
Z ∞
f ? g(x) = f (u)g(x − u)du,
−∞

when this integral exists.

Let now the function:
(
1 si − 1 ≤ x ≤ 1,
I(x) =
0 si x < −1 ou x > 1,

and Bn = I ?n for n ∈ N∗ (that is, the function I convolved n times with

itself: B1 = I, B2 = I ? I, B3 = I ? I ? I, etc...).
Is the function k(x, y) = Bn (x − y) a positive definite kernel over R × R?
If yes, describe the corresponding reproducing kernel Hilbert space.

Exercice 23. Semigroup kernels

1. Are the following functions positive definite kernels?

1
∀x, y ∈ R, K2 (x, y) =
2− e−k x−y k2
∀x, y ∈ R, K3 (x, y) = max (0, 1 − |x − y|)
1
2. For any n > 0, show that the n × n Hankel matrix Aij = 1+i+j
is
positive semidefinite.
3. Describe the functions ϕ : [0, 1] 7→ R such that:
K(x, y) = ϕ (max(x + y − 1, 0))
is a positive definite kernel on [0, 1].

17
4. Can you describe the functions ϕ : R+ 7→ R such that:

K(x, y) = ϕ (max(x, y))

is a positive definite kernel on R+ ?

Exercice 24. Gaussian RKHS

For any σ > 0, let Kσ be the normalized Gaussian kernel on Rd :
k x − y k2

d 1
∀x, y ∈ R Kσ (x, y) = √ d exp − 2σ 2 ,
2πσ

and let Hσ be its reproducing kernel Hilbert space (RKHS).

1. Recall a proof of the positive definiteness of K.

2. For any 0 < σ < τ , show that

Hτ ⊂ Hσ ⊂ L2 (Rd ) ,

3. For any 0 < σ < τ and f ∈ Hτ , show that

k f kHτ ≥ k f kHσ ≥ k f kL2 (Rd ) ,

and that
σ2
0 ≤ kf k2Hσ − kf k2L2 (Rd ) 2 2
≤ 2 k f kHτ − k f kL2 (Rd ) .
τ

4. For any τ > 0 and f ∈ Hτ , show that

lim k f kHσ = k f kL2 (Rd ) .

σ→0

Exercice 25. Kernel for sets

We wish to construct positive definite kernels for finite sets of points in the
interval [0, 1]. Let X = (x1 , . . . , xn ) and Y = (y1 , . . . , yn ) be two such sets of
length n and m.

18
1. Show that the following kernel is positive definite for any σ > 0:

(x − y)2
XX
K1 (X, Y ) = exp − 2
.
x∈X y∈Y
2σ

2. To any finite set X of length n we associate the function gX : R → R

defined by:
(x − t)2

1X
gX (t) = exp − .
n x∈X 2σ 2
Show that the following kernel is positive definite for any σ > 0:
Z
K2 (X, Y ) = gX (t)gY (t)dt .
R

Is there a simple relation between K1 (X, Y ) and K2 (X, Y )?

3. Let P be a partition of [0, 1]. For any bin p ∈ P, let np (X) be the
number of points of X which are in p. Show that the following kernels
are positive definite:
X
K3 (X, Y ) = min(np (X), np (Y )) ,
p∈P

Y
K4 (X, Y ) = min(np (X), np (Y )) .
p∈P

4. Let TD be a complete binary tree of depth D, that is, a directed graph

such that, starting from the root, each node has two children, until the
nodes in the D-th generation which have no children (nodes with no
children are called leaves). The nodes of TD are denoted s ∈ TD . How
many nodes are there in TD ?

5. We denote by S(TD ) the set of connected subgraphs of TD which contain

the root and such that all their nodes have either 0 or 2 children. What
is the size of S(TD ) for D = 10?

6. For 0 < p < 1, we consider the following rule to generate randomly

a tree in S(TD ). We start at the root, and give it two children with
probability p, and no child with probability 1−p. If it has no child, then

19
the process stops and the tree generated is the root only. Otherwise,
the same rule is applied independently to both children, which have
themselves 0 or 2 children with probability 1 − p and p. The process is
repeated iteratively to all new children, until no more child is generated,
or until we reach the D-th generation where nodes have no children with
probability 1. For any T ∈ S(TD ) we denote by π(T ) the probability
of generating T by this process. For any real-valued function h defined
over the set of nodes s ∈ TD , propose a factorization to compute the
following sum efficiently:
X Y
π(T ) h(s) .
T ∈S(TD ) s∈leaves(T )

7. Suppose that each leaf s ∈ leaves(TD ) is associated to a interval p(s)

of [0, 1] which together form a partition. For any node s ∈ TD we
denote by D(s) the set of leaves of TD which are descendant of s, and
we associate to s the subset p(s) ⊂ [0, 1] defined by:
[
p(s) = p(l) .
l∈D(s)

For any T ∈ S(TD ), show that the following function is a positive

definite kernel:
Y
KT (X, Y ) = min(np(s) (X), np(s) (Y )) .
s∈leaves(T )

8. Show that the following function is a positive definite kernel and pro-
pose an efficient implementation to compute it
X
K5 (X, Y ) = π(T )KT (X, Y ) .
T ∈S(TD )

Exercice 26. Rademacher complexity of MKL

Given a fixed sample of n points S = (x1 , . . . , xn ) in a space X , the empirical
Rademacher complexity of a set of function F ⊂ RX is:
" n
#
1 X
R(F) = E sup σi f (xi ) ,
n f ∈F i=1

20
where the expectation is taken over σi ∈ {−1, +1} for i = 1, . . . , n, which
are independent uniform Rademacher random variables. The following result
can be used without proof:
Lemma 1. For any n×n symmetric p.s.d. matrix K, and σ = (σ1 , . . . , σn )>
a vector of independent Rademacher random variables, the following holds:
r
∀r ∈ N∗ , E σ > Kσ ≤ (2r trace (K))r .

1. Let k be a p.d. kernel over X with RKHS Hk , K its Gram matrix on

S, and B(k, t) = {f ∈ Hk : k f kHk ≤ t}. Show that
p
t trace (K)
∀t > 0 , R (B(k, t)) ≤ .
n

2. If, in addition, there exists M > 0 such that ∀x ∈ X , k(x, x) ≤ M 2 ,

show that
tM
∀t > 0 , R (B(k, t)) ≤ √
n
kη = ni=1 ηi ki for any
P
3. Let now k1 , . . . , kp be p p.d. kernel on X , and
p
η ∈ ∆ = {η ∈ Rp : ∀i = 1, . . . , p, ηi ≥ 0 and
P
i=1 ηi = 1}. Show that
kη is a p.d. kernel for any η ∈ ∆, and that, for any non-zero integer
r ∈ N∗
! √ 1
t 2r ( pi=1 trace (Ki )r ) 2r
[ P
∀t > 0 , R B(kη , t) ≤ .
η∈∆
n

4. If there exists M > 0 such that ∀i = 1, . . . , p, ∀x ∈ X , ki (x, x) < M 2 ,

show that
! r
[ 2e(ln p + 1)
∀t > 0 , R B(kη , t) ≤ tM .
η∈∆
n

5. (Bonus) Prove Lemma 1.

Exercice 27. MKL on a DAG

Let V = (v1 , . . . , vM ) be the vertices of a directed acyclic graph (DAG). For

21
any v ∈ V , we denote by D(v) ⊂ V the set of descendants of v (including
itself), and let dv ≥ 0 be a weight associated to each vertex v. We assume
that to each vertex v ∈ V is associated a positive definite kernel Kv over a
space X .

1. Using the notations of the course (slide 159), show that the following
weighted MKL with the set of kernels {Kv : v ∈ V }:
 ! !2 
 X X 
n
min R fv + λ dv k fv kHKv
(fv1 ,...,fvM )∈HKv ×...×HKv  
1 M v∈V v∈V

is equivalent to solving:
n o
n 2
min min R(f ) + λk f kHKη
η∈Σ f ∈HKη

for some set Σ to be determined.

2. We now consider the following variant of MKL which takes the graph
structure into account:

!
   21 2 

 

X X  X
n 2
min R fv + λ  dv k fw kHKw .
 

(fv1 ,...,fvM )∈HKv ×...×HKv  
1 M  v∈V v∈V w∈D(v) 
(7)
Can you intuitively explain why we may want to do this, and what we
can expect from the solution of this formulation?

3. Show that the MKL formulation (7) is equivalent to solving:

n o
min min R(f n ) + λk f k2HKη
η∈ΣV f ∈HKη

for some set ΣV to be determined.

4. Show that if the DAG is a tree, then ΣV is convex. Is it also convex

for a general DAG?

22
Exercice 28. Properties of the dot-product kernel
Consider the dot-product kernel on the sphere K1 : Sp−1 × Sp−1 → R such
that for all pair of points x, x0 in Sp−1 (unit sphere of Rp ),
K1 (x, x0 ) = κ(hx, x0 i),
where κ : [−1, 1] → R is an infinitely differentiable function that admits a
polynomial expansion on [−1, 1]:
+∞
X
κ(u) = ai ui , (8)
i=0

where the ai ’s are real coefficients and the sum above is always converging.
1. Show that if all coefficients ai are non-negative and κ 6= 0, then K1 is
p.d.
2. If K1 is p.d., show that the homogeneous dot-product kernel K2 : Rp ×
Rp → R is also p.d..
(
0 hx,x0 i
kxkkx kκ 6 0 and kx0 k =
if kxk = 6 0
K2 (x, x0 ) = kxkkx0 k .
0 otherwise
Remark: it is in fact possible to show that all coefficients ai need to be
non-negative for the positive definiteness to hold for all dimension p,
but we do not ask for a proof of this result, which is due to Shoenberg,
1942.
3. Assume that all coefficients ai are non-negative (K1 is thus p.d.) and
that κ(1) = κ0 (1) = 1. Let H be the RKHS of K1 and consider its
RKHS mapping ϕ : Sp−1 → H such that K1 (x, x0 ) = hϕ(x), ϕ(x0 )iH for
all x, x0 in Sp−1 . Show that:
∀x, x0 ∈ Sp−1 , kϕ(x) − ϕ(x0 )kH ≤ kx − x0 k.

4. Find an explicit feature map ψ : Sp−1 → `2 , where `2 is the Hilbert

space of real-valued sequences (see definition on slide 240), such that
for all x, y in Sp−1
K1 (x, y) = hψ(x), ψ(y)i`2 .
Hint: remember that hx, yi2 = hxx> , yy > iF , where h., iF is the Frobenius
inner-product. You may want to use the tensor product notation x⊗2 =
xx> and its generalization for degrees higher than 2.

23
5. Let us assume that you have found an explicit feature map ψ in the
previous question. Remember from one of our previous homeworks that
the RKHS H of K1 can be characterized by

H = {fw : w ∈ `2 } such that fw : x 7→ hw, ψ(x)i`2 ,

with
kfw k2H = inf kw0 k2`2 : fw = fw0 .

0
w ∈`2

Consider then a function gz : Sp−1 → R of the form

gz : x 7→ σ(hz, xi)

with z in Sp−1 and σ admits a polynomial expansion σ(u) = +∞ i

P
i=0 bi u .
Could you find a sufficient condition on z and on the coefficients bi
for gz to be in H?
Remark: gz can be interpreted as a one-layer neural network function.
We could ask you to do the same analysis for the homogeneous kernel
K2 , but this would be unnecessary technical for this homework which
is already too long. This being said, if you found it too short, we’re
happy to see your analysis of K2 and the type of functions gz you will
consider.

Kernel-Based Approximation Methods Using MATLAB
0% (1)
Kernel-Based Approximation Methods Using MATLAB
9 pages
Unit 02 - Nonlinear Classification, Linear Regression, Collaborative Filtering - MD
No ratings yet
Unit 02 - Nonlinear Classification, Linear Regression, Collaborative Filtering - MD
14 pages
hw5 Kernel Trick 2021
No ratings yet
hw5 Kernel Trick 2021
4 pages
Machine Learning With Kernel Methods
No ratings yet
Machine Learning With Kernel Methods
760 pages
HW 4
No ratings yet
HW 4
2 pages
Stats 231 / CS229T Homework 3 Solutions
No ratings yet
Stats 231 / CS229T Homework 3 Solutions
6 pages
Class 03
No ratings yet
Class 03
40 pages
Lecture4 introToRKHS
No ratings yet
Lecture4 introToRKHS
33 pages
Solutions To The Exercises On The Kernel Trick
No ratings yet
Solutions To The Exercises On The Kernel Trick
3 pages
Hw2 Kernel Trick 2021
No ratings yet
Hw2 Kernel Trick 2021
1 page
Error Bounds Kernel-Based Approximation
No ratings yet
Error Bounds Kernel-Based Approximation
13 pages
Class03 PDF
No ratings yet
Class03 PDF
40 pages
Reproducing Kernel Hilbert Spaces
No ratings yet
Reproducing Kernel Hilbert Spaces
4 pages
7 PDF
No ratings yet
7 PDF
4 pages
A Reproducing Kernel Hilbert Space Framework For Information-Theoretic Learning
No ratings yet
A Reproducing Kernel Hilbert Space Framework For Information-Theoretic Learning
12 pages
Mva - Slides Machine Learning With Kernel Methods
No ratings yet
Mva - Slides Machine Learning With Kernel Methods
644 pages
Arthur Gretton - Slides4A
No ratings yet
Arthur Gretton - Slides4A
121 pages
The Representation of Similarities in Linear Spaces
No ratings yet
The Representation of Similarities in Linear Spaces
17 pages
Kernel Method Homework
No ratings yet
Kernel Method Homework
5 pages
Reproducing Kernel Hilbert - Module and Kernel Mean Embeddings
No ratings yet
Reproducing Kernel Hilbert - Module and Kernel Mean Embeddings
56 pages
Foundations Computational Mathematics: Online Learning Algorithms
No ratings yet
Foundations Computational Mathematics: Online Learning Algorithms
26 pages
hw1 Kernel Trick
No ratings yet
hw1 Kernel Trick
2 pages
2021 UNAS REFER Rafi Yon Saputra 173112706420242 Kernel Primer
No ratings yet
2021 UNAS REFER Rafi Yon Saputra 173112706420242 Kernel Primer
65 pages
RR 6835
No ratings yet
RR 6835
37 pages
Vector-Valued Reproducing Kernel Hilbert Spaces: With Applications To Function Extension and Image Colorization
No ratings yet
Vector-Valued Reproducing Kernel Hilbert Spaces: With Applications To Function Extension and Image Colorization
71 pages
Research 6
No ratings yet
Research 6
16 pages
Efficient Algorithms for Kernel Aggregation Queries
No ratings yet
Efficient Algorithms for Kernel Aggregation Queries
14 pages
hw1 Kernel Trick 2021
No ratings yet
hw1 Kernel Trick 2021
2 pages
MBL Balkans 2024 Qualifying Quiz
No ratings yet
MBL Balkans 2024 Qualifying Quiz
3 pages
07 Kernels
No ratings yet
07 Kernels
6 pages
Reproducing Kernel Hilbert Spaces-Greg Durrett
No ratings yet
Reproducing Kernel Hilbert Spaces-Greg Durrett
8 pages
Class03 Rkhs Scribe
No ratings yet
Class03 Rkhs Scribe
8 pages
ML Kernel Methods
No ratings yet
ML Kernel Methods
51 pages
Kernel Functions
No ratings yet
Kernel Functions
35 pages
training_kernels
No ratings yet
training_kernels
2 pages
Reproducing Kernel Hilbert Space, Mercer's Theorem, Eigenfunctions, Nystr Om Method, and Use of Kernels in Machine Learning: Tutorial and Survey
No ratings yet
Reproducing Kernel Hilbert Space, Mercer's Theorem, Eigenfunctions, Nystr Om Method, and Use of Kernels in Machine Learning: Tutorial and Survey
31 pages
Lecture 05
No ratings yet
Lecture 05
49 pages
HW2
No ratings yet
HW2
2 pages
RKHS_0
No ratings yet
RKHS_0
13 pages
Introduction To Kernels: Max Welling
No ratings yet
Introduction To Kernels: Max Welling
16 pages
17-570
No ratings yet
17-570
45 pages
(2019) Piecewise Reproducing Kernel Method For Linear Impulsive Delay Differential Equations With Piecewise Constant Arguments
No ratings yet
(2019) Piecewise Reproducing Kernel Method For Linear Impulsive Delay Differential Equations With Piecewise Constant Arguments
10 pages
Lecture17 Kernels
No ratings yet
Lecture17 Kernels
23 pages
High Dimensional Representation
No ratings yet
High Dimensional Representation
33 pages
(1963) Probability Density Functionals and Reproducing Kernel Hilbert Spaces (Parzen)
No ratings yet
(1963) Probability Density Functionals and Reproducing Kernel Hilbert Spaces (Parzen)
15 pages
Reproducing Kernel Hilbert Spaces
No ratings yet
Reproducing Kernel Hilbert Spaces
5 pages
Department of Mathematics & Statistics
No ratings yet
Department of Mathematics & Statistics
5 pages
Classes of Kernels For Machine Learning: A Statistics Perspective
No ratings yet
Classes of Kernels For Machine Learning: A Statistics Perspective
14 pages
Manuscript Sigma
No ratings yet
Manuscript Sigma
10 pages
Lecture 8_Kernels
No ratings yet
Lecture 8_Kernels
32 pages
Combining Entropy Measures For Anomaly Detection
No ratings yet
Combining Entropy Measures For Anomaly Detection
14 pages
Reproducing Kernel Banach Spaces For Machine Learning: Haizhang Zhang Yuesheng Xu
No ratings yet
Reproducing Kernel Banach Spaces For Machine Learning: Haizhang Zhang Yuesheng Xu
35 pages
Practice_Problems_for_ML_Midterms
No ratings yet
Practice_Problems_for_ML_Midterms
5 pages
Part-I 2014
No ratings yet
Part-I 2014
31 pages
Fa mth405 1
0% (1)
Fa mth405 1
31 pages
Equi-Statistical - Convergence of Positive Linear Operators: Sevda Karaku S and Kamil Demirci
No ratings yet
Equi-Statistical - Convergence of Positive Linear Operators: Sevda Karaku S and Kamil Demirci
12 pages
IIT Kanpur PHD May 2017
No ratings yet
IIT Kanpur PHD May 2017
5 pages
Real Key Notes
No ratings yet
Real Key Notes
13 pages
Solutions
No ratings yet
Solutions
50 pages
Theory of Approximation
From Everand
Theory of Approximation
N. I. Achieser
No ratings yet
Differential Forms
From Everand
Differential Forms
Henri Cartan
5/5 (2)
Proof Riemann
No ratings yet
Proof Riemann
82 pages
SVM Class
No ratings yet
SVM Class
33 pages
Electric Load Forecasting
No ratings yet
Electric Load Forecasting
15 pages
Authorship Identification of Romanian Texts With Controversial Paternity
No ratings yet
Authorship Identification of Romanian Texts With Controversial Paternity
6 pages
24_0185
No ratings yet
24_0185
56 pages
Kernel Ridge Regression
No ratings yet
Kernel Ridge Regression
8 pages
Kernell Mallows Kernels For Permutations
No ratings yet
Kernell Mallows Kernels For Permutations
38 pages
Exercices Kernel Trick
No ratings yet
Exercices Kernel Trick
24 pages
Kernel Methods!: Sargur Srihari!
No ratings yet
Kernel Methods!: Sargur Srihari!
29 pages
Lecture 19 - Nonlinear Learning With Kernels (1) - Plain
No ratings yet
Lecture 19 - Nonlinear Learning With Kernels (1) - Plain
15 pages
Scattered Data Approximation Holger Wendland download
100% (2)
Scattered Data Approximation Holger Wendland download
86 pages
(Carus Mathematical Monographs 28) John P. D'Angelo - Inequalities From Complex Analysis-The Mathematical Association of America (2002)
100% (1)
(Carus Mathematical Monographs 28) John P. D'Angelo - Inequalities From Complex Analysis-The Mathematical Association of America (2002)
284 pages
2019-Liu-Machine Learning For Predicting Thermodynamic Properties of Pure Fluids and Their Mixtures
No ratings yet
2019-Liu-Machine Learning For Predicting Thermodynamic Properties of Pure Fluids and Their Mixtures
8 pages