0% found this document useful (0 votes)
56 views

Exercices Kernel Trick

Uploaded by

aymasrayman
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
56 views

Exercices Kernel Trick

Uploaded by

aymasrayman
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 24

MVA ”Kernel methods in machine learning”

Exercices
Julien Mairal and Jean-Philippe Vert

Exercice 1. Kernels
Study whether the following kernels are positive definite:
1. X = (−1, 1), K (x, x0 ) = 1
1−xx0
0
2. X = N, K (x, x0 ) = 2x+x
0
3. X = N, K (x, x0 ) = 2xx
4. X = R+ , K (x, x0 ) = log (1 + xx0 )
5. X = R, K (x, x0 ) = exp (−|x − x0 |2 )
6. X = R, K (x, x0 ) = cos (x + x0 )
7. X = R, K (x, x0 ) = cos (x − x0 )
8. X = R+ , K (x, x0 ) = min(x, x0 )
9. X = R+ , K (x, x0 ) = max(x, x0 )
10. X = R+ , K (x, x0 ) = min(x, x0 )/ max(x, x0 )
11. X = N, K (x, x0 ) = GCD (x, x0 )
12. X = N, K (x, x0 ) = LCM (x, x0 )
13. X = N, K (x, x0 ) = GCD (x, x0 ) /LCM (x, x0 )
14. Given a probability space (Ω, A, P ), on X = A:
∀A, B ∈ A , K (A, B) = P (A ∩ B) − P (A)P (B) .

1
15. Let X be a set and f, g : X → R+ two non-negative functions:

∀x, y ∈ X K(x, y) = min(f (x)g(y), f (y)g(x))

16. Given a non-empty finite set E, on X = P(E) = {A : A ⊂ E}:

|A ∩ B |
∀A, B ⊂ E , K (A, B) = ,
|A ∪ B |
0
where | F | denotes the cardinality of F , and with the convention 0
= 0.

Exercice 2. Function and kernel boundedness


Consider a p.d. kernel K : X × X → R such that K(x, z) ≤ b2 for all x, z in
X . Show that kf k∞ = supx∈X |f (x)| ≤ b for any function f in the unit ball
of the corresponding RKHS.

Exercice 3. Non-expansiveness of the Gaussian kernel


Consider the Gaussian kernel K : Rp × Rp → R such that for all pair of
points x, x0 in Rp ,
α 0 2
K(x, x0 ) = e− 2 kx−x k ,
where k.k is the Euclidean norm on Rp . Call H the RKHS of K and consider
its RKHS mapping ϕ : Rp → H such that K(x, x0 ) = hϕ(x), ϕ(x0 )iH for all
x, x0 in Rp . Show that

kϕ(x) − ϕ(x0 )kH ≤ αkx − x0 k.

The mapping is called non-expansive whenever α ≤ 1.

Exercice 4. Kernels encoding equivalence classes.


Consider a similarity measure K : X × X → {0, 1} with K(x, x) = 1 for all
x in X . Prove that K is p.d. if and only if, for all x, x0 , x00 in X ,

• K(x, x0 ) = 1 ⇔ K(x0 , x) = 1, and

• K(x, x0 ) = K(x0 , x00 ) = 1 ⇒ K(x, x00 ) = 1.

2
Exercice 5. RKHS

1. Let K1 and K2 be two positive definite kernels on a set X , and α, β


two positive scalars. Show that αK1 + βK2 is positive definite, and
describe its RKHS.

2. Let X be a set and F be a Hilbert space. Let Ψ : X → F, and


K : X × X → R be:

∀x, x0 ∈ X , K(x, x0 ) = hΨ(x), Ψ(x0 )iF .

Show that K is a positive definite kernel on X , and describe its RKHS.

3. Prove that for any p.d. kernel K on a space X , a function f : X → R


belongs to the RKHS H with kernel K if and only if there exists λ > 0
such that K(x, x0 ) − λf (x)f (x0 ) is p.d.

Exercice 6. Completeness of the RKHS


We want to finish the construction of the RKHS associated to a positive
definite kernel K given in the course. Remember we have defined the set of
functions:
( n )
X
H0 = αi Kxi : n ∈ N, α1 , . . . , αn ∈ R, x1 , . . . , xn ∈ X
i=1

and for any two functions f, g ∈ H0 , given by:


m
X n
X
f= ai K x i , g= b j Kyj ,
i=1 j=1

we have defined the operation:


X
hf, giH0 := ai bj K (xi , yj ) .
i,j

In the course we have shown that H0 endowed with this inner product is a
pre-Hilbert space. Let us now show how to finish the construction of the
RKHS from H0

3
1. Show that any Cauchy sequence (fn ) in H0 converges pointwisely to a
function f : X → R defined by f (x) = limn→+∞ fn (x).

2. Show that any Cauchy sequence (fn )n∈N in H0 which converges point-
wise to 0 satisfies:
lim k fn kH0 = 0 .
n→+∞

3. Let H ⊂ RX be the set of functions f : X → R which are pointwise


limits of Cauchy sequences in H0 , i.e., if (fn ) is a Cauchy sequence in
H0 , then f (x) = limn→+∞ fn (x). Show that H0 ⊂ H.

4. If (fn ) and (gn ) are two Cauchy sequences in H0 , which converge point-
wisely to two functions f and g ∈ H, show that the inner product
hfn , gn iH0 converges to a number which only depends on f and g. This
allows us to define formally the operation:

hf, giH = lim hfn , gn iH0 .


n→+∞

5. Show that h., .iH is an inner product on H.

6. Show that H0 is dense in H (with respect to the metric defined by the


inner product h., .iH )

7. Show that H is complete.

8. Show that H is a RKHS whose reproducing kernel is K.

Exercice 7. Uniqueness of the RKHS


Prove that if K : X × X is a positive definite function, then it is the r.k. of
a unique RKHS. (Hint: consider the linear space spanned by the functions
Kx : t 7→ K(x, t), and use the fact that a linear subspace F of a Hilbert space
H is dense in H if and only 0 is the only vector orthgonal to all vectors in
F)

Exercice 8. Conditionally positive definite kernels


Let X be a set. A function k : X × X → R is called conditionally positive

4
definite (c.p.d.) if and only if it is symmetric and satisfies:
n
X
ai aj k(xi , xj ) ≥ 0
i,j=1

Pn
for any n ∈ N, x1 , x2 , . . . , xn ∈ X n and a1 , a2 , . . . , an ∈ Rn with i=1 ai = 0
.

1. Show that a positive definite (p.d.) function is c.p.d.

2. Is a constant function p.d.? Is it c.p.d.?

3. If X is a Hilbert space, then is k(x, y) = −||x − y||2 p.d.? Is it c.p.d.?

4. Let X be a nonempty set, and x0 ∈ X a point. For any function


k : X × X → R, let k̃ : X × X → R be the function defined by:

k̃(x, y) = k(x, y) − k(x0 , x) − k(x0 , y) + k(x0 , x0 ).

Show that k is c.p.d. if and only if k̃ is p.d.

5. Let k be a c.p.d. kernel on X such that k(x, x) = 0 for any x ∈ X .


Show that there exists a Hilbert space H and a mapping Φ : X → H
such that, for any x, y ∈ X ,

k(x, y) = −||Φ(x) − Φ(y)||2 .

6. Show that if k is c.p.d., then the function exp(tk(x, y)) is p.d. for all
t≥0

7. Conversely, show that if the function exp(tk(x, y)) is p.d. for any t ≥ 0,
then k is c.p.d.

8. Show that the negative shortest-path distance on a tree1 is c.p.d over


the set of vertices (a tree is an undirected graph without loops). Is the
negative shortest-path distance over graphs c.p.d. in general?
1
I.e., the function k(x, y) = −d(x, y), where d(x, y) is the shortest-path distance be-
tween x and y, that is, the minimum number of edges of any path that connects x to
y.

5
Exercice 9. COCO
Given two sets of real numbers X = (x1 , . . . , xn ) ∈ Rn and Y = (y1 , . . . , yn ) ∈
Rn , the covariance between X and Y is defined as
covn (X, Y ) = En (XY ) − En (X)En (Y ) ,
Pn
where En (U ) = ( i=1 ui )/n. The covariance is useful to detect linear rela-
tionships between X and Y . In order to extend this measure to potential
nonlinear relationships between X and Y , we consider the following criterion:
CnK (X, Y ) = max covn (f (X), g(Y )) ,
f,g∈BK

where K is a positive definite kernel on R, BK is the unit ball of the RKHS


of K, and f (U ) = (f (u1 ), . . . , f (un )) for a vector U = (u1 , . . . , un ).
1. Express simply CnK (X, Y ) for the linear kernel K(a, b) = ab.
2. For a general kernel K, express CnK (X, Y ) in terms of the Gram matri-
ces of X and Y .

Exercice 10. RKHS-induced semi-metrics


Let H be a RKHS of functions with domain X , associated to a measurable
p.d. kernel K : X × X → R. Consider two probability distributions P and
Q on X . Show that
sup |EP [f (X)] − EQ [f (Z)]|2 = E[K(X, X 0 ) + K(Z, Z 0 ) − 2K(X, Z)],
kf kH ≤1

where X, X 0 ∼ P, and Z, Z 0 ∼ Q are jointly independent.

Exercice 11. Kernel PCA for data denoising


Let X be a space endowed with a p.d. kernel K, and Φ : X → H a mapping
to a Hilbert space H such that for all x, x0 ∈ X ,
hΦ(x), Φ(x0 )i = K(x, x0 ) .
Let S = {x1 , . . . , xn } be a set of points in X , and
n
1X
m= Φ(xi )
n i=1
their barycenter in the feature space.

6
1. For x ∈ X , let
Ψ(x) = Pd (Φ(x) − m) + m
where Pd is the projection onto the linear span of the first d kernel
principal components of S. Show that Ψ(x) can be expressed as
n
X
Ψ(x) = γi Φ(xi ) ,
i=1

for some γi to be explicitly computed.

2. For y ∈ X , express

f (y) = k Φ(y) − Ψ(x) k2

in terms of kernel evaluations. Explain why minimizing f (y) can be


thought of as a method to ”denoise” x.
 0 k2

3. Express f and ∇f in the case X = Rp and K(x, x0 ) = exp − k x−x
2σ 2 .
Propose an iterative algorithm (for example gradient descent) to find
a local minimum of f in that case.

4. Download the USPS ZIP code data from


http://statweb.stanford.edu/~tibs/ElemStatLearn/data.html Vi-
sualize (a subset of) the dataset in two dimensions with kernel PCA,
for different kernels. Implement the procedure discussed in question 4,
and test it on some data that you have corrupted with noise. Compute
how similar the denoised images are from the original (uncorrupted)
images as a function of the number of principal components used.

Exercice 12. Kernel k-means


In order to cluster a set of vectors x1 , . . . , xn ∈ Rp into K groups, we consider
the minimization of: n
X
C(z, µ) = k xi − µ z i k 2
i=1

over the cluster assignment variable zi (taking values in 1, . . . , K for all i =


1, . . . , n) and over the cluster means µi ∈ Rp , i = 1, . . . , K.

7
1. Starting from an initial assignment z 0 , we can try to minimize C(z, µ)
by iterating:

µi = argmin C(z i , µ) , z i+1 = argmin C(z, µi ) .


µ z

Explicit how both minimization can be carried out (note: this method
is called k-means).

2. Propose a similar iterative algorithm to perform k-means in the RKHS


H of a p.d. kernel K over Rp , i.e., to minimize:
n
X
CK (z, µ) = k Φ(xi ) − µzi k2 ,
i=1

where Φ : Rp → H satisfies Φ(x)> Φ(x0 ) = K(x, x0 ).

3. Let Z be the n × K assignment matrix with Pnvalues Zij = 1 if xi is


assigned to cluster j, 0 otherwise. Let Nj = i=1 Zij be the number of
points assigned to cluster j, and L be the K × K diagonal matrix with
entries Lii = 1/Ni . Show that minimizing CK (z, µ) is equivalent to
maximizing over the assignment matrix Z the trace of L1/2 Z > KZL1/2 .

4. Let H = ZL1/2 . What can we say about H > H? Do you see a connec-
tion between kernel k-means and kernel PCA? Propose an algorithm
to estimate Z from the solution of kernel PCA.

5. Implement the two variants of kernel k-means (Questions 2 and 4).


Test them with different kernels (linear, Gaussian) on the Libras Move-
ment Data Set 2 (n = 360, p = 90, K = 15). Visualize the data mapped
to the first two principal components for different kernels, and check
how well clustering recovers the 15 classes. (note: only use the first 90
attributes for clustering, the 91st one is the class label).

Exercice 13. Kernel LDA


Fisher’s linear discriminant analysis (LDA) is a method for supervised bi-
nary classification of finite-dimensional vectors. Given two sets of points
2
http://archive.ics.uci.edu/ml/datasets/Libras+Movement

8
 
S1 = x11 , . . . , x1n1 and S2 = x21 , . . . , x2n2 in Rp , let us denote by mi =
1
Pli i
ni j=1 xj , and by:

SB = (m1 − m2 )(m1 − m2 )> , (1)


XX
SW = (x − mi )(x − mi )> , (2)
i=1,2 x∈Si

the between and within class scatter matrices, respectively. LDA constructs
the function
fw (x) = w> x ,
where w is the vector which maximizes
w> SB w
J(w) = .
w > SW w
1. Why does it make sense to maximize J(w)? What do we expect to
find? (you can take as example the case where the two sets S1 and S2
form two clusters, e.g., two Gaussians).

2. We want to extend LDA to the feature space H induced by a positive


definite kernel K by the relations K(x, x0 ) =< Φ(x), Φ(x0 ) >H . For a
vector w ∈ H that is a linear combination of the form
ni
XX
w= αji Φ(xij ) ,
i=1,2 j=1

express J(w) and fw (x) as a function of α and K.

Exercice 14. Rademacher complexity


A Rademacher variable is a random variables σ that can take two possible
values, −1 and +1, with equal probability 1/2.
1. Let (u1 , u2 , . . . , uN ) be N vectors in a Hilbert space endowed with
an inner product < ., . >, and let σ1 , σ2 , . . . , σN be N independent
Rademacher variables. Show that:
N X N
! N
X X
E σi σj < ui , uj > = k ui k2 .
i=1 j=1 i=1

9
2. Let K be a positive definite kernel on a space X , HK denote the associ-
ated reproducing kernel Hilbert space, and BR = {f ∈ HK , k f kHK ≤ R}.
Let a set of points S = (x1 , x2 , . . . , xN ) with xi ∈ X (i = 1, . . . , N ),
and let σ1 , σ2 , . . . , σN be N independent Rademacher variables. Show
that: v
XN u N
uX
E sup σi f (xi ) ≤ Rt K (xi , xi ) .
f ∈BR i=1 i=1

Exercice 15. Some upper bounds for learning theory


Let K be a positive definite kernel on a measurable set X , (HK , k . kHK )
denote the corresponding reproducing kernel Hilbert space, λ > 0, and ϕ :
R → R a function. We assume that:
κ = sup K (x, x) < +∞ ,
x∈X

and we note BR = {f ∈ HK , k f kHK ≤ R}. Let us define, for all f ∈ H and


x ∈ X,
Rϕ (f, x) = ϕ (f (x)) + λk f k2HK .
1. ϕ is said to be Lipschitz if there exists a constant L > 0 such that,
for all u, v ∈ R, | ϕ (u) − ϕ (v) | ≤ L | u − v |. Show that, in that case,
there exists a constant C1 to be determined such that, for all x ∈ X
and f, g ∈ BR :
| Rϕ (f, x) − Rϕ (g, x) | ≤ C1 k f − g kHK .

2. ϕ is said to be convex if for all u, v ∈ R and t ∈ [0, 1], ϕ (tu + (1 − t)v) ≤


tϕ(u)+(1−t)ϕ(v). We assume that ϕ is convex, and that for all x ∈ X ,
there exists fx ∈ H which minimizes f 7→ Rϕ (f, x). Show that there
exists a constant C2 > 0 to be determined, such that:

ψ(f, x) = Rϕ (f, x) − Rϕ (fx , x) ≥ C2 k f − fx k2HK .

3. Under the hypothesis of questions 2.1 and 2.2, show that there exists
a constant C, to be determined, such that if X is a random variable
with values in X , then:
∀f ∈ BR , Eψ (f, X)2 ≤ CEψ (f, X) .

10
Exercice 16. Dual coordinate ascent algorithms for SVMs

1. We recall the primal formulation of SVMs seen in the class (slide 142).
n
1X
min max(0, 1 − yi f (xi )) + λkf k2H ,
f ∈H n
i=1

and its dual formulation (slide 152)


1
maxn 2α> y − α> Kα such that 0 ≤ yi αi ≤ , for all i.
α∈R 2λn
The coordinate ascent method consists of iteratively optimizing with
respect to one variable, while fixing the other ones. Assuming that
you want to maximize the dual by following this approach. Find (and
justify) the update rule for αj .
2. Consider now the primal formulation of SVMs with intercept
n
1X
min max(0, 1 − yi (f (xi ) + b)) + λkf k2H ,
f ∈H,b∈R n
i=1

Can we still apply the representer theorem? Why? Derive the corre-
sponding dual formulation by using Lagrangian duality. Can we apply
the coordinate ascent method to this dual? If yes, what are the update
rules?
3. Consider a coordinate ascent method to this dual that consists of up-
dating two variables (αi , αj ) at a time (while fixing the n − 2 other
variables). What are the update rules for these two variables?

Exercice 17. 2-SVM


The 2-SVM algorithm is a method for supervised binary classification. Given
a training set (xi , yi )i=1,...,n of training patterns x1 , . . . , xn in a space X en-
dowed with a positive definite kernel K, and a set of corresponding labels
y1 , . . . , yn ∈ {−1, 1}, it solves the following problem:
( n )
1X
min L(f (xi ), yi ) + λ||f ||2 ,
f ∈HK n i=1

11
where ||f || is the norm of f in the RKHS HK of the kernel K, and L is the
square hinge loss function:

L(u, y) = max(1 − uy, 0)2 .

Write the primal and dual problems associated to the 2-SVM, and compare
the result with the SVM studied in the course.

Exercice 18. Kernel mean embedding


Let us consider a Borel probability measure P of some random variable X
on a compact set X . Let K : X × X → R be a continuous, bounded, p.d.
kernel and H be its RKHS. The kernel mean embedding of P is defined as
the function
µ(P ) : X → R
y 7→ EX∼P [K(X, y)].

1. Show that µ(P ) is in H and that EX∼P [f (X)] = hf, µ(P )iH for any
f ∈ H.
Remark: If P and Q are two Borel probability measures, then

µ(P ) = µ(Q) implies {EX∼P [f (X)] = EX∼Q [f (X)] for all f ∈ H} .

When H is dense in the space of continuous bounded functions on X ,


this relation is sufficient to show that P = Q. Hence, the kernel mean
embedding (single point in the RKHS!) carries all information about the
distribution. We call such kernels “universal”. It is possible to show
that the Gaussian kernel is universal.

2. Consider the empirical distribution


n
1X
PS = δx ,
n i=1 i

where S = {x1 , . . . , xn } is a finite subset of X and δxi is a Dirac distri-


bution centered at xi . Show that
p
4 EK(X, X)
ES [kµ(P ) − µ(PS )kH ] ≤ √ ,
n

12
where ES is the expectation by randomizing over the training set (each
xi is a r.v. distributed according to P ). Remember that you are allowed
to (and you should!) use any existing result from the slides.
3. Consider the quantity

M M D(S1 , S2 ) = ES [kµ(PS1 ) − µ(PS2 )k2H

for two sets S1 = (x1 , . . . , xn ) and S2 = (y1 , . . . , ym ). Show that


( n m
)!2
1X 1 X
M M D(S1 , S2 ) = sup f (xi ) − f (yj ) ,
kf kH ≤1 n i=1 m j=1

and give a formula for this quantity in terms of kernel evaluations only.
Remark: this is called the maximum mean discrepancy criterion, which
can be used for statistical testing (are S1 and S2 coming from the same
distribution?).
4. We consider X = Rd and the normalized  Gaussian kernel with band-
−d kx−yk2
width σ: K(x, y) = σ exp − 2σ2 . For any two sets S1 and S2 ,
show that M M D(S1 , S2 ) is a decreasing function of σ.

Exercice 19. Sobolev spaces

1. Let

H = f : [0, 1] → R , absolutely continuous, f 0 ∈ L2 ([0, 1]), f (0) = 0 ,




endowed with the bilinear form


Z 1
∀f, g ∈ H , hf, giH = f 0 (u)g 0 (u)du .
0

Show that H is an RKHS, and compute its reproducing kernel.

2. Same question when

H = f : [0, 1] → R , absolutely continuous, f 0 ∈ L2 ([0, 1]), f (0) = f (1) = 0 ,




13
3. Same question, when H is endowed with the bilinear form:
Z 1
∀f, g ∈ H , hf, giH = (f (u)g(u) + f 0 (u)g 0 (u)) du .
0

4. Same question when


H = f : [0, 1] → R , f 0 exists and absolutely continuous, f 00 ∈ L2 ([0, 1]), f (0) = f (0 0) = 0 ,


endowed with the bilinear form


Z 1
∀f, g ∈ H , hf, giH = f 00 (u)g 00 (u)du .
0

Exercice 20. Splines


Let H = C2 ([0, 1]) be the set of twice continuously differentiable functions
f : [0, 1] → R, and H1 ⊂ H be the set of functions f ∈ H that satisfy:
f (0) = f 0 (0) = 0.
1. Show that H1 endowed with the norm:
Z 1
2
k f kH 1 = f 00 (t)2 dt
0

is a reproducing kernel Hilbert space (RKHS), and compute the repro-


ducing kernel K1 .
2. Let H2 be the set of affine functions f : [0, 1] → R (i.e., the functions
that can be written as f (x) = ax + b, with a, b ∈ R). Show that H2
endowed with the norm:
k f k2H2 = f (0)2 + f 0 (0)2
is a RKHS and compute the corresponding kernel K2 .
3. Deduce that H endowed with the norm:
Z 1
2
k f kH = f 00 (t)2 dt + f (0)2 + f 0 (0)2
0

is a RKHS and compute the reproducing kernel K.

14
4. Let 0 < x1 < . . . < xn < 1 and (y1 , . . . , yn ) ∈ Rn . In order to esti-
mate a regression function f : [0, 1] → R, we consider the following
optimization problem:
n Z 1
1X
min (f (xi ) − yi )2 + λ f 00 (t)2 dt. (3)
f ∈H n 0
i=1

Show that any solution of (3) can be expanded as:


n
X
fˆ(x) = αi K1 (xi , x) + β1 x + β2 ,
i=1

with α = (α1 , . . . , αn )0 ∈ Rn et β = (β0 , β1 )0 ∈ R2 .

5. Let I be the n × n identity matrix, M be the square n × n matrix


defined by: (
K1 (xi , xj ) 6 j,
si i =
Mi,j =
K1 (xi , xj ) + nλ si i = j,
T be the n × 2 matrix:
 
1 x1
 .. ..  ,
T = . . 
1 xn

and y = (y1 , . . . , yn )0 . Show that α and β satisfy:


(
T 0 α = 0,
M α + T β = y.

6. Deduce that α and β are given by:


 
−1 0
(
−1 0 −1 −1
α=M I − T (T M T ) T M y,
−1
β = (T 0 M −1 T ) T 0 M −1 y.

7. Show that

• fˆ ∈ C2 ([0, 1]);

15
• fˆ is a polynomial of degree 3 on each interval [xi , xi+1 ] for i =
1, . . . , n − 1;
• fˆ is an affine function on both intervals [0, x1 ] and [xn , 1] .
fˆ is called a spline.

Exercice 21. Duality


Let (x1 , y1 ), . . . , (xn , yn ) a training set of examples where xi ∈ X , a space
endowed with a positive definite kernel K, and yi ∈ {−1, 1}, for i = 1, . . . , n.
HK denotes the RKHS of the kernel K. We want to learn a function f :
X 7→ R by solving the following optimization problem:
n
1X
min `yi (f (xi )) such that k f kHK ≤ B , (4)
f ∈HK n
i=1

where `y is a convex loss functions (for y ∈ {−1, 1}) and B > 0 is a parameter.
1. Show that there exists λ ≥ 0 such that the solution to problem (7) can
be found be solving the following problem:
min R(Kα) + λα> Kα , (5)
α∈Rn

where K is the n×n Gram matrix and R : Rn 7→ R should be explicited.


2. Compute the Fenchel-Legendre transform3 R∗ of R in terms of the
Fenchel-Legendre transform `∗y of `y .
3. Adding the slack variable u = Kα, the problem (7) can be rewritten
as a constrained optimization problem:
min R(u) + λα> Kα such that u = Kα . (6)
α∈Rn ,u∈Rn

Express the dual problem of (6) in terms of R∗ , and explain how a


solution to (6) can be found from a solution to the dual problem.
3
For any function f : RN 7→ R, the Fenchel-Legendre transform (or convex conjugate)
of f is the function f ∗ : RN 7→ R defined by
f ∗ (u) = sup x> u − f (x) .
x∈RN

16
4. Explicit the dual problem for the logistic and squared hinge losses:
`y (u) = log(1 + e−yu ) .
`y (u) = max(0, 1 − yu)2 .

Exercice 22. Bn -splines


The convolution between two functions f, g : R → R is defined by:
Z ∞
f ? g(x) = f (u)g(x − u)du,
−∞

when this integral exists.


Let now the function:
(
1 si − 1 ≤ x ≤ 1,
I(x) =
0 si x < −1 ou x > 1,

and Bn = I ?n for n ∈ N∗ (that is, the function I convolved n times with


itself: B1 = I, B2 = I ? I, B3 = I ? I ? I, etc...).
Is the function k(x, y) = Bn (x − y) a positive definite kernel over R × R?
If yes, describe the corresponding reproducing kernel Hilbert space.

Exercice 23. Semigroup kernels

1. Are the following functions positive definite kernels?


1
∀x, y ∈ R, K2 (x, y) =
2− e−k x−y k2
∀x, y ∈ R, K3 (x, y) = max (0, 1 − |x − y|)
1
2. For any n > 0, show that the n × n Hankel matrix Aij = 1+i+j
is
positive semidefinite.
3. Describe the functions ϕ : [0, 1] 7→ R such that:
K(x, y) = ϕ (max(x + y − 1, 0))
is a positive definite kernel on [0, 1].

17
4. Can you describe the functions ϕ : R+ 7→ R such that:

K(x, y) = ϕ (max(x, y))

is a positive definite kernel on R+ ?

Exercice 24. Gaussian RKHS


For any σ > 0, let Kσ be the normalized Gaussian kernel on Rd :
k x − y k2
 
d 1
∀x, y ∈ R Kσ (x, y) = √ d exp − 2σ 2 ,
2πσ

and let Hσ be its reproducing kernel Hilbert space (RKHS).


1. Recall a proof of the positive definiteness of K.

2. For any 0 < σ < τ , show that

Hτ ⊂ Hσ ⊂ L2 (Rd ) ,

3. For any 0 < σ < τ and f ∈ Hτ , show that

k f kHτ ≥ k f kHσ ≥ k f kL2 (Rd ) ,

and that
σ2  
0 ≤ kf k2Hσ − kf k2L2 (Rd ) 2 2
≤ 2 k f kHτ − k f kL2 (Rd ) .
τ

4. For any τ > 0 and f ∈ Hτ , show that

lim k f kHσ = k f kL2 (Rd ) .


σ→0

Exercice 25. Kernel for sets


We wish to construct positive definite kernels for finite sets of points in the
interval [0, 1]. Let X = (x1 , . . . , xn ) and Y = (y1 , . . . , yn ) be two such sets of
length n and m.

18
1. Show that the following kernel is positive definite for any σ > 0:

(x − y)2
XX  
K1 (X, Y ) = exp − 2
.
x∈X y∈Y

2. To any finite set X of length n we associate the function gX : R → R


defined by:
(x − t)2
 
1X
gX (t) = exp − .
n x∈X 2σ 2
Show that the following kernel is positive definite for any σ > 0:
Z
K2 (X, Y ) = gX (t)gY (t)dt .
R

Is there a simple relation between K1 (X, Y ) and K2 (X, Y )?

3. Let P be a partition of [0, 1]. For any bin p ∈ P, let np (X) be the
number of points of X which are in p. Show that the following kernels
are positive definite:
X
K3 (X, Y ) = min(np (X), np (Y )) ,
p∈P

Y
K4 (X, Y ) = min(np (X), np (Y )) .
p∈P

4. Let TD be a complete binary tree of depth D, that is, a directed graph


such that, starting from the root, each node has two children, until the
nodes in the D-th generation which have no children (nodes with no
children are called leaves). The nodes of TD are denoted s ∈ TD . How
many nodes are there in TD ?

5. We denote by S(TD ) the set of connected subgraphs of TD which contain


the root and such that all their nodes have either 0 or 2 children. What
is the size of S(TD ) for D = 10?

6. For 0 < p < 1, we consider the following rule to generate randomly


a tree in S(TD ). We start at the root, and give it two children with
probability p, and no child with probability 1−p. If it has no child, then

19
the process stops and the tree generated is the root only. Otherwise,
the same rule is applied independently to both children, which have
themselves 0 or 2 children with probability 1 − p and p. The process is
repeated iteratively to all new children, until no more child is generated,
or until we reach the D-th generation where nodes have no children with
probability 1. For any T ∈ S(TD ) we denote by π(T ) the probability
of generating T by this process. For any real-valued function h defined
over the set of nodes s ∈ TD , propose a factorization to compute the
following sum efficiently:
X Y
π(T ) h(s) .
T ∈S(TD ) s∈leaves(T )

7. Suppose that each leaf s ∈ leaves(TD ) is associated to a interval p(s)


of [0, 1] which together form a partition. For any node s ∈ TD we
denote by D(s) the set of leaves of TD which are descendant of s, and
we associate to s the subset p(s) ⊂ [0, 1] defined by:
[
p(s) = p(l) .
l∈D(s)

For any T ∈ S(TD ), show that the following function is a positive


definite kernel:
Y
KT (X, Y ) = min(np(s) (X), np(s) (Y )) .
s∈leaves(T )

8. Show that the following function is a positive definite kernel and pro-
pose an efficient implementation to compute it
X
K5 (X, Y ) = π(T )KT (X, Y ) .
T ∈S(TD )

Exercice 26. Rademacher complexity of MKL


Given a fixed sample of n points S = (x1 , . . . , xn ) in a space X , the empirical
Rademacher complexity of a set of function F ⊂ RX is:
" n
#
1 X
R(F) = E sup σi f (xi ) ,
n f ∈F i=1

20
where the expectation is taken over σi ∈ {−1, +1} for i = 1, . . . , n, which
are independent uniform Rademacher random variables. The following result
can be used without proof:
Lemma 1. For any n×n symmetric p.s.d. matrix K, and σ = (σ1 , . . . , σn )>
a vector of independent Rademacher random variables, the following holds:
r
∀r ∈ N∗ , E σ > Kσ ≤ (2r trace (K))r .

1. Let k be a p.d. kernel over X with RKHS Hk , K its Gram matrix on


S, and B(k, t) = {f ∈ Hk : k f kHk ≤ t}. Show that
p
t trace (K)
∀t > 0 , R (B(k, t)) ≤ .
n

2. If, in addition, there exists M > 0 such that ∀x ∈ X , k(x, x) ≤ M 2 ,


show that
tM
∀t > 0 , R (B(k, t)) ≤ √
n
kη = ni=1 ηi ki for any
P
3. Let now k1 , . . . , kp be p p.d. kernel on X , and
p
η ∈ ∆ = {η ∈ Rp : ∀i = 1, . . . , p, ηi ≥ 0 and
P
i=1 ηi = 1}. Show that
kη is a p.d. kernel for any η ∈ ∆, and that, for any non-zero integer
r ∈ N∗
! √ 1
t 2r ( pi=1 trace (Ki )r ) 2r
[ P
∀t > 0 , R B(kη , t) ≤ .
η∈∆
n

4. If there exists M > 0 such that ∀i = 1, . . . , p, ∀x ∈ X , ki (x, x) < M 2 ,


show that
! r
[ 2e(ln p + 1)
∀t > 0 , R B(kη , t) ≤ tM .
η∈∆
n

5. (Bonus) Prove Lemma 1.

Exercice 27. MKL on a DAG


Let V = (v1 , . . . , vM ) be the vertices of a directed acyclic graph (DAG). For

21
any v ∈ V , we denote by D(v) ⊂ V the set of descendants of v (including
itself), and let dv ≥ 0 be a weight associated to each vertex v. We assume
that to each vertex v ∈ V is associated a positive definite kernel Kv over a
space X .

1. Using the notations of the course (slide 159), show that the following
weighted MKL with the set of kernels {Kv : v ∈ V }:
 ! !2 
 X X 
n
min R fv + λ dv k fv kHKv
(fv1 ,...,fvM )∈HKv ×...×HKv  
1 M v∈V v∈V

is equivalent to solving:
n o
n 2
min min R(f ) + λk f kHKη
η∈Σ f ∈HKη

for some set Σ to be determined.

2. We now consider the following variant of MKL which takes the graph
structure into account:

!
   21 2 

 

X X  X
n 2
min R fv + λ  dv k fw kHKw .
 

(fv1 ,...,fvM )∈HKv ×...×HKv  
1 M  v∈V v∈V w∈D(v) 
(7)
Can you intuitively explain why we may want to do this, and what we
can expect from the solution of this formulation?

3. Show that the MKL formulation (7) is equivalent to solving:


n o
min min R(f n ) + λk f k2HKη
η∈ΣV f ∈HKη

for some set ΣV to be determined.

4. Show that if the DAG is a tree, then ΣV is convex. Is it also convex


for a general DAG?

22
Exercice 28. Properties of the dot-product kernel
Consider the dot-product kernel on the sphere K1 : Sp−1 × Sp−1 → R such
that for all pair of points x, x0 in Sp−1 (unit sphere of Rp ),
K1 (x, x0 ) = κ(hx, x0 i),
where κ : [−1, 1] → R is an infinitely differentiable function that admits a
polynomial expansion on [−1, 1]:
+∞
X
κ(u) = ai ui , (8)
i=0

where the ai ’s are real coefficients and the sum above is always converging.
1. Show that if all coefficients ai are non-negative and κ 6= 0, then K1 is
p.d.
2. If K1 is p.d., show that the homogeneous dot-product kernel K2 : Rp ×
Rp → R is also p.d..
(  
0 hx,x0 i
kxkkx kκ 6 0 and kx0 k =
if kxk = 6 0
K2 (x, x0 ) = kxkkx0 k .
0 otherwise
Remark: it is in fact possible to show that all coefficients ai need to be
non-negative for the positive definiteness to hold for all dimension p,
but we do not ask for a proof of this result, which is due to Shoenberg,
1942.
3. Assume that all coefficients ai are non-negative (K1 is thus p.d.) and
that κ(1) = κ0 (1) = 1. Let H be the RKHS of K1 and consider its
RKHS mapping ϕ : Sp−1 → H such that K1 (x, x0 ) = hϕ(x), ϕ(x0 )iH for
all x, x0 in Sp−1 . Show that:
∀x, x0 ∈ Sp−1 , kϕ(x) − ϕ(x0 )kH ≤ kx − x0 k.

4. Find an explicit feature map ψ : Sp−1 → `2 , where `2 is the Hilbert


space of real-valued sequences (see definition on slide 240), such that
for all x, y in Sp−1
K1 (x, y) = hψ(x), ψ(y)i`2 .
Hint: remember that hx, yi2 = hxx> , yy > iF , where h., iF is the Frobenius
inner-product. You may want to use the tensor product notation x⊗2 =
xx> and its generalization for degrees higher than 2.

23
5. Let us assume that you have found an explicit feature map ψ in the
previous question. Remember from one of our previous homeworks that
the RKHS H of K1 can be characterized by

H = {fw : w ∈ `2 } such that fw : x 7→ hw, ψ(x)i`2 ,

with
kfw k2H = inf kw0 k2`2 : fw = fw0 .

0
w ∈`2

Consider then a function gz : Sp−1 → R of the form

gz : x 7→ σ(hz, xi)

with z in Sp−1 and σ admits a polynomial expansion σ(u) = +∞ i


P
i=0 bi u .
Could you find a sufficient condition on z and on the coefficients bi
for gz to be in H?
Remark: gz can be interpreted as a one-layer neural network function.
We could ask you to do the same analysis for the homogeneous kernel
K2 , but this would be unnecessary technical for this homework which
is already too long. This being said, if you found it too short, we’re
happy to see your analysis of K2 and the type of functions gz you will
consider.

24

You might also like