Exercices Kernel Trick
Exercices Kernel Trick
Exercices
Julien Mairal and Jean-Philippe Vert
Exercice 1. Kernels
Study whether the following kernels are positive definite:
1. X = (−1, 1), K (x, x0 ) = 1
1−xx0
0
2. X = N, K (x, x0 ) = 2x+x
0
3. X = N, K (x, x0 ) = 2xx
4. X = R+ , K (x, x0 ) = log (1 + xx0 )
5. X = R, K (x, x0 ) = exp (−|x − x0 |2 )
6. X = R, K (x, x0 ) = cos (x + x0 )
7. X = R, K (x, x0 ) = cos (x − x0 )
8. X = R+ , K (x, x0 ) = min(x, x0 )
9. X = R+ , K (x, x0 ) = max(x, x0 )
10. X = R+ , K (x, x0 ) = min(x, x0 )/ max(x, x0 )
11. X = N, K (x, x0 ) = GCD (x, x0 )
12. X = N, K (x, x0 ) = LCM (x, x0 )
13. X = N, K (x, x0 ) = GCD (x, x0 ) /LCM (x, x0 )
14. Given a probability space (Ω, A, P ), on X = A:
∀A, B ∈ A , K (A, B) = P (A ∩ B) − P (A)P (B) .
1
15. Let X be a set and f, g : X → R+ two non-negative functions:
|A ∩ B |
∀A, B ⊂ E , K (A, B) = ,
|A ∪ B |
0
where | F | denotes the cardinality of F , and with the convention 0
= 0.
2
Exercice 5. RKHS
In the course we have shown that H0 endowed with this inner product is a
pre-Hilbert space. Let us now show how to finish the construction of the
RKHS from H0
3
1. Show that any Cauchy sequence (fn ) in H0 converges pointwisely to a
function f : X → R defined by f (x) = limn→+∞ fn (x).
2. Show that any Cauchy sequence (fn )n∈N in H0 which converges point-
wise to 0 satisfies:
lim k fn kH0 = 0 .
n→+∞
4. If (fn ) and (gn ) are two Cauchy sequences in H0 , which converge point-
wisely to two functions f and g ∈ H, show that the inner product
hfn , gn iH0 converges to a number which only depends on f and g. This
allows us to define formally the operation:
4
definite (c.p.d.) if and only if it is symmetric and satisfies:
n
X
ai aj k(xi , xj ) ≥ 0
i,j=1
Pn
for any n ∈ N, x1 , x2 , . . . , xn ∈ X n and a1 , a2 , . . . , an ∈ Rn with i=1 ai = 0
.
6. Show that if k is c.p.d., then the function exp(tk(x, y)) is p.d. for all
t≥0
7. Conversely, show that if the function exp(tk(x, y)) is p.d. for any t ≥ 0,
then k is c.p.d.
5
Exercice 9. COCO
Given two sets of real numbers X = (x1 , . . . , xn ) ∈ Rn and Y = (y1 , . . . , yn ) ∈
Rn , the covariance between X and Y is defined as
covn (X, Y ) = En (XY ) − En (X)En (Y ) ,
Pn
where En (U ) = ( i=1 ui )/n. The covariance is useful to detect linear rela-
tionships between X and Y . In order to extend this measure to potential
nonlinear relationships between X and Y , we consider the following criterion:
CnK (X, Y ) = max covn (f (X), g(Y )) ,
f,g∈BK
6
1. For x ∈ X , let
Ψ(x) = Pd (Φ(x) − m) + m
where Pd is the projection onto the linear span of the first d kernel
principal components of S. Show that Ψ(x) can be expressed as
n
X
Ψ(x) = γi Φ(xi ) ,
i=1
2. For y ∈ X , express
7
1. Starting from an initial assignment z 0 , we can try to minimize C(z, µ)
by iterating:
Explicit how both minimization can be carried out (note: this method
is called k-means).
4. Let H = ZL1/2 . What can we say about H > H? Do you see a connec-
tion between kernel k-means and kernel PCA? Propose an algorithm
to estimate Z from the solution of kernel PCA.
8
S1 = x11 , . . . , x1n1 and S2 = x21 , . . . , x2n2 in Rp , let us denote by mi =
1
Pli i
ni j=1 xj , and by:
the between and within class scatter matrices, respectively. LDA constructs
the function
fw (x) = w> x ,
where w is the vector which maximizes
w> SB w
J(w) = .
w > SW w
1. Why does it make sense to maximize J(w)? What do we expect to
find? (you can take as example the case where the two sets S1 and S2
form two clusters, e.g., two Gaussians).
9
2. Let K be a positive definite kernel on a space X , HK denote the associ-
ated reproducing kernel Hilbert space, and BR = {f ∈ HK , k f kHK ≤ R}.
Let a set of points S = (x1 , x2 , . . . , xN ) with xi ∈ X (i = 1, . . . , N ),
and let σ1 , σ2 , . . . , σN be N independent Rademacher variables. Show
that: v
XN u N
uX
E sup σi f (xi ) ≤ Rt K (xi , xi ) .
f ∈BR i=1 i=1
3. Under the hypothesis of questions 2.1 and 2.2, show that there exists
a constant C, to be determined, such that if X is a random variable
with values in X , then:
∀f ∈ BR , Eψ (f, X)2 ≤ CEψ (f, X) .
10
Exercice 16. Dual coordinate ascent algorithms for SVMs
1. We recall the primal formulation of SVMs seen in the class (slide 142).
n
1X
min max(0, 1 − yi f (xi )) + λkf k2H ,
f ∈H n
i=1
Can we still apply the representer theorem? Why? Derive the corre-
sponding dual formulation by using Lagrangian duality. Can we apply
the coordinate ascent method to this dual? If yes, what are the update
rules?
3. Consider a coordinate ascent method to this dual that consists of up-
dating two variables (αi , αj ) at a time (while fixing the n − 2 other
variables). What are the update rules for these two variables?
11
where ||f || is the norm of f in the RKHS HK of the kernel K, and L is the
square hinge loss function:
Write the primal and dual problems associated to the 2-SVM, and compare
the result with the SVM studied in the course.
1. Show that µ(P ) is in H and that EX∼P [f (X)] = hf, µ(P )iH for any
f ∈ H.
Remark: If P and Q are two Borel probability measures, then
12
where ES is the expectation by randomizing over the training set (each
xi is a r.v. distributed according to P ). Remember that you are allowed
to (and you should!) use any existing result from the slides.
3. Consider the quantity
and give a formula for this quantity in terms of kernel evaluations only.
Remark: this is called the maximum mean discrepancy criterion, which
can be used for statistical testing (are S1 and S2 coming from the same
distribution?).
4. We consider X = Rd and the normalized Gaussian kernel with band-
−d kx−yk2
width σ: K(x, y) = σ exp − 2σ2 . For any two sets S1 and S2 ,
show that M M D(S1 , S2 ) is a decreasing function of σ.
1. Let
13
3. Same question, when H is endowed with the bilinear form:
Z 1
∀f, g ∈ H , hf, giH = (f (u)g(u) + f 0 (u)g 0 (u)) du .
0
14
4. Let 0 < x1 < . . . < xn < 1 and (y1 , . . . , yn ) ∈ Rn . In order to esti-
mate a regression function f : [0, 1] → R, we consider the following
optimization problem:
n Z 1
1X
min (f (xi ) − yi )2 + λ f 00 (t)2 dt. (3)
f ∈H n 0
i=1
7. Show that
• fˆ ∈ C2 ([0, 1]);
15
• fˆ is a polynomial of degree 3 on each interval [xi , xi+1 ] for i =
1, . . . , n − 1;
• fˆ is an affine function on both intervals [0, x1 ] and [xn , 1] .
fˆ is called a spline.
where `y is a convex loss functions (for y ∈ {−1, 1}) and B > 0 is a parameter.
1. Show that there exists λ ≥ 0 such that the solution to problem (7) can
be found be solving the following problem:
min R(Kα) + λα> Kα , (5)
α∈Rn
16
4. Explicit the dual problem for the logistic and squared hinge losses:
`y (u) = log(1 + e−yu ) .
`y (u) = max(0, 1 − yu)2 .
17
4. Can you describe the functions ϕ : R+ 7→ R such that:
Hτ ⊂ Hσ ⊂ L2 (Rd ) ,
and that
σ2
0 ≤ kf k2Hσ − kf k2L2 (Rd ) 2 2
≤ 2 k f kHτ − k f kL2 (Rd ) .
τ
18
1. Show that the following kernel is positive definite for any σ > 0:
(x − y)2
XX
K1 (X, Y ) = exp − 2
.
x∈X y∈Y
2σ
3. Let P be a partition of [0, 1]. For any bin p ∈ P, let np (X) be the
number of points of X which are in p. Show that the following kernels
are positive definite:
X
K3 (X, Y ) = min(np (X), np (Y )) ,
p∈P
Y
K4 (X, Y ) = min(np (X), np (Y )) .
p∈P
19
the process stops and the tree generated is the root only. Otherwise,
the same rule is applied independently to both children, which have
themselves 0 or 2 children with probability 1 − p and p. The process is
repeated iteratively to all new children, until no more child is generated,
or until we reach the D-th generation where nodes have no children with
probability 1. For any T ∈ S(TD ) we denote by π(T ) the probability
of generating T by this process. For any real-valued function h defined
over the set of nodes s ∈ TD , propose a factorization to compute the
following sum efficiently:
X Y
π(T ) h(s) .
T ∈S(TD ) s∈leaves(T )
8. Show that the following function is a positive definite kernel and pro-
pose an efficient implementation to compute it
X
K5 (X, Y ) = π(T )KT (X, Y ) .
T ∈S(TD )
20
where the expectation is taken over σi ∈ {−1, +1} for i = 1, . . . , n, which
are independent uniform Rademacher random variables. The following result
can be used without proof:
Lemma 1. For any n×n symmetric p.s.d. matrix K, and σ = (σ1 , . . . , σn )>
a vector of independent Rademacher random variables, the following holds:
r
∀r ∈ N∗ , E σ > Kσ ≤ (2r trace (K))r .
21
any v ∈ V , we denote by D(v) ⊂ V the set of descendants of v (including
itself), and let dv ≥ 0 be a weight associated to each vertex v. We assume
that to each vertex v ∈ V is associated a positive definite kernel Kv over a
space X .
1. Using the notations of the course (slide 159), show that the following
weighted MKL with the set of kernels {Kv : v ∈ V }:
! !2
X X
n
min R fv + λ dv k fv kHKv
(fv1 ,...,fvM )∈HKv ×...×HKv
1 M v∈V v∈V
is equivalent to solving:
n o
n 2
min min R(f ) + λk f kHKη
η∈Σ f ∈HKη
2. We now consider the following variant of MKL which takes the graph
structure into account:
!
21 2
X X X
n 2
min R fv + λ dv k fw kHKw .
(fv1 ,...,fvM )∈HKv ×...×HKv
1 M v∈V v∈V w∈D(v)
(7)
Can you intuitively explain why we may want to do this, and what we
can expect from the solution of this formulation?
22
Exercice 28. Properties of the dot-product kernel
Consider the dot-product kernel on the sphere K1 : Sp−1 × Sp−1 → R such
that for all pair of points x, x0 in Sp−1 (unit sphere of Rp ),
K1 (x, x0 ) = κ(hx, x0 i),
where κ : [−1, 1] → R is an infinitely differentiable function that admits a
polynomial expansion on [−1, 1]:
+∞
X
κ(u) = ai ui , (8)
i=0
where the ai ’s are real coefficients and the sum above is always converging.
1. Show that if all coefficients ai are non-negative and κ 6= 0, then K1 is
p.d.
2. If K1 is p.d., show that the homogeneous dot-product kernel K2 : Rp ×
Rp → R is also p.d..
(
0 hx,x0 i
kxkkx kκ 6 0 and kx0 k =
if kxk = 6 0
K2 (x, x0 ) = kxkkx0 k .
0 otherwise
Remark: it is in fact possible to show that all coefficients ai need to be
non-negative for the positive definiteness to hold for all dimension p,
but we do not ask for a proof of this result, which is due to Shoenberg,
1942.
3. Assume that all coefficients ai are non-negative (K1 is thus p.d.) and
that κ(1) = κ0 (1) = 1. Let H be the RKHS of K1 and consider its
RKHS mapping ϕ : Sp−1 → H such that K1 (x, x0 ) = hϕ(x), ϕ(x0 )iH for
all x, x0 in Sp−1 . Show that:
∀x, x0 ∈ Sp−1 , kϕ(x) − ϕ(x0 )kH ≤ kx − x0 k.
23
5. Let us assume that you have found an explicit feature map ψ in the
previous question. Remember from one of our previous homeworks that
the RKHS H of K1 can be characterized by
with
kfw k2H = inf kw0 k2`2 : fw = fw0 .
0
w ∈`2
gz : x 7→ σ(hz, xi)
24