0% found this document useful (0 votes)
48 views

Introduction To Kernels: Max Welling

The document introduces kernels and kernel methods. It discusses: 1) How kernel methods allow applying linear algorithms to non-linear problems by mapping data to high-dimensional feature spaces. 2) The "kernel trick" which allows computing similarities between points in feature space using kernel functions without needing to explicitly compute the mapping. 3) How positive semi-definite kernel functions correspond to an inner product in some feature space. 4) How kernel methods consist of a kernel choice and learning algorithm, allowing different combinations.

Uploaded by

Kamesh Reddi
Copyright
© © All Rights Reserved
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
48 views

Introduction To Kernels: Max Welling

The document introduces kernels and kernel methods. It discusses: 1) How kernel methods allow applying linear algorithms to non-linear problems by mapping data to high-dimensional feature spaces. 2) The "kernel trick" which allows computing similarities between points in feature space using kernel functions without needing to explicitly compute the mapping. 3) How positive semi-definite kernel functions correspond to an inner product in some feature space. 4) How kernel methods consist of a kernel choice and learning algorithm, allowing different combinations.

Uploaded by

Kamesh Reddi
Copyright
© © All Rights Reserved
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 16

(chapters 1,2,3,4)

Introduction to Kernels

Max Welling
October 1 2004

1
Introduction
• What is the goal of (pick your favorite name):
- Machine Learning
- Data Mining
- Pattern Recognition
- Data Analysis
- Statistics

Automatic detection of non-coincidental structure in data.

• Desiderata:
- Robust algorithms insensitive to outliers and wrong
model assumptions.
- Stable algorithms: generalize well to unseen data.
- Computationally efficient algorithms: large datasets.
2
Let’s Learn Something
Find the common characteristic (structure) among the following
statistical methods?

1. Principal Components Analysis


2. Ridge regression
3. Fisher discriminant analysis
4. Canonical correlation analysis

Answer:
We consider linear combinations of input vector: f ( x )  wT x

Linear algorithm are very well understood and enjoy strong guarantees.
(convexity, generalization bounds).
3
Can we carry these guarantees over to non-linear algorithms?
Feature Spaces

 : x   ( x), R  F d

non-linear mapping to F 
1. high-D space L2
2. infinite-D countable space :
3. function space (Hilbert space)

example: ( x, y )  ( x , y , 2 xy )
2 2

4
Ridge Regression (duality)

problem: min w  ( yi  wT xi ) 2   || w ||2
i 1

input regularization
target

solution: w  ( X T X   I d ) 1 X T y dxd inverse


 X T ( XX T   I  ) 1 y   inverse
 X T (G   I  ) 1 y Gij  xi , x j 

  xi i Gram-matrix
i 1

linear comb. data Dual Representation 5


Kernel Trick
Note: In the dual representation we used the Gram matrix
to express the solution.

Kernel Trick: kernel


Replace : x   ( x),
Gij  xi , x j  Gij   ( xi ),  ( x j )  K ( xi , x j )

If we use algorithms that only depend on the Gram-matrix, G,


then we never have to know (compute) the actual features 

This is the crucial point of kernel methods


6
Modularity

Kernel methods consist of two modules:

1) The choice of kernel (this is non-trivial)


2) The algorithm which takes kernels as input

Modularity: Any kernel can be used with any kernel-algorithm.


some kernels: some kernel algorithms:
2
- support vector machine
k ( x, y )  e( || x  y|| / c)
- Fisher discriminant analysis
k ( x, y )  ( x, y   ) d - kernel regression
k ( x, y )  tanh(  x, y   )
- kernel PCA
1
k ( x, y )  - kernel CCA 7
|| x  y || c 2 2
What is a proper kernel
Definition: A finitely positive semi-definite function k : x  y  R
is a symmetric function of its arguments for which matrices formed
by restriction on any finite subset of points is positive semi-definite.
 T K  0 
Theorem: A function k : x  y  R can be written
as k ( x, y )   ( x), ( y )  where  ( x) is a feature map
x   ( x)  F iff k(x,y) satisfies the semi-definiteness property.

Relevance: We can now check if k(x,y) is a proper kernel using


only properties of k(x,y) itself,
i.e. without the need to know the feature map! 8
Reproducing Kernel Hilbert Spaces
The proof of the above theorem proceeds by constructing a very
special feature map (note that more feature maps may give rise to a kernel)

 : x   ( x)  k ( x,.) i.e. we map to a function space.

definition function space: reproducing property:


m
f (.)    i k ( xi ,.) any m,{xi }  f ,  ( x)  f , k ( x,.) 
i 1
k
   i k ( xi ,.), k ( x,.) 
m 
 f , g    i  j k ( xi , x j )
i 1 j 1 i 1
k

  k ( x , x)  f ( x)
m 
 f , f    i j k ( xi , x j )  0 i i
i 1 j 1 i 1

( finite positive semi  definite)    ( x),  ( y )  k ( x, y ) 9


Mercer’s Theorem
Theorem: X is compact, k(x,y) is symmetric continuous function s.t.
Tk f   k (., x ) f ( x ) dx is a positive semi-definite operator: Tk  0
i.e.
  k ( x, y) f ( x) f ( y) dxdy  0 f  L2 ( X )
then there exists an orthonormal feature basis of eigen-functions
such that:

k ( x, y )    i ( x ) j ( y )
i 1

Hence: k(x,y) is a proper kernel.


Note: Here we construct feature vectors in L2, where the RKHS
construction was in a function space. 10
Learning Kernels
• All information is tunneled through the Gram-matrix information
bottleneck.
• The real art is to pick an appropriate kernel.
2
e.g. take the RBF kernel: k ( x, y )  e( || x  y|| / c )

if c is very small: G=I (all data are dissimilar): over-fitting


if c is very large: G=1 (all data are very similar): under-fitting

We need to learn the kernel. Here is some ways to combine


kernels to improve them:
 k1 ( x, y )   k2 ( x, y )  k ( x, y )  ,   0 k1 cone
k ( x, y ) k ( x , y )  k ( x, y ) k2
1 2
any positive
k1 (( x), ( y ))  k ( x, y ) polynomial
11
Stability of Kernel Algorithms
Our objective for learning is to improve generalize performance:
cross-validation, Bayesian methods, generalization bounds,...

Call ES [ f ( x)]  0 a pattern a sample S.


Is this pattern also likely to be present in new data: EP [ f ( x)]  0 ?
We can use concentration inequalities (McDiamid’s theorem)
to prove that:

Theorem: Let S  {x1 ,..., x} be a IID sample from P and define
the sample mean of f(x) as: f 1  f ( xi ) then it follows that:

 i 1

R 1 R  sup x || f ( x) ||
P(|| f  EP [ f ] || (2  2 ln( ))  1  
 
12
(prob. that sample mean and population mean differ less than is more than ,independent of P!
Rademacher Complexity
Prolem: we only checked the generalization performance for a
single fixed pattern f(x).
What is we want to search over a function class F?

Intuition: we need to incorporate the complexity of this function class.

Rademacher complexity captures the ability of the function class to


fit random noise. ( i  1 uniform distributed)  i  1
(empirical RC)
f1
 2  f2
R ( F )  E [sup |   i f ( xi ) |,| x1 ,..., x ]
f F  i 1

2 
R ( F )  ES E [sup |   i f ( xi ) |]
f F  i 1 xi 13
Generalization Bound
Theorem: Let f be a function in F which maps to [0,1]. (e.g. loss functions)
Then, with probability at least 1   over random draws of size 
every f satisfies:
2
ln( )
E p [ f ( x)]  Edata [ f ( x)]  R ( F )  
2
2
 ln( )
 Edata [ f ( x)]  R ( F )  3 
2
Relevance: The expected pattern E[f]=0 will also be present in a new
data set, if the last 2 terms are small:
- Complexity function class F small
- number of training data large 14
Linear Functions (in feature space)
Consider the FB  { f : x  w,  ( x)  , || w || B}
function class: with k ( x, y )  ( x),  ( y ) 

and a sample: S  {x1 ,..., x}

Then, the empirical  2B


R ( FB )  tr ( K )
RC of FB is bounded by: 

Relevance: Since: {x    i k ( xi , x) ,  T K  B}  FB it follows that
if we control the norm i 1 T K || w ||2 in kernel algorithms, we control
the complexity of the function class (regularization). 15
Margin Bound (classification)
Theorem: Choose c>0 (the margin).
F : f(x,y)=-yg(x), y=+1,-1
S: {( x1 , y1 ),..., ( x , y )} IID sample
 : (0,1) : probability of violating bound.
2
ln( )
1 
4 
Pp [ y  sign( g ( x ))]   i  tr ( K )  3
c i 1 c 2
(prob. of misclassification)
i  (c  yi g ( xi ))  ( slack variable)
( f )  f if f  0 and 0 otherwise

Relevance: We our classification error on new samples. Moreover, we have a


strategy to improve generalization: choose the margin c as large possible such
that all samples are correctly classified: i  0 (e.g. support vector machines).
16

You might also like