0% found this document useful (0 votes)

48 views

Introduction To Kernels: Max Welling

The document introduces kernels and kernel methods. It discusses: 1) How kernel methods allow applying linear algorithms to non-linear problems by mapping data to high-dimensional feature spaces. 2) The "kernel trick" which allows computing similarities between points in feature space using kernel functions without needing to explicitly compute the mapping. 3) How positive semi-definite kernel functions correspond to an inner product in some feature space. 4) How kernel methods consist of a kernel choice and learning algorithm, allowing different combinations.

Uploaded by

Kamesh Reddi

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

48 views

Introduction To Kernels: Max Welling

Uploaded by

Kamesh Reddi

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

You are on page 1/ 16

(chapters 1,2,3,4)

Introduction to Kernels

Max Welling
October 1 2004

1
Introduction
• What is the goal of (pick your favorite name):
- Machine Learning
- Data Mining
- Pattern Recognition
- Data Analysis
- Statistics

Automatic detection of non-coincidental structure in data.

• Desiderata:
- Robust algorithms insensitive to outliers and wrong
model assumptions.
- Stable algorithms: generalize well to unseen data.
- Computationally efficient algorithms: large datasets.
2
Let’s Learn Something
Find the common characteristic (structure) among the following
statistical methods?

1. Principal Components Analysis

2. Ridge regression
3. Fisher discriminant analysis
4. Canonical correlation analysis

Answer:
We consider linear combinations of input vector: f ( x )  wT x

Linear algorithm are very well understood and enjoy strong guarantees.
(convexity, generalization bounds).
3
Can we carry these guarantees over to non-linear algorithms?
Feature Spaces

 : x   ( x), R  F d

non-linear mapping to F 
1. high-D space L2
2. infinite-D countable space :
3. function space (Hilbert space)

example: ( x, y )  ( x , y , 2 xy )
2 2

4
Ridge Regression (duality)

problem: min w  ( yi  wT xi ) 2   || w ||2
i 1

input regularization
target

solution: w  ( X T X   I d ) 1 X T y dxd inverse

 X T ( XX T   I  ) 1 y   inverse
 X T (G   I  ) 1 y Gij  xi , x j 

  xi i Gram-matrix
i 1

linear comb. data Dual Representation 5

Kernel Trick
Note: In the dual representation we used the Gram matrix
to express the solution.

Kernel Trick: kernel

Replace : x   ( x),
Gij  xi , x j  Gij   ( xi ),  ( x j )  K ( xi , x j )

If we use algorithms that only depend on the Gram-matrix, G,

then we never have to know (compute) the actual features 

This is the crucial point of kernel methods

6
Modularity

Kernel methods consist of two modules:

1) The choice of kernel (this is non-trivial)

2) The algorithm which takes kernels as input

Modularity: Any kernel can be used with any kernel-algorithm.

some kernels: some kernel algorithms:
2
- support vector machine
k ( x, y )  e( || x  y|| / c)
- Fisher discriminant analysis
k ( x, y )  ( x, y   ) d - kernel regression
k ( x, y )  tanh(  x, y   )
- kernel PCA
1
k ( x, y )  - kernel CCA 7
|| x  y || c 2 2
What is a proper kernel
Definition: A finitely positive semi-definite function k : x  y  R
is a symmetric function of its arguments for which matrices formed
by restriction on any finite subset of points is positive semi-definite.
 T K  0 
Theorem: A function k : x  y  R can be written
as k ( x, y )   ( x), ( y )  where  ( x) is a feature map
x   ( x)  F iff k(x,y) satisfies the semi-definiteness property.

Relevance: We can now check if k(x,y) is a proper kernel using

only properties of k(x,y) itself,
i.e. without the need to know the feature map! 8
Reproducing Kernel Hilbert Spaces
The proof of the above theorem proceeds by constructing a very
special feature map (note that more feature maps may give rise to a kernel)

 : x   ( x)  k ( x,.) i.e. we map to a function space.

definition function space: reproducing property:

m
f (.)    i k ( xi ,.) any m,{xi }  f ,  ( x)  f , k ( x,.) 
i 1
k
   i k ( xi ,.), k ( x,.) 
m 
 f , g    i  j k ( xi , x j )
i 1 j 1 i 1
k

  k ( x , x)  f ( x)
m 
 f , f    i j k ( xi , x j )  0 i i
i 1 j 1 i 1

( finite positive semi  definite)    ( x),  ( y )  k ( x, y ) 9

Mercer’s Theorem
Theorem: X is compact, k(x,y) is symmetric continuous function s.t.
Tk f   k (., x ) f ( x ) dx is a positive semi-definite operator: Tk  0
i.e.
  k ( x, y) f ( x) f ( y) dxdy  0 f  L2 ( X )
then there exists an orthonormal feature basis of eigen-functions
such that:

k ( x, y )    i ( x ) j ( y )
i 1

Hence: k(x,y) is a proper kernel.

Note: Here we construct feature vectors in L2, where the RKHS
construction was in a function space. 10
Learning Kernels
• All information is tunneled through the Gram-matrix information
bottleneck.
• The real art is to pick an appropriate kernel.
2
e.g. take the RBF kernel: k ( x, y )  e( || x  y|| / c )

if c is very small: G=I (all data are dissimilar): over-fitting

if c is very large: G=1 (all data are very similar): under-fitting

We need to learn the kernel. Here is some ways to combine

kernels to improve them:
 k1 ( x, y )   k2 ( x, y )  k ( x, y )  ,   0 k1 cone
k ( x, y ) k ( x , y )  k ( x, y ) k2
1 2
any positive
k1 (( x), ( y ))  k ( x, y ) polynomial
11
Stability of Kernel Algorithms
Our objective for learning is to improve generalize performance:
cross-validation, Bayesian methods, generalization bounds,...

Call ES [ f ( x)]  0 a pattern a sample S.

Is this pattern also likely to be present in new data: EP [ f ( x)]  0 ?
We can use concentration inequalities (McDiamid’s theorem)
to prove that:

Theorem: Let S  {x1 ,..., x} be a IID sample from P and define
the sample mean of f(x) as: f 1  f ( xi ) then it follows that:


 i 1

R 1 R  sup x || f ( x) ||
P(|| f  EP [ f ] || (2  2 ln( ))  1  
 
12
(prob. that sample mean and population mean differ less than is more than ,independent of P!
Rademacher Complexity
Prolem: we only checked the generalization performance for a
single fixed pattern f(x).
What is we want to search over a function class F?

Intuition: we need to incorporate the complexity of this function class.

Rademacher complexity captures the ability of the function class to

fit random noise. ( i  1 uniform distributed)  i  1
(empirical RC)
f1
 2  f2
R ( F )  E [sup |   i f ( xi ) |,| x1 ,..., x ]
f F  i 1

2 
R ( F )  ES E [sup |   i f ( xi ) |]
f F  i 1 xi 13
Generalization Bound
Theorem: Let f be a function in F which maps to [0,1]. (e.g. loss functions)
Then, with probability at least 1   over random draws of size 
every f satisfies:
2
ln( )
E p [ f ( x)]  Edata [ f ( x)]  R ( F )  
2
2
 ln( )
 Edata [ f ( x)]  R ( F )  3 
2
Relevance: The expected pattern E[f]=0 will also be present in a new
data set, if the last 2 terms are small:
- Complexity function class F small
- number of training data large 14
Linear Functions (in feature space)
Consider the FB  { f : x  w,  ( x)  , || w || B}
function class: with k ( x, y )  ( x),  ( y ) 

and a sample: S  {x1 ,..., x}

Then, the empirical  2B

R ( FB )  tr ( K )
RC of FB is bounded by: 

Relevance: Since: {x    i k ( xi , x) ,  T K  B}  FB it follows that
if we control the norm i 1 T K || w ||2 in kernel algorithms, we control
the complexity of the function class (regularization). 15
Margin Bound (classification)
Theorem: Choose c>0 (the margin).
F : f(x,y)=-yg(x), y=+1,-1
S: {( x1 , y1 ),..., ( x , y )} IID sample
 : (0,1) : probability of violating bound.
2
ln( )
1 
4 
Pp [ y  sign( g ( x ))]   i  tr ( K )  3
c i 1 c 2
(prob. of misclassification)
i  (c  yi g ( xi ))  ( slack variable)
( f )  f if f  0 and 0 otherwise

Relevance: We our classification error on new samples. Moreover, we have a

strategy to improve generalization: choose the margin c as large possible such
that all samples are correctly classified: i  0 (e.g. support vector machines).
16

Cs 229, Autumn 2016 Problem Set #2: Naive Bayes, SVMS, and Theory
No ratings yet
Cs 229, Autumn 2016 Problem Set #2: Naive Bayes, SVMS, and Theory
20 pages
MTH-641 Merged Handouts (1-113 Topics)
100% (2)
MTH-641 Merged Handouts (1-113 Topics)
141 pages
Topics For APMA E4101 - Fall, 2018, v1.0: Marc Spiegelman December 3, 2018
No ratings yet
Topics For APMA E4101 - Fall, 2018, v1.0: Marc Spiegelman December 3, 2018
5 pages
Matlab Homework Experts 2
No ratings yet
Matlab Homework Experts 2
10 pages
4. Feedback linearization
No ratings yet
4. Feedback linearization
37 pages
Unit 7 - Derivative Applications
No ratings yet
Unit 7 - Derivative Applications
12 pages
SinhaDu16 PDF
No ratings yet
SinhaDu16 PDF
20 pages
Assignment06 2024
No ratings yet
Assignment06 2024
3 pages
2021 UNAS REFER Rafi Yon Saputra 173112706420242 Kernel Primer
No ratings yet
2021 UNAS REFER Rafi Yon Saputra 173112706420242 Kernel Primer
65 pages
APA Chapter3 T20
No ratings yet
APA Chapter3 T20
24 pages
Practice_Problems_for_ML_Midterms
No ratings yet
Practice_Problems_for_ML_Midterms
5 pages
Adaptive Mean Shift-Based Clustering
No ratings yet
Adaptive Mean Shift-Based Clustering
11 pages
1 Lecture 5b: Probabilistic Perspectives On ML Algorithms
No ratings yet
1 Lecture 5b: Probabilistic Perspectives On ML Algorithms
6 pages
Slides Chap5 KernelMethods
No ratings yet
Slides Chap5 KernelMethods
24 pages
More Kernels and Their Properties
No ratings yet
More Kernels and Their Properties
3 pages
For More Important Questions Visit:: Continuity and Differentiation
No ratings yet
For More Important Questions Visit:: Continuity and Differentiation
8 pages
problem_set_f3a9968c-11db-4887-8482-ad690c58f86f
No ratings yet
problem_set_f3a9968c-11db-4887-8482-ad690c58f86f
13 pages
Problem Sheet 1 (1)
No ratings yet
Problem Sheet 1 (1)
3 pages
03 - Kernelization
No ratings yet
03 - Kernelization
32 pages
ML Lecture06 2
No ratings yet
ML Lecture06 2
63 pages
9ba5285ed3557d31aaecaaf2dbb18949 Second Set
No ratings yet
9ba5285ed3557d31aaecaaf2dbb18949 Second Set
2 pages
ESE589HW1-25
No ratings yet
ESE589HW1-25
4 pages
ML Kernel Methods
No ratings yet
ML Kernel Methods
51 pages
2. Probability Theory_D
No ratings yet
2. Probability Theory_D
80 pages
Lecture 8_Kernels
No ratings yet
Lecture 8_Kernels
32 pages
Propability and Statistics Lecture Notes المذكره
No ratings yet
Propability and Statistics Lecture Notes المذكره
86 pages
12 Diffrentiation Previous Year Question Paper - 1
No ratings yet
12 Diffrentiation Previous Year Question Paper - 1
12 pages
Tut 1
No ratings yet
Tut 1
10 pages
Lecture 4
No ratings yet
Lecture 4
51 pages
Chapter 2 - Differentiation
No ratings yet
Chapter 2 - Differentiation
42 pages
Class03 PDF
No ratings yet
Class03 PDF
40 pages
ch00 Ma1300
No ratings yet
ch00 Ma1300
5 pages
CH 00
No ratings yet
CH 00
5 pages
CHAPTER 1 Functions and Limit
No ratings yet
CHAPTER 1 Functions and Limit
23 pages
Chapter 1 - Full PDF
No ratings yet
Chapter 1 - Full PDF
50 pages
Notes6_Classification
No ratings yet
Notes6_Classification
10 pages
Gaussian Process - Part 2: 1 2 N T I 1 2 N T
No ratings yet
Gaussian Process - Part 2: 1 2 N T I 1 2 N T
4 pages
Mod6_Slides
No ratings yet
Mod6_Slides
27 pages
Lec20 PDF
No ratings yet
Lec20 PDF
7 pages
MTS423 - Functional Analysis
100% (1)
MTS423 - Functional Analysis
20 pages
CS775 Lec 2
No ratings yet
CS775 Lec 2
66 pages
Lecture 10
No ratings yet
Lecture 10
15 pages
Deep Learning 1
No ratings yet
Deep Learning 1
48 pages
Calc 1 Final Practice Paper
No ratings yet
Calc 1 Final Practice Paper
1 page
Tutorial Cal 2
No ratings yet
Tutorial Cal 2
3 pages
Kagan Lecture2
No ratings yet
Kagan Lecture2
118 pages
CSE 597 Spring 2019 Exercise 1 Due Sunday 11:59 PM, February 3th
No ratings yet
CSE 597 Spring 2019 Exercise 1 Due Sunday 11:59 PM, February 3th
6 pages
Chapter 2 - Random Variable - L2 - Jan 2024
No ratings yet
Chapter 2 - Random Variable - L2 - Jan 2024
14 pages
Data Analysis
No ratings yet
Data Analysis
30 pages
Multiscale
No ratings yet
Multiscale
2 pages
Kernel Functions
No ratings yet
Kernel Functions
35 pages
Kernel Ridge Regression
No ratings yet
Kernel Ridge Regression
8 pages
Gradinet
No ratings yet
Gradinet
51 pages
Vapnik - Complete Statistical Theory of Learning Learning U
No ratings yet
Vapnik - Complete Statistical Theory of Learning Learning U
59 pages
05 Kernel
No ratings yet
05 Kernel
24 pages
Tools 2022 05 26 22 32 24
No ratings yet
Tools 2022 05 26 22 32 24
23 pages
Lecture 10
No ratings yet
Lecture 10
4 pages
Geometric functions in computer aided geometric design
From Everand
Geometric functions in computer aided geometric design
Oscar Ruiz
No ratings yet
Multiple Integrals, A Collection of Solved Problems
From Everand
Multiple Integrals, A Collection of Solved Problems
Steven Tan
No ratings yet
Hyperbolic Functions (Trigonometry) Mathematics E-Book For Public Exams
From Everand
Hyperbolic Functions (Trigonometry) Mathematics E-Book For Public Exams
Mohmmad Khaja Shareef
No ratings yet
Chemical Reaction Engineering An Introduction To Industrial Catalytic Reactors
No ratings yet
Chemical Reaction Engineering An Introduction To Industrial Catalytic Reactors
20 pages
Advanced Distillation Curve Approach
No ratings yet
Advanced Distillation Curve Approach
14 pages
Matlab For Optimization PDF
No ratings yet
Matlab For Optimization PDF
49 pages
06 Convergence Algorithm and Diagnostics-Libre
No ratings yet
06 Convergence Algorithm and Diagnostics-Libre
25 pages
Answers To Problems: N, N N N, (Iii) N N N
No ratings yet
Answers To Problems: N, N N N, (Iii) N N N
3 pages
CONVERSN
No ratings yet
CONVERSN
2 pages
Ethernet Ring Protection Switching PDF
No ratings yet
Ethernet Ring Protection Switching PDF
2 pages
Parts of The Plant
No ratings yet
Parts of The Plant
4 pages
S and P Block Elements
No ratings yet
S and P Block Elements
5 pages
Capital Market
No ratings yet
Capital Market
5 pages
Full download Tulip 1st Edition Celia Fisher pdf docx
100% (11)
Full download Tulip 1st Edition Celia Fisher pdf docx
50 pages
ZAIN Group SWOT Analysis - Free SWOT Analysis
No ratings yet
ZAIN Group SWOT Analysis - Free SWOT Analysis
5 pages
MA209
No ratings yet
MA209
117 pages
9990 Example Candidate Responses Paper 1 (For Examination From 2018)
100% (1)
9990 Example Candidate Responses Paper 1 (For Examination From 2018)
30 pages
The Russian Healthcare System
No ratings yet
The Russian Healthcare System
25 pages
Digital Literacy
No ratings yet
Digital Literacy
30 pages
Bridge Fatigue
No ratings yet
Bridge Fatigue
2 pages
Pavilion in Architecture
No ratings yet
Pavilion in Architecture
4 pages
Training Programme For Energy Manager Training Course
No ratings yet
Training Programme For Energy Manager Training Course
5 pages
Amca BP Apv16 BPPS
100% (1)
Amca BP Apv16 BPPS
14 pages
Intl Summer Jul 0327 15-PAL
No ratings yet
Intl Summer Jul 0327 15-PAL
5 pages
FortiSwitch Secure Access Series
No ratings yet
FortiSwitch Secure Access Series
19 pages
Accessibility in E
No ratings yet
Accessibility in E
7 pages
Common-Emitter Circuit With Emitter Resistor: Without R
No ratings yet
Common-Emitter Circuit With Emitter Resistor: Without R
16 pages
A Marching Strip
No ratings yet
A Marching Strip
9 pages
Nottingham - Middle Age Skirmish
100% (3)
Nottingham - Middle Age Skirmish
52 pages
Sap - HR HCM PDF
No ratings yet
Sap - HR HCM PDF
6 pages
Accustream Intensifier Assembly
100% (1)
Accustream Intensifier Assembly
2 pages
1 Slab Rolling Lesson
No ratings yet
1 Slab Rolling Lesson
7 pages
Burgis
No ratings yet
Burgis
24 pages
20 - 21 F5 MY MATH CP A Marking (With Comments)
No ratings yet
20 - 21 F5 MY MATH CP A Marking (With Comments)
4 pages
Ojt
No ratings yet
Ojt
8 pages
Egypt: Discover The Treasures of India, Oman
No ratings yet
Egypt: Discover The Treasures of India, Oman
32 pages
Bluehill Universal
No ratings yet
Bluehill Universal
16 pages
Kalpataru Bliss.: Santacruz East, Kalina
No ratings yet
Kalpataru Bliss.: Santacruz East, Kalina
2 pages
Powell Intranet Brochure
No ratings yet
Powell Intranet Brochure
2 pages