Introduction To Kernels: Max Welling
Introduction To Kernels: Max Welling
Introduction to Kernels
Max Welling
October 1 2004
1
Introduction
• What is the goal of (pick your favorite name):
- Machine Learning
- Data Mining
- Pattern Recognition
- Data Analysis
- Statistics
• Desiderata:
- Robust algorithms insensitive to outliers and wrong
model assumptions.
- Stable algorithms: generalize well to unseen data.
- Computationally efficient algorithms: large datasets.
2
Let’s Learn Something
Find the common characteristic (structure) among the following
statistical methods?
Answer:
We consider linear combinations of input vector: f ( x ) wT x
Linear algorithm are very well understood and enjoy strong guarantees.
(convexity, generalization bounds).
3
Can we carry these guarantees over to non-linear algorithms?
Feature Spaces
: x ( x), R F d
non-linear mapping to F
1. high-D space L2
2. infinite-D countable space :
3. function space (Hilbert space)
example: ( x, y ) ( x , y , 2 xy )
2 2
4
Ridge Regression (duality)
problem: min w ( yi wT xi ) 2 || w ||2
i 1
input regularization
target
k ( x , x) f ( x)
m
f , f i j k ( xi , x j ) 0 i i
i 1 j 1 i 1
Theorem: Let S {x1 ,..., x} be a IID sample from P and define
the sample mean of f(x) as: f 1 f ( xi ) then it follows that:
i 1
R 1 R sup x || f ( x) ||
P(|| f EP [ f ] || (2 2 ln( )) 1
12
(prob. that sample mean and population mean differ less than is more than ,independent of P!
Rademacher Complexity
Prolem: we only checked the generalization performance for a
single fixed pattern f(x).
What is we want to search over a function class F?
2
R ( F ) ES E [sup | i f ( xi ) |]
f F i 1 xi 13
Generalization Bound
Theorem: Let f be a function in F which maps to [0,1]. (e.g. loss functions)
Then, with probability at least 1 over random draws of size
every f satisfies:
2
ln( )
E p [ f ( x)] Edata [ f ( x)] R ( F )
2
2
ln( )
Edata [ f ( x)] R ( F ) 3
2
Relevance: The expected pattern E[f]=0 will also be present in a new
data set, if the last 2 terms are small:
- Complexity function class F small
- number of training data large 14
Linear Functions (in feature space)
Consider the FB { f : x w, ( x) , || w || B}
function class: with k ( x, y ) ( x), ( y )