Chap6.1-KernelMethods
Chap6.1-KernelMethods
Kernel Methods
Sargur Srihari
[email protected]
1
Machine Learning Srihari
2
Machine Learning Srihari
Memory-Based Methods
• Training data points are used in prediction
• Examples of such methods
• Parzen probability density model
• Linear combination of kernel functions centered on each training data point
4
Machine Learning Srihari
Kernel Functions
• Linear models can be re-cast
• into equivalent dual where predictions are based on
kernel functions evaluated at training points
• Kernel function is given by
k (x,x ) = ϕ(x)T ϕ(x )
• where ϕ(x) is a fixed nonlinear mapping (basis function)
• Kernel is a symmetric function of its arguments
k (x,x ) = k (x ,x)
• Kernel can be interpreted as similarity of x and x
• Simplest is identity mapping in feature space ϕ(x) = x
• In which case k (x,x ) = xTx
• Called Linear Kernel 5
Machine Learning Srihari
Kernel Trick
• Formulated as inner product allows extending
well-known algorithms
• by using the kernel trick
• Basic idea of kernel trick
• If an input vector x appears only in the form of
scalar products then we can replace scalar
products with some other choice of kernel
• Used widely
• in support vector machines
• in developing non-linear variant of PCA
• In kernel Fisher discriminant 6
Machine Learning Srihari
Dual Representation
= ΦT a
• Solution for w: linear combination of φ (xn) whose
coefficients
€
are functions of w where
• Φ is design matrix whose nth row is given by φ (xn)T
% φ 0 (x1 ) . . φ M −1 (x1 ) (
' *
' . . *
Φ = 'φ 0 (x n ) . . φ M −1 (x n )* is a N × M matrix
' . . *
'&φ (x ) . . φ M −1 (x N )*)
0 N
Transformation from w to a
10
Machine Learning Srihari
• Notes:
• Φ is NxM and K is NxN Note: N M times M N
• K is a matrix of similarities of pairs of samples (thus it is symmetric) 11
Machine Learning Srihari
Error Function in Terms of Gram Matrix
• Error Function is
1 N
λ
J(w) = ∑ {w φ(x )−t } + w w
2
T T
n n
2 n=1 2
• Substituting w = ΦTa into J(w) gives
1 1 λ
J(w) = aT ΦΦT ΦΦTa −aT ΦΦT t + tT t + aT ΦΦTa
2 2 2
where t = (t1,..,tN)T
Constructing Kernels
• To exploit kernel substitution need valid kernels
• Two methods:
• (1) from ϕ (x) to k(x,x’)
• (2) from k(x,x’) to ϕ (x)
• First Method
• Choose ϕ (x) and use it to find corresponding kernel
k(x,x ') = φ(x)T φ(x ')
M
=∑ φi (x)φi (x ')
i=1
Kernel
Functions
k(x,x ) = ϕ (x)Tϕ (x’)
Red cross is x
16
Machine Learning Srihari
17
Machine Learning Srihari
4. k(x,x )=exp(k1(x,x ))
5. k(x,x )=k1(x,x )+k2(x,x )
6. k(x,x )=k1(x,x )k2(x,x )
7. k(x,x )=k3(f(x).f(x )) f(x) is a function from x to RM
k3 is a valid kernel in RM
8. k(x,x )=xTAx A is a symmetric positive semidefinite matrix
20
Machine Learning Srihari
Gaussian Kernel
• Commonly used kernel is
k(x,x ) = exp (-||x-x ||2/2σ2)
• It is seen as a valid kernel by expanding square
||x-x ||2 = xTx + (x )Tx -2xTx
• To give
k(x,x ) = exp (-xTx/2σ2) exp (-xTx /σ2) exp (-(x )Tx /2σ2)
• From kernel construction rules 2 and 4
• together with validity of linear kernel k(x,x )=xTx
• Can be extended to non-Euclidean distances
k(x,x ) = exp {(-1/2σ2)[κ(x,x )+κ (x ,x )-2κ (x,x )]}
21
Machine Learning Srihari
2-D space x
25
Machine Learning Srihari
26
Machine Learning Srihari
TF-IDF
• Cosine similarity performance can be improved
with some preprocessing
• Use feature vector called TF-IDF representation
• Term frequency-Inverse document frequency
tf(x ij ) ! log(1 + x ij )
• TF is a log-transform of the count
N
• IDF is defined as idf(j) ! log
1 + ∑ I(x > 0)
N
i=1 ij
xij is the no of times word j occurs in document i
• Where ϕ (x)=tf-idf(x) 27
Machine Learning Srihari
String Kernels
• Real power of kernels: inputs are structured
• Compare two variable-length strings
• Strings x, and x’ of length D, D’ defined over A
• E.g., two amino acid sequences defined over 20-letter
alphabet A={A,R,N,D,C,E,Q,G,H,I,L,K,M,F,P,S,T,W,Y,V}
• x is the sequence of length 110
IPTSALVKETLALLSTHRTLLIANETLRIPVPVHKN……VNQFLDYLQEFLGVMNTEWI
Mercer Kernel
• If s is a substring of x we can write x=usv for
some (possibly empty) strings u,s and v
• ϕs(x) is no of times substring s appears in x
• Define kernel between two strings x and x’ as
k(x,x ') = ∑ ws φs (x)φs (x ')
s∈A*
30
Machine Learning Srihari
31
Machine Learning Srihari
33
Machine Learning Srihari
Fisher Kernel
• Alternative technique for generative models
• In document retrieval, protein sequences
• Consider parametric generative model p(x|θ)
where θ denotes vector of parameters
• Goal: find kernel that measures similarity of two
vectors x and x induced by the generative model
• Define Fisher score as gradient wrt θ
g(θ,x) = ∇θ ln p(x | θ ) A vector of same dimensionality as θ
• Fisher Kernel is k(x, x ') = g(θ, x)T F−1g(θ, x’) Fisher score is more
generally the gradient
€ F = E x [ g(θ,x)g(θ,x)T ] of the log-likelihood
36