Nonlinear component analysis as a kernel eigenvalue problem

Presentation of paper #7:

Nonlinear component
analysis as a kernel
eigenvalue problem
Scholkopf, Smola, Muller
Neural Computation 10, 1299-1319, MIT Press (1998)

Group C:
M. Filannino, G. Rates, U. Sandouk
COMP61021: Modelling and Visualization of high-dimensional data

Introduction
● Kernel Principal Component Analysis (KPCA)
○ KPCA is an extension of Principal Component Analysis
○ It computes PCA into a new feature space dimension
○ Useful for feature extraction, dimensionality reduction

Introduction
● Kernel Principal Component Analysis (KPCA)
○ KPCA is an extension of Principal Component Analysis
○ It computes PCA into a new feature space
○ Useful for feature extraction, dimensionality reduction

Motivation: possible solutions
Principal Curves

Trevor Hastie; Werner Stuetzle, “Principal Curves,” Journal of the American
Statistical Association, Vol. 84, No. 406. (Jun. 1989), pp. 502-516.

● Optimization (including the quality of data approximation)

● Natural geometric meaning

● Natural projection

http://pisuerga.inf.ubu.es/cgosorio/Visualization/imgs/review3_html_m20a05243.png

Motivation: possible solutions
Autoencoders

Hinton, G. E., & Salakhutdinov, R. R. (2006). Reducing the dimensionality of
data with neural networks. Science, 313, 504--507.

● Feed forward neural network

● Approximate the identity

function

http://www.nlpca.de/fig_NLPCA_bottleneck_autoassociative_autoencoder_neural_network.png

Motivation: some new problems

● Low input dimensions

● Problem dependant

● Hard optimization problems

Motivation: kernel trick
KPCA captures the overall variance of patterns

Motivation: kernel trick

Video

Principle
Data

Features
"We are not interested in PCs
in the input space, we are
interested in PCs of features
that are nonlinearly related to
the original ones"

Principle
Data

"We are not interested in PCs

New features
in the input space, we are
interested in PCs of features
that are nonlinearly related to
the original ones"

Principle
Given a data set of N centered observations in a d-dimensional space

● PCA diagonalizes the covariance matrix:

● It is necessary to solve the following system of equations:

● We can define the same computation in another dot product space F:

Principle
Given a data set of N centered observations in a high-dimensional space

● Covariance matrix in new space:

● Again, it is necessary to solve the following system of equations:

● This means that:

Principle
● Combining the last tree equations, we obtain:

● we define a new function

● and a new N x N matrix:

● our equation becomes:

Principle
● let λ1 ≤ λ2 ≤ ... ≤ λN denote the eigenvalues of K, and α1, ..., αN the
corresponding eigenvectors, with λp being the first nonzero eigenvalue
then we require they are normalized in F:

● Encoding a data point y means computing:

Algorithm

● Centralization
For a given data set, subtracting the mean for all the observation to
achieve the centralized data in RN.
● Finding principal components
Compute the matrix using kernel function, find
eigenvectors and eigenvalues
● Encoding training/testing data
where x is a vector that encodes the training
data. This can be done since we calculated eigenvalues and
eigenvectors.

Algorithm
● Reconstructing training data
The operation cannot be done because eigenvectors do not have
a pre-images in the original dimension.
● Reconstructing test data point
The operation cannot be done because eigenvectors do not have
a pre-images in the original dimension.

Disadvantages
● Centering in original space does not mean centering in F, we need
to adjust the K matrix as follows:

● KPCA is now a parametric technique:
○ choice of a proper kernel function
■ Gaussian, sigmoid, polynomial
○ Mercer's theorem
■ k(x,y) must be continue, simmetric, and semi-defined positive
(xTAx ≥ 0)
■ it guarantees that there are non-zero eigenvalues
● Data reconstruction is not possible, unless using approximation
formula:

Advantages

● Time complexity

○ we will return to this point later

● Handle non linearly separable problems

● Extraction of more principal components than PCA

○ Feature extraction vs. dimensionality reduction

Experiments

● Applications
● Data Sets
● Methods compared
● Assessment
● Experiments
● Results

Applications
● Clustering
○ Density Estimation
■ ex High correlation between features
○ De-noising
■ ex Lighting removing from bright images
○ Compression
■ ex Image compression

● Classification
○ ex categorisations

Datasets
Experiment Name Created by Representation
y x2 C y= x2
● Simple 1+2 = 3 Uniform distribution C noise sd 0.1
- Unlabelled
example1 Dist [-1, 1]
- 2 Dimensions

Three clusters
1+2 = 3
Three Gaussians - Unlabelled
● Simple
sd = 0.1 - 2 Dimensions
example2 Dist [1,1] x [0.5, 1]
Kernels
A circle and square
The eleven gaussians - Unlabelled
● De-noising
Dist [-1, 1] with zero mean - 10 Dimensions

● USPS Hand written digit
Character -Labelled
Recognition -256 Dimensions
-9298 Digits

Experiments
1 Simple Example 1 experiment
Dataset : 1+ 2 = 3 The uniform dist sd = 0.2
Kernel: Polynomial 1 – 4

2 USPS Character Recognition Parameters
Dataset: USPS Kernel PCA
Kernel Polynomial 1 7
Components 32 2048 (x x2)
Methods
Five layer Neural Networks Kernel SVM PCA SVM Neural Networks and SVM
The best parameters for the task
3 De- noising
Parameters
Dataset: De-noising 11 gaussians sd = 0.1

Methods

Kernel Autoencoders Principal Curves Kernel PCA Linear PCA

4 Kernels Parameters
Radial Basis Function
Sigmoid

Methods
These are the methods we used in the experiments
Dimensionality
reduction

Classification
● Supervised Unsupervised
Linear PCA Linear

Neural Networks Kernel PCA
● SVM Kernel Autoencoders Linear
Non
● Kernel LDA Principal Curves
Face
Recognition

Assessment
● 1 Accuracy
Classification: Exact Classification
Clustering: Comparable to other clusters

●
● 2 Time Complexity
● The time to compute
●
● 3 Storage Complexity
● The storage of the data
●
● 4 Interpretability
● How easy it is to understand

Simple Example
● Recreated example ● Nonlinear PCA paper ex
Dataset: The USPS Handwritten digits Dataset: 1+ 2 =3 The uniform dist with sd 0.2
Training set: 3000
Classifier: The polynomial Kernel 1 - 4
Classifier: The SVM dot product Kernel 1 -7
PC: 32 – 2048 x2 PC: 1 – 3

The
eigenvector
3D 1 -3 of
highest
by a eigenvalue
Kernel

Do
PCA
Kernel Polynomial 1 -4
Accurate
2D The function y = x2 + B
Clustering
of Non with noise B of sd= 0.2
linear from uniform distribution
features [-1, 1]

Character recognition
Dataset: The USPS Handwritten digits
Training set: 3000
Classifier: The SVM dot product Kernel 1 -7
PC: 32 – 2048 (x x2)
● The performance is better
for Linear Classifier
trained on non linear
components than linear
components

● The performance is
improved from linear as
the number of component
is increased Fig The result of the Character Recognition experiment ( )

De-noising
Dataset: The De-noising eleven gaussians
Training set: 100
Classifier: The Gaussian Kernel sd parameter
PC: 2

The de-noising on non linear feature of the distribution

Fig The result of the denoising experiment ( )

Kernels
The choice of Kernel regulates the accuracy of the algorithm and is dependent on the
application. The Mercer Kernels Gram Matrix are

Experiments

Radial Basis Function
Dataset Three gaussian sd 0.1
Classifier y exp x y 0.1 Kernel 1 4
PC 1 8

Sigmoid
Dataset Three Gaussian sd 0.1
Classifier Kernel
PC 1 3

Results -The PC 1-2 separate the 3 clusters
RBF
- The PC of 3 -5 half the clusters
PC 1 PC 2 PC 3
PC 4
-The PC of 6-8 split them
orthogonally
PC 5 PC 6 PC 7
PC8
The clusters are split to 12 places.
Sigmoid
-The PC 1 -2 separates the 3
clusters

- The PC 3 half the 3 clusters

-The same no of PC’s to separate
PC 1 PC2 clusters.
PC3 - The Sigmoid needs < PC to half.

Results
Experiment 1 Experiment 2 Experiment 3 Experiment 4

1 Accuracy

Kernel Polynomial 4 Polynomial 4 Gaussian 0.2 Sigmoid

Components 8 Split to 12 512 2 3 split to 6

Accuracy 4.4

2 Time

3 Space

4 Interpretability

Very Good Very Good Complicated Very good

Discussions: KDA
Kernel Fisher Discriminant (KDA)

Sebastian Mika , Gunnar Rätsch , Jason Weston , Bernhard Schölkopf
, Klaus-Robert Müller

● Best discriminant projection

http://lh3.ggpht.com/_qIDcOEX659I/S14l1wmtv6I/AAAAAAAAAxE/3G9kOsTt0VM/s1600-h/kda62.png

Discussions
Doing PCA in F rather in Rd

● The first k principal components carry more variance than any

other k directions

● The mean squared error observed by the first k principles is

minimal
● The principal components are uncorrelated

Discussions
Going into a higher dimensionality for a lower
dimensionality

● Pick the right high dimensionality space

Need of a proper kernel

● What kernel to use?
○ Gaussian, sigmoidal, polynomial
● Problem dependent

Discussions
Time Complexity

● Alot of features (alot of dimensions).
● KPCA works!
○ Subspace of F (only the observed x's)
○ No dot product calculation
● Computational complexity is hardly changed by the fact that we
need to evaluate kernel function rather than just dot products
○ (if the kernel is easy to compute)
○ e.g. Polynomial Kernels
Payback: using linear classifier.

Discussions
Pre-image reconstruction maybe impossible

Approximation can be done in F

Need explicite ϕ

● Regression learning problem
● Non-linear optimization problem
● Algebric Solution (rarely)

Discussions
Interpretablity

● Cross-Features Features

○ Dependent on the kernel

● Reduced Space Features

○ Preserves the highest variance

among data in F.

Conclusions
Applications

● Feature Extraction (Classification)

● Clustering

● Denoising

● Novelty detection

● Dimensionality Reduction (Compression)

References
[1] J.T. Kwok and I.W. Tsang, “The Pre-Image Problem in Kernel Methods,”
IEEE Trans. Neural Networks, vol. 15, no. 6, pp. 1517-1525, 2004.
[2] Hinton, G. E., & Salakhutdinov, R. R. (2006). Reducing the dimensionality of
data with neural networks. Science, 313, 504-507.
[3] Sebastian Mika , Gunnar Rätsch , Jason Weston , Bernhard Schölkopf ,
Klaus-Robert Müller
[4] Trevor Hastie; Werner Stuetzle, “Principal Curves,” Journal of the American
Statistical Association, Vol. 84, No. 406. (Jun. 1989), pp. 502-516.
[5] G. Moser, "Analisi delle componenti principali", Tecniche di trasformazione
di spazi vettoriali per analisi statistica multi-dimensionale.
[6] I.T. Jolliffe, "Principal component analysis", Spriger-Verlag, 2002.
[7] Wikipedia, "Kernel Principal Component Analysis", 2011.
[8] A. Ghodsi, "Data visualization", 2006.
[9] B. Scholkopf, S. Mika, A. Smola, G. Ratsch, and K.R. Muller, "Kernel PCA
pattern reconstruction via approximate pre-images". In Proceedings of the 8th
International Conference on Artiﬁcial Neural Networks, pages 147 - 152, 1998.

References

[10] J.T.Kwok, I.W.Tsang, "The pre-image problem in kernel methods",
Proceedings of the Twentieth International Conference on Machine Learning
(ICML-2003), 2003.

● K-R, Müller, S, Mika, G, Rätsch, K,Tsuda, and B, Schölkopf “An
Introduction to Kernel-Based Learning Algorithms” IEEE
TRANSACTIONS ON NEURAL NETWORKS, VOL. 12, NO. 2, MARCH
2001
● S, Mika, B, Schölkopf, A, Smola Klaus-Robert M¨uller, M,Scholz, G, Rätsch
“Kernel PCA and De-Noising in Feature Spaces”

Nonlinear component analysis as a kernel eigenvalue problem

More Related Content

What's hot (20)

Viewers also liked (20)

Similar to Nonlinear component analysis as a kernel eigenvalue problem (20)

More from Michele Filannino (17)

Recently uploaded (20)

Nonlinear component analysis as a kernel eigenvalue problem