0% found this document useful (0 votes)
23 views

PCA ChrisDing4

Principal component analysis (PCA) is a procedure to reduce the dimensionality of data by finding a new set of variables called principal components. PCA retains most of the variation in the original data. The first principal component accounts for as much of the variability in the data as possible, and each succeeding component accounts for as much of the remaining variability as possible. The principal components are eigenvectors of the data's covariance matrix and represent orthogonal directions with maximum variance in the data.

Uploaded by

Steve Yang
Copyright
© © All Rights Reserved
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
23 views

PCA ChrisDing4

Principal component analysis (PCA) is a procedure to reduce the dimensionality of data by finding a new set of variables called principal components. PCA retains most of the variation in the original data. The first principal component accounts for as much of the variability in the data as possible, and each succeeding component accounts for as much of the remaining variability as possible. The principal components are eigenvectors of the data's covariance matrix and represent orthogonal directions with maximum variance in the data.

Uploaded by

Steve Yang
Copyright
© © All Rights Reserved
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 74

Principal Component Analysis

Chris Ding
Department of Computer Science and Engineering
University of Texas at Arlington
PCA is the procedure of finding intrinsic
dimensions of the data

1.Data analysis
2.Data reduction
3.Data visualization

Represent high dimensional data in low-dim space


High-dimensional data

Gene expression Face images Handwritten digits


Example…
Application of feature reduction

• Face recognition
• Handwritten digit recognition
• Text mining
• Image retrieval
• Microarray data analysis
• Protein classification
Use PCA to approximate an image (a data matrix)

112 x 92

PCA PCA PCA PCA


original
k=10 k=20 k=30 k=40
Use PCA to approximate a set of images

original

PCA k=1

PCA k=2

PCA k=4

PCA k=6
Use PCA to approximate a set of images

original

PCA k=1

PCA k=2

PCA k=4

PCA k=6
Display the characters in 2-dim space

 a T
x
~x  G x   T 
T 1

 a2 x 
Application of feature reduction
Intrinsic dimensions of the data
Samples of children: hours of study, hours on internet, vs their age

Hours on study / homework

Children’s age

Hours on internet
Intrinsic dimensions of the data
Samples of children: hours of study, hours on internet, vs their age

Hours on study / homework

Children’s age

Data lie in a subspace


(intrinsic dimensions)

Hours on internet
PCA is the procedure of finding
intrinsic dimensions of the data
Find lines that best represent the data
PCA is a rotation of space to proper
directions (principal directions)
Geometric picture of principal components (PCs)

z1

• the 1st PC z1 is a minimum distance fit to a line in X space


• the 2nd PC z 2 is a minimum distance fit to a line in the plane
perpendicular to the 1st PC
PCs are a series of linear least squares fits to a sample,
each orthogonal to all the previous.
PCA represents data:
the close data to a linear subspace,
the more accurate representation
PCA Step 0: move coordinate to data center
This is equivalent to Centering the data

Hours on study / homework

Children’s age

Hours on internet
PCA Step 1: find a line that best represents the data

Hours on study / homework

Children’s age

Hours on internet
PCA Step 1: find a line that best represents the data

Hours on study / homework

Children’s age

Hours on internet
PCA Step 1: find a line that best represents the data

Hours on study / homework

Children’s age

Hours on internet
PCA Step 1: find a line that best represents the data

Hours on study / homework

Children’s age

Hours on internet
PCA Step 1: find a line that best represents the data

Hours on study / homework

Children’s age

projection error

Hours on internet

minimize sum of projection errors squared


Which error to minimize?
PCA Step 1: find the line that best represents the data
Fitting data to a curve (a straight line, the simplest curve)
Hours on study / homework

Children’s age

Hours on internet

minimize sum of projection errors squared


This gives the 1st principal direction
PCA directions are eigenvectors of Covariance Matrix
Repeating this process to find 3rd, 4th, … lines to best fit the remaining data,
the results are given by u2 , u3 ,… , uk
Intrinsic dimensions of the data
Samples of children study, use internet vs their age

Hours on study / homework

Children’s age

Hours on internet
PCA from maximum variance

PCA from maximum spread-out


PCA represents data:
the close data to a linear subspace,
the more accurate representation

smaller
variance

Larger
variance
Larger spread-out = Larger variance
What is Principal Component Analysis?

• Principal component analysis (PCA)


– Reduce the dimensionality of a data set by finding a new set of
variables, smaller than the original set of variables
– Retains most of the sample's information.
– Useful for the compression and classification of data.

• By information we mean the variation present in the sample,


given by the correlations between the original variables.
– The new variables, called principal components (PCs), are
uncorrelated, and are ordered by the fraction of the total information
each retains.
Geometric picture of principal components (PCs)

z1

• the 1st PC z1 is a minimum distance fit to a line in X space


• the 2nd PC z 2 is a minimum distance fit to a line in the plane
perpendicular to the 1st PC
PCs are a series of linear least squares fits to a sample,
each orthogonal to all the previous.
C. Ding

Principal Component as maximum variance


Let x  (x 1 ,x 2 , ,x p )T be a vector random variable in p dimensions/variables
随机变量
(e1 ,e2 , ,e p )
Given n observations/samples of x:
x1 , x2 ,, xn   p

The first principal component.


Define a scalar random variable as linear combination of dimensions:
p
z1  a1T x   j1 ,
a
j 1
x j
a1  (a11 ,a21 , ,a p 1 )

var[ z1 ] is maximized
Principal Component as maximum variance
Because
1
 
n
2
var[ z1 ]  E (( z1  z1 ) 2 )   a1T xi  a1T x
n i 1
1 n T
 
T

  a1 xi  x xi  x a1  a1T Sa1
n i 1
where
1 n

S   xi  x xi  x
n i 1
T
 
1 n
is the covariance matrix. x   xi is the mean.
n i 1
In the following, we assume the data is centered. x  0
Principal Component as maximum variance

To find a1 that T
maximizes var[ z1 ] subject to a1 a1  1

Let λ be a Lagrange multiplier.

L  a1T Sa1  (a1T a1  1)



L  Sa1  a1  0
a1 “eigen” = german “do something to itself”
Sa1  a1 operator = matrix

therefore a1 is an eigenvector of S

corresponding to the largest eigenvalue   1.


Algebraic derivation of PCs

To find the next coefficient vector a2 maximizing var[ z2 ]


subject to cov[ z 2 , z1 ]  0 uncorrelated
and to a2T a2  1
T T
First note that cov[ z 2 , z1 ]  a Sa2   a a2
1 1 1

then let λ and φ be Lagrange multipliers, and maximize

T T T
L  a Sa2   (a a2  1)  a a
2 2 2 1
Algebraic derivation of PCs

T T T
L  a Sa2   (a a2  1)  a a
2 2 2 1


L  Sa2  a2  a1  0    0
a2

T
Sa2  a2 and   a Sa2 2
Algebraic derivation of PCs

We find that a2 is also an eigenvector of S


whose eigenvalue   2 is the second largest.
In general
T
var[ z k ]  a Sak  k
k

• The kth largest eigenvalue of S is the variance of the kth PC.

• The kth PC z k retains the kth greatest fraction of the variation


in the sample.
C. Ding
Projection to PCA subspace
• Main steps for computing PCA subspace
– Form the covariance matrix S.

– Compute its eigenvectors:


a  p
i i 1

– PCA subspace is spanned by the first d eigenvectors a d


i i 1

– The transformation G is given by  a1T x 


 
T
G  [a1 , a2 , , ad ]
~x  G x   a2 x 
T

U  (u1 ,u 2 , ,u k ) 
 T 
p T  ad x 
x    x~  G x  PCAsubspace
Algebraic derivation of PCs

Assume x0
p n
Form the matrix: X  [ x1 , x2 , , xn ]  
1
then S  XX T
n

Obtain eigenvectors of S by computing the SVD of X:


T
X  UV X’ = U^T * X = \Sigma * V’
Homework:

After you
1.Compute the covariance matrix S
2.Obtain the first k eigenvectors of S as (u_1, …, u_k)

Show that:
You can obtain (v_1,…,v_k) by doing matrix – vector
multiplications. No need to compute eigenvectors of the
kernel (Gram) matrix.
Reduction and Reconstruction Reconstruction
Dimension reduction X   p n  G T X   d  n
G T X   d n  X  G (G T X )   pn

GT  d  p

Y  G T X   d n
X   p n

X   p n
G   p d
Optimality property of PCA
Main theoretical result:
The matrix G consisting of the first d eigenvectors of the
covariance matrix S solves the following min problem:
T 2
min G pd X  G (G X ) subject to G T G  I d
F

2
X X reconstruction error
F

PCA projection minimizes the reconstruction error among all


linear projections of size d.
Applications of PCA

• Eigenfaces for recognition. Turk and Pentland. 1991.

• Principal Component Analysis for clustering gene


expression data. Yeung and Ruzzo. 2001.

• Probabilistic Disease Classification of Expression-


Dependent Proteomic Data from Mass Spectrometry of
Human Serum. Lilien. 2003.
Outline of lecture

• What is feature reduction?


• Why feature reduction?
• Feature reduction algorithms
• Principal Component Analysis
• Nonlinear PCA using Kernels
Motivation

Linear projections
will not detect the
pattern.
Nonlinear PCA using Kernels

• Traditional PCA applies linear transformation


– May not be effective for nonlinear data

• Solution: apply nonlinear transformation to potentially very high-


dimensional space.

 : x   ( x)
• Computational efficiency: apply the kernel trick.
– Require PCA can be rewritten in terms of dot product.

More on kernels
K ( xi , x j )   ( xi )   ( x j ) later
Nonlinear PCA using Kernels

Rewrite PCA in terms of dot product

Assume the data has been centered, i.e.,  xi  0.


i
1
The covariance matrix S can be written as S   xi xiT
n i

Let v be The eigenvector of S corresponding to


nonzero eigenvalue
1 1
Sv   xi xi v  v  v 
T
 ( x T
i v ) xi
n i n i
Eigenvectors of S lie in the space spanned by all data points.
Nonlinear PCA using Kernels
1 1
Sv   xi xi v  v  v 
T
 ( x T
i v ) xi
n i n i
The covariance matrix can be written in matrix form:
1
S  XX T , where X  [x1 , x 2 , , x n ].
n
v    i xi  X 1
Sv  XX T X  X
i n
1 T
( X X )( X T X )   ( X T X )
n
1 T Any benefits?
( X X )  
n
Nonlinear PCA using Kernels

Next consider the feature space:  : x   ( x)


1   T
S  X X  , where X   [x1 , x 2 ,  , x n ].

n
1
v    i ( xi )  X 
i

 T
X X   
n

The (i,j)-th entry of X  X


 T 
is  ( xi )   ( x j )
Apply the kernel trick: K ( xi , x j )   ( xi )   ( x j )
1
K is called the kernel matrix. K  
n
Nonlinear PCA using Kernels

• Projection of a test point x onto v:

 ( x)  v   ( x)    i ( xi )
i

   i ( x)   ( xi )    i K ( x, xi )
i i

Explicit mapping is not required here.


Reference

• Principal Component Analysis. I.T. Jolliffe.

• Kernel Principal Component Analysis. Schölkopf, et al.

• Geometric Methods for Feature Extraction and


Dimensional Reduction. Burges.
主成分分析( PCA ) = K 均值聚类 (k-means)

把每个类的数据点集中到类中心 ( 假设每个类大致是球型 )
这 K 个类中心就组成了主成分分析的子空间!
(这可用数学严格证明)

in p-dim space

One early major advance using matrix


analysis
(Zha, He, Ding, et al, NIPS 2000)
(Ding & He, ICML 2004)
C. Ding, NMF for data
68
clustering and
PCA  k-means clustering

- Move every data points to its cluster center


- K cluster centers span a cluster subspace (k-1 dim)
- Cluster-subspace = PCA subspace (1st k-1 PCA directions)

in p-dim space

One early major advance on PCA, K-means (Zha, He, Ding, et al, NIPS 2000)
(Ding & He, ICML 2004)
Solution of K-means is represented by cluster indicators

 H

n1

We actually use scaled indicators: n2

nk
n1

n2

nk
Q TQ  I

You might also like