0% found this document useful (0 votes)
6 views46 pages

CS464_Ch6_FeatureExtraction

Uploaded by

pesimistcaylaq
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views46 pages

CS464_Ch6_FeatureExtraction

Uploaded by

pesimistcaylaq
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 46

CS464

Introduction to
Machine Learning

Feature Extraction - PCA


Bilkent University
PCA Applications
• Noise reduction
• Data visualization
• Data compression
Data Visualization

• Example:
– Given 53 blood and urine test results (features) from 65
people
– How can we visualize measurements?
Data Visualization
Data Visualization
Data Visualizations
Data Visualization
Data Visualization
• Is there a representation better than the coordinate
axes?
• Is it really necessary to show all 53 dimensions?
– What if there are strong correlations between the
features?
• How could we find the smallest subspace of the 53-
dimension space that keeps the most of information
about the original data?

• A solution: Principal Component Analysis


Dimensionality Reduction
Assumption: Data lies in a lower dimensional space.
= ⌥ ⌥Q(z | xj ) log P (z | xj , (t) )P (xj | (t) )
= Q(z | xj ) log Q(z | xj )
z
j=1
j=1 z Feature Extraction Q(z | xj )

⌥ Lower&dimensional&projecJons&
• Rather than picking ⇥ ⌥ m ⌥ ⇤
⌥⌥ (t) a⇥subset
⌥ m ⌥ of features, ⇤ x1, Q(z
xQ(z
2 ,…,| xj )
| xj )
Q(z | xj ) log P (xj | (t)) Q(z | xj ) log
z
x
Q(z
n create
• xj ) lognew features ) from theQ(z
j=1 z
existing ones
| Rather&than&picking&a&subset&of&the&features,&we&can&
P (xj | | xj ) log P
P
(z
(z
|
|
x
x j ,
,
(t) )
(t) )
1 z obtain&new&ones&by&combining&exisJng&features&x
j=1 z &…&x 1
j
n&

z1 = w0(1) + ⌥wi(1)xi
(1) (1)
z1 = w0 + wi xi
… i
i

(k)
⌥ (k)
zk = w0 + wi xi
i
• New&features&are&linear&combinaJons&of&old&ones&
• New features are linear combinations of old ones
• Reduces&dimension&when&k<n&
• Reduces dimension when k < n
• Let’s&consider&how&to&do&this&in&the&unsupervised&
se]ng&& how to do this in an unsupervised setting
• Let’s consider
just&X,&but&no&Y&
(only X– no Y)
Data Compression
Reduce data from
2D to 1D
(inches)

(cm)
Data Compression
Reduce data from
2D to 1D
(inches)

(cm)

Andrew Ng Coursera slide


Example
Mean
househ
old
Per capita Poverty income
GDP GDP Human Index (thousa
(trillions of (thousands Development Life (Gini as nds of
Country US$) of intl. $) Index expectancy percentage) US$) …
Canada 1.577 39.17 0.908 80.7 32.6 67.293 …
China 5.878 7.54 0.687 73 46.9 10.22 …
India 1.632 3.41 0.547 64.7 36.8 0.735 …
Russia 1.48 19.84 0.755 65.5 39.9 0.72 …
Singapore 0.223 56.69 0.866 80 42.5 67.1 …
USA 14.527 46.86 0.91 78.3 40.8 84.3 …
… … … … … … …

Andrew Ng Coursera slide


Data Visualization

Country
Canada 1.6 1.2
China 1.7 0.3
India 1.6 0.2
Russia 1.4 0.5
Singapore 0.5 1.7
USA 2 1.5
… … …
Andrew Ng Coursera slide
Data represented in two dimensions

Andrew Ng Coursera slide


Principal Component Analysis
A 2D Dataset
First PCA Axis
Second PCA Axis
PCA for 3D Data
PCA Algorithm

• Compute the covariance matrix Σ


Covariance Matrix & Eigen Vecs

Credit: Vincent Spruyt


Steps in PCA
• Mean center the data
• Compute covariance matrix (or the scatter matrix)
• Calculate eigenvalues and eigenvectors of
covariance matrix
– Eigenvector with largest eigenvalue λ1 is 1st principal
component (PC)
– Eigenvector with kth largest eigenvalue λk is kth PC
– Proportion of variance captured by kth PC = λk / Σi λI
Steps in PCA with notation
Scaling up
• Covariance matrix can be really big!
• Σ is d x d
– 10000 features can be common! – finding
eigenvectors is very slow...

• Use singular value decomposition (SVD) – (Takes


input X and finds k eigenvectors )
– fast implementations available, e.g., Matlab svd
SVD
SVD
Applying PCA
• Full set of PCs comprise a new orthogonal basis for feature
space, whose axes are aligned with the maximum variances
of original data.

• Projection of original data onto first k PCs gives a reduced


dimensionality representation of the data.

• Transforming reduced dimensionality projection back into


original space gives a reduced dimensionality reconstruction
of the original data.

• Reconstruction will have some error


PCA
original data mean centered data with
PCs overlayed
PCA example
original data projected original data reconstructed using
into full PC space only a single PC
Dimensionality reduction with PCA
• In high-dimensional problem, data usually lies near a linear
subspace, as noise introduces small variability
• Only keep data projections onto the principal components
with large eigen values
• Can ignore the components of the lesser significance

• You might loose some information, but if the eigenvalues are


small, you don’t lose much.
Slide from Aarthi Singh
Eigenfaces
Input Images Eigenfaces (Eigen vectors)
Reconstruction 1
In the following figure, each new image from left to right corresponds to using 1
additional principle component for reconstruction.

The figure becomes recognizable around the 7th or 8th image, but not perfect.
Reconstruction 1
In this next image, we show a similar picture, but with each additional face
representing an additional 8 principle components.

You can see that it takes a rather large number of images before the picture looks
totally correct.
Source: https://www.cs.princeton.edu/~cdecoro/eigenfaces/
Reconstruction 2
However, in this next image, we show images where the dataset excludes all those
images with either glasses or different lighting conditions.

The point to keep in mind is that each new image represents one new principle
component. As you can see, the image converges extremly quickly.
Original Image
Reconstruction Error vs PCA Dimensions
PCA Compression: 144D => 60D
PCA Compression: 144D => 16D
PCA Compression 144D=>6D
PCA Compression: 144D => 3D
PCA: a useful preprocessing step
• Helps reduce computational complexity

• Can help supervised learning

• PCA can also be seen as noise reduction

• Caveats:
– Directions of greatest variance may not be most
informative (i.e. greatest classification power).
Problematic Dataset for PCA
Problematic Dataset for PCA
PCA summary
Acknowledgements
• Aarthi Singh, Andrew Ng, Barnabás Póczos

You might also like