Dimensionality Reduction
Dimensionality Reduction
Dimensionality Reduction
Curse of Dimensionality
Large number of input features may lead to poor performance
Dimensionality Reduction
Reduce the number of redundant and noisy features
Dimensionality Reduction: Why
Feature Selection
Filtering and wrapper-based methods
Add if
Stop
▪ If adding any feature does not decrease , or
▪ If the decrease in error too small, or
▪ If we have reached the desired performance level
Feature Extraction Methods
Singular Value Decomposition (SVD)
Use Case: Information Retrieval
high Feynman was a
Newton was similarity
Document good physicist.
Query ( a good low
Collection physicist similarity
Newton was a
British physicist
𝐷𝑖 Very high-dimensional
𝑄 o hundreds of millions of dimensions
⃗
𝑄∙ ⃗
𝐷𝑖 This is a very sparse vector
⃗ ⃗
cos( 𝑄, 𝐷𝑖 ¿¿= o most entries are zero.
|𝑄|∨ 𝐷𝑖 ∨¿¿
⃗ ⃗
𝐷𝑗
make
auto car
hidden
engine emissions
Markov
bonnet hood
model
tyres make
emissions
lorry model
normalize
boot trunk
learning
Synonymy Polysemy
Will have small cosine Will have large cosine
but are related but not truly related
Motivating Example
automobile 1 1 0 1 0 0
car 1 0 1 1 0 0
model 1 1 1 2 1 1
learning 0 0 0 1 1 1
Motivating Example
𝑄= 𝑐𝑎𝑟 𝑚𝑜𝑑𝑒𝑙
R R R R NR NR
automobile 1 1 0 1 0 0 0
car 1 0 1 1 0 0 1
model 1 1 1 2 1 1 1
learning 0 0 0 1 1 1 0
2 1 2 3 1 1
X
LSA: Motivating Example
Let us do this
𝑄= 𝑐𝑎𝑟 𝑚𝑜𝑑𝑒𝑙
with Linear
Algebra R R R R NR NR
automobile 1 1 1 1 0 0 0
car 1 1 1 1 0 0 1
model 1 1 1 2 1 1 1
learning 0 0 0 1 1 1 0
2 2 2 3 1 1
Bit of Linear Algebra: Rank
automobile 1 1 0 1 0 0
car 1 0 1 1 0 0
model 1 1 1 2 1 1
learning 0 0 0 1 1 1
Let be a matrix
number of linearly independent rows or column
𝑥1 𝑥1 𝑥1 𝑥+¿
1 𝑥2𝑥2 𝑥2
automobile 1 1 1 1 0 0 1 0
car 1 1 1 1 0 0 1 0
model 1 1 1 2 1 1 1 1
learning 0 0 0 1 1 1 0 1
car car
𝑥1 (1,1,0)
𝐷1 , 𝐷2 , 𝐷 3 (1,1,0)
𝐷1 , 𝐷2 , 𝐷 3 (1,0)
automobile automobile
𝐷4 (1,1,1)
𝐷4 (1,1)
𝐷5 , 𝐷6 ( 0,0,1) 𝐷5 , 𝐷6 (0,1)
learning learning
𝑥2 (0,0,1)
Matrix Factorization
1 1 1 1 0 0
1 1 1 1 0 0 1 0
1 1 1 1 0 0
1 1 1 2 1 1 1 0
0 0 0 1 1 1
0 0 0 1 1 1 1 1
0 1
𝑥1 (1,1,0)
𝑥2 (0,0,1)
Criteria for Approximation?
Eigenvalues:
Eigenvectors:
Eigenvector Decomposition
▪
𝐴=| 6
5 |
5
6
𝜆1 =11 𝜆2 =1 ||
𝑥1= 1
1 | |
𝑥2 = 1
−1
𝐴= | | |
6
5
5
=
1 1
6 √2 1 ||
. . | |
1 11 0 1 1 1
−1 0 1 √ 2 1 −1 |
𝑈 Λ
Singular Value Decomposition
2. be an matrix
• with , zero otherwise
Dimensionality Reduction
𝜎1 𝑚 ×(𝑛 −𝑟 )
𝜎1
𝜎2 𝜎2
Σ 𝑚×𝑛 =¿ 𝜎 𝑟𝑟 ×𝑟 𝜎𝑟
0 𝑟 ×𝑟
0 (𝑚− 𝑟 )× 𝑛
Dimensionality Reduction
𝐶 Eigenvectors of Σ Eigenvectors of
𝑚 ×𝑛 𝑚×𝑚 𝑚 ×𝑛 𝑛 ×𝑛
≈
𝑟 ×𝑛
𝑟 ×𝑟
𝑚 ×𝑟
⊤
𝑼 𝚺 𝑽
Singular Value Decomposition (SVD)
′ ′ ′
𝑈
𝐶 ≈𝑈 Σ𝑉 =𝑈
⊤
, Σ =Σ ,𝑉 =𝑉
: Input Term-Doc Matrix
matrix ( terms, documents)
: Left Singular Matrix
matrix ( terms, concepts)
: Singular Values
diag matrix (strength of each concept)
: Right Singular Matrix
matrix ( documents, concepts)
Singular Value Decomposition (SVD)
𝐶 ≈ 𝑈 Σ 𝑉 ⊤=∑ 𝜎 𝑖 𝑢𝑖 ∘ 𝑣 ⊤
𝑖
𝑖
𝑛 𝑟 𝑟 𝑛
𝑚
𝐶 ≈ 𝑚
Singular Value Decomposition (SVD)
𝐶 ≈ 𝑈 Σ 𝑉 ⊤=∑ 𝜎 𝑖 𝑢𝑖 ∘ 𝑣 ⊤
𝑖
𝑖
𝑛 𝜎 1 𝒖𝟏 𝒗 𝟏 𝜎 2 𝒖𝟐 𝒗 𝟐
𝑚
𝐶 ≈ +¿
Low Rank Approximation
The problem
Given an matrix and a positive integer , find another matrix to
minimize the Frobenius norm of
Low Rank Approximation from SVD
Steps
1. Given , find SVD
3. Compute
What did we do?
𝑘 𝑘 𝑛 Lower dimensional
representation of doc in
𝐶
latent space
Hidden or
Latent
Concepts
Term-Document Matrix
docid text
d1 Ship ocean voyage
d2 Ocean boat
d3 ship
d4 Voyage trip
Term-document matrix
d5 voyage
d6 trip
Singular Vector Decomposition
𝑈 Σ
= *
⊤
𝑉
Singular Vector Decomposition
𝑈 Σ 𝑉⊤
numpy
numpy.linalg.svd:
https://docs.scipy.org/doc/numpy/reference/generated/numpy.linalg.svd.html
Scipy
scipy.linalg.svd:
https://docs.scipy.org/doc/scipy/reference/generated/scipy.linalg.svd.html
scipy.sparse.linalg.svds:
https://docs.scipy.org/doc/scipy/reference/generated/scipy.sparse.linalg.svds.html
SVD: Summary
Naïve Experimenter:
How many dimensions are important for
measurement?
Which dimensions are important?
Experimental Setup:
Measure ball’s position in 3D
Three cameras (120 Hz) record movement of the
system
[]
𝑥𝐴 Camera A
𝑦𝐴
𝑋 = 𝑥𝐵
⃗
Camera B
𝑦𝐵
𝑥𝐶
𝑦𝐶 Camera C
Naïve basis reflect the way the data has been collected
Consider
A Toy Example
[]
𝑥𝐴 Camera A
𝑦𝐴
𝑋 = 𝑥𝐵
⃗
Camera B
𝑦𝐵
𝑥𝐶
𝑦𝐶 Camera C
𝐵=
[ ][
𝑏1
𝑏2
=
1
0
0
1 ] dimensional
[ ][ ]
𝑏1 1 0 … 0
𝑏2 0 1 … 0
𝐵= . = . . … . dimensional
. . . … .
Each row vector is an orthonormal basis vector 0 0 … 1
𝑏𝑚
Each data point can be trivially expressed as linear
combination of
Change of Basis: Core PCA Idea
Change of Basis
Is there another basis, which is a linear combination of the original
basis, that best re-express the original dataset?
row of
6 ×72000 column of
column of
10 mins of recording with 120 Hz
=
Change of Basis: Core PCA Idea
Interpretation of Change of Basis:
Matrix performs a linear transform from to
[] [ ]
𝑝1 𝑝1 . 𝑥𝑖
[ ]
𝑝2 𝑝1 . 𝑥 1 ⋯ 𝑝 1 . 𝑥𝑛 𝑝2 . 𝑥𝑖
𝑃𝑋 = . [ 𝑥 1 𝑥 2 … 𝑥 𝑛 ] 𝑌= ⋮ ⋱ ⋮ 𝑦𝑖= .
. 𝑝𝑚 . 𝑥1 ⋯ 𝑝𝑚 . 𝑥𝑛 .
𝑝𝑚 𝑝𝑚 . 𝑥𝑖
Change of Basis: Core PCA Idea
𝑝2
𝑏2 =[ 0,1,0]
𝑥𝑖 𝑝1
𝑦𝑖
[ 2]
.𝑥 𝑖
=𝑝
=𝑝
1
1]
𝑦 𝑖[
2
.𝑥 𝑖
𝑏1=[1,0,0]
,1 ]
𝑦
0 [3]
[0, 𝑖
=
𝑝
𝑏 3= 3 .𝑥
𝑖
𝑝3
Change of Basis: Core PCA Idea
Principal Components of
𝑎=[ 𝑎1 𝑎 2 … 𝑎𝑛 ] 𝑏=[ 𝑏1 𝑏2 … 𝑏𝑛 ]
2 1 ⊤
𝜎 𝑎𝑏= 𝑎𝑏
𝑛− 1
Measurement types
[]
𝑥1
( )
2
𝜎1
2
𝜎 12 … 𝜎1𝑛
2 All
𝑥2 measurements
𝑋= . 1 ⊤ 𝜎 221 𝜎 22 … 𝜎 22 𝑛 for a particular
.
𝐶 𝑋= 𝑋𝑋 =
𝑛 −1 ⋮ ⋮ ⋱ ⋮ type
. 2 2 2
𝑥𝑚 𝜎𝑚1 𝜎𝑚2 … 𝜎𝑚
Covariance Matrix
Two Goals:
( ) 𝐶 𝑌0 =¿ 0
2 2 2
𝜎1 𝜎 12 … 𝜎1𝑛
2 2 2
𝐶 𝑋 = 𝜎 21 𝜎2 … 𝜎 2𝑛
⋮ ⋮ ⋱ ⋮
𝜎 2𝑚 1 2
𝜎𝑚 2 … 𝜎 2𝑚
DIAGONALIZATION
Diagonalizing
is a symmetric matrix
𝐴=𝐸 Λ 𝐸 ⊤
𝐴
𝑒1
¿ 𝑒1 𝑒2 𝑒𝑚 𝑒2
𝑒𝑚
Solving PCA with Linear Algebra
𝐴
𝑒1
𝐴=𝐸 Λ 𝐸 ⊤ ¿ 𝑒1 𝑒2 𝑒𝑚 𝑒2
𝑒𝑚
Assume:
𝐴=𝑃 ⊤ Λ 𝑃
Solving PCA with Linear Algebra
⊤ 1
𝑃=𝐸 𝐶𝑌 = Λ
𝑛 −1
[ ]
𝑝1 . 𝑥𝑖
𝑦 𝑖= 𝑝2 . 𝑥𝑖
…
𝑝𝑚 . 𝑥𝑖
SVD and PCA
SVD over a data matrix:
𝑋=𝑈 Σ𝑉 ⊤
Data Transform
𝑌 =𝑈 ⊤ 𝑋
1
𝐶𝑌 = Λ
𝑛 −1
Λ=Σ 2 𝜆=𝜎 2
1 2
𝐶𝑌 = Σ
𝑛 −1
PCA: Summary
o o o o o o
o + + o
o o o +
+ + + o o o + +
o ++ o
o o ++++ + + + o o
+ + ++
o o
++ + + +
o o
+ + +
+ + + + + +
PCA LDA
Unsupervised Supervised
LDA Intuition
1) Centre of clusters should
be far apart
High between class variance
Objective
To find the direction () such that the projected data points are well
separated
Objective (Restated)
Maximize the projected class mean distance
Minimize within class variance
Projections
Projection of
Projection of
Projection of
Projection of covariance
Projection of covariance
Revisiting the Goals
⊤ ⊤ max 𝑤⊤ 𝑆 𝐵 𝑤
( 𝜇1 − 𝜇 2 ) 𝑤 𝑤 ( 𝜇 1 − 𝜇2 )
⊤ ⊤
(𝜇1 − 𝜇 2) ( 𝜇1 − 𝜇 2 ) 𝑤 𝑤
⊤
𝑤 (𝜇 1 − 𝜇 2 ) ( 𝜇 1 −𝜇 2 ) 𝑤
⊤ Between class covariance
𝑤⊤ 𝑆 𝐵 𝑤
Revisiting the Goals
𝑤⊤ Σ 1 𝑤+𝑤⊤ Σ 2 𝑤
𝑤 ⊤ ( Σ 1+ Σ 2) 𝑤
𝑤 ⊤ 𝑆𝑤 𝑤
mi 𝑛 𝑤⊤ 𝑆𝑤 𝑤
max 𝑤⊤ 𝑆 𝐵 𝑤 min 𝑤⊤ 𝑆𝑤 𝑤
w w
𝑤⊤ 𝑆 𝐵 𝑤
max ⊤ Rayleigh Quotient
w 𝑤 𝑆𝑤 𝑤
Such that
Constraint Optimization
max 𝑤⊤ 𝑆 𝐵 𝑤 Such that
w
Lagrangian
𝜕𝐿
=0
𝜕𝑤
2 𝑆 𝐵 𝑤 −2 𝜆 𝑆 𝑤 𝑤=0
𝑆 𝐵 𝑤=𝜆 𝑆 𝑤 𝑤
𝑆−𝑤1 𝑆 𝐵 𝑤= 𝜆𝑤
Constraint Optimization
𝑆−𝑤1 𝑆 𝐵 𝑤= 𝜆𝑤
is eigenvector of
Compute
the eigenvector of
https://content.iospress.com/articles/ai-communications/aic729
LDA Example (Two Class)
[ ]
4 1 𝑐1
2 4 𝑐1 𝜇1 =[ 3.00 3.60]
2 3 𝑐1 𝜇 2=[8.40 7.60 ] [
𝑆 𝐵 = 2 9.16
21.60
21.60
16.00 ]
3 6 𝑐1
4 4 𝑐1
𝑋=
9 10 𝑐2 𝑆𝑤 1 =
[ 0 .80
−0.40
− 0.40
2.6 4 ]
6 8 𝑐2 𝑆𝑤 =
[ 2 .64 − 0.44
]
[ ]
−0.44 5.28
9 5 𝑐2 𝑆𝑤 2 = 1 .84 − 0.04
8 7 𝑐2 − 0.04 2.64
10 8 𝑐2
https://content.iospress.com/articles/ai-communications/aic729
LDA Example (Two Class)
[ ]
4 1 𝑐1 𝑆−𝑤1 𝑆 𝐵 𝑤= 𝜆𝑤
2 4 𝑐1 ¿ 𝑆−𝑤1 𝑆𝐵 − 𝜆 𝐼 ∨¿ 0
2 3 𝑐1
𝑋=
3
4
6
4
𝑐1
𝑐1 [ 1 1.89 − 𝜆
5.08
8.81
3.76 − 𝜆
=0
]
9 10 𝑐2 𝜆=15.65
6 8 𝑐2
9
8
5
7
𝑐2
𝑐2
[ 1 1.89
5.08
8.81
3.76 ][ ]
𝑤1
𝑤2
=15.65
𝑤1
𝑤2 [ ]
10 8 𝑐2 ⊤
𝑤∗ =[ − 0.91− 0.39 ]
https://content.iospress.com/articles/ai-communications/aic729
PCA vs LDA
PCA vs LDA
LDA
LDA: Summary