Lecture 9 - Data Reduction
Lecture 9 - Data Reduction
Data Preprocessing
- Data Reduction
Data Preprocessing
Data Quality
Data Cleaning
Data Integration
Data Reduction
2
2
Data Reduction Strategies
3
Data Reduction Strategies
Data compression
4
Data Reduction: Dimensionality Reduction
Curse of dimensionality
When dimensionality increases, data becomes increasingly sparse
Density and distance between points, which is critical to clustering,
outlier analysis, becomes less meaningful
Dimensionality reduction
Avoid the curse of dimensionality
Help eliminate irrelevant features and reduce noise
Reduce time and space required in data mining
Allow easier visualization
Dimensionality reduction techniques
Wavelet transforms
Principal Component Analysis
Supervised and nonlinear techniques (e.g., feature selection)
5
Visualization Problem
Not easy to visualize multivariate data
- 1D: dot
Original Variable B PC 2
PC 1
Original Variable A
PCA:
Orthogonal projection of data onto lower-dimension
linear space that...
• maximizes variance of projected data (purple line)
15
2D Gaussian dataset
16
1st PCA axis
17
2nd PCA axis
18
Principal component analysis
• Principal component analysis (PCA) is a procedure which
uses the correlations between the variables to identify
which combinations of variables capture most information
about the dataset
• Orthogonal/Orthonormal
• How much do the dimensions vary from the mean with respect to each other ?
• Two vectors v1 and v2 for which <v1,v2>=0 holds are said to be orthogonal
• Let A be an nxn square matrix and x an nx1 column vector. Then a (right)
eigenvector of A is a nonzero vector x such that:
Eigenvalue Eigenvector
Procedure:
Finding the eigenvalues
=0 Finding lambdas
Var(T X) is maximal
Good Better
v( x1 ) c(x1,x2 ) ........c(x1,x p )
c(x1,x2 ) v( x2 ) ........c(x2 ,x p )
Cov(X)=
c(x ,x ) c(x ,x )..........v( x )
1 p 2 p p
PCA algorithm
(based on sample covariance matrix)
• Given data {x1, …, xm}, compute covariance matrix
1 m 1 m
(x i x)( x x) T where x xi
m i 1 m i 1
29
PCA – zero mean
• Suppose we are given x1, x2, ..., xM (N x 1) vectors
N: # of features
Step 1: compute sample mean M: # data
M
1
x
M
x
i 1
i
Φi xi x
Step 3: compute the sample covariance matrix Σx
1 M
1 M
1 where A=[Φ1 Φ2 ... ΦΜ]
x
M
i 1
( x i x )( x i x )T
M
i 1
T
i
i
M
AAT
i.e., the columns of A are the Φi
(N x M matrix)
30
PCA - Steps
Step 4: compute the eigenvalues/eigenvectors of Σx
xui iui
where we assume 1 2 ... N
Note : most software packages return the eigenvalues (and corresponding eigenvectors)
is decreasing order – if not, you can explicitly put them in this order)
2 2
N . .
32
Example
• Compute the PCA of the following dataset:
(1,2),(3,3),(3,5),(5,4),(5,6),(6,5),(8,7),(9,8)
33
Example (cont’d)
• The eigenvectors are the solutions of the systems:
xui iui
K that satisfies
i 1
N
T where T is a threshold (e.g., 0.9)
the following
inequality:
i 1
i
35
Data Normalization