0% found this document useful (0 votes)
40 views

Lecture 9 - Data Reduction

Data

Uploaded by

raoseshu
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
40 views

Lecture 9 - Data Reduction

Data

Uploaded by

raoseshu
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 36

Transfer Functions

Data Preprocessing
- Data Reduction
Data Preprocessing

 Data Preprocessing: An Overview

 Data Quality

 Major Tasks in Data Preprocessing

 Data Cleaning

 Data Integration

 Data Reduction

 Data Transformation and Data Discretization

2
2
Data Reduction Strategies

 Data reduction: Obtain a reduced representation of the data set that is


much smaller in volume but yet produces the same (or almost the same)
analytical results

 Why data reduction? — A database/data warehouse may store terabytes of


data. Complex data analysis may take a very long time to run on the
complete data set.

 Data reduction strategies


 Dimensionality reduction, e.g., remove unimportant attributes
 Numerosity reduction (some simply call it: Data Reduction)
 Data compression

3
Data Reduction Strategies

 Data reduction strategies

 Dimensionality reduction, e.g., remove unimportant attributes


 Wavelet transforms

 Principal Components Analysis (PCA)

 Feature subset selection, feature creation

 Numerosity reduction (some simply call it: Data Reduction)


 Regression and Log-Linear Models

 Histograms, clustering, sampling

 Data cube aggregation

 Data compression

4
Data Reduction: Dimensionality Reduction

 Curse of dimensionality
 When dimensionality increases, data becomes increasingly sparse
 Density and distance between points, which is critical to clustering,
outlier analysis, becomes less meaningful
 Dimensionality reduction
 Avoid the curse of dimensionality
 Help eliminate irrelevant features and reduce noise
 Reduce time and space required in data mining
 Allow easier visualization
 Dimensionality reduction techniques
 Wavelet transforms
 Principal Component Analysis
 Supervised and nonlinear techniques (e.g., feature selection)

5
Visualization Problem
 Not easy to visualize multivariate data
 - 1D: dot

 - 2D: Bivariate plot (i.e. X-Y plane)

 - 3D: X-Y-Z plot

 - 4D: ternary plot with a color code /Tetrahedron- 5D, 6D,


etc. : ???
Motivation

• Given data points in d dimensions


• Convert them to data points in r<d dimensions
• With minimal loss of information
Basics of PCA
 PCA is useful when we need to extract useful information
from multivariate data sets.

 This technique is based on the reduced dimensionality.


What is Principal Component

 A principal component can be defined as a linear


combination of optimally-weighted observed variables.
What are the new axes?

Original Variable B PC 2
PC 1

Original Variable A

• Orthogonal directions of greatest variance in data


• Projections along PC1 discriminate the data most along any one axis
Principle Component Analysis

PCA:
Orthogonal projection of data onto lower-dimension
linear space that...
• maximizes variance of projected data (purple line)

• minimizes mean squared distance between


• data point and
• projections (sum of blue lines) 14
The Principal Components
• Vectors originating from the center of mass

• Principal component #1 points


in the direction of the largest variance.

• Each subsequent principal component…


• is orthogonal to the previous ones, and
• points in the directions of the largest
variance of the residual subspace

15
2D Gaussian dataset

16
1st PCA axis

17
2nd PCA axis

18
Principal component analysis
• Principal component analysis (PCA) is a procedure which
uses the correlations between the variables to identify
which combinations of variables capture most information
about the dataset

• Mathematically, it determines the eigenvectors of the


covariance matrix and sorts them in importance according
to their corresponding eigenvalues
Basics for Principal Component Analysis

• Orthogonal/Orthonormal

• Standard deviation, Variance, Covariance

• The Covariance matrix

• Eigenvalues and Eigenvectors


Covariance

• Standard Deviation and Variance are 1-dimensional

• How much do the dimensions vary from the mean with respect to each other ?

• Covariance measures between 2 dimensions

We easily see, if X=Y we end up with variance


Covariance Matrix

• Let X be a random vector.

• Then the covariance matrix of X, denoted by Cov(X), is

• The diagonals of Cov(X) are .


• In matrix notation,

The covariance matrix is symmetric


Orthogonality/Orthonormality

1.5 <v1,v2> = <(1 0),(0 1)>


= 0
1
0.5

0.5 1.0 1.5

• Two vectors v1 and v2 for which <v1,v2>=0 holds are said to be orthogonal

• Unit vectors which are orthogonal are said to be orthonormal.


Eigenvalues/Eigenvectors

• Let A be an nxn square matrix and x an nx1 column vector. Then a (right)
eigenvector of A is a nonzero vector x such that:

Eigenvalue Eigenvector

Procedure:
Finding the eigenvalues

=0 Finding lambdas

Find corresponding eigenvectors


Transformation

• Looking for a transformation of the data matrix X (pxn) such that

Y= T X=1 X1+ 2 X2+..+ p Xp


Transformation

What is a reasonable choice for the  ?

Remember: We wanted a transformation that maximizes information

That means: captures Variance in the data

Maximize the variance of the projection of the observations on the Y


variables !
Find  such that

Var(T X) is maximal

The matrix C=Var(X) is the covariance matrix of the Xi variables


Transformation
Can we intuitively see that in a picture?

Good Better
 v( x1 ) c(x1,x2 ) ........c(x1,x p ) 
 
 c(x1,x2 ) v( x2 ) ........c(x2 ,x p ) 
Cov(X)=  
 
 c(x ,x ) c(x ,x )..........v( x ) 
 1 p 2 p p 
PCA algorithm
(based on sample covariance matrix)
• Given data {x1, …, xm}, compute covariance matrix 

1 m 1 m
   (x i  x)( x  x) T where x   xi
m i 1 m i 1

• PCA basis vectors = Compute the eigenvectors of 

• Larger eigenvalue  more important eigenvectors

29
PCA – zero mean
• Suppose we are given x1, x2, ..., xM (N x 1) vectors
N: # of features
Step 1: compute sample mean M: # data
M
1
x
M
x
i 1
i

Step 2: subtract sample mean (i.e., center data at zero)

Φi  xi  x
Step 3: compute the sample covariance matrix Σx
1 M
1 M
1 where A=[Φ1 Φ2 ... ΦΜ]
x 
M

i 1
( x i  x )( x i  x )T

M

i 1
 T
i
i  
M
AAT
i.e., the columns of A are the Φi
(N x M matrix)

30
PCA - Steps
Step 4: compute the eigenvalues/eigenvectors of Σx
 xui  iui
where we assume 1  2  ...  N
Note : most software packages return the eigenvalues (and corresponding eigenvectors)
is decreasing order – if not, you can explicitly put them in this order)

Since Σx is symmetric, <u1,u2,…,uN> form an orthogonal basis


in RN and we can represent any x∈RN as: x 
x 
y 
y 
1 1

 2  2
N  .   . 

x  x   yi ui  y1u1  y2u2  ...  y N u N


   
 .  .
xx:  
 .   . 
i 1    
i.e., this is  .   . 
just a “change”  .   . 
(x  x)T ui    
yi  T
 ( x  x )T
ui if || ui || 1 of basis!  xN   y N 
ui ui
Note : most software packages normalize ui to unit length to simplify calculations; if
not, you can explicitly normalize them) 31
PCA - Steps
Step 5: dimensionality reduction step – approximate x using
only the first K eigenvectors (K<<N) (i.e., corresponding to
the K largest eigenvalues where K is a parameter)

32
Example
• Compute the PCA of the following dataset:

(1,2),(3,3),(3,5),(5,4),(5,6),(6,5),(8,7),(9,8)

• Compute the sample covariance matrix is:

• The eigenvalues can be computed by finding the roots of the


characteristic polynomial:

33
Example (cont’d)
• The eigenvectors are the solutions of the systems:
xui  iui

Note: if ui is a solution, then cui is also a solution where c≠0.

Eigenvectors can be normalized to unit-length using:


vi
vˆi 
|| vi ||
34
Choosing the projection dimension K ?

• K is typically chosen based on how much information


(variance) we want to preserve:
K

Choose the smallest  i

K that satisfies
i 1
N
T where T is a threshold (e.g., 0.9)
the following
inequality: 
i 1
i

• If T=0.9, for example, we “preserve” 90% of the information


(variance) in the data.

• If K=N, then we “preserve” 100% of the information in the


data (i.e., just a “change” of basis and xˆ  x )

35
Data Normalization

• The principal components are dependent on the units used


to measure the original variables as well as on the range of
values they assume.

• Data should always be normalized prior to using PCA.

• A common normalization method is to transform all the data


to have zero mean and unit standard deviation:

xi   where μ and σ are the mean and standard


deviation of the i-th feature xi

36

You might also like