Lecture 9 -Data Prep - Reduction - PCA-M
Lecture 9 -Data Prep - Reduction - PCA-M
Data Mining
Lecture # 9
Data Preprocessing
(Ch # 3)
Data Preprocessing
Dimensionality reduction is a part of Data
Preprocessing
Data Preprocessing has following four
major steps
1. Data cleaning
2. Data integration
3. Data reduction
4. Data Transformation and Discretization
Data Reduction
Reduce representation of the data set that is
much smaller in volume, yet closely
maintains the integrity of the original data.
Different Strategies are
• Dimensionality Reduction
• Numerosity reduction
• Data Compression
Dimensionality Reduction
(DR)
Process of reducing the number of random
attributes under consideration.
Two very common methods are Wavelet
Transforms and Principal Components
Analysis (PCA)
Numerosity Reduction
Replace the original data volume by
alternative smaller forms of data
representation.
Regression and log-linear models
(parametric methods)
Histograms, clustering, sampling and data
cube aggregation (nonparametric methods)
Data Compression
Transformations are applied to obtain a reduced or
compressed representation
Lossless: The original data can be reconstructed
from the compressed data without any information
loss.
Lossy: Only an approximation of the original data
can be reconstructed.
DR: Principal Components Analysis
(PCA)
Why PCA?
PCA is a useful statistical technique, has
found applications in:
Face recognition
Image Compression
Reducing dimension of data
PCA Goal:
Removing Dimensional Redundancy
The major goal of PCA in Data Science and Machine
Learning is to remove the “dimensional redundancy”
from data.
What does that mean?
A typical dataset contains several dimensions (variables) that
may or may not correlate.
Dimensions that correlate vary together.
The information represented by a set of dimensions with high
correlation can be extracted by studying just one dimension
that represents the whole set.
Hence the goal is to reduce the dimensions of a dataset to a
smaller set of representative dimensions that do not correlate.
PCA Goal:
Removing Dimensional Redundancy
Dim 1
Dim 2
Dim 3
Analyzing 12
Dim 4
Dim 5
Dimensional data
Dim 6 is challenging !!!
Dim 7
Dim 8
Dim 9
Dim 10
Dim 11
Dim 12
PCA Goal:
Removing Dimensional Redundancy
Dim 1
Dim 2 But some dimensions
Dim 3
represent redundant
Dim 4
Dim 5
information. Can we
Dim 6 “reduce” these.
Dim 7
Dim 8
Dim 9
Dim 10
Dim 11
Dim 12
PCA Goal:
Removing Dimensional Redundancy
Dim 1
Dim 2 Lets assume we have a
Dim 3 “PCA black box” that
Dim 4
can reduce the
Dim 5
correlating
Dim 6
Dim 7
dimensions.
Dim 8
Dim 9 Pass the 12d data set
Dim 10 through the black box
Dim 11
to get a three
Dim 12
dimensional data set.
PCA Goal:
Removing Dimensional Redundancy
Given appropriate reduction,
Dim 1
Dim 2
analyzing the reduced dataset
Dim 3
is much more efficient than
Dim 4 the original “redundant” data.
Dim 5 Dim A
Dim 6 PCA Dim B
Dim 7 Black box
Dim C
Dim 8
Dim 9
Dim 10 Pass the 12 d data set through the
Dim 11 black box to get a three dimensional
Dim 12
data set.
Mathematics inside PCA Black box: Bases
Lets now give the “black box” a mathematical form.
In linear algebra dimensions of a space are a linearly
independent set called “bases” that spans the space created by
dimensions.
i.e. each point in that space is a linear combination of the bases
set.
e.g. consider the simplest example of standard basis in R n
consisting of the coordinate axes.
Every point in R3 is a linear
combination of the standard basis of
R3
1 0 0
0 1 0
0 0 1
(2,3,3) = 2 (1,0,0) + 3(0,1,0) + 3 (0,0,1)
PCA Goal: Change of Basis
Assume X is the 6-dimensional data set given as
input
Dimensions
Xi
x i 1
n
it doesn’t tell us a lot about data
set.
Different data sets can have same
mean.
Standard Deviation (SD) of a data n
(n 1)
Variance is another measure of
n
the spread of data in data set. It is
almost identical to SD. 2
i
( X X ) 2
s i 1
(n 1)
Covariance
SD and Variance are 1-dimensional
1-D data sets could be
Heights of all the people in the room
Salary of employee in a company
Marks in the quiz
However many datasets have more than 1-dimension
Our aim is to find any relationship between different
dimensions.
E.g. Finding relationship with students result and their hour of
study.
It is used to measure relationship between 2-Dimensions.
n
(X i X )(Yi Y )
cov( X , Y ) i 1
(n 1)
Covariance Interpretation
We have data set for students study hour (H) and marks
achieved (M)
We find cov(H,M)
Exact value of covariance is not as important as the sign
(i.e. positive or negative)
+ve , both dimensions increase together
-ve , as one dimension increases other decreases
Zero, their exist no relationship
Covariance Matrix
Covariance is always measured between 2 –
dim.
What if we have a data set with more than
2-dim?
We have to calculate more than one
covariance measurement.
E.g. from a 3-dim data set (dimensions
x,y,z) we could cacluate cov(x,y) ,
cov(x,z) , cov(y,z)
Covariance Matrix
Can use covariance matrix to find
covariance of all the possible pairs
Since cov(a,b)=cov(b,a)
The matrix is symmetrical about the main
diagonal
22
Eigenvectors and Eigenvalues
More formally defined
Let A be a nn matrix. The vector v that
satisfies
Av = v
for some scalar is called the eigenvalue
of matrix A corresponding to eigenvector
v.
23
Principal Components Analysis
(PCA)
PCA is a method to identify a
new set of predictors, as linear
combinations of the original
ones, that captures the
‘maximum amount’ of
variance in the observed data.
A technique for identifying
patterns in data.
Also used to express data in such a way as to
highlight similarities and differences.
PCA are used to reduce the dimension in data
without losing the integrity of information.
Principal Components Analysis
(PCA)
Definition
Principal Components Analysis (PCA) produces a list of
p principle components (Y1, . . . , Yp) such that
Each Yi is a linear combination of the original
predictors, and it’s vector norm is 1
The Yi’s are pairwise orthogonal
The Yi’s are ordered in decreasing order in the
amount of captured observed variance.
That is, the observed data shows more variance
in the direction of Y1 than in the direction of Y2.
Step 2:
Subtract the mean from each of the
data point
Step1 & Step2
X1 X2
2.5 2.4 0.69 0.49 0.4761 0.2401
0.5 0.7 -1.31 -1.21 1.7161 1.4641
2.2 2.9 0.39 0.99 0.1521 0.9801
1.9 2.2 0.09 0.29 0.0081 0.0841
3.1 3 1.29 1.09 1.6641 1.1881
2.3 2.7 0.49 0.79 0.2401 0.6241
2 1.6 0.19 -0.31 0.0361 0.0961
1 1.1 -0.81 -0.81 0.6561 0.6561
1.5 1.6 -0.31 -0.31 0.0961 0.0961
1.1 0.9 -0.71 -1.01 0.5041 1.0201
18.1 19.1 0 0 5.549 6.449
Step3: Calculate the Covariance matrix
.616555556 .615444444
cov
.615444444 .716555556
Step 4: Calculate the eigenvalues and
eigenvectors of the covariance matrix
using the following equation
Where
What does this all mean?
Data Points
Eigenvectors
Conclusion
Eigenvector give us information about the
pattern.
By looking at graph in previous slide. See
how one of the eigenvectors go through the
middle of the points.
Second eigenvector tells about another
weak pattern in data.
So by finding eigenvectors of covariance
matrix we are able to extract lines that
characterize the data.
Step 5:Chosing components and forming a feature
vector.
Highest eigenvalue is the principal
component of the data set.
In our example, the eigenvector with the
largest eigenvalue was the one that pointed
down the middle of the data.
So, once the eigenvectors are found, the
next step is to order them by eigenvalues,
highest to lowest.
This gives the components in order of
significance.
Cont’d
Now, here comes the idea of dimensionality
reduction and data compression
You can decide to ignore the components of
least significance.
You do lose some information, but if
eigenvalues are small you don’t lose much.
More formal stated (see next slide)
Cont’d
We have n – dimension
So we will find n eigenvectors
But if we chose only p first eigenvectors.
Then the final dataset has only p dimension
Step 6: Deriving the new dataset
Now, we have
chosen the .7351 .6778
components
(eigenvectors) that .6778 .7351
Choice-1: with two eigenvectors
we want to keep.
We can write them in
form of a matrix of
vectors .6778
In our example we .7351
have two Choice-2: with one eigenvector
eigenvectors, So we i.e. first eigenvector
Cont’d
To obtain the final dataset we will
multiply the above vector transposed
with the transpose of original data
matrix i.e.
Step 2
Subtract the mean
Step 3
Calculate the covariance matrix
Step 4
Calculate the eigenvectors and eigenvalues of the
covariance matrix
Step 5
Choose components and form a feature vector