0% found this document useful (0 votes)
6 views

Lecture 9 -Data Prep - Reduction - PCA-M

The document discusses data preprocessing in data mining, focusing on four major steps: data cleaning, integration, reduction, and transformation. It highlights dimensionality reduction techniques such as Principal Components Analysis (PCA) and explains the importance of removing dimensional redundancy to enhance data analysis efficiency. Additionally, it covers concepts like covariance, eigenvalues, and eigenvectors, which are essential for understanding PCA and its application in regression modeling.

Uploaded by

gihel53025
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views

Lecture 9 -Data Prep - Reduction - PCA-M

The document discusses data preprocessing in data mining, focusing on four major steps: data cleaning, integration, reduction, and transformation. It highlights dimensionality reduction techniques such as Principal Components Analysis (PCA) and explains the importance of removing dimensional redundancy to enhance data analysis efficiency. Additionally, it covers concepts like covariance, eigenvalues, and eigenvectors, which are essential for understanding PCA and its application in regression modeling.

Uploaded by

gihel53025
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 44

CS06504

Data Mining
Lecture # 9
Data Preprocessing
(Ch # 3)
Data Preprocessing
 Dimensionality reduction is a part of Data
Preprocessing
 Data Preprocessing has following four
major steps
1. Data cleaning
2. Data integration
3. Data reduction
4. Data Transformation and Discretization
Data Reduction
 Reduce representation of the data set that is
much smaller in volume, yet closely
maintains the integrity of the original data.
Different Strategies are
• Dimensionality Reduction
• Numerosity reduction
• Data Compression
Dimensionality Reduction
(DR)
 Process of reducing the number of random
attributes under consideration.
 Two very common methods are Wavelet
Transforms and Principal Components
Analysis (PCA)
Numerosity Reduction
 Replace the original data volume by
alternative smaller forms of data
representation.
 Regression and log-linear models
(parametric methods)
 Histograms, clustering, sampling and data
cube aggregation (nonparametric methods)
Data Compression
 Transformations are applied to obtain a reduced or
compressed representation
 Lossless: The original data can be reconstructed
from the compressed data without any information
loss.
 Lossy: Only an approximation of the original data
can be reconstructed.
DR: Principal Components Analysis
(PCA)
 Why PCA?
 PCA is a useful statistical technique, has
found applications in:
Face recognition
Image Compression
Reducing dimension of data
PCA Goal:
Removing Dimensional Redundancy
 The major goal of PCA in Data Science and Machine
Learning is to remove the “dimensional redundancy”
from data.
 What does that mean?
 A typical dataset contains several dimensions (variables) that
may or may not correlate.
 Dimensions that correlate vary together.
 The information represented by a set of dimensions with high
correlation can be extracted by studying just one dimension
that represents the whole set.
 Hence the goal is to reduce the dimensions of a dataset to a
smaller set of representative dimensions that do not correlate.
PCA Goal:
Removing Dimensional Redundancy
Dim 1
Dim 2
Dim 3
Analyzing 12
Dim 4
Dim 5
Dimensional data
Dim 6 is challenging !!!
Dim 7
Dim 8
Dim 9
Dim 10
Dim 11
Dim 12
PCA Goal:
Removing Dimensional Redundancy

Dim 1
Dim 2 But some dimensions
Dim 3
represent redundant
Dim 4
Dim 5
information. Can we
Dim 6 “reduce” these.
Dim 7
Dim 8
Dim 9
Dim 10
Dim 11
Dim 12
PCA Goal:
Removing Dimensional Redundancy
Dim 1
Dim 2 Lets assume we have a
Dim 3 “PCA black box” that
Dim 4
can reduce the
Dim 5
correlating
Dim 6
Dim 7
dimensions.
Dim 8
Dim 9 Pass the 12d data set
Dim 10 through the black box
Dim 11
to get a three
Dim 12
dimensional data set.
PCA Goal:
Removing Dimensional Redundancy
Given appropriate reduction,
Dim 1
Dim 2
analyzing the reduced dataset
Dim 3
is much more efficient than
Dim 4 the original “redundant” data.
Dim 5 Dim A
Dim 6 PCA Dim B
Dim 7 Black box
Dim C
Dim 8
Dim 9
Dim 10 Pass the 12 d data set through the
Dim 11 black box to get a three dimensional
Dim 12
data set.
Mathematics inside PCA Black box: Bases
 Lets now give the “black box” a mathematical form.
 In linear algebra dimensions of a space are a linearly
independent set called “bases” that spans the space created by
dimensions.
i.e. each point in that space is a linear combination of the bases
set.
e.g. consider the simplest example of standard basis in R n
consisting of the coordinate axes.
Every point in R3 is a linear
combination of the standard basis of
R3
1 0 0
0 1 0
0 0 1
(2,3,3) = 2 (1,0,0) + 3(0,1,0) + 3 (0,0,1)
PCA Goal: Change of Basis
 Assume X is the 6-dimensional data set given as
input
Dimensions

 x11 x12 x13 x14 x15 x16 


x x26 
Data Points

 21 x22 x23 x24 x25


X  x31 x32 x33 x34 x35 x36 
 
 x41 x42 x43 x44 x45 x46 
 x51 x52 x53 x54 x55 x56 

• A naïve basis for X is standard basis for R6 and hence


BX = X
• Here, we want to find a new (reduced) basis P such as
PX = Y
• Y will be the resultant reduced data set.
PCA Goal
 Change of Basis
PX Y

 p11 p12  p1m   x1   y1 


p p22  p2 m   x2   y2 
 21
     
     
     
     
    
 pm1 pm 2  pmm   xm   ym 

• QUESTION: What is a good choice for P ?


– Lets park this question right now and revisit after studying
some related concepts
Background Stats/Maths
 Mean and Standard Deviation
 Variance and Covariance
 Covariance Matrix
 Eigenvectors and Eigenvalues
Mean and Standard Deviation
 Mean: n

 Xi
x  i 1

n
it doesn’t tell us a lot about data
set.
Different data sets can have same
mean.
 Standard Deviation (SD) of a data n

set is a measure of how spread  i


( X  X ) 2

out the data is. s i 1

(n  1)
 Variance is another measure of
n
the spread of data in data set. It is
almost identical to SD. 2
 i
( X  X ) 2

s  i 1
(n  1)
Covariance
 SD and Variance are 1-dimensional
 1-D data sets could be
Heights of all the people in the room
Salary of employee in a company
Marks in the quiz
 However many datasets have more than 1-dimension
 Our aim is to find any relationship between different
dimensions.
 E.g. Finding relationship with students result and their hour of
study.
 It is used to measure relationship between 2-Dimensions.
n

 (X i  X )(Yi  Y )
cov( X , Y )  i 1
(n  1)
Covariance Interpretation
 We have data set for students study hour (H) and marks
achieved (M)
 We find cov(H,M)
 Exact value of covariance is not as important as the sign
(i.e. positive or negative)
 +ve , both dimensions increase together
 -ve , as one dimension increases other decreases
 Zero, their exist no relationship
Covariance Matrix
 Covariance is always measured between 2 –
dim.
 What if we have a data set with more than
2-dim?
 We have to calculate more than one
covariance measurement.
 E.g. from a 3-dim data set (dimensions
x,y,z) we could cacluate cov(x,y) ,
cov(x,z) , cov(y,z)
Covariance Matrix
 Can use covariance matrix to find
covariance of all the possible pairs
 Since cov(a,b)=cov(b,a)
The matrix is symmetrical about the main
diagonal

 cov( x, x) cov( x, y ) cov( x, z ) 


 
c  cov( y, x) cov( y, y ) cov( y, z ) 
 cov( z , x) cov( z , y ) cov( z , z ) 
 
Eigenvectors
 Consider the two
multiplications between a
matrix and a vector  2 3   1   11
     
 In first example the resulting  2 1   3   5 
vector is not an integer
multiple of the original  2 3   3   12   3
vector.       4  
 Whereas in second example,  2 1   2   8   2
the resulting vector is 4 times
the original vector

22
Eigenvectors and Eigenvalues
 More formally defined
Let A be a nn matrix. The vector v that
satisfies
Av = v
for some scalar  is called the eigenvalue
of matrix A corresponding to eigenvector
v.

23
Principal Components Analysis
(PCA)
PCA is a method to identify a
new set of predictors, as linear
combinations of the original
ones, that captures the
‘maximum amount’ of
variance in the observed data.
A technique for identifying
patterns in data.
Also used to express data in such a way as to
highlight similarities and differences.
PCA are used to reduce the dimension in data
without losing the integrity of information.
Principal Components Analysis
(PCA)
Definition
Principal Components Analysis (PCA) produces a list of
p principle components (Y1, . . . , Yp) such that
Each Yi is a linear combination of the original
predictors, and it’s vector norm is 1
The Yi’s are pairwise orthogonal
The Yi’s are ordered in decreasing order in the
amount of captured observed variance.
That is, the observed data shows more variance
in the direction of Y1 than in the direction of Y2.

To perform dimensionality reduction we select the


top m principle components of PCA as our new
predictors and express our observed data in
The Intuition Behind PCA
Top PCA components capture
the most of amount of
variation (interesting features)
of the data.
Each component is a linear
combination of the original
predictors - we visualize them
as vectors in the feature
space.
The Intuition Behind PCA
Transforming our
observed data means
projecting our dataset
onto the space defined
by the top m PCA
components, these
components are our
new predictors.
Using PCA for
Regression
PCA is easy to use in Python, so how do we then
use it for regression modeling in a real-life problem?
If we use all p of the new Yj , then we have not
improved the dimensionality. Instead, we select the
first M PCA variables, Y1, ..., YM , to use as predictors
in a regression model.
The choice of M is important and can vary from
application to application. It depends on various
things, like how collinear the predictors are, how
truly related they are to the response, etc...
What would be the best way to check for a
specified problem?
Train and Test!!!
Step by Step
 Step 1:
We need to have some data for PCA

 Step 2:
Subtract the mean from each of the
data point
Step1 & Step2
X1 X2
2.5 2.4 0.69 0.49 0.4761 0.2401
0.5 0.7 -1.31 -1.21 1.7161 1.4641
2.2 2.9 0.39 0.99 0.1521 0.9801
1.9 2.2 0.09 0.29 0.0081 0.0841
3.1 3 1.29 1.09 1.6641 1.1881
2.3 2.7 0.49 0.79 0.2401 0.6241
2 1.6 0.19 -0.31 0.0361 0.0961
1 1.1 -0.81 -0.81 0.6561 0.6561
1.5 1.6 -0.31 -0.31 0.0961 0.0961
1.1 0.9 -0.71 -1.01 0.5041 1.0201
18.1 19.1 0 0 5.549 6.449
Step3: Calculate the Covariance matrix

 Calculate the covariance matrix


Var(x1) = 5.549/9 = 0.61656
Var(x2) = 6.449/9 = 0.71656
Cov(x1,x2) = 5.539/9 = 0.6154

 Non-diagonal elements in the covariance matrix are positive


 So x1 , x2 variable increase together

 .616555556 .615444444 
cov  
 .615444444 .716555556 
Step 4: Calculate the eigenvalues and
eigenvectors of the covariance matrix
using the following equation

where A is the cov. Matrix and I is


the identity matrix.
By solving the above equation we
get a second degree equation. After
solving the second degree equation
we tow values of . These are
eigenvalues of A.
Now we will find eigenvectors by solving the following
equation
A
Since covariance matrix is square, we can calculate
the eigenvector and eigenvalues of the matrix using
the constraint

We solve the above two equation one by one for both


the eigenvalues and find two eigenvectors
The final answer is

Where
What does this all mean?

Data Points

Eigenvectors
Conclusion
 Eigenvector give us information about the
pattern.
 By looking at graph in previous slide. See
how one of the eigenvectors go through the
middle of the points.
 Second eigenvector tells about another
weak pattern in data.
 So by finding eigenvectors of covariance
matrix we are able to extract lines that
characterize the data.
Step 5:Chosing components and forming a feature
vector.
 Highest eigenvalue is the principal
component of the data set.
 In our example, the eigenvector with the
largest eigenvalue was the one that pointed
down the middle of the data.
 So, once the eigenvectors are found, the
next step is to order them by eigenvalues,
highest to lowest.
 This gives the components in order of
significance.
Cont’d
 Now, here comes the idea of dimensionality
reduction and data compression
 You can decide to ignore the components of
least significance.
 You do lose some information, but if
eigenvalues are small you don’t lose much.
 More formal stated (see next slide)
Cont’d
 We have n – dimension
 So we will find n eigenvectors
 But if we chose only p first eigenvectors.
 Then the final dataset has only p dimension
Step 6: Deriving the new dataset
 Now, we have
chosen the   .7351 .6778 
components  
(eigenvectors) that  .6778 .7351 
Choice-1: with two eigenvectors
we want to keep.
 We can write them in
form of a matrix of
vectors  .6778 
 
 In our example we  .7351 
have two Choice-2: with one eigenvector
eigenvectors, So we i.e. first eigenvector
Cont’d
 To obtain the final dataset we will
multiply the above vector transposed
with the transpose of original data
matrix i.e.

for first eigenvector we get first


principle component as

 Final dataset will have data items in


columns and dimensions along rows.
 So we have original data set
represented in transformed form
Original data set represented using two
eigenvectors.
Original data set restored using only one
eigenvectors.
PCA – Mathematical Working
 Naïve Basis (I) of input matrix (X) spans a large
dimensional space.
 Change of Basis (P) is required so that X can be
projected along a lower dimension space having
significant dimensions only.
 A properly selected P will generate a projection Y.
 Use this P to project the correlation matrix. Lessen
the number of Eigenvectors in P for a reduced
dimension projection.
PCA Procedure
 Step 1
Get data

 Step 2
Subtract the mean

 Step 3
Calculate the covariance matrix

 Step 4
Calculate the eigenvectors and eigenvalues of the
covariance matrix

 Step 5
Choose components and form a feature vector

You might also like