Principal Component Analysis (PCA) : Anisha M. Lal
Principal Component Analysis (PCA) : Anisha M. Lal
Analysis (PCA)
ANISHA M. LAL
Dimensionality Reduction
and Feature Construction
• Principal components analysis (PCA)
• Principal Component Analysis (PCA) is an unsupervised
linear transformation technique that is widely used across
different fields, most prominently for feature extraction and
dimensionality reduction.
• PCA used to reduce dimensions of data without much loss of
information.
• Dimensionality Reduction is a process through which we can
visualize a high dimension data by reducing the no of
dimensions.
• Used in machine learning and in signal processing and image
compression (among other things).
PCA
i
( x x ) 2
( x x )( y
i i y)
cov( A1 , A2 ) i 1
(n 1)
• Covariance matrix:
var( H ) 104 .5
104 .5 var( M )
47.7 104 .5
104 .5 370
Covariance matrix
PCA Algorithm
• Given original data set S = {x1, ..., xk}, produce new set
by subtracting the mean of attribute Ai from each xi.
• Calculate the covariance matrix.
• Calculate the (unit) eigenvectors and eigenvalues of the
covariance matrix.
• Order eigenvectors by eigenvalue, highest to lowest.
• Construct new feature vector.
5. Derive the new data set.
TransformedData = RowFeatureVector RowDataAdjust
PCA
1. Given original data set S = {x1, ..., xk}, produce new set by
subtracting the mean of attribute Ai from each xi.
.735178956
v 2 .0490833989
.677873399
.677873399 .735178956
FeatureVector1
.735178956 .677873399
.677873399
FeatureVector 2
.735178956
5. Derive the new data set.
.677873399 .735178956
RowFeatureVector1
.735178956 .677873399
This gives original data in terms of chosen components (eigenvectors)—that is, along these axes.
.69 1.31 .39 .09 1.29 .49 .19 .81 .31 .71
RowDataAdjust
.49 1.21 .99 .29 1.09 .79 .31 .81 .31 1.01
Reconstructing the original data
We did:
TransformedData = RowFeatureVector RowDataAdjust
so we can do
RowDataAdjust = RowFeatureVector -1 TransformedData
= RowFeatureVector T TransformedData
and
RowDataOriginal = RowDataAdjust + OriginalMean
Advantages of PCA
• Remove Co-relation Features.
• Improves the Algorithm Performance by reducing the no of
dimensions: The training time of the algorithms reduces
significantly with less number of features.
• Reduces over-fitting of data: Overfitting mainly occurs when
there are too many variables in the dataset. So, PCA helps in
overcoming the overfitting issue by reducing the number of
features.
• Improves Visualization: It is very hard to visualize and
understand the data in high dimensions. PCA transforms a high
dimensional data to low dimensional data (2 dimension) so that
it can be visualized easily.
Disadvantages of PCA
• Independent variables becomes less interpretable.
• Data Standardization is must for PCA: Principal components will
be biased towards features with high variance, leading to false
results. PCA is affected by scale, so you need to scale the features in
your data before applying PCA.
• Categorical features requires Encoding as PCA works only on
numerical data.
• Information is loss when data is spread in different
structures/shapes: Although Principal Components try to cover
maximum variance among the features in a dataset, if we don’t
select the number of Principal Components with care, it may miss
some information as compared to the original list of features.