0% found this document useful (0 votes)
8 views

Principal Component Analysis Concepts

Principal Component Analysis (PCA) is a technique used to reduce the dimensionality of large data sets by transforming the data to a new set of variables called principal components. The first principal component accounts for as much of the variability in the data as possible, and each succeeding component accounts for as much of the remaining variability as possible. PCA involves computing the eigenvectors and eigenvalues of the covariance matrix to determine the principal components that best explain the variance in the data.

Uploaded by

aman gupta
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views

Principal Component Analysis Concepts

Principal Component Analysis (PCA) is a technique used to reduce the dimensionality of large data sets by transforming the data to a new set of variables called principal components. The first principal component accounts for as much of the variability in the data as possible, and each succeeding component accounts for as much of the remaining variability as possible. PCA involves computing the eigenvectors and eigenvalues of the covariance matrix to determine the principal components that best explain the variance in the data.

Uploaded by

aman gupta
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 16

PCA

Principal Component Analysis Concepts


[email protected]
QU1HPBT85A

Proprietary content. ©Great


ThisLearning.
file is meantAll
forRights
personalReserved. Unauthorized use oronly.
use by [email protected] distribution prohibited
Sharing or publishing the contents in part or full is liable for legal action.
Principal Component Analysis

1. Main idea: seek most accurate data representation in a lower dimensional space

1. Example in 2-D, project data to 1-D subspace (a line) with minimal projection error

[email protected]
QU1HPBT85A

1. In both the pictures above, the data points (black dots) are projected to one line but the
second line is closer to the actual points (less projection errors) than first one

1. Notice that the good line to use for projection lies in the direction of largest variance

Ref: http://www.cs.haifa.ac.il/~rita/uml_course/add_mat/PCA.pdf
Proprietary content. ©Great
ThisLearning.
file is meantAll
forRights
personalReserved. Unauthorized use oronly.
use by [email protected] distribution prohibited
Sharing or publishing the contents in part or full is liable for legal action.
PCA Pt 2

5. After the data is projected on the best line, need to transform the coordinate system to
get 1D representation for vector y

5. Note that new data y has the same variance as old data x in the direction of the green
[email protected]
QU1HPBT85A line

5. PCA preserves largest variances in the data

Ref: http://www.cs.haifa.ac.il/~rita/uml_course/add_mat/PCA.pdf
Proprietary content. ©Great
ThisLearning.
file is meantAll
forRights
personalReserved. Unauthorized use oronly.
use by [email protected] distribution prohibited
Sharing or publishing the contents in part or full is liable for legal action.
PCA Pt 3

8. In general PCA on n dimensions will result in another set of new n dimensions. The
one which captures maximum variance in the underlying data is the principal
component 1, principal component 2 is orthogonal to it

8. Example in 2-D, project data to 1-D subspace (a line) with minimal projection error

[email protected]
QU1HPBT85A

Ref: http://www.cs.haifa.ac.il/~rita/uml_course/add_mat/PCA.pdf
Proprietary content. ©Great
ThisLearning.
file is meantAll
forRights
personalReserved. Unauthorized use oronly.
use by [email protected] distribution prohibited
Sharing or publishing the contents in part or full is liable for legal action.
Mechanics of Principal Component Analysis

Mechanics of Principal Component Analysis


[email protected]
QU1HPBT85A

http://setosa.io/ev/principal-component-analysis/
Proprietary content. ©Great
ThisLearning.
file is meantAll
forRights
personalReserved. Unauthorized use oronly.
use by [email protected] distribution prohibited
Sharing or publishing the contents in part or full is liable for legal action.
Principal Component Analysis steps

1. Begins by standardizing the data. Data on all the dimensions are subtracted from their
means to shift the data points to the origin. i.e. the data is centered on the origins

1. Generate the covariance matrix / correlation matrix for all the dimensions

1. Perform eigen decomposition, that is, compute eigen vectors which are the principal
components and the corresponding eigen values which are the magnitudes of variance
captured
[email protected]
QU1HPBT85A

1. Sort the eigen pairs in descending order of eigen values and select he one with the
largest value. This is the first principal component that covers the maximum
information from the original data

Ref: http://www.cs.haifa.ac.il/~rita/uml_course/add_mat/PCA.pdf
Proprietary content. ©Great
ThisLearning.
file is meantAll
forRights
personalReserved. Unauthorized use oronly.
use by [email protected] distribution prohibited
Sharing or publishing the contents in part or full is liable for legal action.
Principal Component Analysis (Performance issues)

1. PCA effectiveness depends upon the scales of the attributes. If attributes have
different scales, PCA will pick variable with highest variance rather than picking up
attributes based on correlation

1. Changing scales of the variables can change the PCA

1. Interpreting PCA can become challenging due to presence of discrete data

[email protected]
QU1HPBT85A 1. Presence of skew in data with long thick tail can impact the effectiveness of the PCA
(related to point 1)

1. PCA assumes linear relationship between attributes. It is ineffective when relationships


are non linear

Proprietary content. ©Great


ThisLearning.
file is meantAll
forRights
personalReserved. Unauthorized use oronly.
use by [email protected] distribution prohibited
Sharing or publishing the contents in part or full is liable for legal action.
Principal Component Analysis steps
Lab-3 Principal Component Analysis on iris data set

Description – Explore the iris data set and perform PCA

The data set is winequality-red.csv

[email protected]
QU1HPBT85A

Sol: PCA-iris.ipynb
Proprietary content. ©Great
ThisLearning.
file is meantAll
forRights
personalReserved. Unauthorized use oronly.
use by [email protected] distribution prohibited
Sharing or publishing the contents in part or full is liable for legal action.
Principal Component Analysis (Signal to noise ratio)
Principal Component Analysis (Signal to noise ratio)
Signal – all valid values for a variable
(show between max and min values for
Y max
x axis and y axis). Represents a valid
data

Noise – The spread of data points


across the best fit line. For a given
value of x, there are multiple values of
[email protected]
QU1HPBT85A Y min y (some on line and some around the
X min X max line). This spread is due to random
factors

Signal to Noise Ratio – Variance of


Signal
signal / variance in noise.
X_std_df = pd.DataFrame(X_std)
axes = pd.plotting.scatter_matrix(X_std_df)
plt.tight_layout() Greater the SNR the better the model
will be

Proprietary content. ©Great


ThisLearning.
file is meantAll
forRights
personalReserved. Unauthorized use oronly.
use by [email protected] distribution prohibited
Sharing or publishing the contents in part or full is liable for legal action.
Principal Component Covariance Matrix
Principal Component Covariance Matrix

1. Variance is measured within the


dimensions and co-variance is
among the dimensions

1. Express total variance (variance and


[email protected]
QU1HPBT85A cross variance between dimensions
as a matrix (variance matrix)

1. Covariance matrix is a mathematical


representation of the total variance
of individual dimension and across
dimensions .
Covariance matrix for three dimensions x,y and z

eig_vals, eig_vecs = np.linalg.eig(cov_matrix)


Proprietary content. ©Great
ThisLearning.
file is meantAll
forRights
personalReserved. Unauthorized use oronly.
use by [email protected] distribution prohibited
Sharing or publishing the contents in part or full is liable for legal action.
Improving SNR through PCA ( Scaling the dimensions)
Improving SNR through PCA ( Scaling the dimensions)

2nd Principal Component


1. The mean is subtracted from all the
points on both dimensions i.e. (xi –
xbar) and (yi – ybar)

1. The dimensions are transformed using


algebra into new set of dimensions
[email protected]
QU1HPBT85A

1. The transformation is a rotation of axes


in mathematical space

1 st Principal Component

X_std = StandardScaler().fit_transform(X)
eig_vals, eig_vecs = np.linalg.eig(cov_matrix)

Proprietary content. ©Great


ThisLearning.
file is meantAll
forRights
personalReserved. Unauthorized use oronly.
use by [email protected] distribution prohibited
Sharing or publishing the contents in part or full is liable for legal action.
(Calculating total variance (covariance and variance )
PCA (Calculating total variance (covariance and variance )

4. Multiplying the two matrices produces a matrix of total variance also called
covariance matrix (a square and symmetric matrix).

[email protected]
QU1HPBT85A

Proprietary content. ©Great


ThisLearning.
file is meantAll
forRights
personalReserved. Unauthorized use oronly.
use by [email protected] distribution prohibited
Sharing or publishing the contents in part or full is liable for legal action.
Improving SNR through PCA (Principal components)
Improving SNR through PCA (Principal components)

5. The original data points are now


represented by the red dots on
new dimensions

5. It also introduces error of


representation (vertical red lines Noise
[email protected]
QU1HPBT85A
from the blue dots to
corresponding red dots on the
X min X max
new dimension)

Signal
5. The axis rotation is done such that
the new dimension captures max
variance in the data points and print('Eigen Vectors \n%s', eig_vecs)
also reduces total error of print('\n Eigen Values \n%s', eig_vals)
representation

Proprietary content. ©Great


ThisLearning.
file is meantAll
forRights
personalReserved. Unauthorized use oronly.
use by [email protected] distribution prohibited
Sharing or publishing the contents in part or full is liable for legal action.
Properties of principal components and their covariance
matrix Properties of principal components and their covariance matrix
8. Thus to find principal components we need to get the diagonal matrix
from the original covariance matrix

[email protected]
QU1HPBT85A

8. For this we have to transform the matrix A to a new matrix B such that the
covariance matrix of B ( ), is a diagonal matrix (Ref to part 2, bullet
5)

Proprietary content. ©Great


ThisLearning.
file is meantAll
forRights
personalReserved. Unauthorized use oronly.
use by [email protected] distribution prohibited
Sharing or publishing the contents in part or full is liable for legal action.
PCA for dimensionality reduction
PCA for dimensionality reduction

1. PCA can also be used to reduce dimensions

1. Arrange all eigen vectors along with corresponding eigen values in


descending order of eigen values

1. Plot a cumulative eigen_value graph as shown below


[email protected]
QU1HPBT85A
1. Eigen vectors with insignificant contribution to total eigen values can be
removed from analysis (for e.g. eigen vector 6 and 7 below)

Proprietary content. ©Great


ThisLearning.
file is meantAll
forRights
personalReserved. Unauthorized use oronly.
use by [email protected] distribution prohibited
Sharing or publishing the contents in part or full is liable for legal action.
END

Thanks
[email protected]
QU1HPBT85A

Proprietary content. ©Great


ThisLearning.
file is meantAll
forRights
personalReserved. Unauthorized use oronly.
use by [email protected] distribution prohibited
Sharing or publishing the contents in part or full is liable for legal action.

You might also like