0% found this document useful (0 votes)
42 views

Dimensionality Reduction (Principal Component Analysis)

Principal Component Analysis (PCA) is an unsupervised machine learning algorithm used for dimensionality reduction. It converts correlated variables into linearly uncorrelated variables called principal components. PCA works by considering the variance of each attribute to reduce dimensionality while retaining important variables. It involves calculating the covariance matrix, eigenvalues, and eigenvectors to derive the principal components.

Uploaded by

Arnab Talukdar
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
42 views

Dimensionality Reduction (Principal Component Analysis)

Principal Component Analysis (PCA) is an unsupervised machine learning algorithm used for dimensionality reduction. It converts correlated variables into linearly uncorrelated variables called principal components. PCA works by considering the variance of each attribute to reduce dimensionality while retaining important variables. It involves calculating the covariance matrix, eigenvalues, and eigenvectors to derive the principal components.

Uploaded by

Arnab Talukdar
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

Dimensionality Reduction

(Principal Component Analysis)


By
Dr. Mausumi Maitra
Professor
Dept. of Information Technology
Govt. College of Engg. and Ceramic Technology
PCA

• Principal Component Analysis is an unsupervised


learning algorithm that is used for the
dimensionality reduction in Machine Learning.

• It is a statistical process that converts the


observations of correlated features into a set of
linearly uncorrelated features with the help of
orthogonal transformation.

• These new transformed features are called


the Principal Components.

• It is one of the popular tools that is used for


exploratory data analysis and predictive modeling.
PCA works by considering the variance of each attribute
because the high attribute shows the good split between
the classes, and hence it reduces the dimensionality.

Some real-world applications of PCA are image


processing, movie recommendation system,
optimizing the power allocation in various
communication channels. It is a feature extraction
technique, so it contains the important variables and
drops the least important variable.

The PCA algorithm is based on some mathematical


concepts such as:

• Variance and Covariance

• Eigenvalues and Eigen factors


Some common terms used in PCA
algorithm:

• Dimensionality: It is the number of features or


variables present in the given dataset. More easily, it
is the number of columns present in the dataset.

• Correlation: It signifies that how strongly two


variables are related to each other. Such as if one
changes, the other variable also gets changed. The
correlation value ranges from -1 to +1. Here, -1 occurs
if variables are inversely proportional to each other,
and +1 indicates that variables are directly
proportional to each other.

• Orthogonal: It defines that variables are not


correlated to each other, and hence the correlation
• Eigenvectors: If there is a square matrix M, and a non-
zero vector v is given. Then v will be eigenvector if Av
is the scalar multiple of v.

• Covariance Matrix: A matrix containing the


covariance between the pair of variables is called the
Covariance Matrix.

Properties of PCA

As described above, the transformed new


features or the output of PCA are the Principal
Components. The number of these PCs are
either equal to or less than the original
features present in the dataset. Some
• The principal component must be the linear combination
of the original features.

• These components are orthogonal, i.e., the correlation


between a pair of variables is zero.

• The importance of each component decreases when


going to 1 to n, it means the 1 PC has the most
importance, and n PC will have the least importance.
Algorithm
Steps :
1.Getting the dataset

Firstly, we need to take the input dataset and divide it


into two subparts X and Y, where X is the training set,
and Y is the validation set.

2.Representing data into a structure

Now we will represent our dataset into a structure.


Such as we will represent the two-dimensional matrix
of independent variable X. Here each row corresponds
to the data items, and the column corresponds to the
Features. The number of columns is the dimensions of
the dataset.
3. Standardizing the data

In this step, we will standardize our dataset. Such as in a


particular column, the features with high variance are
more important compared to the features with lower
variance.
If the importance of features is independent of the
variance of the feature, then we will divide each data item
in a column with the standard deviation of the column.
Here we will name the matrix as Z.

4. Calculating the Covariance of Z

To calculate the covariance of Z, we will take the matrix Z,


and will transpose it. After transpose, we will multiply it
by Z. The output matrix will be the Covariance matrix of Z.
5. Calculating the Eigen Values and Eigen Vectors

Now we need to calculate the eigenvalues and


eigenvectors for the resultant covariance matrix Z.
Eigenvectors or the covariance matrix are the directions
of the axes with high information. And the coefficients of
these eigenvectors are defined as the eigenvalues.

6. Sorting the Eigen Vectors

In this step, we will take all the eigenvalues and will sort
them in decreasing order, which means from largest to
smallest. And simultaneously sort the eigenvectors
accordingly in matrix P of eigenvalues. The resultant
matrix will be named as P*.
7. Calculating the new features Or Principal
Components

Here we will calculate the new features. To do this, we


will multiply the P* matrix to the Z. In the resultant matrix
Z*, each observation is the linear combination of original
features. Each column of the Z* matrix is independent of
each other.

8. Calculating the Covariance of Z

To calculate the covariance of Z, we will take the matrix Z


and will transpose it. After transpose, we will multiply it
by Z. The output matrix will be the Covariance matrix of
Z.
9. Remove less or unimportant features from the
new dataset.

The new feature set has occurred, so we will


decide here what to keep and what to remove. It
means, we will only keep the relevant or important
features in the new dataset, and unimportant
features will be removed out.
Applications of Principal Component
Analysis

• PCA is mainly used as the dimensionality


reduction technique in various AI
applications such as computer vision, image
compression, etc.

• It can also be used for finding hidden


patterns if data has high dimensions.

• Some fields where PCA is used are Finance,


data mining, Psychology, etc.

You might also like