0% found this document useful (0 votes)
31 views

Principal Component Analysis (PCA) Explained - Built in

Principal Component Analysis

Uploaded by

Sam Smith
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
31 views

Principal Component Analysis (PCA) Explained - Built in

Principal Component Analysis

Uploaded by

Sam Smith
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

A Step by Step Explanation of bumpsskkier@gmail.

com

05/07/2024, 17:18
Principal Component Analysis Principal Component Analysis (PCA) Explained | Built In Continue as Sam

(PCA)
To create your account, Google will share your name,
email address, and profile picture with builtin.com.
See builtin.com's privacy policy and terms of service.
Learn how to use a PCA when working with large data sets.

Written by Zakaria Jaadi

Image: Shutterstock / Built In

UPDATED BY REVIEWED BY
Brennan Whitfield | Feb 23, 2024 Sadrach Pierre

https://builtin.com/data-science/step-step-explanation-principal-component-analysis 1/11
method used to simplify a large data set into a smaller set while still maintaining
05/07/2024, 17:18 significant patterns and trends. Principal Component Analysis (PCA) Explained | Built In

Principal component analysis can be broken down into five steps. I’ll go through each step,
providing logical explanations of what PCA is doing and simplifying mathematical concepts
such as standardization, covariance, eigenvectors and eigenvalues without focusing on how to
compute them.

How Do You Do a Principal Component Analysis?

1. Standardize the range of continuous initial variables


2. Compute the covariance matrix to identify correlations
3. Compute the eigenvectors and eigenvalues of the covariance matrix to identify the
principal components
4. Create a feature vector to decide which principal components to keep
5. Recast the data along the principal components axes

First, some basic (and brief) background is necessary for context.

What Is Principal Component Analysis?


Principal component analysis, or PCA, is a dimensionality reduction method that is often used
to reduce the dimensionality of large data sets, by transforming a large set of variables into a

https://builtin.com/data-science/step-step-explanation-principal-component-analysis 2/11
05/07/2024, 17:18 Principal Component Analysis (PCA) Explained | Built In

What Are Principal Components?


Principal components are new variables that are constructed as linear combinations or
mixtures of the initial variables. These combinations are done in such a way that the new
variables (i.e., principal components) are uncorrelated and most of the information within the
initial variables is squeezed or compressed into the first components. So, the idea is 10-
dimensional data gives you 10 principal components, but PCA tries to put maximum possible
information in the first component, then maximum remaining information in the second and
so on, until having something like shown in the scree plot below.

https://builtin.com/data-science/step-step-explanation-principal-component-analysis 3/11
Geometrically speaking, principal components represent the directions of the data that explain
05/07/2024, 17:18 Principal
a maximal amount of variance, that is Component
to say, the Analysis (PCA)
lines that Explained
capture most| information
Built In of
the data. The relationship between variance and information here, is that, the larger the
variance carried by a line, the larger the dispersion of the data points along it, and the larger
the dispersion along a line, the more information it has. To put all this simply, just think of
principal components as new axes that provide the best angle to see and evaluate the data, so
that the differences between the observations are better visible.

How PCA Constructs the Principal Components


As there are as many principal components as there are variables in the data, principal
components are constructed in such a manner that the first principal component accounts for
the largest possible variance in the data set. For example, let’s assume that the scatter plot
of our data set is as shown below, can we guess the first principal component ? Yes, it’s
approximately the line that matches the purple marks because it goes through the origin and
it’s the line in which the projection of the points (red dots) is the most spread out. Or
mathematically speaking, it’s the line that maximizes the variance (the average of the squared
distances from the projected points (red dots) to the origin).

https://builtin.com/data-science/step-step-explanation-principal-component-analysis 4/11
05/07/2024, 17:18 Principal Component Analysis (PCA) Explained | Built In
The second principal component is calculated in the same way, with the condition that it is
uncorrelated with (i.e., perpendicular to) the first principal component and that it accounts for
the next highest variance.

This continues until a total of p principal components have been calculated, equal to the
original number of variables.

Step-by-Step Explanation of PCA

Step 1: Standardization
The aim of this step is to standardize the range of the continuous initial variables so that each
one of them contributes equally to the analysis.

More specifically, the reason why it is critical to perform standardization prior to PCA, is that
the latter is quite sensitive regarding the variances of the initial variables. That is, if there are
large differences between the ranges of initial variables, those variables with larger ranges will
dominate over those with small ranges (for example, a variable that ranges between 0 and 100
will dominate over a variable that ranges between 0 and 1), which will lead to biased results.
So, transforming the data to comparable scales can prevent this problem.

Mathematically, this can be done by subtracting the mean and dividing by the standard
deviation for each value of each variable.

Once the standardization is done, all the variables will be transformed to the same scale.

https://builtin.com/data-science/step-step-explanation-principal-component-analysis 5/11
The covariance matrix is a p × p symmetric matrix (where p is the number of dimensions) that
05/07/2024, 17:18 Principal Component Analysis (PCA) Explained | Built In
has as entries the covariances associated with all possible pairs of the initial variables. For
example, for a 3-dimensional data set with 3 variables x, y, and z, the covariance matrix is a
3×3 data matrix of this from:

Covariance Matrix for 3-Dimensional Data


Covariance Matrix for 3-Dimensional Data.

Since the covariance of a variable with itself is its variance (Cov(a,a)=Var(a)), in the main
diagonal (Top left to bottom right) we actually have the variances of each initial variable. And
since the covariance is commutative (Cov(a,b)=Cov(b,a)), the entries of the covariance matrix
are symmetric with respect to the main diagonal, which means that the upper and the lower
triangular portions are equal.

What do the covariances that we have as entries of the matrix tell us about the
correlations between the variables?

It’s actually the sign of the covariance that matters:

If positive then: the two variables increase or decrease together (correlated)


If negative then: one increases when the other decreases (Inversely correlated)

Now that we know that the covariance matrix is not more than a table that summarizes the
correlations between all the possible pairs of variables, let’s move to the next step.

Step 3: Compute the eigenvectors and eigenvalues of the


covariance matrix to identify the principal components

https://builtin.com/data-science/step-step-explanation-principal-component-analysis 6/11
because the eigenvectors of the Covariance matrix areactually the
FORdirections
EMPLOYERS of the axes whereLOG
JOIN IN

05/07/2024, 17:18 there is the most variance (most information) and that
Principal Component we call(PCA)
Analysis Principal Components.
Explained | Built In And
JOBSeigenvalues are simply
COMPANIES theARTICLES
coefficients attached to eigenvectors,
SALARIES which give the amountof MY ITEMS
COURSES

variance carried in each Principal Component.

By ranking your eigenvectors in order of their eigenvalues, highest to lowest, you get the
principal components in order of significance.

Principal Component Analysis Example:

Let’s suppose that our data set is 2-dimensional with 2 variables x,y and that the eigenvectors
and eigenvalues of the covariance matrix are as follows:

Principal Component Analysis Example

If we rank the eigenvalues in descending order, we get λ1>λ2, which means that the
eigenvector that corresponds to the first principal component (PC1) is v1 and the one that
corresponds to the second principal component (PC2) is v2.

After having the principal components, to compute the percentage of variance (information)
accounted for by each component, we divide the eigenvalue of each component by the sum of
eigenvalues. If we apply this on the example above, we find that PC1 and PC2 carry
respectively 96 percent and 4 percent of the variance of the data.

Step 4: Create a Feature Vector

https://builtin.com/data-science/step-step-explanation-principal-component-analysis 7/11
data set will have only p dimensions.
05/07/2024, 17:18 Principal Component Analysis (PCA) Explained | Built In
Principal Component Analysis Example:

Continuing with the example from the previous step, we can either form a feature vector with
both of the eigenvectors v1 and v2:

Principal Component Analysis eigen


vectors
Or discard the eigenvector v2, which is the one of lesser significance, and form a feature vector
with v1 only:

Principal
Component
Discarding the eigenvector v2 will reduce dimensionality by 1, and will consequently cause a
loss of information in the final data set. But given that v2 was carrying only 4 percent of the
information, the loss will be therefore not important and we will still have 96 percent of the
information that is carried by v1.

So, as we saw in the example, it’s up to you to choose whether to keep all the components or
discard the ones of lesser significance, depending on what you are looking for. Because if you
just want to describe your data in terms of new variables (principal components) that are
uncorrelated without seeking to reduce dimensionality, leaving out lesser significant
components is not needed.

Step 5: Recast the Data Along the Principal Components


Axes

https://builtin.com/data-science/step-step-explanation-principal-component-analysis 8/11
05/07/2024, 17:18 Principal Component AnalysisPrincipal
featureComponent
vector Analysis (PCA) Explained | Built In

An overview of principal component analysis (PCA). | Video: Visually Explained

References:

[Steven M. Holland, Univ. of Georgia]: Principal Components Analysis


[skymind.ai]: Eigenvectors, Eigenvalues, PCA, Covariance and Entropy
[Lindsay I. Smith]: A tutorial on Principal Component Analysis

https://builtin.com/data-science/step-step-explanation-principal-component-analysis 9/11
05/07/2024, 17:18
Recent Data SciencePrincipal
ArticlesComponent Analysis (PCA) Explained | Built In

Top 13 Predictive Analytics Tools to Know

Big Data in Retail: 12 Companies to Know

32 Big Data Examples and Applications

About Get Involved Resources Tech Hubs


Our Story Recruit With Built Customer Our Sites
In Support
Careers
Become an Share Feedback
Our Staff Writers Expert
Contributor Report a Bug
Built In is the online Content
community for startups Descriptions Browse Jobs
and tech companies.
Find startup jobs, tech Tech A-Z
news and events.

   

Learning Lab User Accessibility Copyright Privacy Terms of Your Privacy Choices/Cookie CA Notice of
Agreement Statement Policy Policy Use Settings Collection

https://builtin.com/data-science/step-step-explanation-principal-component-analysis 10/11
05/07/2024, 17:18 Principal Component Analysis (PCA) Explained | Built In

https://builtin.com/data-science/step-step-explanation-principal-component-analysis 11/11

You might also like