Principal Component Analysis (PCA) Explained - Built in
Principal Component Analysis (PCA) Explained - Built in
com
05/07/2024, 17:18
Principal Component Analysis Principal Component Analysis (PCA) Explained | Built In Continue as Sam
(PCA)
To create your account, Google will share your name,
email address, and profile picture with builtin.com.
See builtin.com's privacy policy and terms of service.
Learn how to use a PCA when working with large data sets.
UPDATED BY REVIEWED BY
Brennan Whitfield | Feb 23, 2024 Sadrach Pierre
https://builtin.com/data-science/step-step-explanation-principal-component-analysis 1/11
method used to simplify a large data set into a smaller set while still maintaining
05/07/2024, 17:18 significant patterns and trends. Principal Component Analysis (PCA) Explained | Built In
Principal component analysis can be broken down into five steps. I’ll go through each step,
providing logical explanations of what PCA is doing and simplifying mathematical concepts
such as standardization, covariance, eigenvectors and eigenvalues without focusing on how to
compute them.
https://builtin.com/data-science/step-step-explanation-principal-component-analysis 2/11
05/07/2024, 17:18 Principal Component Analysis (PCA) Explained | Built In
https://builtin.com/data-science/step-step-explanation-principal-component-analysis 3/11
Geometrically speaking, principal components represent the directions of the data that explain
05/07/2024, 17:18 Principal
a maximal amount of variance, that is Component
to say, the Analysis (PCA)
lines that Explained
capture most| information
Built In of
the data. The relationship between variance and information here, is that, the larger the
variance carried by a line, the larger the dispersion of the data points along it, and the larger
the dispersion along a line, the more information it has. To put all this simply, just think of
principal components as new axes that provide the best angle to see and evaluate the data, so
that the differences between the observations are better visible.
https://builtin.com/data-science/step-step-explanation-principal-component-analysis 4/11
05/07/2024, 17:18 Principal Component Analysis (PCA) Explained | Built In
The second principal component is calculated in the same way, with the condition that it is
uncorrelated with (i.e., perpendicular to) the first principal component and that it accounts for
the next highest variance.
This continues until a total of p principal components have been calculated, equal to the
original number of variables.
Step 1: Standardization
The aim of this step is to standardize the range of the continuous initial variables so that each
one of them contributes equally to the analysis.
More specifically, the reason why it is critical to perform standardization prior to PCA, is that
the latter is quite sensitive regarding the variances of the initial variables. That is, if there are
large differences between the ranges of initial variables, those variables with larger ranges will
dominate over those with small ranges (for example, a variable that ranges between 0 and 100
will dominate over a variable that ranges between 0 and 1), which will lead to biased results.
So, transforming the data to comparable scales can prevent this problem.
Mathematically, this can be done by subtracting the mean and dividing by the standard
deviation for each value of each variable.
Once the standardization is done, all the variables will be transformed to the same scale.
https://builtin.com/data-science/step-step-explanation-principal-component-analysis 5/11
The covariance matrix is a p × p symmetric matrix (where p is the number of dimensions) that
05/07/2024, 17:18 Principal Component Analysis (PCA) Explained | Built In
has as entries the covariances associated with all possible pairs of the initial variables. For
example, for a 3-dimensional data set with 3 variables x, y, and z, the covariance matrix is a
3×3 data matrix of this from:
Since the covariance of a variable with itself is its variance (Cov(a,a)=Var(a)), in the main
diagonal (Top left to bottom right) we actually have the variances of each initial variable. And
since the covariance is commutative (Cov(a,b)=Cov(b,a)), the entries of the covariance matrix
are symmetric with respect to the main diagonal, which means that the upper and the lower
triangular portions are equal.
What do the covariances that we have as entries of the matrix tell us about the
correlations between the variables?
Now that we know that the covariance matrix is not more than a table that summarizes the
correlations between all the possible pairs of variables, let’s move to the next step.
https://builtin.com/data-science/step-step-explanation-principal-component-analysis 6/11
because the eigenvectors of the Covariance matrix areactually the
FORdirections
EMPLOYERS of the axes whereLOG
JOIN IN
05/07/2024, 17:18 there is the most variance (most information) and that
Principal Component we call(PCA)
Analysis Principal Components.
Explained | Built In And
JOBSeigenvalues are simply
COMPANIES theARTICLES
coefficients attached to eigenvectors,
SALARIES which give the amountof MY ITEMS
COURSES
By ranking your eigenvectors in order of their eigenvalues, highest to lowest, you get the
principal components in order of significance.
Let’s suppose that our data set is 2-dimensional with 2 variables x,y and that the eigenvectors
and eigenvalues of the covariance matrix are as follows:
If we rank the eigenvalues in descending order, we get λ1>λ2, which means that the
eigenvector that corresponds to the first principal component (PC1) is v1 and the one that
corresponds to the second principal component (PC2) is v2.
After having the principal components, to compute the percentage of variance (information)
accounted for by each component, we divide the eigenvalue of each component by the sum of
eigenvalues. If we apply this on the example above, we find that PC1 and PC2 carry
respectively 96 percent and 4 percent of the variance of the data.
https://builtin.com/data-science/step-step-explanation-principal-component-analysis 7/11
data set will have only p dimensions.
05/07/2024, 17:18 Principal Component Analysis (PCA) Explained | Built In
Principal Component Analysis Example:
Continuing with the example from the previous step, we can either form a feature vector with
both of the eigenvectors v1 and v2:
Principal
Component
Discarding the eigenvector v2 will reduce dimensionality by 1, and will consequently cause a
loss of information in the final data set. But given that v2 was carrying only 4 percent of the
information, the loss will be therefore not important and we will still have 96 percent of the
information that is carried by v1.
So, as we saw in the example, it’s up to you to choose whether to keep all the components or
discard the ones of lesser significance, depending on what you are looking for. Because if you
just want to describe your data in terms of new variables (principal components) that are
uncorrelated without seeking to reduce dimensionality, leaving out lesser significant
components is not needed.
https://builtin.com/data-science/step-step-explanation-principal-component-analysis 8/11
05/07/2024, 17:18 Principal Component AnalysisPrincipal
featureComponent
vector Analysis (PCA) Explained | Built In
References:
https://builtin.com/data-science/step-step-explanation-principal-component-analysis 9/11
05/07/2024, 17:18
Recent Data SciencePrincipal
ArticlesComponent Analysis (PCA) Explained | Built In
Learning Lab User Accessibility Copyright Privacy Terms of Your Privacy Choices/Cookie CA Notice of
Agreement Statement Policy Policy Use Settings Collection
https://builtin.com/data-science/step-step-explanation-principal-component-analysis 10/11
05/07/2024, 17:18 Principal Component Analysis (PCA) Explained | Built In
https://builtin.com/data-science/step-step-explanation-principal-component-analysis 11/11