PCA Project Advanced Statistics
PCA Project Advanced Statistics
Ankit Sharma
TABLE OF CONTENTS:
TOPICS PAGE NO#
Executive Summary Page 3
Page 3
Introduction
2.4 Check the dataset for outliers before and after scaling.
Page 15-16
What insight do you derive here?
2.7 Write down the explicit form of the first PC (in terms of the
eigenvectors. Use values with two places of decimals only).
Page 21-22
[hint: write the linear equation of PC in terms of eigenvectors
and corresponding features]
2.8 Consider the cumulative values of the eigenvalues. How Page 21-22
does it help you to decide on the optimum number of principal
components? What do the eigenvectors indicate?
Page 23
THE END
1|Page
LIST OF FIGURES:
LIST OF TABLES:
Table 1 Dataset Sample Page 4
Table 2. Dataset Info & Is Null Function Page 5
Table 3 Dataset head after scaling Page 13
Table 4 Covariance Matrix Page 14
Table 5. Correlation Matrix Page 15
2|Page
EXECUTIVE SUMMARY
The dataset Education - Post 12th Standard.csv contains information on various colleges. You are
expected to do a Principal Component Analysis for this case study according to the instructions
given.
INTRODUCTION
Brief Introduction to PCA. PCA is a technique that can be used to transform a series of potentially
coordinated observations into a set of orthogonal vectors called principal components (PCs). One
way to think of PCs is that they are a means of explaining variance in the data.
Accordingly, to Wikipedia: “Principal component analysis (PCA) is a statistical procedure that uses
an orthogonal transformation to convert a set of observations of possibly correlated variables
(entities each of which takes on various numerical values) into a set of values of linearly
uncorrelated variables called principal components.”
DATA DESCRIPTION
1) NAMES: NAMES OF VARIOUS UNIVERSITY AND COLLEGES
2) APPS: NUMBER OF APPLICATIONS RECEIVED
3) ACCEPT: NUMBER OF APPLICATIONS ACCEPTED
4) ENROLL: NUMBER OF NEW STUDENTS ENROLLED
5) TOP10PERC: PERCENTAGE OF NEW STUDENTS FROM TOP 10% OF HIGHER SECONDARY
CLASS
6) TOP25PERC: PERCENTAGE OF NEW STUDENTS FROM TOP 25% OF HIGHER SECONDARY
CLASS
7) F. UNDERGRAD: NUMBER OF FULL-TIME UNDERGRADUATE STUDENTS
8) P. UNDERGRAD: NUMBER OF PART-TIME UNDERGRADUATE STUDENTS
9) OUTSTATE: NUMBER OF STUDENTS FOR WHOM THE PARTICULAR COLLEGE OR UNIVERSITY
IS OUT-OF-STATE TUITION
10) ROOM.BOARD: COST OF ROOM AND BOARD
11) BOOKS: ESTIMATED BOOK COSTS FOR A STUDENT
12) PERSONAL: ESTIMATED PERSONAL SPENDING FOR A STUDENT
13) PHD: PERCENTAGE OF FACULTIES WITH PH.D.’S
14) TERMINAL: PERCENTAGE OF FACULTIES WITH TERMINAL DEGREE
15) S.F. RATIO: STUDENT/FACULTY RATIO
16) PERC.ALUMNI: PERCENTAGE OF ALUMNI WHO DONATE
17) EXPEND: THE INSTRUCTIONAL EXPENDITURE PER STUDENT
18) GRAD.RATE: GRADUATION RATE
3|Page
SAMPLE OF THE DATASET:
4|Page
Q2.1) Perform Exploratory Data Analysis [both univariate and multivariate analysis to be
performed]. What insight do you draw from the EDA?
Accept:
Enroll:
Top10perc:
6|Page
Top25perc:
F.Undergrad
P.Undergrad:
Outstate:
7|Page
Room.Board:
Books:
Personal:
PhD:
8|Page
Terminal:
S.F.Ratio:
perc.alumni:
Expend:
9|Page
Grad.Rate:
CONCLUSION: We plot dist.-plot and box-plot, and get that majority of the variables are positively
skewed with outliers present in the data. Except Top25perc which is normally distributed and has no
Outliers. Outstate is also normally distributed and has only one outlier present in the data.
Room.Board, S.F.Ratio, perc.alumni, Grad.Rate is also normally distributed, whereas PhD
&Terminal is negatively skewed.
MULTIVARIATE ANALYSIS:
Accordingly, to Wikipedia: Multivariate analysis is based on the statistical principle of multivariate
statistics, which involves observation and analysis of more than one statistical outcome variable at a
time. In design and analysis, the technique is used to perform trade studies across multiple
dimensions while taking into account the effects of all variables on the responses of interest. Uses
for multivariate analysis include: Design for capability Inverse design, where any variable can be
treated as an independent variable Analysis of Alternatives, the selection of concepts to fulfill a
customer need Analysis of concepts with respect to changing scenarios Identification of critical
design drivers and correlations across hierarchical levels. Multivariate analysis can be complicated by
the desire to include physics-based analysis to calculate the effects of variables for a hierarchical
"system-of-systems." Often, studies that wish to use multivariate analysis are stalled by the
dimensionality of the problem.
For performing the multivariate analysis, we will use Pair plot and Heat Map to understand the
relationship between all the continuous values in the dataset.
PAIR PLOT
HEAT MAP
10 | P a g e
PAIR PLOT:
With this pair plot we get to understand the patterns and the trend.
11 | P a g e
HEAT MAP:
We get to see that the variable is highly positive correlated with each other are Apps & Accept,
Enroll & Accept. We also see negative correlation between Outstate & Personal.
Q 2.2 Is scaling necessary for PCA in this case? Give justification and perform scaling.
Yes, Scaling is necessary for PCA. Scaling of the data comes under the set of steps of data pre-
processing when we are performing machine learning algorithms in the data set. As we know most
of the supervised and unsupervised learning methods make decisions according to the data sets
applied to them and often the algorithms calculate the distance between the data points to make
better inferences out of the data.
We have dropped the Categorical variable before performing the scaling, and below is the
result. Z score tells us how many points is the Standard deviation away from the mean and
also the direction. Now we can understand that all the variables are scaled by the Z score
function.
12 | P a g e
SCALED DATA
2.3 Comment on the comparison between the covariance and the correlation matrices
from this data. [on scaled data]
Covariance and Correlation are two mathematical concepts which are quite commonly used in
business statistics. Both of these two determine the relationship and measures the dependency
between two random variables. Despite, some similarities between these two mathematical terms,
they are different from each other. Correlation is when the change in one item may result in the
change in another item. Correlation is considered as the best tool for for measuring and expressing
the quantitative relationship between two variables in formula. On the other hand, covariance is
when two items vary together.
Link for the Table: Difference Between Covariance and Correlation (with Comparison Chart) - Key
Differences
13 | P a g e
Scaling in general means the representation of dataset into one unit.
14 | P a g e
Table 5. Correlation Matrix
Q2.4 Check the dataset for outliers before and after scaling. What insight do you derive
here?
We will plot box plots to identify the outliers in each data.
Before Scaling:
16 | P a g e
Q2.5 Extract the eigenvalues and eigenvectors. [Using Sklearn PCA Print Both]
Eigen Values: Define the Eigenvalues λ of matrix A. The Eigenvalue of Matrix A is a scalar λ, such that
the equation Av = λv should have a nontrivial solution. Mention 2 properties of Eigenvalues.
Eigenvectors with distinct Eigenvalues are linearly independentSingular Matrices have zero
Eigenvalues.
17 | P a g e
-8.23443779e-01]
[-2.49030449e-01 -1.37808883e-01 1.48967389e-01 1.84995991e-01
-5.60919470e-01 1.62755446e-01 2.09744235e-01 -2.21453442e-01
2.75022548e-01 2.98324237e-01 -1.14639620e-03 -2.59293381e-02
3.59321731e-01 3.40197083e-03 5.84289756e-02 6.97485854e-02
3.54559731e-01]
[-6.47575181e-02 5.63418434e-02 6.77411649e-01 8.70892205e-02
1.27288825e-01 6.41054950e-01 -1.49692034e-01 2.13293009e-01
-1.33663353e-01 -8.20292186e-02 -7.72631963e-04 2.88282896e-03
-3.19400370e-02 -9.43887925e-03 6.68494643e-02 -1.14379958e-02
-2.81593679e-02]
[ 4.25285386e-02 2.19929218e-01 4.99721120e-01 -2.30710568e-01
2.22311021e-01 -3.31398003e-01 6.33790064e-01 -2.32660840e-01
-9.44688900e-02 1.36027616e-01 1.11433396e-03 -1.28904022e-02
1.85784733e-02 -3.09001353e-03 -2.75286207e-02 -3.94547417e-02
-3.92640266e-02]
[-3.18312875e-01 5.83113174e-02 -1.27028371e-01 -5.34724832e-01
-1.40166326e-01 9.12555212e-02 -1.09641298e-03 -7.70400002e-02
-1.85181525e-01 -1.23452200e-01 -1.38133366e-02 2.98075465e-02
-4.03723253e-02 -1.12055599e-01 6.91126145e-01 -1.27696382e-01
2.32224316e-02]
[-3.17056016e-01 4.64294477e-02 -6.60375454e-02 -5.19443019e-01
-2.04719730e-01 1.54927646e-01 -2.84770105e-02 -1.21613297e-02
-2.54938198e-01 -8.85784627e-02 -6.20932749e-03 -2.70759809e-02
5.89734026e-02 1.58909651e-01 -6.71008607e-01 5.83134662e-02
1.64850420e-02]
[ 1.76957895e-01 2.46665277e-01 -2.89848401e-01 -1.61189487e-01
7.93882496e-02 4.87045875e-01 2.19259358e-01 -8.36048735e-02
2.74544380e-01 4.72045249e-01 2.22215182e-03 -2.12476294e-02
-4.45000727e-01 -2.08991284e-02 -4.13740967e-02 1.77152700e-02
-1.10262122e-02]
[-2.05082369e-01 -2.46595274e-01 -1.46989274e-01 1.73142230e-02
2.16297411e-01 -4.73400144e-02 2.43321156e-01 6.78523654e-01
-2.55334907e-01 4.22999706e-01 1.91869743e-02 3.33406243e-03
1.30727978e-01 -8.41789410e-03 2.71542091e-02 -1.04088088e-01
1.82660654e-01]
[-3.18908750e-01 -1.31689865e-01 2.26743985e-01 7.92734946e-02
-7.59581203e-02 -2.98118619e-01 -2.26584481e-01 -5.41593771e-02
-4.91388809e-02 1.32286331e-01 3.53098218e-02 -4.38803230e-02
-6.92088870e-01 -2.27742017e-01 -7.31225166e-02 9.37464497e-02
3.25982295e-01]
[-2.52315654e-01 -1.69240532e-01 -2.08064649e-01 2.69129066e-01
1.09267913e-01 2.16163313e-01 5.59943937e-01 -5.33553891e-03
4.19043052e-02 -5.90271067e-01 1.30710024e-02 -5.00844705e-03
-2.19839000e-01 -3.39433604e-03 -3.64767385e-02 6.91969778e-02
1.22106697e-01]]
2.6 Perform PCA and export the data of the Principal Component (eigenvectors) into a
data frame with the original features.
Requisites before performing PCA:
18 | P a g e
Bartlett test of Sphericity
Bartlett's test of sphericity tests the hypothesis that the variables are uncorrelated in the data.
If the p-value is small, then we can reject the null hypothesis and agree that there is at least
one pair of variables in the data which are correlated hence PCA is recommended.
Performing Bartlett test of Sphericity, the result is 0.0, where we do not accept the H0.
KMO Test
The Kaiser-Meyer-Olkin (KMO) - measure of sampling adequacy (MSA) is an index used to examine
how appropriate PCA is.
Generally, if MSA is less than 0.5, PCA is not recommended, since no reduction is expected. On the
other hand, if MSA > 0.7 is expected to provide a considerable reduction is the dimension and
extraction of meaningful components.
Performing KMO Test, the result is 0.8131, where we can proceed with the PCA analysis.
pca.components_
19 | P a g e
#Correlation between components and features
Scree plot:
A Scree plot is something that may be plotted in a graph or bar diagram. Let us learn about
the scree plot in python. A Scree plot is a graph useful to plot the eigenvectors. This plot is
useful to determine the PCA (Principal Component Analysis) and FA (Factor Analysis). The
screen plot has another name that is the Scree test
20 | P a g e
Visually we can observe that there is steep drop in variance explained with increase in
number of PC's.
Q2.7 Write down the explicit form of the first PC (in terms of the eigenvectors. Use values
with two places of decimals only). [hint: write the linear equation of PC in terms of
eigenvectors and corresponding features].
PCA component of first PC.
Q2.8 Consider the cumulative values of the eigenvalues. How does it help you to decide
on the optimum number of principal components? What do the eigenvectors indicate?
The first principal component gives 32% variance in data, second component 58 %, third
component 65%, fourth component 71%, fifth component 76% and Sixth component as 81%.
Thus, we consider values till 80% values and in this case as described above we have 6
components.
The first component indicates 32% variance, and so on the sixth component indicates 81%
variance in data.
21 | P a g e
Fig 8: Cumulative Explained variance vs number of features
Q2.9 Explain the business implication of using the Principal Component Analysis for this
case study. How may PCs help in the further analysis? [Hint: Write Interpretations of the
Principal Components Obtained]
This case study is about education dataset which contains the names of various colleges, and various
details about colleges and university. To understand more about the dataset, we performed
univariate analysis and multivariate analysis which gives us the understanding about the variables.
From analysis we can understand the distribution of the dataset, skew, and patterns in the dataset.
From multivariate analysis we can understand the correlation of variables. Inference of multivariate
analysis shows we can understand multiple variables highly correlated with each other. The scaling
helps the dataset to standardize the variable in one scale. The PCA components for this business
case is 6 where we could understand the maximum variance of the dataset.
22 | P a g e
Fig 9: Rectangle heat Map
∞
← THE END →
23 | P a g e