67% found this document useful (3 votes)
653 views

PCA Project Advanced Statistics

Scaling is necessary before performing PCA on this dataset for the following reasons: - The variables in the dataset are measured in different units (e.g. numbers, percentages, costs). This means some variables will dominate over others during PCA just due to their inherent scale. - The PCA algorithm works by calculating distances between data points. If variables are on different scales, the distances will be distorted by the original measurement scales rather than just the underlying relationships. - Scaling puts all the variables on a common scale (usually between 0-1), so none of them can dominate the results due to their original scale. This allows the true relationships between the variables to emerge more clearly in the PCA. - Since the variables in this

Uploaded by

Ankit Sharma
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
67% found this document useful (3 votes)
653 views

PCA Project Advanced Statistics

Scaling is necessary before performing PCA on this dataset for the following reasons: - The variables in the dataset are measured in different units (e.g. numbers, percentages, costs). This means some variables will dominate over others during PCA just due to their inherent scale. - The PCA algorithm works by calculating distances between data points. If variables are on different scales, the distances will be distorted by the original measurement scales rather than just the underlying relationships. - Scaling puts all the variables on a common scale (usually between 0-1), so none of them can dominate the results due to their original scale. This allows the true relationships between the variables to emerge more clearly in the PCA. - Since the variables in this

Uploaded by

Ankit Sharma
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 24

EDUCATION - POST 12TH STANDARD

DATA ANALYSIS USING PCA

Name: ANKIT SHARMA PGP-


DSBA Online JUNE’ 21
Date: 12/09/2021

Ankit Sharma
TABLE OF CONTENTS:
TOPICS PAGE NO#
Executive Summary Page 3

Page 3
Introduction

Data Description Page 3

Sample of the dataset Page 4

2.1 Perform Exploratory Data Analysis [both univariate and Page 5- 12


multivariate analysis to be performed]. What insight do you
draw from the EDA?

2.2 Is scaling necessary for PCA in this case? Give justification


Page 12-13
and perform scaling.

2.3 Comment on the comparison between the covariance and


Page 12-15
the correlation matrices from this data. [on scaled data]

2.4 Check the dataset for outliers before and after scaling.
Page 15-16
What insight do you derive here?

2.5 Extract the eigenvalues and eigenvectors. [Using Sklearn


PCA Print Both] Page 17-18

2.6 Perform PCA and export the data of the Principal


Component (eigenvectors) into a data frame with the original Page 19-21
features

2.7 Write down the explicit form of the first PC (in terms of the
eigenvectors. Use values with two places of decimals only).
Page 21-22
[hint: write the linear equation of PC in terms of eigenvectors
and corresponding features]

2.8 Consider the cumulative values of the eigenvalues. How Page 21-22
does it help you to decide on the optimum number of principal
components? What do the eigenvectors indicate?

2.9 Explain the business implication of using the Principal


Component Analysis for this case study. How may PCs help in
Page 22-23
the further analysis? [Hint: Write Interpretations of the
Principal Components Obtained]

Page 23
THE END

1|Page
LIST OF FIGURES:

Fig1 Correlation Plot of original data Page 5


Fig 2 Univariate Analysis Page 10
Fig 3 Pair Plot Page 11
Fig 4 Heat Map Page 12
Fig 5 Box plot before scaling Page 15
Fig 6 Box plot after scaling Page 16
Fig 7 Scree Plot Page 20
Fig 8 Cumulative Explained variance vs number of features Page 22
Fig 9 Rectangle heat Map Page 23
Fig 10 Heat Map Page 23

LIST OF TABLES:
Table 1 Dataset Sample Page 4
Table 2. Dataset Info & Is Null Function Page 5
Table 3 Dataset head after scaling Page 13
Table 4 Covariance Matrix Page 14
Table 5. Correlation Matrix Page 15

2|Page
EXECUTIVE SUMMARY
The dataset Education - Post 12th Standard.csv contains information on various colleges. You are
expected to do a Principal Component Analysis for this case study according to the instructions
given.

INTRODUCTION
Brief Introduction to PCA. PCA is a technique that can be used to transform a series of potentially
coordinated observations into a set of orthogonal vectors called principal components (PCs). One
way to think of PCs is that they are a means of explaining variance in the data.

Accordingly, to Wikipedia: “Principal component analysis (PCA) is a statistical procedure that uses
an orthogonal transformation to convert a set of observations of possibly correlated variables
(entities each of which takes on various numerical values) into a set of values of linearly
uncorrelated variables called principal components.”

DATA DESCRIPTION
1) NAMES: NAMES OF VARIOUS UNIVERSITY AND COLLEGES
2) APPS: NUMBER OF APPLICATIONS RECEIVED
3) ACCEPT: NUMBER OF APPLICATIONS ACCEPTED
4) ENROLL: NUMBER OF NEW STUDENTS ENROLLED
5) TOP10PERC: PERCENTAGE OF NEW STUDENTS FROM TOP 10% OF HIGHER SECONDARY
CLASS
6) TOP25PERC: PERCENTAGE OF NEW STUDENTS FROM TOP 25% OF HIGHER SECONDARY
CLASS
7) F. UNDERGRAD: NUMBER OF FULL-TIME UNDERGRADUATE STUDENTS
8) P. UNDERGRAD: NUMBER OF PART-TIME UNDERGRADUATE STUDENTS
9) OUTSTATE: NUMBER OF STUDENTS FOR WHOM THE PARTICULAR COLLEGE OR UNIVERSITY
IS OUT-OF-STATE TUITION
10) ROOM.BOARD: COST OF ROOM AND BOARD
11) BOOKS: ESTIMATED BOOK COSTS FOR A STUDENT
12) PERSONAL: ESTIMATED PERSONAL SPENDING FOR A STUDENT
13) PHD: PERCENTAGE OF FACULTIES WITH PH.D.’S
14) TERMINAL: PERCENTAGE OF FACULTIES WITH TERMINAL DEGREE
15) S.F. RATIO: STUDENT/FACULTY RATIO
16) PERC.ALUMNI: PERCENTAGE OF ALUMNI WHO DONATE
17) EXPEND: THE INSTRUCTIONAL EXPENDITURE PER STUDENT
18) GRAD.RATE: GRADUATION RATE

3|Page
SAMPLE OF THE DATASET:

Table 1. Dataset Sample

Dataset has 18 columns/variables with 777 rows.

4|Page
Q2.1) Perform Exploratory Data Analysis [both univariate and multivariate analysis to be
performed]. What insight do you draw from the EDA?

Table 2. Dataset Info & Is Null Function

 The dataset has 777 rows and 18 columns.


 The dataset has object, float64 and int64 values, where name alone is categorical.
 Also, we had checked for duplicates, and we found none.
 The dataset, has no missing values.

CORRELATION PLOT- “for the original/unchanged data”

Fig1: Correlation Plot of original data 5|Page


UNIVARIATE ANALYSIS:
The term univariate analysis refers to the analysis of one variable. You can remember this because
the prefix “uni” means “one.” The purpose of univariate analysis is to understand the distribution of
values for a single variable.
Apps:

Accept:

Enroll:

Top10perc:

6|Page
Top25perc:

F.Undergrad

P.Undergrad:

Outstate:

7|Page
Room.Board:

Books:

Personal:

PhD:

8|Page
Terminal:

S.F.Ratio:

perc.alumni:

Expend:

9|Page
Grad.Rate:

Fig 2: Univariate Analysis

CONCLUSION: We plot dist.-plot and box-plot, and get that majority of the variables are positively
skewed with outliers present in the data. Except Top25perc which is normally distributed and has no
Outliers. Outstate is also normally distributed and has only one outlier present in the data.
Room.Board, S.F.Ratio, perc.alumni, Grad.Rate is also normally distributed, whereas PhD
&Terminal is negatively skewed.

MULTIVARIATE ANALYSIS:
Accordingly, to Wikipedia: Multivariate analysis is based on the statistical principle of multivariate
statistics, which involves observation and analysis of more than one statistical outcome variable at a
time. In design and analysis, the technique is used to perform trade studies across multiple
dimensions while taking into account the effects of all variables on the responses of interest. Uses
for multivariate analysis include: Design for capability Inverse design, where any variable can be
treated as an independent variable Analysis of Alternatives, the selection of concepts to fulfill a
customer need Analysis of concepts with respect to changing scenarios Identification of critical
design drivers and correlations across hierarchical levels. Multivariate analysis can be complicated by
the desire to include physics-based analysis to calculate the effects of variables for a hierarchical
"system-of-systems." Often, studies that wish to use multivariate analysis are stalled by the
dimensionality of the problem.

For performing the multivariate analysis, we will use Pair plot and Heat Map to understand the
relationship between all the continuous values in the dataset.

Below are the two Plots to understand the relationship:

PAIR PLOT

HEAT MAP

10 | P a g e
PAIR PLOT:

Fig 3: Pair Plot

With this pair plot we get to understand the patterns and the trend.

11 | P a g e
HEAT MAP:

Fig 4: Heat Map

The heat map gives us correlation between two continuous variables.

We get to see that the variable is highly positive correlated with each other are Apps & Accept,
Enroll & Accept. We also see negative correlation between Outstate & Personal.

Q 2.2 Is scaling necessary for PCA in this case? Give justification and perform scaling.
Yes, Scaling is necessary for PCA. Scaling of the data comes under the set of steps of data pre-
processing when we are performing machine learning algorithms in the data set. As we know most
of the supervised and unsupervised learning methods make decisions according to the data sets
applied to them and often the algorithms calculate the distance between the data points to make
better inferences out of the data.

We have dropped the Categorical variable before performing the scaling, and below is the
result. Z score tells us how many points is the Standard deviation away from the mean and
also the direction. Now we can understand that all the variables are scaled by the Z score
function.

12 | P a g e
SCALED DATA

Table 3. Dataset head after scaling

2.3 Comment on the comparison between the covariance and the correlation matrices
from this data. [on scaled data]
Covariance and Correlation are two mathematical concepts which are quite commonly used in
business statistics. Both of these two determine the relationship and measures the dependency
between two random variables. Despite, some similarities between these two mathematical terms,
they are different from each other. Correlation is when the change in one item may result in the
change in another item. Correlation is considered as the best tool for for measuring and expressing
the quantitative relationship between two variables in formula. On the other hand, covariance is
when two items vary together.

Link for the Table: Difference Between Covariance and Correlation (with Comparison Chart) - Key
Differences

13 | P a g e
Scaling in general means the representation of dataset into one unit.

Table 4. Covariance Matrix

14 | P a g e
Table 5. Correlation Matrix

Q2.4 Check the dataset for outliers before and after scaling. What insight do you derive
here?
We will plot box plots to identify the outliers in each data.

Before Scaling:

Fig 5: Box plot before scaling


15 | P a g e
After Scaling:

Fig 6: Box plot after scaling

The outliers are present in data in both the scenarios.


Scaling does not remove outliers it is the representation of dataset into one unit.
To remove outliers, we have to perform other steps of either removing them or replacing
them with IQR values of Lower Limit or Upper Limit.

16 | P a g e
Q2.5 Extract the eigenvalues and eigenvectors. [Using Sklearn PCA Print Both]
Eigen Values: Define the Eigenvalues λ of matrix A. The Eigenvalue of Matrix A is a scalar λ, such that
the equation Av = λv should have a nontrivial solution. Mention 2 properties of Eigenvalues.
Eigenvectors with distinct Eigenvalues are linearly independentSingular Matrices have zero
Eigenvalues.

EigenVectors: Eigenvector of a square matrix is defined as a non-vector by which when a


given matrix is multiplied, it is equal to a scalar multiple of that vector.
Eigen Vectors
%s [[-2.48765602e-01 3.31598227e-01 -6.30921033e-02 2.81310530e-01
-5.74140964e-03 -1.62374420e-02 -4.24863486e-02 -1.03090398e-01
-9.02270802e-02 5.25098025e-02 -3.58970400e-01 4.59139498e-01
-4.30462074e-02 1.33405806e-01 -8.06328039e-02 -5.95830975e-01
2.40709086e-02]
[-2.07601502e-01 3.72116750e-01 -1.01249056e-01 2.67817346e-01
-5.57860920e-02 7.53468452e-03 -1.29497196e-02 -5.62709623e-02
-1.77864814e-01 4.11400844e-02 5.43427250e-01 -5.18568789e-01
5.84055850e-02 -1.45497511e-01 -3.34674281e-02 -2.92642398e-01
-1.45102446e-01]
[-1.76303592e-01 4.03724252e-01 -8.29855709e-02 1.61826771e-01
5.56936353e-02 -4.25579803e-02 -2.76928937e-02 5.86623552e-02
-1.28560713e-01 3.44879147e-02 -6.09651110e-01 -4.04318439e-01
6.93988831e-02 2.95896092e-02 8.56967180e-02 4.44638207e-01
1.11431545e-02]
[-3.54273947e-01 -8.24118211e-02 3.50555339e-02 -5.15472524e-02
3.95434345e-01 -5.26927980e-02 -1.61332069e-01 -1.22678028e-01
3.41099863e-01 6.40257785e-02 1.44986329e-01 -1.48738723e-01
8.10481404e-03 6.97722522e-01 1.07828189e-01 -1.02303616e-03
3.85543001e-02]
[-3.44001279e-01 -4.47786551e-02 -2.41479376e-02 -1.09766541e-01
4.26533594e-01 3.30915896e-02 -1.18485556e-01 -1.02491967e-01
4.03711989e-01 1.45492289e-02 -8.03478445e-02 5.18683400e-02
2.73128469e-01 -6.17274818e-01 -1.51742110e-01 -2.18838802e-02
-8.93515563e-02]
[-1.54640962e-01 4.17673774e-01 -6.13929764e-02 1.00412335e-01
4.34543659e-02 -4.34542349e-02 -2.50763629e-02 7.88896442e-02
-5.94419181e-02 2.08471834e-02 4.14705279e-01 5.60363054e-01
8.11578181e-02 9.91640992e-03 5.63728817e-02 5.23622267e-01
5.61767721e-02]
[-2.64425045e-02 3.15087830e-01 1.39681716e-01 -1.58558487e-01
-3.02385408e-01 -1.91198583e-01 6.10423460e-02 5.70783816e-01
5.60672902e-01 -2.23105808e-01 -9.01788964e-03 -5.27313042e-02
-1.00693324e-01 2.09515982e-02 -1.92857500e-02 -1.25997650e-01
-6.35360730e-02]
[-2.94736419e-01 -2.49643522e-01 4.65988731e-02 1.31291364e-01
-2.22532003e-01 -3.00003910e-02 1.08528966e-01 9.84599754e-03
-4.57332880e-03 1.86675363e-01 -5.08995918e-02 1.01594830e-01
-1.43220673e-01 3.83544794e-02 3.40115407e-02 1.41856014e-01

17 | P a g e
-8.23443779e-01]
[-2.49030449e-01 -1.37808883e-01 1.48967389e-01 1.84995991e-01
-5.60919470e-01 1.62755446e-01 2.09744235e-01 -2.21453442e-01
2.75022548e-01 2.98324237e-01 -1.14639620e-03 -2.59293381e-02
3.59321731e-01 3.40197083e-03 5.84289756e-02 6.97485854e-02
3.54559731e-01]
[-6.47575181e-02 5.63418434e-02 6.77411649e-01 8.70892205e-02
1.27288825e-01 6.41054950e-01 -1.49692034e-01 2.13293009e-01
-1.33663353e-01 -8.20292186e-02 -7.72631963e-04 2.88282896e-03
-3.19400370e-02 -9.43887925e-03 6.68494643e-02 -1.14379958e-02
-2.81593679e-02]
[ 4.25285386e-02 2.19929218e-01 4.99721120e-01 -2.30710568e-01
2.22311021e-01 -3.31398003e-01 6.33790064e-01 -2.32660840e-01
-9.44688900e-02 1.36027616e-01 1.11433396e-03 -1.28904022e-02
1.85784733e-02 -3.09001353e-03 -2.75286207e-02 -3.94547417e-02
-3.92640266e-02]
[-3.18312875e-01 5.83113174e-02 -1.27028371e-01 -5.34724832e-01
-1.40166326e-01 9.12555212e-02 -1.09641298e-03 -7.70400002e-02
-1.85181525e-01 -1.23452200e-01 -1.38133366e-02 2.98075465e-02
-4.03723253e-02 -1.12055599e-01 6.91126145e-01 -1.27696382e-01
2.32224316e-02]
[-3.17056016e-01 4.64294477e-02 -6.60375454e-02 -5.19443019e-01
-2.04719730e-01 1.54927646e-01 -2.84770105e-02 -1.21613297e-02
-2.54938198e-01 -8.85784627e-02 -6.20932749e-03 -2.70759809e-02
5.89734026e-02 1.58909651e-01 -6.71008607e-01 5.83134662e-02
1.64850420e-02]
[ 1.76957895e-01 2.46665277e-01 -2.89848401e-01 -1.61189487e-01
7.93882496e-02 4.87045875e-01 2.19259358e-01 -8.36048735e-02
2.74544380e-01 4.72045249e-01 2.22215182e-03 -2.12476294e-02
-4.45000727e-01 -2.08991284e-02 -4.13740967e-02 1.77152700e-02
-1.10262122e-02]
[-2.05082369e-01 -2.46595274e-01 -1.46989274e-01 1.73142230e-02
2.16297411e-01 -4.73400144e-02 2.43321156e-01 6.78523654e-01
-2.55334907e-01 4.22999706e-01 1.91869743e-02 3.33406243e-03
1.30727978e-01 -8.41789410e-03 2.71542091e-02 -1.04088088e-01
1.82660654e-01]
[-3.18908750e-01 -1.31689865e-01 2.26743985e-01 7.92734946e-02
-7.59581203e-02 -2.98118619e-01 -2.26584481e-01 -5.41593771e-02
-4.91388809e-02 1.32286331e-01 3.53098218e-02 -4.38803230e-02
-6.92088870e-01 -2.27742017e-01 -7.31225166e-02 9.37464497e-02
3.25982295e-01]
[-2.52315654e-01 -1.69240532e-01 -2.08064649e-01 2.69129066e-01
1.09267913e-01 2.16163313e-01 5.59943937e-01 -5.33553891e-03
4.19043052e-02 -5.90271067e-01 1.30710024e-02 -5.00844705e-03
-2.19839000e-01 -3.39433604e-03 -3.64767385e-02 6.91969778e-02
1.22106697e-01]]

2.6 Perform PCA and export the data of the Principal Component (eigenvectors) into a
data frame with the original features.
Requisites before performing PCA:

All variables must be on same scale; hence we can omit scaling.


Standardization
Bartlett test of Sphericity
KMO Test

18 | P a g e
Bartlett test of Sphericity

Bartlett's test of sphericity tests the hypothesis that the variables are uncorrelated in the data.

 H0: All variables in the data are uncorrelated


 Ha: At least one pair of variables in the data is correlated.

If the null hypothesis cannot be rejected then PCA is not advisable.

If the p-value is small, then we can reject the null hypothesis and agree that there is at least
one pair of variables in the data which are correlated hence PCA is recommended.

Performing Bartlett test of Sphericity, the result is 0.0, where we do not accept the H0.

KMO Test

The Kaiser-Meyer-Olkin (KMO) - measure of sampling adequacy (MSA) is an index used to examine
how appropriate PCA is.

Generally, if MSA is less than 0.5, PCA is not recommended, since no reduction is expected. On the
other hand, if MSA > 0.7 is expected to provide a considerable reduction is the dimension and
extraction of meaningful components.

Performing KMO Test, the result is 0.8131, where we can proceed with the PCA analysis.

pca.components_

19 | P a g e
#Correlation between components and features

Scree plot:

A Scree plot is something that may be plotted in a graph or bar diagram. Let us learn about
the scree plot in python. A Scree plot is a graph useful to plot the eigenvectors. This plot is
useful to determine the PCA (Principal Component Analysis) and FA (Factor Analysis). The
screen plot has another name that is the Scree test

Fig 7: Scree Plot

20 | P a g e
Visually we can observe that there is steep drop in variance explained with increase in
number of PC's.

We will proceed with 6 components here.

Q2.7 Write down the explicit form of the first PC (in terms of the eigenvectors. Use values
with two places of decimals only). [hint: write the linear equation of PC in terms of
eigenvectors and corresponding features].
PCA component of first PC.

Q2.8 Consider the cumulative values of the eigenvalues. How does it help you to decide
on the optimum number of principal components? What do the eigenvectors indicate?

Adding all the Eigen vectors we will get sum of 100.

The first principal component gives 32% variance in data, second component 58 %, third
component 65%, fourth component 71%, fifth component 76% and Sixth component as 81%.

Thus, we consider values till 80% values and in this case as described above we have 6
components.

The first component indicates 32% variance, and so on the sixth component indicates 81%
variance in data.

21 | P a g e
Fig 8: Cumulative Explained variance vs number of features

Q2.9 Explain the business implication of using the Principal Component Analysis for this
case study. How may PCs help in the further analysis? [Hint: Write Interpretations of the
Principal Components Obtained]
This case study is about education dataset which contains the names of various colleges, and various
details about colleges and university. To understand more about the dataset, we performed
univariate analysis and multivariate analysis which gives us the understanding about the variables.
From analysis we can understand the distribution of the dataset, skew, and patterns in the dataset.
From multivariate analysis we can understand the correlation of variables. Inference of multivariate
analysis shows we can understand multiple variables highly correlated with each other. The scaling
helps the dataset to standardize the variable in one scale. The PCA components for this business
case is 6 where we could understand the maximum variance of the dataset.

22 | P a g e
Fig 9: Rectangle heat Map

Fig 10: Heat Map


← THE END →

23 | P a g e

You might also like