0% found this document useful (0 votes)

310 views

Project Advance Stats - Abhishek

The document discusses analyzing salary data using one-way and two-way ANOVA. A one-way ANOVA found a significant difference in mean salaries between education levels but not between occupations. A two-way ANOVA found a significant interaction between education and occupation on salaries. Principal component analysis was also discussed, including benefits of standardizing data first. Correlations between college data variables were examined using a heatmap. Box plots and descriptive statistics confirmed standardization did not significantly reduce outliers.

Uploaded by

Abhishek Gautam

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

310 views

Project Advance Stats - Abhishek

Uploaded by

Abhishek Gautam

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 14

Project: Advance Statistics

Abhishek Gautam
PGP-DSBA
Date 17th Oct 2021

1
1.1 Answer

One way ANOVA(Education)

𝐻0: The mean salary of all the category will be equal (Doctorate, Bachelors, HS-Grad).

Alternate Hypothesis 𝐻1: the mean salary will be different in at least one of the category.

One way ANOVA(Occupation)

Null Hypothesis 𝐻0: The mean salary is the same across all the 4 categories of occupation(Specialty, Sales,

Adm-clerical, Exec-Managerial).

Alternate Hypothesis 𝐻1: The mean salary is different in at least one category of occupation

1.2 Answer

One way ANOVA for ‘Education’

The above is the ANOVA table for Education variable.

Since the p value = 1.257709e-08 is less than the significance level (alpha = 0.05), we can reject the null

hypothesis and conclude that there is a significant difference in the mean salaries for at least one category

of education.

2
One way ANOVA for ‘Occupation

Since the p value = 0.458508 is greater than the significance level (alpha = 0.05), we fail to reject the null
hypothesis (i.e. we accept H0) and conclude that there is no significant difference in the mean salaries
across the 4 categories of occupation.

1.3 Anova for occupation

Since the p value is greater than Alpha we cannot reject the Ho for occupation.

1.4

1.5

Interaction b/w Education and occupation

3
From above plot we can make out that the interaction between people with:
 Adm-Clerical job with Bachelors and Doctorates is comparably same
 Sales job with Bachelors and Doctorates is same.
 Prof-Speciality job with HS-grad and Bachelors is a bit.
 Exec-Managerial job role has with grad and doctorates is slightly higher
 From above plot we can figure out that people with educational level:
o Doctorates: are into higher salary brackets and mostly Prof-speciality roles orExec-
managerial roles or in sales profile, very few are doing Adm-clerical jobs
o Bachelors: fall in mid income range and found mostly working as an Exec -managers , Adm-
clerks or into sales but very few are found in Prof- speciality profile.
o HS-grads : are in low income brackets, mostly doing Prof-speciality or Adm -clerical work
and few are doing Sales but hardly any in Exec-managerial role.

1.6

H0:- the mean salary variable for each occupation type and education label should be equal

H1:- At least one of the means of salary type of occupation and education level are not equal.

Alpha = 0.05

P< 0.05 reject h0

p>0.05 fail to reject h0

two way anova

df sum_sq mean_sq F PR(>F)

1.98E-
C(Education) 2 1.03E+11 5.13E+10 31.25768
08
3.55E-
C(Occupation) 3 5.52E+09 1.84E+09 1.12008
01
Residual 34 5.59E+10 1.64E+09 NaN NaN

We can see that there is some sort of interaction b/w the two treatments, so we will introduce a new term
while performing the two way anova

Two way anova

df sum_sq mean_sq F PR(>F)

5.47E-
C(Education) 2 1.03E+11 5.13E+10 72.21196
12
C(Occupation) 3 5.52E+09 1.84E+09 2.587626 7.21E-

4
02
2.23E-
C(Education):C(Occupation) 6 3.63E+10 6.06E+09 8.519815
05
Residual 29 2.06E+10 7.11E+08 NaN NaN

P value is less than the significance level(0.05) we can reject the null hypothesis

Result one of the category salaries is different

1.7

2.1

2.2 Answer

 Our data set has 18 components hence got 18 principle components

 Performing the PCA is necessary to normalize the data

5
 PCA calculate a new projection to our data set.
 If we normalize our data all variables have the same standard deviation, thus all the variable have
the same weight and our PCA calculate relevant axis. This skew the PCA towards high magnitude
feature. We can speed up gradient descent or calculation in algorithm by scaling
 Scaling of data
 Z= value-mean/Standard deviation

6
7
8
Statical description: -

count mean std min 25% 50% 75% max IQR ll ul

Apps 777 3001.63835 3870.20148 81 776 1558 3624 48094 2848 -3496 7896
Accept 777 2018.80438 2451.11397 72 604 1110 2424 26330 1820 -2126 5154
Enroll 777 779.972973 929.17619 35 242 434 902 6392 660 -748 1892
Top10perc 777 27.558559 17.640364 1 15 23 35 96 20 -15 65
Top25perc 777 55.796654 19.804778 9 41 54 69 100 28 -1 111
-
F.Undergrad 777 3699.90734 4850.42053 139 992 1707 4005 31643 3013 8524.5
3527.5
P.Undergrad 777 855.298584 1522.43189 1 95 353 967 21836 872 -1213 2275
-
Outstate 777 10440.6692 4023.01648 2340 7320 9990 12925 21700 5605 21332.5
1087.5
Room.Board 777 4357.52638 1096.69642 1780 3597 4200 5050 8124 1453 1417.5 7229.5
Books 777 549.380952 165.10536 96 470 500 600 2340 130 275 795
Personal 777 1340.64221 677.071454 250 850 1200 1700 6800 850 -425 2975
PhD 777 72.660232 16.328155 8 62 75 85 103 23 27.5 119.5
Terminal 777 79.702703 14.722359 24 71 82 92 100 21 39.5 123.5
S.F.Ratio 777 14.089704 3.958349 2.5 11.5 13.6 16.5 39.8 5 4 24
perc.alumni 777 22.743887 12.391801 0 13 21 31 64 18 -14 58
Expend 777 9660.17117 5221.76844 3186 6751 8377 10830 56233 4079 632.5 16948.5
Grad.Rate 777 65.46332 17.17771 10 53 65 78 118 25 15.5 115.5

Observations:
 After Scaling Standard deviation is 1.0 for all variables.
 Post scaling Q1(25%) value and minimum values difference is lesser than original dataset in most
of the variables.

2.3 Answer

 Both the terms, Covariance and Corelation matrices measure the relationship and the dependency
between two variables.
 “Covariance” indicates the direction of the linear relationship between variables.
 “Correlation” on the other hand measures both the strength and direction of the linear relationship
between two variables.
 Correlation refers to the scaled form of covariance. Covariance is affected by the change in scale.
 Covariance indicates the direction of the linear relationship between variables. Correlation on
the other hand measures both the strength and direction of the linear relationship between two
variables

9
Heatmap

Observation:
 Highest corelation is seen among:
o Enrol variables with F undergrad
o Enrol with accept
o Apps with accept and apps
 Least corelation observed with SF ratio variable with: expend outstate grad rate, perc alumni, room
board and top 10perc

2.4

After standardization of data post scaling again box plot is draw on scaled data, also used describe
function. After scaling no much difference in tern of outlier’s reduction

10
Box plot with scaled data:-

When computing the empirical mean and standard deviation. Standard scaler there fore cannot guarantee
balanced feature scales in the presence of outliers.

Scaled data: -

count mean std min 25% 50% 75% max IQR ll ul

11
2.5 Perform PCA
In below table we can see that first PC or Array explains 33.12% variance in our dataset, while first seven features
capture 70.12% variance

12
Heat Map co relation matrix between components and features

2.6 Answer

In PCA, given a mean cantered dataset X with n sample and p variables, the first principal component PC1
isgiven by the linear combination of the original variables X_1, X_2, …, X_p

PC_1 = w_{17}X_1 + w_{16}X_2 + … + w_{1p}X_pThe first principal component PC1 represents the
component that retains the maximum variance ofthe data. w1 corresponds to an eigenvector of the
covariance matrix

E=1*X^X/N-1

The explicit form of the PC1 is as below:

[ 0.2487656 , 0.2076015 , 0.17630359, 0.35427395, 0.34400128,0.15464096, 0.0264425 , 0.29473642,
0.24903045, 0.06475752,-0.04252854, 0.31831287, 0.31705602, -0.17695789, 0.20508237,0.31890875,
0.25231565],

13
2.7 Answer

Observations:
 The plot visually shows how much of the variance are explained, by how many principle
components.
 In the below plot we see that, the 1st PC explains variance 33.13%, 2nd PC explains 57.19% and so
on.
 Effectively we can get material variance explained (ie. 90%) by analysing 9 principle components
instead all of the 17 variables(attributes) in the dataset

PCA uses the eigenvectors of the covariance matrix to figure out how we should rotate the data. Because
rotation is a kind of linear transformation, your new dimensions will be sums of the old ones. The eigen-
vectors (Principle Components), determine the direction or Axes along which linear transformation acts,
stretching or compressing input vectors. They are the lines of change that represent the action of the larger
matrix, the very “line” in linear transformation.

2.8
Business implication for using PCA
 PCA is use for EDA and for getting the predictive models will be done by continuous variables.
 Dimensionality reduction can be done by PCA by sharing each data point for the first few principal
components to obtain lower dimension data while preserving as much as of the data’s variation
possible. In case reduction of dimension is higher it shows a higher variance.
 The first PCA can equivalently be defines as a direction that maximizes the variance of the
projection

ANOVA
33% (3)
ANOVA
1 page
A Wholesale Distributor
100% (3)
A Wholesale Distributor
5 pages
Arnab Chowdhury As1
No ratings yet
Arnab Chowdhury As1
12 pages
Advanced Statistics Project
17% (6)
Advanced Statistics Project
2 pages
Problem Statement2
0% (1)
Problem Statement2
2 pages
Advanced Statistics Project Report Final
No ratings yet
Advanced Statistics Project Report Final
40 pages
Advance Statistics Business Report
No ratings yet
Advance Statistics Business Report
15 pages
Project: Advanced Statistics: Anova, Eda and Pca
No ratings yet
Project: Advanced Statistics: Anova, Eda and Pca
35 pages
Advanced Statistics
100% (1)
Advanced Statistics
16 pages
PCA Project Advanced Statistics
67% (3)
PCA Project Advanced Statistics
24 pages
Ruhee Ansari - Advanced Statistic Project SCB
100% (1)
Ruhee Ansari - Advanced Statistic Project SCB
28 pages
Project - Advanced Statistics - Final-1
100% (3)
Project - Advanced Statistics - Final-1
15 pages
SMDM Business-Report Arvind Soni-2
0% (1)
SMDM Business-Report Arvind Soni-2
15 pages
Project SMDM Kundan Sinha PDF
0% (1)
Project SMDM Kundan Sinha PDF
4 pages
Asphalt Shingles Data Analysis PDF
No ratings yet
Asphalt Shingles Data Analysis PDF
4 pages
Prob 3
No ratings yet
Prob 3
2 pages
Advanced Statistics ANOVA PCA EDA Project Report 3 Great Lakes
No ratings yet
Advanced Statistics ANOVA PCA EDA Project Report 3 Great Lakes
28 pages
Advanced Statistics Project Report
100% (1)
Advanced Statistics Project Report
34 pages
Business Report SMDM Bhushan
No ratings yet
Business Report SMDM Bhushan
18 pages
Advance Statistics - Buisness Report
100% (1)
Advance Statistics - Buisness Report
26 pages
AV Project Shivakumar Vanga
100% (1)
AV Project Shivakumar Vanga
37 pages
Business Report - Advanced Statistics - Great Learning
100% (1)
Business Report - Advanced Statistics - Great Learning
20 pages
Vijayalakshmi
No ratings yet
Vijayalakshmi
17 pages
Education - Post 12th Standard - CSV
No ratings yet
Education - Post 12th Standard - CSV
11 pages
Which Year Has The Most Number of Records?: AS Quiz 2: Exploratory Data Analysis
100% (2)
Which Year Has The Most Number of Records?: AS Quiz 2: Exploratory Data Analysis
5 pages
Project Report - Advanced - Stats - Final PDF
No ratings yet
Project Report - Advanced - Stats - Final PDF
25 pages
SMDM Project Report Dipti
No ratings yet
SMDM Project Report Dipti
14 pages
Business Analytics Report: Submitted To
No ratings yet
Business Analytics Report: Submitted To
32 pages
Dbms db03 2020 Assessment (Solved) : Find Study Resources
50% (2)
Dbms db03 2020 Assessment (Solved) : Find Study Resources
12 pages
Data Mining Business Report
No ratings yet
Data Mining Business Report
38 pages
Sunira - Predictive Modeling
100% (1)
Sunira - Predictive Modeling
65 pages
Project Advanced Statistics UMESHHASIJA SEP2021 Jupyter File
100% (1)
Project Advanced Statistics UMESHHASIJA SEP2021 Jupyter File
25 pages
Data Mining Case Study PDF
100% (1)
Data Mining Case Study PDF
21 pages
Predictive Modelling Project Gloria Susan Raju 11 APR 2021 PDF
No ratings yet
Predictive Modelling Project Gloria Susan Raju 11 APR 2021 PDF
56 pages
Problem Statement 1
100% (1)
Problem Statement 1
17 pages
As Quiz 3 PCA Solution PDF
100% (1)
As Quiz 3 PCA Solution PDF
1 page
Advanced Statistics Jupyter File PDF
100% (2)
Advanced Statistics Jupyter File PDF
56 pages
DATA MINING PROJECT PAVITHRAA GOVINDARAJAN 24 OCT 2021 Jupyter Notebook PDF
100% (3)
DATA MINING PROJECT PAVITHRAA GOVINDARAJAN 24 OCT 2021 Jupyter Notebook PDF
49 pages
Data Mining Clustering PDF
No ratings yet
Data Mining Clustering PDF
15 pages
Cart-Rf-ANN: Prepared by Muralidharan N
0% (1)
Cart-Rf-ANN: Prepared by Muralidharan N
16 pages
Predictive Modelling Alternative Firm Level PDF
100% (4)
Predictive Modelling Alternative Firm Level PDF
26 pages
Project 2 SMDM
50% (2)
Project 2 SMDM
5 pages
Data Mining Project
100% (1)
Data Mining Project
24 pages
Detail Project Report SMDM
100% (1)
Detail Project Report SMDM
25 pages
Pranjal - Singh - 30.10.2022 SMDM PROJECT REPORT
No ratings yet
Pranjal - Singh - 30.10.2022 SMDM PROJECT REPORT
9 pages
SMDM-Project Report (Madhur Dhananiwala)
100% (2)
SMDM-Project Report (Madhur Dhananiwala)
43 pages
AS Project Report
No ratings yet
AS Project Report
22 pages
Data Mining Quiz 1 Clustering
100% (2)
Data Mining Quiz 1 Clustering
4 pages
ML Quiz-2
No ratings yet
ML Quiz-2
5 pages
Mini Project - Factor Hair Analysis: Sravanthi.M
100% (2)
Mini Project - Factor Hair Analysis: Sravanthi.M
24 pages
SMDM Project Gopala Satish Kumar Jupyter Notebook G8 DSBA
100% (1)
SMDM Project Gopala Satish Kumar Jupyter Notebook G8 DSBA
14 pages
SMDM Project
100% (1)
SMDM Project
22 pages
Advanced Statistics: Business Report Ranvijay Sharma
No ratings yet
Advanced Statistics: Business Report Ranvijay Sharma
16 pages
Assignment Report - Advanced Statistics
No ratings yet
Assignment Report - Advanced Statistics
12 pages
QUIZ Week 2 CART Practice PDF
No ratings yet
QUIZ Week 2 CART Practice PDF
10 pages
Predictive Model: Submitted by
100% (3)
Predictive Model: Submitted by
27 pages
PM ProjectJune - 2021
100% (1)
PM ProjectJune - 2021
33 pages
SMDM - Project Report - Lakshmi
No ratings yet
SMDM - Project Report - Lakshmi
26 pages
Business Report: Predictive Modelling
100% (2)
Business Report: Predictive Modelling
37 pages
Advanced Statistics Project
No ratings yet
Advanced Statistics Project
23 pages
QRS Detection Using PCA
No ratings yet
QRS Detection Using PCA
4 pages
R7 Yuyty
100% (1)
R7 Yuyty
9 pages
Tristan Camilleri - SOR0511 - Questions - 2021.06.20
No ratings yet
Tristan Camilleri - SOR0511 - Questions - 2021.06.20
33 pages
High Dimensional Covariance Matrix Estimation An Introduction to Random Matrix Theory 1st Edition Aygul Zagidullina download
100% (1)
High Dimensional Covariance Matrix Estimation An Introduction to Random Matrix Theory 1st Edition Aygul Zagidullina download
42 pages
Top 170 Machine Learning Interview Questions and Answers (2024) - Reader View
No ratings yet
Top 170 Machine Learning Interview Questions and Answers (2024) - Reader View
51 pages
A Research Study On Unsupervised Machine Learning Algorithms For Early Fault Detection in Predictive Maintenance
No ratings yet
A Research Study On Unsupervised Machine Learning Algorithms For Early Fault Detection in Predictive Maintenance
7 pages
Discrimination Between Roasted Coffee, Roasted Corn and Coffee Husks by Diffuse
No ratings yet
Discrimination Between Roasted Coffee, Roasted Corn and Coffee Husks by Diffuse
8 pages
Multi Dimensional Scaling - Angrau - Prashanth
100% (1)
Multi Dimensional Scaling - Angrau - Prashanth
41 pages
Assignment 2, Quiz 2 & Quiz 3 PDF
No ratings yet
Assignment 2, Quiz 2 & Quiz 3 PDF
2 pages
Coriandis Manual
No ratings yet
Coriandis Manual
17 pages
Machine Learning
No ratings yet
Machine Learning
135 pages
DCA For Production Forecasting Base On Machine Learning
No ratings yet
DCA For Production Forecasting Base On Machine Learning
14 pages
Analysing Global Fixed Income Markets With Tensors: Ntroduction
No ratings yet
Analysing Global Fixed Income Markets With Tensors: Ntroduction
9 pages
Acm - M.E - Cse - Uit r2024 A
No ratings yet
Acm - M.E - Cse - Uit r2024 A
115 pages
Thesis Using Factor Analysis
100% (3)
Thesis Using Factor Analysis
5 pages
TheProfessionalNurseSelf AssessmentScale
100% (1)
TheProfessionalNurseSelf AssessmentScale
14 pages
Introduction To K-Nearest Neighbors: Simplified (With Implementation in Python)
100% (1)
Introduction To K-Nearest Neighbors: Simplified (With Implementation in Python)
125 pages
Analysis of Agricultural Soil PH Using Digital Image Processing
No ratings yet
Analysis of Agricultural Soil PH Using Digital Image Processing
5 pages
Tourism Destination Competitiveness - The Spanish Mediterranean Case
No ratings yet
Tourism Destination Competitiveness - The Spanish Mediterranean Case
21 pages
A Comparative Study of Air Quality Index Based On PDF
No ratings yet
A Comparative Study of Air Quality Index Based On PDF
18 pages
DOC-20241212-WA0007
No ratings yet
DOC-20241212-WA0007
23 pages
Data Science - Sem6
100% (3)
Data Science - Sem6
118 pages
Recognizing Faces From The Eyes Only
No ratings yet
Recognizing Faces From The Eyes Only
6 pages
Kartini
No ratings yet
Kartini
10 pages
Investment in Intangible Assets and Economic Complexity 2025 Research Policy
No ratings yet
Investment in Intangible Assets and Economic Complexity 2025 Research Policy
14 pages
PrincipalComponentAnalysis-LectureNotesPublic
No ratings yet
PrincipalComponentAnalysis-LectureNotesPublic
24 pages
Outlier Sample, Tasneem Ahmad
No ratings yet
Outlier Sample, Tasneem Ahmad
9 pages
Instant Download The Essence of Multivariate Thinking Basic Themes and Methods Lisa L. Harlow PDF All Chapters
100% (13)
Instant Download The Essence of Multivariate Thinking Basic Themes and Methods Lisa L. Harlow PDF All Chapters
60 pages
Paper 4 PDF
No ratings yet
Paper 4 PDF
6 pages
ML notes-1
No ratings yet
ML notes-1
54 pages

Project Advance Stats - Abhishek

Uploaded by

Project Advance Stats - Abhishek

Uploaded by

Project: Advance Statistics

One way ANOVA(Education)

One way ANOVA(Occupation)

One way ANOVA for ‘Education’

The above is the ANOVA table for Education variable.

1.3 Anova for occupation

Interaction b/w Education and occupation

P< 0.05 reject h0

p>0.05 fail to reject h0

two way anova

df sum_sq mean_sq F PR(>F)

Two way anova

df sum_sq mean_sq F PR(>F)

Result one of the category salaries is different

 Our data set has 18 components hence got 18 principle components

count mean std min 25% 50% 75% max IQR ll ul

count mean std min 25% 50% 75% max IQR ll ul

The explicit form of the PC1 is as below:

You might also like