Project Advance Stats - Abhishek
Project Advance Stats - Abhishek
Abhishek Gautam
PGP-DSBA
Date 17th Oct 2021
1
1.1 Answer
𝐻0: The mean salary of all the category will be equal (Doctorate, Bachelors, HS-Grad).
Alternate Hypothesis 𝐻1: the mean salary will be different in at least one of the category.
Null Hypothesis 𝐻0: The mean salary is the same across all the 4 categories of occupation(Specialty, Sales,
Adm-clerical, Exec-Managerial).
Alternate Hypothesis 𝐻1: The mean salary is different in at least one category of occupation
1.2 Answer
Since the p value = 1.257709e-08 is less than the significance level (alpha = 0.05), we can reject the null
hypothesis and conclude that there is a significant difference in the mean salaries for at least one category
of education.
2
One way ANOVA for ‘Occupation
Since the p value = 0.458508 is greater than the significance level (alpha = 0.05), we fail to reject the null
hypothesis (i.e. we accept H0) and conclude that there is no significant difference in the mean salaries
across the 4 categories of occupation.
Since the p value is greater than Alpha we cannot reject the Ho for occupation.
1.4
1.5
3
From above plot we can make out that the interaction between people with:
Adm-Clerical job with Bachelors and Doctorates is comparably same
Sales job with Bachelors and Doctorates is same.
Prof-Speciality job with HS-grad and Bachelors is a bit.
Exec-Managerial job role has with grad and doctorates is slightly higher
From above plot we can figure out that people with educational level:
o Doctorates: are into higher salary brackets and mostly Prof-speciality roles orExec-
managerial roles or in sales profile, very few are doing Adm-clerical jobs
o Bachelors: fall in mid income range and found mostly working as an Exec -managers , Adm-
clerks or into sales but very few are found in Prof- speciality profile.
o HS-grads : are in low income brackets, mostly doing Prof-speciality or Adm -clerical work
and few are doing Sales but hardly any in Exec-managerial role.
1.6
H0:- the mean salary variable for each occupation type and education label should be equal
H1:- At least one of the means of salary type of occupation and education level are not equal.
Alpha = 0.05
We can see that there is some sort of interaction b/w the two treatments, so we will introduce a new term
while performing the two way anova
4
02
2.23E-
C(Education):C(Occupation) 6 3.63E+10 6.06E+09 8.519815
05
Residual 29 2.06E+10 7.11E+08 NaN NaN
P value is less than the significance level(0.05) we can reject the null hypothesis
1.7
2.1
2.2 Answer
5
PCA calculate a new projection to our data set.
If we normalize our data all variables have the same standard deviation, thus all the variable have
the same weight and our PCA calculate relevant axis. This skew the PCA towards high magnitude
feature. We can speed up gradient descent or calculation in algorithm by scaling
Scaling of data
Z= value-mean/Standard deviation
6
7
8
Statical description: -
Observations:
After Scaling Standard deviation is 1.0 for all variables.
Post scaling Q1(25%) value and minimum values difference is lesser than original dataset in most
of the variables.
2.3 Answer
Both the terms, Covariance and Corelation matrices measure the relationship and the dependency
between two variables.
“Covariance” indicates the direction of the linear relationship between variables.
“Correlation” on the other hand measures both the strength and direction of the linear relationship
between two variables.
Correlation refers to the scaled form of covariance. Covariance is affected by the change in scale.
Covariance indicates the direction of the linear relationship between variables. Correlation on
the other hand measures both the strength and direction of the linear relationship between two
variables
9
Heatmap
Observation:
Highest corelation is seen among:
o Enrol variables with F undergrad
o Enrol with accept
o Apps with accept and apps
Least corelation observed with SF ratio variable with: expend outstate grad rate, perc alumni, room
board and top 10perc
2.4
After standardization of data post scaling again box plot is draw on scaled data, also used describe
function. After scaling no much difference in tern of outlier’s reduction
10
Box plot with scaled data:-
When computing the empirical mean and standard deviation. Standard scaler there fore cannot guarantee
balanced feature scales in the presence of outliers.
Scaled data: -
11
2.5 Perform PCA
In below table we can see that first PC or Array explains 33.12% variance in our dataset, while first seven features
capture 70.12% variance
12
Heat Map co relation matrix between components and features
2.6 Answer
In PCA, given a mean cantered dataset X with n sample and p variables, the first principal component PC1
isgiven by the linear combination of the original variables X_1, X_2, …, X_p
PC_1 = w_{17}X_1 + w_{16}X_2 + … + w_{1p}X_pThe first principal component PC1 represents the
component that retains the maximum variance ofthe data. w1 corresponds to an eigenvector of the
covariance matrix
E=1*X^X/N-1
13
2.7 Answer
Observations:
The plot visually shows how much of the variance are explained, by how many principle
components.
In the below plot we see that, the 1st PC explains variance 33.13%, 2nd PC explains 57.19% and so
on.
Effectively we can get material variance explained (ie. 90%) by analysing 9 principle components
instead all of the 17 variables(attributes) in the dataset
PCA uses the eigenvectors of the covariance matrix to figure out how we should rotate the data. Because
rotation is a kind of linear transformation, your new dimensions will be sums of the old ones. The eigen-
vectors (Principle Components), determine the direction or Axes along which linear transformation acts,
stretching or compressing input vectors. They are the lines of change that represent the action of the larger
matrix, the very “line” in linear transformation.
2.8
Business implication for using PCA
PCA is use for EDA and for getting the predictive models will be done by continuous variables.
Dimensionality reduction can be done by PCA by sharing each data point for the first few principal
components to obtain lower dimension data while preserving as much as of the data’s variation
possible. In case reduction of dimension is higher it shows a higher variance.
The first PCA can equivalently be defines as a direction that maximizes the variance of the
projection
14