Data Minning Project
Data Minning Project
By :- AKSHAY PANKAR
DSBA GREAT LEARNING
GREAT LEARNING 1
DATA MINING AKSHAY PANKAR PROJECT
Contents
PART 1 – CLUSTERING
Ads24x7, a digital marketing company, has collected a large amount of data on different
types of ads through its marketing intelligence team. To effectively utilize this data, the
company has decided to segment the ads into homogeneous groups based on their
provided features.
Objective
The task is to use clustering procedures to segment the ads into distinct categories that
will allow Ads24x7 to better understand the different types of ads, their features, and
how they can be targeted towards specific audiences. The main objective is to create a
more efficient and targeted digital marketing strategy for Ads24x7's clients.
Q1.1 Clustering: Read the data and perform basic analysis such as printing a few
rows (head and tail), info, data summary, null values duplicate values, etc.
Head of Data –
GREAT LEARNING 2
DATA MINING AKSHAY PANKAR PROJECT
Tail of Data –
Data Info -
GREAT LEARNING 3
DATA MINING AKSHAY PANKAR PROJECT
Observations :
Q1.2 Treat missing values in CPC, CTR and CPM using the formula given. You
may refer to the Bank_KMeans Solution File to understand the coding behind
treating the missing values using a specific formula. You have to basically create
an user defined function and then call the function for imputing.
The missing values in Columns 'CTR', 'CPM', and 'CPC' will be treated.
After treating the missing values , we can see as follows
GREAT LEARNING 4
DATA MINING AKSHAY PANKAR PROJECT
● The columns having a mix of data types, with object, int64, and float64 types.
● The 'Timestamp' column is in object type and needs to be converted to date-time
format.
● 'InventoryType', 'Ad Type', 'Platform', 'Device Type', & 'Format' columns are in
object type and contain categorical data.
● The 'Ad - Length', 'Ad- Width', 'Ad Size', 'Available_Impressions',
'Matched_Queries', 'Impressions', and 'Clicks' columns are in int-64 type having
contain numerical data.
● The 'Spend', 'Fee', 'Revenue', 'CTR', 'CPM', and 'CPC' columns are in float64
type do have contain numerical data.
● The 'CTR', 'CPM', and 'CPC' columns are derived columns that can be calculated
using other columns in the dataset.
GREAT LEARNING 5
DATA MINING AKSHAY PANKAR PROJECT
GREAT LEARNING 6
DATA MINING AKSHAY PANKAR PROJECT
GREAT LEARNING 7
DATA MINING AKSHAY PANKAR PROJECT
GREAT LEARNING 8
DATA MINING AKSHAY PANKAR PROJECT
GREAT LEARNING 9
DATA MINING AKSHAY PANKAR PROJECT
From Graph Following observations made
Ad Length: The length of the ad in pixels. The mean length is 385 pixels, with a
standard deviation of 233.
Ad Width: The width of the ad in pixels. The mean width is 338 pixels, with a standard
deviation of 203.
Ad Size: The total size of the ad in pixels, calculated as the product of the length and
width. The mean ad size is 96,674 pixels, with a standard deviation of 61,538.
Available Impressions: The total number of times the ad could have been displayed.
The mean number of available impressions is 2.4 million, with a standard deviation of
4.7 million.
GREAT LEARNING 10
DATA MINING AKSHAY PANKAR PROJECT
Matched Queries: The number of times the ad was displayed to a user. The mean
number of matched queries is 1.3 million, with a standard deviation of 2.5 million.
Impressions: The number of times the ad was actually displayed to a user. The mean
number of impressions is 1.2 million, with a standard deviation of 2.4 million.
Clicks: The number of times the ad was clicked on by a user. The mean number of
clicks is 10,679, with a standard deviation of 17,353.
Spend: The total amount of money spent on the ad. The mean spend is $2,706.63, with
a standard deviation of $4,067.93.
Fee: The fee charged for displaying the ad. The mean fee is $0.335, with a standard
deviation of $0.032.
Q1.5 Perform z-score scaling and discuss how it affects the speed of the
algorithm.
GREAT LEARNING 11
DATA MINING AKSHAY PANKAR PROJECT
The data has been scale using the z-score technique, which transforms the
values to have a mean of 0 and a standard deviation of 1.
The resulting Data-Frame shows the scale values for each of the numerical
columns.
After applying z-score technique scaling to the numerical variables, we can
observe that all the variables have been scaled to a similar range with a mean of
zero and a standard deviation of one.
The minimum and maximum values of each variable can transformed to negative
and positive values, respectively, indicating that the variables' original scale has
been modified to a standard scale.
This standardization will allow us to compare and analyse the variables with
different units and scales without any bias in the analysis
Based on the time taken to cluster the data, it appears that clustering the
unscaled data was faster than clustering the scaled data. This is likely due to
scaling involves additional computations, which can slow down the clustering
process.
GREAT LEARNING 12
DATA MINING AKSHAY PANKAR PROJECT
The dendrogram herr displays the distance between each point of dataset, and
the vertical height of the line indicates the distance between clusters.
It calculates the distance between each point using the "ward" linkage method.
The resulting dendrogram is then plotted using the dendrogram() function from
the same module..
Overall, it is nothing but visual representation of the hierarchical clustering
process on the scaled data, allowing for an easy understanding of the clustering
structure and the distance between different points in the dataset.
The above dendrogram is truncated to show only the last 80 merged clusters, and the
color threshold is set to 10.
(b) Make Elbow plot (up to n=10) and identify optimum number of clusters for
k-means algorithm.
GREAT LEARNING 13
DATA MINING AKSHAY PANKAR PROJECT
By using WSS, Make Elbow plot (up to n=10) and we identify optimum number
of clusters for k-means algorithm. we can conclude that the WSS value
decreases as the number of clusters increases. However, the rate of decrease
slows down after the third cluster. Based on the elbow method, we can conclude
that the optimal number of clusters is 3 or 4.
o We can conclude from the plot that there is a consistent dip from 2 to 8 and there
doesn't seem to be a clear 'elbow' here. We may chose any from 2 to 8 as our of
clusters.
o So, let's look at another method to get a 'second opinion from maths'. Let's
create a plot with Silhouette scores to see how it varies with k.
(c) Print silhouette scores for up to 10 clusters and identify optimum number
of clusters.
GREAT LEARNING 14
DATA MINING AKSHAY PANKAR PROJECT
(d) Profile the ads based on optimum number of clusters using silhouette
score and your domain understanding
[Hint: Group the data by clusters and take sum or mean to identify trends in
clicks, spend, revenue, CPM, CTR, & CPC based on Device Type. Make bar
plots.]
We have added two new columns namely as , 'cluster_1' and 'cluster_2', to the dataframe and
assigns the cluster labels obtained from K-Means clustering to each row of the dataframe. The
resulting dataframe has two additional columns indicating the cluster assignments for each
observation.
Here we are analyzing the clusters obtained through K-Means clustering. By grouping the data
by cluster and taking the mean of each column, we can get insights into the characteristics of
each cluster. We have also included the cluster count to understand the distribution of data
among clusters.
Following DataFrame df_clust_2 contains the mean values for each numeric column for each
cluster, as well as the frequency of each cluster.
GREAT LEARNING 15
DATA MINING AKSHAY PANKAR PROJECT
PART 2 – PCA
PCA FH (FT): Primary census abstract for female headed households excluding
institutional households (India & States/UTs - District Level), Scheduled tribes - 2011
PCA for Female Headed Household Excluding Institutional Household. The Indian
Census has the reputation of being one of the best in the world. The first Census in
India was conducted in the year 1872. This was conducted at different points of time in
GREAT LEARNING 16
DATA MINING AKSHAY PANKAR PROJECT
different parts of the country. In 1881 a Census was taken for the entire country
simultaneously. Since then, Census has been conducted every ten years, without a
break. Thus, the Census of India 2011 was the fifteenth in this unbroken series since
1872, the seventh after independence and the second census of the third millennium
and twenty first century. The census has been uninterruptedly continued despite of
several adversities like wars, epidemics, natural calamities, political unrest, etc. The
Census of India is conducted under the provisions of the Census Act 1948 and the
Census Rules, 1990. The Primary Census Abstract which is important publication of
2011 Census gives basic information on Area, Total Number of Households, Total
Population, Scheduled Castes, Scheduled Tribes Population, Population in the age
group 0-6, Literates, Main Workers and Marginal Workers classified by the four
broad industrial categories, namely, (i) Cultivators, (ii) Agricultural Laborers, (iii)
Household Industry Workers, and (iv) Other Workers and also Non-Workers. The
characteristics of the Total Population include Scheduled Castes, Scheduled Tribes,
Institutional and Houseless Population and are presented by sex and rural-urban
residence. Census 2011 covered 35 States/Union Territories, 640 districts, 5,924 sub-
districts, 7,935 Towns and 6,40,867 Villages.
The data collected has so many variables thus making it difficult to find useful details
without using Data Science Techniques. You are tasked to perform detailed EDA and
identify Optimum Principal Components that explains the most variance in data. Use
Sklearn only.
Q2.1 Read the data and perform basic checks like checking head, info, summary,
nulls, and duplicates, etc.
Data has been read and we found that there are 640 rows and 61 columns only
There are also no duplicates and null values present in the data.
(i) Which state has highest gender ratio and which has the lowest?
GREAT LEARNING 17
DATA MINING AKSHAY PANKAR PROJECT
(ii) Which district has the highest & lowest gender ratio? (Example Questions).
Pick 5 variables out of the given 24 variables below for EDA: No_HH, TOT_M,
TOT_F, M_06, F_06, M_SC, F_SC, M_ST, F_ST, M_LIT, F_LIT, M_ILL, F_ILL,
TOT_WORK_M, TOT_WORK_F, MAINWORK_M, MAINWORK_F, MAIN_CL_M,
MAIN_CL_F, MAIN_AL_M, MAIN_AL_F, MAIN_HH_M, MAIN_HH_F, MAIN_OT_M,
MAIN_OT_F
Q2.3 PCA: We choose not to treat outliers for this case. Do you think that treating
outliers for this case is necessary?
Outliers are those data points that significantly deviate from the majority of the data.
Outliers can have a impact on the results of data analysis including PCA.
PCA is a linear transformation technique that is very much sensitive to the scale and
distribution of the variables in the dataset. Outliers mostly introduce noise and bias to
the principal component, leading to inaccurate results. Therefore it is essential to handle
outliers before applying PCA.
Below mention are variables with multiple outliers which needs to be treated on order to
obtain a better results before applying PCA.
GREAT LEARNING 18
DATA MINING AKSHAY PANKAR PROJECT
Q2.4 PCA: Scale the Data using z-score method. Does scaling have any impact on
outliers? Compare boxplots before and after scaling and comment.
Yes, scaling have an impact on outliers, particularly for standardization or z-score
scaling.
When scaling the data using standardization, each variable is centered by subtracting
its mean and then scaled by dividing by its standard deviation. This scaling method
assumes that the data is normally distributed, and is particularly sensitive to outliers, as
they can have a large effect on the mean and standard deviation.
Scaling can have an impact on outliers, particularly when using standardization, and it is
important to consider the distribution of the data and the impact of outliers when
choosing a scaling method.
GREAT LEARNING 19
DATA MINING AKSHAY PANKAR PROJECT
GREAT LEARNING 20
DATA MINING AKSHAY PANKAR PROJECT
Q2.5 PCA: Perform all the required steps for PCA (use sklearn only) Create the
covariance Matrix Get eigen values and eigen vector.
Drawn out 80% of data as n_components and drawn out eigen values and eigen vectors.
Eigen values –
array([[ 1.49129005e-01, 1.59181102e-01, 1.58166916e-01,
1.56388017e-01, 1.56859170e-01, 1.43365873e-01,
1.43507327e-01, 1.87496844e-02, 1.77600288e-02,
1.55147091e-01, 1.45405735e-01, 1.54603427e-01,
1.58250887e-01, 1.54067644e-01, 1.42408676e-01,
1.41912833e-01, 1.25601902e-01, 1.11747735e-01,
8.29639439e-02, 1.19237553e-01, 8.99412457e-02,
1.41886859e-01, 1.33830045e-01, 1.22740385e-01,
1.16780183e-01, 1.56686577e-01, 1.48631737e-01,
8.82222690e-02, 6.51879111e-02, 1.27293814e-01,
1.15801592e-01, 1.45429324e-01, 1.42317170e-01,
1.50903895e-01, 1.47991889e-01, 1.57936905e-01,
1.55833676e-01, 1.57668085e-01, 1.49432315e-01,
9.48411921e-02, 6.71786485e-02, 1.28197497e-01,
1.13864915e-01, 1.45171087e-01, 1.41044397e-01,
1.50950287e-01, 1.47512106e-01, 1.43026130e-01,
1.33743253e-01, 6.30287066e-02, 5.67899639e-02,
1.19131171e-01, 1.12988673e-01, 1.42200240e-01,
1.41385301e-01, 1.47652321e-01, 1.42056754e-01,
-1.08161271e-02],
[-1.17629148e-01, -7.93089477e-02, -9.43697520e-02,
-1.86121716e-02, -1.26971671e-02, -7.86228878e-02,
GREAT LEARNING 21
DATA MINING AKSHAY PANKAR PROJECT
-8.74012976e-02, 6.46274427e-02, 6.23493865e-02,
-1.05417518e-01, -1.33626942e-01, -7.78006862e-03,
-2.27379144e-02, -1.20520888e-01, -7.95233624e-02,
-1.66496624e-01, -1.45785406e-01, 4.36296567e-02,
9.28399872e-02, -5.53770642e-02, -7.74273408e-02,
-9.98591643e-02, -1.13965760e-01, -2.03116554e-01,
-2.07431281e-01, 7.99186006e-02, 1.06377463e-01,
2.72331598e-01, 2.75015276e-01, 1.56125177e-01,
1.31185775e-01, 4.32512077e-02, 7.66413559e-03,
-7.17352328e-02, -8.83043960e-02, -4.26916045e-02,
-9.14983655e-02, 6.70492795e-02, 8.70437521e-02,
2.62025691e-01, 2.66074303e-01, 1.49307198e-01,
1.16536500e-01, 3.90491802e-02, -2.70108580e-03,
-7.59974814e-02, -1.00893617e-01, 1.37835848e-01,
1.64647169e-01, 2.82824314e-01, 2.87902730e-01,
1.82310758e-01, 1.74319743e-01, 5.50891934e-02,
3.60334733e-02, -4.76495234e-02, -4.06070429e-02,
-4.97511375e-02],
[ 1.03822578e-01, -4.27536980e-02, 2.94134831e-02,
-8.03312178e-02, -7.43392778e-02, -4.49612686e-02,
1.73470129e-02, 3.06643248e-01, 3.23994466e-01,
-3.22798769e-02, 1.52856387e-03, -6.05440191e-02,
6.96802378e-02, -7.36637988e-03, 1.92927188e-01,
1.07131420e-02, 2.04256550e-01, 8.56592702e-03,
1.85456859e-01, 1.90325063e-01, 3.29953285e-01,
-1.05872543e-01, 2.30998521e-02, -2.72230958e-02,
7.29476183e-02, -6.39749708e-02, 1.13024037e-01,
-8.34736991e-02, -1.38981120e-02, 5.48498600e-02,
2.44553373e-01, -1.41905614e-01, -8.58820058e-02,
-1.21162032e-01, -3.99739705e-02, -6.95300276e-02,
-5.73534269e-02, -5.75747682e-02, 1.30598841e-01,
-7.72916128e-02, 2.70372019e-03, 6.14034733e-02,
2.65204347e-01, -1.40680952e-01, -8.25770929e-02,
-1.21281050e-01, -4.63455381e-02, -8.93930548e-02,
5.15446581e-02, -9.55888150e-02, -6.29923811e-02,
1.72207200e-02, 1.59447048e-01, -1.40222287e-01,
-9.13028882e-02, -1.13760290e-01, -8.85559819e-03,
3.37652979e-01],
[ 7.16658376e-02, 5.28133730e-02, 6.77806310e-02,
2.94235963e-02, 1.70983874e-02, 9.13352585e-03,
1.32490641e-02, 7.02930271e-02, 5.88320768e-02,
8.95871807e-02, 1.26267973e-01, -3.65018324e-02,
-1.77856447e-02, 6.65061943e-02, 9.96370945e-02,
9.53969110e-02, 1.19240640e-01, 6.65018604e-02,
2.48280102e-01, -1.42161359e-01, -4.66921960e-02,
-1.75905647e-02, -4.47679076e-02, 1.47695874e-01,
1.52747791e-01, -7.26650696e-02, 1.26652188e-02,
1.64344027e-01, 2.86987459e-01, -2.55656109e-01,
-1.69270325e-01, -1.57288026e-01, -1.41733452e-01,
3.14449389e-02, 6.72322728e-02, 4.08764171e-02,
4.87910703e-02, -8.62144866e-02, 1.37308500e-02,
1.38320873e-01, 2.93361633e-01, -2.56173589e-01,
-1.60593823e-01, -1.56029481e-01, -1.33532600e-01,
2.98287943e-02, 6.69993381e-02, -8.93028222e-03,
GREAT LEARNING 22
DATA MINING AKSHAY PANKAR PROJECT
9.37221445e-03, 2.17085624e-01, 2.46599486e-01,
-2.41269819e-01, -1.96514027e-01, -1.56802982e-01,
-1.57855601e-01, 3.49231100e-02, 6.53096824e-02,
5.94677878e-02],
[-1.30350671e-02, -4.80760394e-02, -2.55554772e-02,
-8.58679794e-02, -8.36975609e-02, -1.65016033e-01,
-1.52310284e-01, 3.50162904e-01, 3.51440444e-01,
-1.83307736e-02, 2.83407556e-02, -1.13574644e-01,
-1.16456566e-01, -3.28759342e-02, -2.81029323e-02,
-5.68595464e-02, -6.77880069e-02, -3.27759869e-01,
-2.75470830e-01, -2.75053110e-01, -2.20792210e-01,
-5.63753701e-02, -1.10014504e-02, 6.10879272e-02,
9.75052483e-02, 7.18575697e-02, 7.81297116e-02,
-1.03449498e-02, -5.45499192e-02, -4.96468049e-02,
-2.51979332e-02, 2.18313693e-02, 6.63710735e-02,
1.50876968e-01, 1.95395986e-01, -6.25852416e-02,
-2.16348377e-02, 6.41190511e-02, 6.14735728e-02,
-6.85801556e-03, -6.27032285e-02, -6.19101459e-02,
-4.12400868e-02, 1.89709897e-02, 6.25134963e-02,
1.38511730e-01, 1.74460018e-01, 1.04598875e-01,
1.23838699e-01, -1.00788782e-02, -2.90134736e-02,
1.82012044e-02, 4.73711425e-02, 3.17262356e-02,
7.45443540e-02, 1.96938109e-01, 2.57274769e-01,
1.55227809e-01],
[-1.03076462e-01, 7.07752970e-02, -1.93773950e-02,
1.12916196e-01, 1.16221693e-01, -7.68183675e-03,
-8.32992282e-02, 4.83142268e-01, 4.45702232e-01,
3.02368507e-02, -3.91456262e-02, 1.63705825e-01,
1.82156553e-02, 6.96351282e-02, -1.00990641e-01,
8.25253162e-02, -7.14896701e-02, 2.93453012e-01,
4.44872133e-02, 1.03137657e-01, -4.70633743e-02,
6.74239464e-02, -9.88888637e-02, 2.19687342e-02,
-3.91241303e-02, 1.19502380e-02, -1.21164648e-01,
-1.23940721e-02, -4.40632573e-02, 4.73341340e-02,
-6.03010901e-02, 4.50611285e-02, -7.98717861e-02,
1.79883609e-02, -5.27577154e-02, 7.14046618e-02,
2.39609608e-02, 2.02142730e-02, -1.00423431e-01,
-2.68284612e-03, -3.31331052e-02, 5.11536826e-02,
-4.60304222e-02, 4.81568837e-02, -6.40896223e-02,
1.85559628e-02, -3.73467025e-02, -2.60927957e-02,
-1.83995774e-01, -2.80594285e-02, -7.07917881e-02,
3.19446654e-02, -1.12268055e-01, 3.23066512e-02,
-1.21053112e-01, 1.88931938e-02, -1.14629515e-01,
-4.44059027e-01],
[ 5.35371888e-02, 9.99234589e-02, 8.08885270e-02,
1.34049144e-01, 1.22287648e-01, 5.40113107e-02,
3.39758422e-02, -1.07019479e-01, -9.97370898e-02,
9.46314221e-02, 9.44724933e-02, 9.28113508e-02,
2.11429454e-02, 5.81676887e-02, -5.21791600e-02,
4.30577887e-02, -7.99258312e-02, -2.13148404e-01,
-2.67563460e-01, 7.48028027e-03, -7.71282665e-02,
-1.25950741e-01, -3.63618982e-01, 7.80315014e-02,
1.88249724e-02, 8.69974671e-02, 6.31394892e-02,
-3.57756468e-02, -4.65371420e-02, 9.88712123e-02,
GREAT LEARNING 23
DATA MINING AKSHAY PANKAR PROJECT
1.00027753e-01, -1.59433396e-01, -3.57112197e-01,
2.39212446e-02, -1.68609661e-02, 1.32555585e-01,
1.32347594e-01, 8.29376616e-02, 3.95827221e-02,
-4.18817639e-02, -6.18296312e-02, 9.31842603e-02,
8.01404066e-02, -1.63493447e-01, -3.78031456e-01,
2.17252724e-02, -1.92879238e-02, 1.09333880e-01,
1.35961493e-01, -1.13613996e-02, 2.37826456e-03,
1.24830738e-01, 1.60150077e-01, -1.41955506e-01,
-2.79443056e-01, 2.83337740e-02, 5.08788422e-03,
-9.59758855e-02],
[-9.50669722e-02, -1.20875222e-01, -1.22602615e-01,
-1.59578663e-01, -1.69327603e-01, 1.86246485e-01,
1.77047539e-01, -8.32106795e-02, -9.55705649e-02,
-1.06172430e-01, -1.28841243e-01, -1.19324539e-01,
-6.18082322e-02, -4.15267648e-02, -4.81004405e-02,
-4.25388192e-02, -6.02973254e-02, 3.97891546e-01,
1.69329556e-01, 3.28697028e-02, 6.77855155e-02,
-1.15474985e-01, -2.09704787e-01, -4.69605669e-02,
-5.45326270e-02, 2.10101621e-02, 7.52081151e-02,
-6.33462810e-02, -2.08098867e-02, 1.11930331e-02,
6.42754932e-02, 3.68089815e-03, -1.11770845e-01,
1.73595912e-01, 2.63963764e-01, -1.74321611e-01,
-1.40003335e-01, 2.54754114e-02, 9.17806670e-02,
-4.26672332e-02, 1.94290078e-04, 2.27276480e-02,
9.58332272e-02, 5.31930758e-03, -9.74708959e-02,
1.63213971e-01, 2.32849453e-01, 9.52110793e-04,
9.53530378e-03, -1.12522301e-01, -7.84665461e-02,
-5.09895414e-02, -4.12308252e-02, 2.75605160e-03,
-1.37157251e-01, 2.25771449e-01, 3.41284783e-01,
-1.30214655e-01],
[ 1.55552590e-02, 4.06712235e-02, 4.24705781e-02,
1.59258427e-01, 1.89586145e-01, -4.21624461e-01,
-4.05989214e-01, -8.22303843e-02, -8.66280513e-02,
-1.78213451e-02, -2.59643771e-03, 1.52993067e-01,
4.24121337e-02, -4.46321203e-02, 3.02636576e-02,
-8.14004744e-02, 5.56377959e-02, 3.52014125e-02,
2.69531657e-01, 8.31790996e-02, 1.15156685e-01,
-2.52081281e-01, -2.00876592e-01, -1.79180651e-01,
-1.03049044e-01, 1.25583692e-01, 1.44092530e-02,
-9.84071329e-03, -4.68571580e-02, -1.34015195e-02,
-1.08990078e-01, 1.10007311e-01, -6.11864015e-02,
6.87257641e-02, 5.79359609e-02, 1.17297154e-01,
3.94617952e-02, 1.38038905e-01, 3.04574778e-02,
3.40704049e-02, -8.36506919e-03, 3.27777159e-03,
-9.13021970e-02, 9.56621739e-02, -7.81886822e-02,
7.24503293e-02, 6.39322229e-02, 4.23293637e-02,
-3.28501335e-02, -1.30048155e-01, -1.57544539e-01,
-8.51532881e-02, -1.80141799e-01, 1.53067147e-01,
-8.67860675e-04, 4.22490625e-02, 4.06653740e-02,
2.53731310e-01],
[-1.16154547e-02, 1.69931911e-02, 2.34056965e-02,
3.59396129e-02, 3.84264683e-02, 3.66975815e-01,
3.71204765e-01, 9.84065776e-02, 1.51127636e-01,
7.49086705e-03, -2.53698579e-03, 3.18798793e-02,
GREAT LEARNING 24
DATA MINING AKSHAY PANKAR PROJECT
3.84519993e-02, -5.99185025e-02, -8.08987120e-02,
-8.36423857e-02, -8.74190997e-02, 3.02420847e-02,
1.40082684e-01, -1.08973045e-01, -1.47832727e-01,
1.27476264e-01, -1.27516059e-01, -1.61181309e-01,
-1.42594248e-01, 3.40004797e-02, -6.23297311e-02,
-2.16455562e-02, -7.54239463e-02, -7.19204890e-03,
-1.24893713e-01, 1.51880247e-01, -1.07820987e-01,
-1.62794852e-02, -1.00531308e-01, 7.08154160e-02,
6.27815590e-02, 1.60203259e-02, -8.74627395e-02,
-3.75985823e-02, -8.70930738e-02, -1.61329373e-02,
-1.45954500e-01, 1.37009370e-01, -1.19445332e-01,
-3.14923432e-02, -1.05734036e-01, 1.22055246e-01,
9.45826638e-03, 3.71154019e-02, -2.38276663e-02,
2.33745666e-02, -6.30823550e-02, 1.85620711e-01,
-8.00613141e-02, 4.44304676e-02, -5.35095397e-02,
5.46940310e-01],
[ 8.15787714e-02, -4.65403428e-02, -8.49192732e-03,
-1.43097000e-01, -1.25828943e-01, -2.44380093e-02,
2.08967226e-02, 1.63795718e-02, 9.38114541e-03,
-2.84142661e-04, 1.32437563e-01, -1.51030418e-01,
-1.40208510e-01, 6.10183634e-02, -9.60141617e-03,
8.49880715e-02, 7.36236534e-02, 1.73267745e-01,
-2.39946333e-01, 3.74739756e-01, 2.08542969e-01,
-2.20420715e-01, -3.89478080e-02, 7.21032276e-02,
1.17969496e-01, 6.61667705e-02, -1.59270488e-01,
2.58025863e-01, -1.29044064e-01, 1.06449302e-01,
-2.13878855e-01, -4.94068965e-02, -2.99001302e-03,
2.44013011e-03, -2.31278199e-02, -1.29100793e-01,
1.29495463e-02, 4.53847216e-02, -2.11418197e-01,
2.54854832e-01, -1.80285273e-01, 9.46311073e-02,
-2.37193962e-01, -6.20480160e-02, -2.62827863e-02,
-4.93321578e-03, -3.51132556e-02, 1.73155631e-01,
3.19666734e-02, 1.98075457e-01, 2.63574297e-02,
1.63810302e-01, -8.38075836e-02, -6.18828279e-03,
7.94148993e-02, 5.20213641e-02, 1.47648209e-02,
1.30554311e-01],
[-2.22632112e-02, -5.11919789e-02, -4.55203603e-02,
-1.43803865e-01, -1.41457728e-01, -1.07858701e-01,
-9.79479377e-02, -4.79534382e-02, -4.85698493e-02,
3.04581990e-02, 8.02524670e-02, -2.21559326e-01,
-1.63364994e-01, 2.63275641e-02, 5.18639072e-02,
4.15583741e-02, 1.35248844e-01, -1.47473903e-01,
1.15096210e-01, -1.44143433e-01, 1.70497977e-01,
2.50222886e-01, -7.72883636e-02, 2.24177635e-01,
2.57825861e-01, 4.52636890e-02, -9.10242266e-02,
2.48736330e-03, 7.58293277e-02, 6.70062116e-02,
8.76942354e-02, 2.81620525e-01, -1.94718981e-01,
1.02662109e-01, -1.39761067e-01, -1.02466087e-01,
-8.39689117e-02, 5.11374423e-02, -5.82251595e-02,
1.80138886e-02, 8.81134118e-02, 5.37350906e-02,
9.80767907e-02, 2.81807696e-01, -2.06051331e-01,
1.01353708e-01, -1.16695989e-01, 9.29584133e-03,
-1.56701095e-01, -4.97850513e-02, 3.64887291e-02,
7.02920599e-02, 4.06869614e-02, 2.68386889e-01,
GREAT LEARNING 25
DATA MINING AKSHAY PANKAR PROJECT
-1.47926430e-01, 1.20742110e-01, -2.03340685e-01,
-3.40755623e-02]])
Eigen Vectors -
Q2.6 PCA: Identify the optimum number of PCs (for this project, take at least 90%
explained variance). Show Scree plot.
GREAT LEARNING 26
DATA MINING AKSHAY PANKAR PROJECT
Q2.7 PCA: Compare PCs with Actual Columns and identify which is explaining
most variance. Write inferences about all the Principal components in terms of
actual variables.
GREAT LEARNING 28
DATA MINING AKSHAY PANKAR PROJECT
Variance of MARG_OT_F in original data: 16934799.552501462
Variance of MARG_OT_F in PC data: 6964990.246931101
MARG_OT_F has more variance in the original data.
Variance of MARGWORK_3_6_M in original data: 1524536773.226562
Variance of MARGWORK_3_6_M in PC data: 999362020.696375
MARGWORK_3_6_M has more variance in the original data.
Variance of MARGWORK_3_6_F in original data: 6884088307.6025715
Variance of MARGWORK_3_6_F in PC data: 3922398437.6359544
MARGWORK_3_6_F has more variance in the original data.
Variance of MARG_CL_3_6_M in original data: 36238072.03114251
Variance of MARG_CL_3_6_M in PC data: 23141590.35160169
MARG_CL_3_6_M has more variance in the original data.
Variance of MARG_CL_3_6_F in original data: 71698106.27726665
Variance of MARG_CL_3_6_F in PC data: 61466518.3740084
MARG_CL_3_6_F has more variance in the original data.
Variance of MARG_AL_3_6_M in original data: 820182.5043793997
Variance of MARG_AL_3_6_M in PC data: 393819.641918526
MARG_AL_3_6_M has more variance in the original data.
Variance of MARG_AL_3_6_F in original data: 6232719.529645966
Variance of MARG_AL_3_6_F in PC data: 1685197.9156564507
MARG_AL_3_6_F has more variance in the original data.
Variance of MARG_HH_3_6_M in original data: 9361068.857861383
Variance of MARG_HH_3_6_M in PC data: 5708106.623576284
MARG_HH_3_6_M has more variance in the original data.
Variance of MARG_HH_3_6_F in original data: 28469064.456337955
Variance of MARG_HH_3_6_F in PC data: 23547104.009145148
MARG_HH_3_6_F has more variance in the original data.
Variance of MARG_OT_3_6_M in original data: 128686.18450704271
Variance of MARG_OT_3_6_M in PC data: 34777.787790982
MARG_OT_3_6_M has more variance in the original data.
Variance of MARG_OT_3_6_F in original data: 810046.4717429585
Variance of MARG_OT_3_6_F in PC data: 217622.8541079817
MARG_OT_3_6_F has more variance in the original data.
Variance of MARGWORK_0_3_M in original data: 9223152.65312011
Variance of MARGWORK_0_3_M in PC data: 4233767.603638496
MARGWORK_0_3_M has more variance in the original data.
Variance of MARGWORK_0_3_F in original data: 11074498.645830909
Variance of MARGWORK_0_3_F in PC data: 4672596.819874803
MARGWORK_0_3_F has more variance in the original data.
Variance of MARG_CL_0_3_M in original data: 2219227.0994498287
Variance of MARG_CL_0_3_M in PC data: 999702.9436619718
MARG_CL_0_3_M has more variance in the original data.
Variance of MARG_CL_0_3_F in original data: 7777275.346478849
Variance of MARG_CL_0_3_F in PC data: 4403767.748062782
MARG_CL_0_3_F has more variance in the original data.
Variance of MARG_AL_0_3_M in original data: 205514.0674858172
Variance of MARG_AL_0_3_M in PC data: 34925.857689894794
MARG_AL_0_3_M has more variance in the original data.
Variance of MARG_AL_0_3_F in original data: 1249125.3126736197
Variance of MARG_AL_0_3_F in PC data: 153846.88290232338
MARG_AL_0_3_F has more variance in the original data.
Variance of MARG_HH_0_3_M in original data: 581526.7179088405
Variance of MARG_HH_0_3_M in PC data: 182287.19814269672
MARG_HH_0_3_M has more variance in the original data.
GREAT LEARNING 29
DATA MINING AKSHAY PANKAR PROJECT
Variance of MARG_HH_0_3_F in original data: 2513423.2002738593
Variance of MARG_HH_0_3_F in PC data: 1304802.8919440494
MARG_HH_0_3_F has more variance in the original data.
Variance of MARG_OT_0_3_M in original data: 11641.897865316923
Variance of MARG_OT_0_3_M in PC data: 3013.3870671948357
MARG_OT_0_3_M has more variance in the original data.
Variance of MARG_OT_0_3_F in original data: 95939.39665248436
Variance of MARG_OT_0_3_F in PC data: 24418.612360621977
MARG_OT_0_3_F has more variance in the original data.
Variance of NON_WORK_M in original data: 372836.2517581193
Variance of NON_WORK_M in PC data: 139886.95884132895
NON_WORK_M has more variance in the original data.
Variance of NON_WORK_F in original data: 828480.8333235511
Variance of NON_WORK_F in PC data: 260929.50860475318
NON_WORK_F has more variance in the original data.
Variance of Gender_Ratio in original data: 0.05246821628445864
Variance of Gender_Ratio in PC data: 0.05208232410490937
Gender_Ratio has more variance in the original data.
GREAT LEARNING 30
DATA MINING AKSHAY PANKAR PROJECT
GREAT LEARNING 31