0% found this document useful (0 votes)
14 views

Data Minning Project

Uploaded by

akshaypankar907
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views

Data Minning Project

Uploaded by

akshaypankar907
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 31

DATA MINING AKSHAY PANKAR PROJECT

Data Mining Project Report

By :- AKSHAY PANKAR
DSBA GREAT LEARNING

GREAT LEARNING 1
DATA MINING AKSHAY PANKAR PROJECT

Contents
PART 1 – CLUSTERING

Ads24x7, a digital marketing company, has collected a large amount of data on different
types of ads through its marketing intelligence team. To effectively utilize this data, the
company has decided to segment the ads into homogeneous groups based on their
provided features.

Objective

The task is to use clustering procedures to segment the ads into distinct categories that
will allow Ads24x7 to better understand the different types of ads, their features, and
how they can be targeted towards specific audiences. The main objective is to create a
more efficient and targeted digital marketing strategy for Ads24x7's clients.

Q1.1 Clustering: Read the data and perform basic analysis such as printing a few
rows (head and tail), info, data summary, null values duplicate values, etc.
Head of Data –

GREAT LEARNING 2
DATA MINING AKSHAY PANKAR PROJECT

Tail of Data –

Data Info -

GREAT LEARNING 3
DATA MINING AKSHAY PANKAR PROJECT
Observations :

Following Observations are made:-

● The dataset having 23066 entries and 19 columns.


● There are no missing values in any of the columns except for columns 'CTR',
'CPM', and 'CPC'.
● The data types of the columns are integers, floats, and objects (strings).
● Columns names 'CTR', 'CPM', and 'CPC' have 4726 missing values, which is
20% of the data.

Q1.2 Treat missing values in CPC, CTR and CPM using the formula given. You
may refer to the Bank_KMeans Solution File to understand the coding behind
treating the missing values using a specific formula. You have to basically create
an user defined function and then call the function for imputing.

The missing values in Columns 'CTR', 'CPM', and 'CPC' will be treated.
After treating the missing values , we can see as follows

GREAT LEARNING 4
DATA MINING AKSHAY PANKAR PROJECT

● The columns having a mix of data types, with object, int64, and float64 types.
● The 'Timestamp' column is in object type and needs to be converted to date-time
format.
● 'InventoryType', 'Ad Type', 'Platform', 'Device Type', & 'Format' columns are in
object type and contain categorical data.
● The 'Ad - Length', 'Ad- Width', 'Ad Size', 'Available_Impressions',
'Matched_Queries', 'Impressions', and 'Clicks' columns are in int-64 type having
contain numerical data.
● The 'Spend', 'Fee', 'Revenue', 'CTR', 'CPM', and 'CPC' columns are in float64
type do have contain numerical data.
● The 'CTR', 'CPM', and 'CPC' columns are derived columns that can be calculated
using other columns in the dataset.

GREAT LEARNING 5
DATA MINING AKSHAY PANKAR PROJECT

Dividing data into 2 different frames

Q1.3 Check if there are any outliers.


Q1.4 Do you think treating outliers is necessary for K-Means clustering? Based
on your judgment decide whether to treat outliers and if yes, which method to
employ. (As an analyst your judgment may be different from another analyst).

GREAT LEARNING 6
DATA MINING AKSHAY PANKAR PROJECT

GREAT LEARNING 7
DATA MINING AKSHAY PANKAR PROJECT

GREAT LEARNING 8
DATA MINING AKSHAY PANKAR PROJECT

GREAT LEARNING 9
DATA MINING AKSHAY PANKAR PROJECT
From Graph Following observations made

● As observed , most of the variables having skewed distributions


● The distribution for the percemtage revenue is highly skewed towards right side
with more outliers
● Ad_width is towards left skewed with no outliers and ad_size is towards right
skewed with very less outliers
● We can seethat right skewed plot in column spend with numerous outliers which
are means heavy amount has been invested for marketing cost
● The distribution for all most variables is highly skewed towards right. All these
variables have some outliers towards right direction,

Descriptive Stats for observing scale issues between variables

Following observations were made

Ad Length: The length of the ad in pixels. The mean length is 385 pixels, with a
standard deviation of 233.

Ad Width: The width of the ad in pixels. The mean width is 338 pixels, with a standard
deviation of 203.

Ad Size: The total size of the ad in pixels, calculated as the product of the length and
width. The mean ad size is 96,674 pixels, with a standard deviation of 61,538.

Available Impressions: The total number of times the ad could have been displayed.
The mean number of available impressions is 2.4 million, with a standard deviation of
4.7 million.

GREAT LEARNING 10
DATA MINING AKSHAY PANKAR PROJECT
Matched Queries: The number of times the ad was displayed to a user. The mean
number of matched queries is 1.3 million, with a standard deviation of 2.5 million.

Impressions: The number of times the ad was actually displayed to a user. The mean
number of impressions is 1.2 million, with a standard deviation of 2.4 million.

Clicks: The number of times the ad was clicked on by a user. The mean number of
clicks is 10,679, with a standard deviation of 17,353.

Spend: The total amount of money spent on the ad. The mean spend is $2,706.63, with
a standard deviation of $4,067.93.

Fee: The fee charged for displaying the ad. The mean fee is $0.335, with a standard
deviation of $0.032.

Checking for correlations among variables

Q1.5 Perform z-score scaling and discuss how it affects the speed of the
algorithm.

GREAT LEARNING 11
DATA MINING AKSHAY PANKAR PROJECT

 The data has been scale using the z-score technique, which transforms the
values to have a mean of 0 and a standard deviation of 1.
 The resulting Data-Frame shows the scale values for each of the numerical
columns.
 After applying z-score technique scaling to the numerical variables, we can
observe that all the variables have been scaled to a similar range with a mean of
zero and a standard deviation of one.
 The minimum and maximum values of each variable can transformed to negative
and positive values, respectively, indicating that the variables' original scale has
been modified to a standard scale.
 This standardization will allow us to compare and analyse the variables with
different units and scales without any bias in the analysis
 Based on the time taken to cluster the data, it appears that clustering the
unscaled data was faster than clustering the scaled data. This is likely due to
scaling involves additional computations, which can slow down the clustering
process.

Q1.6 Perform clustering and do the following:

(a) Perform Hierarchical by constructing a Dendrogram using WARD and


Euclidean distance.

GREAT LEARNING 12
DATA MINING AKSHAY PANKAR PROJECT

 The dendrogram herr displays the distance between each point of dataset, and
the vertical height of the line indicates the distance between clusters.
 It calculates the distance between each point using the "ward" linkage method.
The resulting dendrogram is then plotted using the dendrogram() function from
the same module..
 Overall, it is nothing but visual representation of the hierarchical clustering
process on the scaled data, allowing for an easy understanding of the clustering
structure and the distance between different points in the dataset.

The above dendrogram is truncated to show only the last 80 merged clusters, and the
color threshold is set to 10.

(b) Make Elbow plot (up to n=10) and identify optimum number of clusters for
k-means algorithm.

GREAT LEARNING 13
DATA MINING AKSHAY PANKAR PROJECT

By using WSS, Make Elbow plot (up to n=10) and we identify optimum number
of clusters for k-means algorithm. we can conclude that the WSS value
decreases as the number of clusters increases. However, the rate of decrease
slows down after the third cluster. Based on the elbow method, we can conclude
that the optimal number of clusters is 3 or 4.

Follwing observation were made.

o We can conclude from the plot that there is a consistent dip from 2 to 8 and there
doesn't seem to be a clear 'elbow' here. We may chose any from 2 to 8 as our of
clusters.
o So, let's look at another method to get a 'second opinion from maths'. Let's
create a plot with Silhouette scores to see how it varies with k.

(c) Print silhouette scores for up to 10 clusters and identify optimum number
of clusters.

GREAT LEARNING 14
DATA MINING AKSHAY PANKAR PROJECT
(d) Profile the ads based on optimum number of clusters using silhouette
score and your domain understanding
[Hint: Group the data by clusters and take sum or mean to identify trends in
clicks, spend, revenue, CPM, CTR, & CPC based on Device Type. Make bar
plots.]

We have added two new columns namely as , 'cluster_1' and 'cluster_2', to the dataframe and
assigns the cluster labels obtained from K-Means clustering to each row of the dataframe. The
resulting dataframe has two additional columns indicating the cluster assignments for each
observation.

Here we are analyzing the clusters obtained through K-Means clustering. By grouping the data
by cluster and taking the mean of each column, we can get insights into the characteristics of
each cluster. We have also included the cluster count to understand the distribution of data
among clusters.

Following DataFrame df_clust_2 contains the mean values for each numeric column for each
cluster, as well as the frequency of each cluster.

GREAT LEARNING 15
DATA MINING AKSHAY PANKAR PROJECT

1.7 Clustering: Conclude the project by providing summary of your learnings.

o In this project, we have explored the application of clustering algorithms to


segment ads based on their performance metrics. We began by performing basic
analysis of the data, identifying missing values, outliers, and duplicates, and
cleaning the data accordingly.
o We then performed some z-score scaling to standardize the data and improve
the performance of the clustering algorithm. We applied both Hierarchical and K-
Means clustering algorithms to identify clusters of ads with similar performance
metrics. We used the Elbow method and Silhouette scores to determine the
optimal number of clusters for K-Means clustering.
o Finally, we profiled that the ads based on the identified clusters and identified
trends in the performance metrics for each cluster, based on the device type. We
found that different types of devices have different performance characteristics,
which should be taken into account when designing and targeting ads.
o Overall, this project helped us to understand the application of clustering
algorithms in advertising analytics and how it can be used to segment ads based
on their performance characteristics, which can help in optimizing ad targeting
and improving the overall performance of ad campaigns.

PART 2 – PCA

PCA FH (FT): Primary census abstract for female headed households excluding
institutional households (India & States/UTs - District Level), Scheduled tribes - 2011
PCA for Female Headed Household Excluding Institutional Household. The Indian
Census has the reputation of being one of the best in the world. The first Census in
India was conducted in the year 1872. This was conducted at different points of time in

GREAT LEARNING 16
DATA MINING AKSHAY PANKAR PROJECT
different parts of the country. In 1881 a Census was taken for the entire country
simultaneously. Since then, Census has been conducted every ten years, without a
break. Thus, the Census of India 2011 was the fifteenth in this unbroken series since
1872, the seventh after independence and the second census of the third millennium
and twenty first century. The census has been uninterruptedly continued despite of
several adversities like wars, epidemics, natural calamities, political unrest, etc. The
Census of India is conducted under the provisions of the Census Act 1948 and the
Census Rules, 1990. The Primary Census Abstract which is important publication of
2011 Census gives basic information on Area, Total Number of Households, Total
Population, Scheduled Castes, Scheduled Tribes Population, Population in the age
group 0-6, Literates, Main Workers and Marginal Workers classified by the four
broad industrial categories, namely, (i) Cultivators, (ii) Agricultural Laborers, (iii)
Household Industry Workers, and (iv) Other Workers and also Non-Workers. The
characteristics of the Total Population include Scheduled Castes, Scheduled Tribes,
Institutional and Houseless Population and are presented by sex and rural-urban
residence. Census 2011 covered 35 States/Union Territories, 640 districts, 5,924 sub-
districts, 7,935 Towns and 6,40,867 Villages.
The data collected has so many variables thus making it difficult to find useful details
without using Data Science Techniques. You are tasked to perform detailed EDA and
identify Optimum Principal Components that explains the most variance in data. Use
Sklearn only.

Q2.1 Read the data and perform basic checks like checking head, info, summary,
nulls, and duplicates, etc.

Data has been read and we found that there are 640 rows and 61 columns only
There are also no duplicates and null values present in the data.

Q2.2 PCA: Perform detailed Exploratory analysis by creating certain questions


like

(i) Which state has highest gender ratio and which has the lowest?

State with highest gender ratio is Andhra Pradesh


Highest gender ratio is 2.28324963845265

State with lowest gender ratio is Lakshadweep


Lowest gender ratio is 1.1519925134523903

GREAT LEARNING 17
DATA MINING AKSHAY PANKAR PROJECT

(ii) Which district has the highest & lowest gender ratio? (Example Questions).
Pick 5 variables out of the given 24 variables below for EDA: No_HH, TOT_M,
TOT_F, M_06, F_06, M_SC, F_SC, M_ST, F_ST, M_LIT, F_LIT, M_ILL, F_ILL,
TOT_WORK_M, TOT_WORK_F, MAINWORK_M, MAINWORK_F, MAIN_CL_M,
MAIN_CL_F, MAIN_AL_M, MAIN_AL_F, MAIN_HH_M, MAIN_HH_F, MAIN_OT_M,
MAIN_OT_F

District with highest gender ratio is Krishna


Highest gender ratio is 2.28324963845265

District with lowest gender ratio is Lakshadweep


Lowest gender ratio: 1.1519925134523903

Q2.3 PCA: We choose not to treat outliers for this case. Do you think that treating
outliers for this case is necessary?
Outliers are those data points that significantly deviate from the majority of the data.
Outliers can have a impact on the results of data analysis including PCA.
PCA is a linear transformation technique that is very much sensitive to the scale and
distribution of the variables in the dataset. Outliers mostly introduce noise and bias to
the principal component, leading to inaccurate results. Therefore it is essential to handle
outliers before applying PCA.
Below mention are variables with multiple outliers which needs to be treated on order to
obtain a better results before applying PCA.

GREAT LEARNING 18
DATA MINING AKSHAY PANKAR PROJECT

Q2.4 PCA: Scale the Data using z-score method. Does scaling have any impact on
outliers? Compare boxplots before and after scaling and comment.
Yes, scaling have an impact on outliers, particularly for standardization or z-score
scaling.
When scaling the data using standardization, each variable is centered by subtracting
its mean and then scaled by dividing by its standard deviation. This scaling method
assumes that the data is normally distributed, and is particularly sensitive to outliers, as
they can have a large effect on the mean and standard deviation.
Scaling can have an impact on outliers, particularly when using standardization, and it is
important to consider the distribution of the data and the impact of outliers when
choosing a scaling method.

Below is the boxplot before scaling -

GREAT LEARNING 19
DATA MINING AKSHAY PANKAR PROJECT

Boxplot after scaling which has no outliers –

GREAT LEARNING 20
DATA MINING AKSHAY PANKAR PROJECT

Q2.5 PCA: Perform all the required steps for PCA (use sklearn only) Create the
covariance Matrix Get eigen values and eigen vector.

Drawn out 80% of data as n_components and drawn out eigen values and eigen vectors.
Eigen values –
array([[ 1.49129005e-01, 1.59181102e-01, 1.58166916e-01,
1.56388017e-01, 1.56859170e-01, 1.43365873e-01,
1.43507327e-01, 1.87496844e-02, 1.77600288e-02,
1.55147091e-01, 1.45405735e-01, 1.54603427e-01,
1.58250887e-01, 1.54067644e-01, 1.42408676e-01,
1.41912833e-01, 1.25601902e-01, 1.11747735e-01,
8.29639439e-02, 1.19237553e-01, 8.99412457e-02,
1.41886859e-01, 1.33830045e-01, 1.22740385e-01,
1.16780183e-01, 1.56686577e-01, 1.48631737e-01,
8.82222690e-02, 6.51879111e-02, 1.27293814e-01,
1.15801592e-01, 1.45429324e-01, 1.42317170e-01,
1.50903895e-01, 1.47991889e-01, 1.57936905e-01,
1.55833676e-01, 1.57668085e-01, 1.49432315e-01,
9.48411921e-02, 6.71786485e-02, 1.28197497e-01,
1.13864915e-01, 1.45171087e-01, 1.41044397e-01,
1.50950287e-01, 1.47512106e-01, 1.43026130e-01,
1.33743253e-01, 6.30287066e-02, 5.67899639e-02,
1.19131171e-01, 1.12988673e-01, 1.42200240e-01,
1.41385301e-01, 1.47652321e-01, 1.42056754e-01,
-1.08161271e-02],
[-1.17629148e-01, -7.93089477e-02, -9.43697520e-02,
-1.86121716e-02, -1.26971671e-02, -7.86228878e-02,

GREAT LEARNING 21
DATA MINING AKSHAY PANKAR PROJECT
-8.74012976e-02, 6.46274427e-02, 6.23493865e-02,
-1.05417518e-01, -1.33626942e-01, -7.78006862e-03,
-2.27379144e-02, -1.20520888e-01, -7.95233624e-02,
-1.66496624e-01, -1.45785406e-01, 4.36296567e-02,
9.28399872e-02, -5.53770642e-02, -7.74273408e-02,
-9.98591643e-02, -1.13965760e-01, -2.03116554e-01,
-2.07431281e-01, 7.99186006e-02, 1.06377463e-01,
2.72331598e-01, 2.75015276e-01, 1.56125177e-01,
1.31185775e-01, 4.32512077e-02, 7.66413559e-03,
-7.17352328e-02, -8.83043960e-02, -4.26916045e-02,
-9.14983655e-02, 6.70492795e-02, 8.70437521e-02,
2.62025691e-01, 2.66074303e-01, 1.49307198e-01,
1.16536500e-01, 3.90491802e-02, -2.70108580e-03,
-7.59974814e-02, -1.00893617e-01, 1.37835848e-01,
1.64647169e-01, 2.82824314e-01, 2.87902730e-01,
1.82310758e-01, 1.74319743e-01, 5.50891934e-02,
3.60334733e-02, -4.76495234e-02, -4.06070429e-02,
-4.97511375e-02],
[ 1.03822578e-01, -4.27536980e-02, 2.94134831e-02,
-8.03312178e-02, -7.43392778e-02, -4.49612686e-02,
1.73470129e-02, 3.06643248e-01, 3.23994466e-01,
-3.22798769e-02, 1.52856387e-03, -6.05440191e-02,
6.96802378e-02, -7.36637988e-03, 1.92927188e-01,
1.07131420e-02, 2.04256550e-01, 8.56592702e-03,
1.85456859e-01, 1.90325063e-01, 3.29953285e-01,
-1.05872543e-01, 2.30998521e-02, -2.72230958e-02,
7.29476183e-02, -6.39749708e-02, 1.13024037e-01,
-8.34736991e-02, -1.38981120e-02, 5.48498600e-02,
2.44553373e-01, -1.41905614e-01, -8.58820058e-02,
-1.21162032e-01, -3.99739705e-02, -6.95300276e-02,
-5.73534269e-02, -5.75747682e-02, 1.30598841e-01,
-7.72916128e-02, 2.70372019e-03, 6.14034733e-02,
2.65204347e-01, -1.40680952e-01, -8.25770929e-02,
-1.21281050e-01, -4.63455381e-02, -8.93930548e-02,
5.15446581e-02, -9.55888150e-02, -6.29923811e-02,
1.72207200e-02, 1.59447048e-01, -1.40222287e-01,
-9.13028882e-02, -1.13760290e-01, -8.85559819e-03,
3.37652979e-01],
[ 7.16658376e-02, 5.28133730e-02, 6.77806310e-02,
2.94235963e-02, 1.70983874e-02, 9.13352585e-03,
1.32490641e-02, 7.02930271e-02, 5.88320768e-02,
8.95871807e-02, 1.26267973e-01, -3.65018324e-02,
-1.77856447e-02, 6.65061943e-02, 9.96370945e-02,
9.53969110e-02, 1.19240640e-01, 6.65018604e-02,
2.48280102e-01, -1.42161359e-01, -4.66921960e-02,
-1.75905647e-02, -4.47679076e-02, 1.47695874e-01,
1.52747791e-01, -7.26650696e-02, 1.26652188e-02,
1.64344027e-01, 2.86987459e-01, -2.55656109e-01,
-1.69270325e-01, -1.57288026e-01, -1.41733452e-01,
3.14449389e-02, 6.72322728e-02, 4.08764171e-02,
4.87910703e-02, -8.62144866e-02, 1.37308500e-02,
1.38320873e-01, 2.93361633e-01, -2.56173589e-01,
-1.60593823e-01, -1.56029481e-01, -1.33532600e-01,
2.98287943e-02, 6.69993381e-02, -8.93028222e-03,

GREAT LEARNING 22
DATA MINING AKSHAY PANKAR PROJECT
9.37221445e-03, 2.17085624e-01, 2.46599486e-01,
-2.41269819e-01, -1.96514027e-01, -1.56802982e-01,
-1.57855601e-01, 3.49231100e-02, 6.53096824e-02,
5.94677878e-02],
[-1.30350671e-02, -4.80760394e-02, -2.55554772e-02,
-8.58679794e-02, -8.36975609e-02, -1.65016033e-01,
-1.52310284e-01, 3.50162904e-01, 3.51440444e-01,
-1.83307736e-02, 2.83407556e-02, -1.13574644e-01,
-1.16456566e-01, -3.28759342e-02, -2.81029323e-02,
-5.68595464e-02, -6.77880069e-02, -3.27759869e-01,
-2.75470830e-01, -2.75053110e-01, -2.20792210e-01,
-5.63753701e-02, -1.10014504e-02, 6.10879272e-02,
9.75052483e-02, 7.18575697e-02, 7.81297116e-02,
-1.03449498e-02, -5.45499192e-02, -4.96468049e-02,
-2.51979332e-02, 2.18313693e-02, 6.63710735e-02,
1.50876968e-01, 1.95395986e-01, -6.25852416e-02,
-2.16348377e-02, 6.41190511e-02, 6.14735728e-02,
-6.85801556e-03, -6.27032285e-02, -6.19101459e-02,
-4.12400868e-02, 1.89709897e-02, 6.25134963e-02,
1.38511730e-01, 1.74460018e-01, 1.04598875e-01,
1.23838699e-01, -1.00788782e-02, -2.90134736e-02,
1.82012044e-02, 4.73711425e-02, 3.17262356e-02,
7.45443540e-02, 1.96938109e-01, 2.57274769e-01,
1.55227809e-01],
[-1.03076462e-01, 7.07752970e-02, -1.93773950e-02,
1.12916196e-01, 1.16221693e-01, -7.68183675e-03,
-8.32992282e-02, 4.83142268e-01, 4.45702232e-01,
3.02368507e-02, -3.91456262e-02, 1.63705825e-01,
1.82156553e-02, 6.96351282e-02, -1.00990641e-01,
8.25253162e-02, -7.14896701e-02, 2.93453012e-01,
4.44872133e-02, 1.03137657e-01, -4.70633743e-02,
6.74239464e-02, -9.88888637e-02, 2.19687342e-02,
-3.91241303e-02, 1.19502380e-02, -1.21164648e-01,
-1.23940721e-02, -4.40632573e-02, 4.73341340e-02,
-6.03010901e-02, 4.50611285e-02, -7.98717861e-02,
1.79883609e-02, -5.27577154e-02, 7.14046618e-02,
2.39609608e-02, 2.02142730e-02, -1.00423431e-01,
-2.68284612e-03, -3.31331052e-02, 5.11536826e-02,
-4.60304222e-02, 4.81568837e-02, -6.40896223e-02,
1.85559628e-02, -3.73467025e-02, -2.60927957e-02,
-1.83995774e-01, -2.80594285e-02, -7.07917881e-02,
3.19446654e-02, -1.12268055e-01, 3.23066512e-02,
-1.21053112e-01, 1.88931938e-02, -1.14629515e-01,
-4.44059027e-01],
[ 5.35371888e-02, 9.99234589e-02, 8.08885270e-02,
1.34049144e-01, 1.22287648e-01, 5.40113107e-02,
3.39758422e-02, -1.07019479e-01, -9.97370898e-02,
9.46314221e-02, 9.44724933e-02, 9.28113508e-02,
2.11429454e-02, 5.81676887e-02, -5.21791600e-02,
4.30577887e-02, -7.99258312e-02, -2.13148404e-01,
-2.67563460e-01, 7.48028027e-03, -7.71282665e-02,
-1.25950741e-01, -3.63618982e-01, 7.80315014e-02,
1.88249724e-02, 8.69974671e-02, 6.31394892e-02,
-3.57756468e-02, -4.65371420e-02, 9.88712123e-02,

GREAT LEARNING 23
DATA MINING AKSHAY PANKAR PROJECT
1.00027753e-01, -1.59433396e-01, -3.57112197e-01,
2.39212446e-02, -1.68609661e-02, 1.32555585e-01,
1.32347594e-01, 8.29376616e-02, 3.95827221e-02,
-4.18817639e-02, -6.18296312e-02, 9.31842603e-02,
8.01404066e-02, -1.63493447e-01, -3.78031456e-01,
2.17252724e-02, -1.92879238e-02, 1.09333880e-01,
1.35961493e-01, -1.13613996e-02, 2.37826456e-03,
1.24830738e-01, 1.60150077e-01, -1.41955506e-01,
-2.79443056e-01, 2.83337740e-02, 5.08788422e-03,
-9.59758855e-02],
[-9.50669722e-02, -1.20875222e-01, -1.22602615e-01,
-1.59578663e-01, -1.69327603e-01, 1.86246485e-01,
1.77047539e-01, -8.32106795e-02, -9.55705649e-02,
-1.06172430e-01, -1.28841243e-01, -1.19324539e-01,
-6.18082322e-02, -4.15267648e-02, -4.81004405e-02,
-4.25388192e-02, -6.02973254e-02, 3.97891546e-01,
1.69329556e-01, 3.28697028e-02, 6.77855155e-02,
-1.15474985e-01, -2.09704787e-01, -4.69605669e-02,
-5.45326270e-02, 2.10101621e-02, 7.52081151e-02,
-6.33462810e-02, -2.08098867e-02, 1.11930331e-02,
6.42754932e-02, 3.68089815e-03, -1.11770845e-01,
1.73595912e-01, 2.63963764e-01, -1.74321611e-01,
-1.40003335e-01, 2.54754114e-02, 9.17806670e-02,
-4.26672332e-02, 1.94290078e-04, 2.27276480e-02,
9.58332272e-02, 5.31930758e-03, -9.74708959e-02,
1.63213971e-01, 2.32849453e-01, 9.52110793e-04,
9.53530378e-03, -1.12522301e-01, -7.84665461e-02,
-5.09895414e-02, -4.12308252e-02, 2.75605160e-03,
-1.37157251e-01, 2.25771449e-01, 3.41284783e-01,
-1.30214655e-01],
[ 1.55552590e-02, 4.06712235e-02, 4.24705781e-02,
1.59258427e-01, 1.89586145e-01, -4.21624461e-01,
-4.05989214e-01, -8.22303843e-02, -8.66280513e-02,
-1.78213451e-02, -2.59643771e-03, 1.52993067e-01,
4.24121337e-02, -4.46321203e-02, 3.02636576e-02,
-8.14004744e-02, 5.56377959e-02, 3.52014125e-02,
2.69531657e-01, 8.31790996e-02, 1.15156685e-01,
-2.52081281e-01, -2.00876592e-01, -1.79180651e-01,
-1.03049044e-01, 1.25583692e-01, 1.44092530e-02,
-9.84071329e-03, -4.68571580e-02, -1.34015195e-02,
-1.08990078e-01, 1.10007311e-01, -6.11864015e-02,
6.87257641e-02, 5.79359609e-02, 1.17297154e-01,
3.94617952e-02, 1.38038905e-01, 3.04574778e-02,
3.40704049e-02, -8.36506919e-03, 3.27777159e-03,
-9.13021970e-02, 9.56621739e-02, -7.81886822e-02,
7.24503293e-02, 6.39322229e-02, 4.23293637e-02,
-3.28501335e-02, -1.30048155e-01, -1.57544539e-01,
-8.51532881e-02, -1.80141799e-01, 1.53067147e-01,
-8.67860675e-04, 4.22490625e-02, 4.06653740e-02,
2.53731310e-01],
[-1.16154547e-02, 1.69931911e-02, 2.34056965e-02,
3.59396129e-02, 3.84264683e-02, 3.66975815e-01,
3.71204765e-01, 9.84065776e-02, 1.51127636e-01,
7.49086705e-03, -2.53698579e-03, 3.18798793e-02,

GREAT LEARNING 24
DATA MINING AKSHAY PANKAR PROJECT
3.84519993e-02, -5.99185025e-02, -8.08987120e-02,
-8.36423857e-02, -8.74190997e-02, 3.02420847e-02,
1.40082684e-01, -1.08973045e-01, -1.47832727e-01,
1.27476264e-01, -1.27516059e-01, -1.61181309e-01,
-1.42594248e-01, 3.40004797e-02, -6.23297311e-02,
-2.16455562e-02, -7.54239463e-02, -7.19204890e-03,
-1.24893713e-01, 1.51880247e-01, -1.07820987e-01,
-1.62794852e-02, -1.00531308e-01, 7.08154160e-02,
6.27815590e-02, 1.60203259e-02, -8.74627395e-02,
-3.75985823e-02, -8.70930738e-02, -1.61329373e-02,
-1.45954500e-01, 1.37009370e-01, -1.19445332e-01,
-3.14923432e-02, -1.05734036e-01, 1.22055246e-01,
9.45826638e-03, 3.71154019e-02, -2.38276663e-02,
2.33745666e-02, -6.30823550e-02, 1.85620711e-01,
-8.00613141e-02, 4.44304676e-02, -5.35095397e-02,
5.46940310e-01],
[ 8.15787714e-02, -4.65403428e-02, -8.49192732e-03,
-1.43097000e-01, -1.25828943e-01, -2.44380093e-02,
2.08967226e-02, 1.63795718e-02, 9.38114541e-03,
-2.84142661e-04, 1.32437563e-01, -1.51030418e-01,
-1.40208510e-01, 6.10183634e-02, -9.60141617e-03,
8.49880715e-02, 7.36236534e-02, 1.73267745e-01,
-2.39946333e-01, 3.74739756e-01, 2.08542969e-01,
-2.20420715e-01, -3.89478080e-02, 7.21032276e-02,
1.17969496e-01, 6.61667705e-02, -1.59270488e-01,
2.58025863e-01, -1.29044064e-01, 1.06449302e-01,
-2.13878855e-01, -4.94068965e-02, -2.99001302e-03,
2.44013011e-03, -2.31278199e-02, -1.29100793e-01,
1.29495463e-02, 4.53847216e-02, -2.11418197e-01,
2.54854832e-01, -1.80285273e-01, 9.46311073e-02,
-2.37193962e-01, -6.20480160e-02, -2.62827863e-02,
-4.93321578e-03, -3.51132556e-02, 1.73155631e-01,
3.19666734e-02, 1.98075457e-01, 2.63574297e-02,
1.63810302e-01, -8.38075836e-02, -6.18828279e-03,
7.94148993e-02, 5.20213641e-02, 1.47648209e-02,
1.30554311e-01],
[-2.22632112e-02, -5.11919789e-02, -4.55203603e-02,
-1.43803865e-01, -1.41457728e-01, -1.07858701e-01,
-9.79479377e-02, -4.79534382e-02, -4.85698493e-02,
3.04581990e-02, 8.02524670e-02, -2.21559326e-01,
-1.63364994e-01, 2.63275641e-02, 5.18639072e-02,
4.15583741e-02, 1.35248844e-01, -1.47473903e-01,
1.15096210e-01, -1.44143433e-01, 1.70497977e-01,
2.50222886e-01, -7.72883636e-02, 2.24177635e-01,
2.57825861e-01, 4.52636890e-02, -9.10242266e-02,
2.48736330e-03, 7.58293277e-02, 6.70062116e-02,
8.76942354e-02, 2.81620525e-01, -1.94718981e-01,
1.02662109e-01, -1.39761067e-01, -1.02466087e-01,
-8.39689117e-02, 5.11374423e-02, -5.82251595e-02,
1.80138886e-02, 8.81134118e-02, 5.37350906e-02,
9.80767907e-02, 2.81807696e-01, -2.06051331e-01,
1.01353708e-01, -1.16695989e-01, 9.29584133e-03,
-1.56701095e-01, -4.97850513e-02, 3.64887291e-02,
7.02920599e-02, 4.06869614e-02, 2.68386889e-01,

GREAT LEARNING 25
DATA MINING AKSHAY PANKAR PROJECT
-1.47926430e-01, 1.20742110e-01, -2.03340685e-01,
-3.40755623e-02]])

Eigen Vectors -

array([35.65291214, 7.65920732, 4.17002442, 2.78757727, 1.9439828 ,


1.22067748, 1.15089397, 0.46807127, 0.401935 6 , 0.37038528,
0.31911354, 0.2356875 ]

Q2.6 PCA: Identify the optimum number of PCs (for this project, take at least 90%
explained variance). Show Scree plot.

GREAT LEARNING 26
DATA MINING AKSHAY PANKAR PROJECT
Q2.7 PCA: Compare PCs with Actual Columns and identify which is explaining
most variance. Write inferences about all the Principal components in terms of
actual variables.

Below is the comparision of PCs with Actual Colums –


Variance of No_HH in original data: 2317017260.221425
Variance of No_HH in PC data: 1545119510.4332726
No_HH has more variance in the original data.
Variance of TOT_M in original data: 5385286471.3963175
Variance of TOT_M in PC data: 3628118144.18022
TOT_M has more variance in the original data.
Variance of TOT_F in original data: 12905122966.875473
Variance of TOT_F in PC data: 8492460052.918953
TOT_F has more variance in the original data.
Variance of M_06 in original data: 132270859.08106142
Variance of M_06 in PC data: 85630037.23629673
M_06 has more variance in the original data.
Variance of F_06 in original data: 128284948.6109541
Variance of F_06 in PC data: 80708649.2398582
F_06 has more variance in the original data.
Variance of M_SC in original data: 208120241.67792508
Variance of M_SC in PC data: 148886190.95335525
M_SC has more variance in the original data.
Variance of F_SC in original data: 472101104.4734909
Variance of F_SC in PC data: 335449338.8973726
F_SC has more variance in the original data.
Variance of M_ST in original data: 98261005.6703612
Variance of M_ST in PC data: 36224177.488896675
M_ST has more variance in the original data.
Variance of F_ST in original data: 252037897.742322
Variance of F_ST in PC data: 100354462.5835347
F_ST has more variance in the original data.
Variance of M_LIT in original data: 3125959685.415858
Variance of M_LIT in PC data: 1922249859.0465574
M_LIT has more variance in the original data.
Variance of F_LIT in original data: 5630680464.515241
Variance of F_LIT in PC data: 2504016996.260278
F_LIT has more variance in the original data.
Variance of M_ILL in original data: 393054624.2347335
Variance of M_ILL in PC data: 292145867.748893
M_ILL has more variance in the original data.
Variance of F_ILL in original data: 2219982831.7586427
Variance of F_ILL in PC data: 1889253830.4546127
F_ILL has more variance in the original data.
Variance of TOT_WORK_M in original data: 1326382711.0744288
Variance of TOT_WORK_M in PC data: 825131816.7763596
TOT_WORK_M has more variance in the original data.
Variance of TOT_WORK_F in original data: 1383271712.548395
Variance of TOT_WORK_F in PC data: 943814103.398083
TOT_WORK_F has more variance in the original data.
Variance of MAINWORK_M in original data: 991048052.0472517
Variance of MAINWORK_M in PC data: 563004708.6180522
MAINWORK_M has more variance in the original data.
GREAT LEARNING 27
DATA MINING AKSHAY PANKAR PROJECT
Variance of MAINWORK_F in original data: 899895764.3333243
Variance of MAINWORK_F in PC data: 466734420.3221283
MAINWORK_F has more variance in the original data.
Variance of MAIN_CL_M in original data: 22459656.172239304
Variance of MAIN_CL_M in PC data: 17339318.558186468
MAIN_CL_M has more variance in the original data.
Variance of MAIN_CL_F in original data: 28370139.90588572
Variance of MAIN_CL_F in PC data: 17482571.60708124
MAIN_CL_F has more variance in the original data.
Variance of MAIN_AL_M in original data: 40953702.21024061
Variance of MAIN_AL_M in PC data: 29431717.762595408
MAIN_AL_M has more variance in the original data.
Variance of MAIN_AL_F in original data: 165489895.05318603
Variance of MAIN_AL_F in PC data: 64004559.732755624
MAIN_AL_F has more variance in the original data.
Variance of MAIN_HH_M in original data: 1634926.2453442924
Variance of MAIN_HH_M in PC data: 540228.8787961748
MAIN_HH_M has more variance in the original data.
Variance of MAIN_HH_F in original data: 10108676.23810397
Variance of MAIN_HH_F in PC data: 1018366.2480340382
MAIN_HH_F has more variance in the original data.
Variance of MAIN_OT_M in original data: 679565695.6845046
Variance of MAIN_OT_M in PC data: 198602858.0175068
MAIN_OT_M has more variance in the original data.
Variance of MAIN_OT_F in original data: 359944462.71701664
Variance of MAIN_OT_F in PC data: 89177780.68848272
MAIN_OT_F has more variance in the original data.
Variance of MARGWORK_M in original data: 54919833.48047486
Variance of MARGWORK_M in PC data: 32774941.699457083
MARGWORK_M has more variance in the original data.
Variance of MARGWORK_F in original data: 120922452.04424638
Variance of MARGWORK_F in PC data: 98247236.3759657
MARGWORK_F has more variance in the original data.
Variance of MARG_CL_M in original data: 1720155.1328638555
Variance of MARG_CL_M in PC data: 662782.2069395136
MARG_CL_M has more variance in the original data.
Variance of MARG_CL_F in original data: 12706559.199704092
Variance of MARG_CL_F in PC data: 2884399.8993323003
MARG_CL_F has more variance in the original data.
Variance of MARG_AL_M in original data: 14300163.566116458
Variance of MARG_AL_M in PC data: 7793035.755467512
MARG_AL_M has more variance in the original data.
Variance of MARG_AL_F in original data: 45885400.09604851
Variance of MARG_AL_F in PC data: 35635650.68925422
MARG_AL_F has more variance in the original data.
Variance of MARG_HH_M in original data: 214056.02576046556
Variance of MARG_HH_M in PC data: 58099.00839920158
MARG_HH_M has more variance in the original data.
Variance of MARG_HH_F in original data: 1436925.3548488815
Variance of MARG_HH_F in PC data: 385618.03239218844
MARG_HH_F has more variance in the original data.
Variance of MARG_OT_M in original data: 13027709.314062513
Variance of MARG_OT_M in PC data: 6012057.140851038
MARG_OT_M has more variance in the original data.

GREAT LEARNING 28
DATA MINING AKSHAY PANKAR PROJECT
Variance of MARG_OT_F in original data: 16934799.552501462
Variance of MARG_OT_F in PC data: 6964990.246931101
MARG_OT_F has more variance in the original data.
Variance of MARGWORK_3_6_M in original data: 1524536773.226562
Variance of MARGWORK_3_6_M in PC data: 999362020.696375
MARGWORK_3_6_M has more variance in the original data.
Variance of MARGWORK_3_6_F in original data: 6884088307.6025715
Variance of MARGWORK_3_6_F in PC data: 3922398437.6359544
MARGWORK_3_6_F has more variance in the original data.
Variance of MARG_CL_3_6_M in original data: 36238072.03114251
Variance of MARG_CL_3_6_M in PC data: 23141590.35160169
MARG_CL_3_6_M has more variance in the original data.
Variance of MARG_CL_3_6_F in original data: 71698106.27726665
Variance of MARG_CL_3_6_F in PC data: 61466518.3740084
MARG_CL_3_6_F has more variance in the original data.
Variance of MARG_AL_3_6_M in original data: 820182.5043793997
Variance of MARG_AL_3_6_M in PC data: 393819.641918526
MARG_AL_3_6_M has more variance in the original data.
Variance of MARG_AL_3_6_F in original data: 6232719.529645966
Variance of MARG_AL_3_6_F in PC data: 1685197.9156564507
MARG_AL_3_6_F has more variance in the original data.
Variance of MARG_HH_3_6_M in original data: 9361068.857861383
Variance of MARG_HH_3_6_M in PC data: 5708106.623576284
MARG_HH_3_6_M has more variance in the original data.
Variance of MARG_HH_3_6_F in original data: 28469064.456337955
Variance of MARG_HH_3_6_F in PC data: 23547104.009145148
MARG_HH_3_6_F has more variance in the original data.
Variance of MARG_OT_3_6_M in original data: 128686.18450704271
Variance of MARG_OT_3_6_M in PC data: 34777.787790982
MARG_OT_3_6_M has more variance in the original data.
Variance of MARG_OT_3_6_F in original data: 810046.4717429585
Variance of MARG_OT_3_6_F in PC data: 217622.8541079817
MARG_OT_3_6_F has more variance in the original data.
Variance of MARGWORK_0_3_M in original data: 9223152.65312011
Variance of MARGWORK_0_3_M in PC data: 4233767.603638496
MARGWORK_0_3_M has more variance in the original data.
Variance of MARGWORK_0_3_F in original data: 11074498.645830909
Variance of MARGWORK_0_3_F in PC data: 4672596.819874803
MARGWORK_0_3_F has more variance in the original data.
Variance of MARG_CL_0_3_M in original data: 2219227.0994498287
Variance of MARG_CL_0_3_M in PC data: 999702.9436619718
MARG_CL_0_3_M has more variance in the original data.
Variance of MARG_CL_0_3_F in original data: 7777275.346478849
Variance of MARG_CL_0_3_F in PC data: 4403767.748062782
MARG_CL_0_3_F has more variance in the original data.
Variance of MARG_AL_0_3_M in original data: 205514.0674858172
Variance of MARG_AL_0_3_M in PC data: 34925.857689894794
MARG_AL_0_3_M has more variance in the original data.
Variance of MARG_AL_0_3_F in original data: 1249125.3126736197
Variance of MARG_AL_0_3_F in PC data: 153846.88290232338
MARG_AL_0_3_F has more variance in the original data.
Variance of MARG_HH_0_3_M in original data: 581526.7179088405
Variance of MARG_HH_0_3_M in PC data: 182287.19814269672
MARG_HH_0_3_M has more variance in the original data.

GREAT LEARNING 29
DATA MINING AKSHAY PANKAR PROJECT
Variance of MARG_HH_0_3_F in original data: 2513423.2002738593
Variance of MARG_HH_0_3_F in PC data: 1304802.8919440494
MARG_HH_0_3_F has more variance in the original data.
Variance of MARG_OT_0_3_M in original data: 11641.897865316923
Variance of MARG_OT_0_3_M in PC data: 3013.3870671948357
MARG_OT_0_3_M has more variance in the original data.
Variance of MARG_OT_0_3_F in original data: 95939.39665248436
Variance of MARG_OT_0_3_F in PC data: 24418.612360621977
MARG_OT_0_3_F has more variance in the original data.
Variance of NON_WORK_M in original data: 372836.2517581193
Variance of NON_WORK_M in PC data: 139886.95884132895
NON_WORK_M has more variance in the original data.
Variance of NON_WORK_F in original data: 828480.8333235511
Variance of NON_WORK_F in PC data: 260929.50860475318
NON_WORK_F has more variance in the original data.
Variance of Gender_Ratio in original data: 0.05246821628445864
Variance of Gender_Ratio in PC data: 0.05208232410490937
Gender_Ratio has more variance in the original data.

Q2.8 PCA: Write linear equation for first PC.


Linear Equation of first PC is as below -

PC1_i += w[j] * X[i,j]


[3.4954678040932663e-16,
0.7261820937414998,
-0.6737913400601855,
0.06678957154740779,
-0.1191803252287208]

GREAT LEARNING 30
DATA MINING AKSHAY PANKAR PROJECT

GREAT LEARNING 31

You might also like