Data Mining Project DSBA PCA Report
Data Mining Project DSBA PCA Report
List of Tables
Table 1: Dataset head ................................................................................................................................... 5
Table 2: Dataset info ..................................................................................................................................... 5
Table 3: Dataset Summary ............................................................................................................................ 5
Table 4: Covariance Matrix (part) ............................................................................................................... 15
Table 5: PCs with actual columns................................................................................................................ 17
List of Equations
Equation 1: Linear Equation for First PC ..................................................................................................... 21
[email protected]
A62KUCY7WL
PCA FH (FT): Primary census abstract for female headed households excluding institutional households
(India & States/UTs - District Level), Scheduled tribes - 2011
The Indian Census has the reputation of being one of the best in the world. The first Census in India was
conducted in the year 1872. This was conducted at different points of time in different parts of the
country. In 1881 a Census was taken for the entire country simultaneously. Since then, Census has been
conducted every ten years, without a break. Thus, the Census of India 2011 was the fifteenth in this
unbroken series since 1872, the seventh after independence and the second census of the third
millennium and twenty first century. The census has been uninterruptedly continued despite several
adversities like wars, epidemics, natural calamities, political unrest, etc. The Census of India is conducted
under the provisions of the Census Act 1948 and the Census Rules, 1990. The Primary Census Abstract
which is important publication of 2011 Census gives basic information on Area, Total Number of
Households, Total Population, Scheduled Castes, Scheduled Tribes Population, Population in the age
group 0-6, Literates, Main Workers and Marginal Workers classified by the four broad industrial
categories, namely, (i) Cultivators, (ii) Agricultural Laborers, (iii) Household Industry Workers, and (iv)
Other Workers and also Non-Workers. The characteristics of the Total Population include Scheduled
Castes, Scheduled Tribes, Institutional and Houseless Population and are presented by sex and residence
[email protected]
A62KUCY7WL(rural-urban). Census 2011 covered 35 States/Union Territories containing 640 districts which in turn
contained 5,924 sub-districts, 7,935 towns and 6,40,867 villages.
The data collected has so many variables making it difficult to find useful details without using Data
Science Techniques. You are tasked to perform detailed EDA and identify Optimum Principal Components
that explains the most variance in data. (Use Sklearn only).
We will start analyzing the data by going thru the basic steps like:
1. Check head
2. Check info
3. Check summary
4. Check nulls
5. Check duplicates
State
Code Dist.Code State Area Name No_HH TOT_M TOT_F M_06 F_06
Jammu &
1 1 Kashmir Kupwara 7707 23388 29796 5862 6196
Jammu &
1 2 Kashmir Badgam 6218 19585 23102 4482 3733
[email protected] Jammu & Leh(Ladakh
A62KUCY7WL
1 3 Kashmir ) 4452 6546 10964 1082 1018
Jammu &
1 4 Kashmir Kargil 1320 2784 4206 563 677
Jammu &
1 5 Kashmir Punch 11654 20591 29981 5157 4587
(not all columns are shown)
int64 59
object 2
There are 640 rows and 61 columns in the dataset where the 59 columns have Integer data type and 2
columns have object data type.
Checking summary:
We can see there are 640 districts (as per 2011). On the average there are about 52 thousand
households in each district. However, the range is between 350 and over 3 lakhs. We will explore more
in the EDA section.
Checking Nulls
Checking Duplicates
Perform a detailed exploratory analysis of the variables. Since the number of variables is
very large, you are asked to choose any 5 variables from the 20 important variables listed
below.
[email protected]
A62KUCY7WL
No_HH, TOT_M, TOT_F, M_06, F_06, M_SC, F_SC, M_ST, F_ST, M_LIT, F_LIT, M_ILL,
F_ILL, TOT_WORK_M, TOT_WORK_F, MAINWORK_M, MAINWORK_F, MAIN_CL_M,
MAIN_CL_F, MAIN_AL_M, MAIN_AL_F, MAIN_HH_M, MAIN_HH_F, MAIN_OT_M,
MAIN_OT_F
Example Question:
While exploring the variables, it is recommended that you focus on the insights possible from each of
the variables. Also provide a small discussion based on the plots or tables.
1. Which state has highest gender ratio and which has the lowest?
The state of Andhra Pradesh has the highest female to male ratio (1.89) according to 2011 census data.
This means 1.89 females per male. While the Union Territory of Lakshadweep has the lowest gender ratio
of 1.15. Among the states, Haryana & Uttar Pradesh have the lowest gender ratio (F to M).
• Krishna District of Andhra Pradesh has the highest Female to Male ratio of 2.28.
Badgam District of Jammu & Kashmir has the lowest Female to Male ratio of 1.17
The below map shows Gender-Ratio as per State. You can see that ‘Telangana’ is white because the data
is for 2011 and Telangana has been created in 2014. . You can explore to get old shape files for India
before 2011.
According to the data, northern states have lower gender ratios in general.
2. Analysis of Literacy
Kerala is at the top while Rajasthan & Bihar are at the bottom.
10
Uttar Pradesh has the most ‘non-working’ population according to the data in 2011. Kerala has most
‘non-working’ female population after Uttar Pradesh.
Daman & Diu and Dadra Nagar Haveli have the lowest number of non-working population for both
Females & Males.
11
[email protected]
A62KUCY7WL
Uttar Pradesh has the highest number of SC/ST population. It is also observed that SC population is
significantly higher than ST population according to 2011 data. It is also noted that there are more SC
Females than males.
12
There can be more exploration on this data based on your personal interest. For example – take one
state or UT and dig deeper. You can create Instagram/ LinkedIn template based infographics and share
them on your Social Profile to build network. You are strictly forbidden to share this project on any
public or private forum.
We choose not to treat outliers for this case. Do you think that treating outliers for this
case is necessary?
13
After scaling,
[email protected]
A62KUCY7WL
Perform all the required steps for PCA (use sklearn only)
Bartlett's test of sphericity tests the hypothesis that the variables are uncorrelated in the population. If
the null hypothesis cannot be rejected, then PCA is not advisable.
14
KMO Test
The Kaiser-Meyer-Olkin (KMO) - measure of sampling adequacy (MSA) is an index used to examine how
appropriate PCA is. Generally, if MSA is less than 0.5, PCA is not recommended, since no reduction is
expected. On the other hand, MSA > 0.7 is expected to provide a considerable reduction is the dimension
and extraction of meaningful components.
MSA = 0.80349
Covariance Matrix
[email protected] No_HH TOT_M TOT_F M_06 F_06 M_SC F_SC M_ST F_ST
A62KUCY7WL
No_HH 1 0.92 0.97 0.8 0.8 0.78 0.83 0.15 0.17
TOT_M 0.92 1 0.98 0.95 0.95 0.84 0.83 0.09 0.09
TOT_F 0.97 0.98 1 0.91 0.91 0.82 0.83 0.12 0.13
M_06 0.8 0.95 0.91 1 1 0.78 0.75 0.06 0.04
F_06 0.8 0.95 0.91 1 1 0.77 0.74 0.07 0.05
M_SC 0.78 0.84 0.82 0.78 0.77 1 0.99 -0.05 -0.05
F_SC 0.83 0.83 0.83 0.75 0.74 0.99 1 -0.01 -0.01
M_ST 0.15 0.09 0.12 0.06 0.07 -0.05 -0.01 1 0.99
F_ST 0.17 0.09 0.13 0.04 0.05 -0.05 -0.01 0.99 1
M_LIT 0.93 0.99 0.99 0.91 0.91 0.82 0.82 0.09 0.09
F_LIT 0.93 0.93 0.96 0.83 0.83 0.72 0.73 0.1 0.1
M_ILL 0.76 0.91 0.86 0.95 0.95 0.8 0.76 0.08 0.07
F_ILL 0.86 0.89 0.89 0.86 0.87 0.83 0.85 0.14 0.15
TOT_WORK_M 0.94 0.97 0.97 0.86 0.85 0.83 0.82 0.12 0.12
TOT_WORK_F 0.93 0.81 0.88 0.68 0.69 0.71 0.78 0.27 0.29
MAINWORK_M 0.93 0.93 0.94 0.79 0.79 0.78 0.78 0.11 0.11
MAINWORK_F 0.89 0.75 0.82 0.59 0.59 0.65 0.71 0.23 0.25
15
Identify the optimum number of PCs (for this project, the optimum number is based on
the explanation of at least 90% of variance )
Since the number of variables is large and value of MSA is 0.8, it is expected that a few components will
be enough to explain 90% of variation in the data.
16
Compare PCs with actual variables and identify which is explaining most variance. Try to
explain the PCs in terms of the original variables
17
18
19
20
Observations:
The first Principal component is positively correlated with Number of Household, Total Male & Female
population, Literacy & Illiteracy Numbers among M & F, Number of SC in Males & Females, Working
population, etc. These variables explain the most variance in the data i.e. 56%
The Second Principal component is correlated with Marginal Cultivator Male/Female population and
Marginal Agriculture (Male & Female) population etc. The Second PC explains about 14% of variation in
the data.
The Third Principal Component explains about 7% variation in the data. It positively correlates with
Marginal Agriculture 0-3 Female, and 3-6 M&F Population.
The Fourth Principal Component correlated positively with Marginal Households Male, Marginal Other (0-
3,3-6) Workers Male population. It explains about 6% of variation in the data.
[email protected]
A62KUCY7WL
The Fifth Principal Component explains about 4% variation in data. It is positively correlated with
Scheduled Tribes Population Male& Female, Non-working Male& Female population.
The Sixth Principal Component explains about 3% variation in data. It is positively correlated with Female
Marginal Other workers (0-3,3-6), Main & Marginal Households Female population.
Overall the first 6 PCs explain 90% variation in the data. Each PCs correlates with a different set of
variables explaining how different aspects of population contribute to the variation in data.
21
[email protected]
A62KUCY7WL
22
Code:
[email protected]
A62KUCY7WL
23
24
25
26
27