0% found this document useful (0 votes)
60 views36 pages

Graded Project

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
60 views36 pages

Graded Project

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 36

GRADED PROJECT

DATA MINING

Submitted By - UMENDRA PRATAP SINGH SOLANKI


Submission Date – 24/04/2023
CONTENT
1. Perform Exploratory Data Analysis [both univariate and multivariate analysis to be
performed]. The inferences drawn from this should be properly documented.

2. Scale the variables and write the inference for using the type of scaling function for this case
study.

3. Comment on the comparison between covariance and the correlation matrix after scaling.

4. Check the dataset for outliers before and after scaling. Draw your inferences from this
exercise.

5. Build the covariance matrix, eigenvalues and eigenvector.

6. Write the explicit form of the first PC (in terms of Eigen Vectors).

7. Discuss the cumulative values of the eigenvalues. How does it help you to decide on the
optimum number of principal components? What do the eigenvectors indicate? Perform PCA
and export the data of the Principal Component scores into a data frame.

8. Mention the business implication of using the Principal Component Analysis for this case
study.

9. Read the data and do exploratory data analysis. Describe the data briefly. (Check the null
values, Data types, shape, EDA, etc, etc)

10. Do you think scaling is necessary for clustering in this case? Justify

11. Apply hierarchical clustering to scaled data. Identify the number of optimum clusters using
Dendrogram and briefly describe them.

12. Apply K-Means clustering on scaled data and determine optimum clusters. Apply elbow
curve and find the silhouette score.

13. Describe cluster profiles for the clusters defined. Recommend different priority-based
actions that need to be taken for different clusters on the bases of their vulnerability situations
according to their Economic and Health Conditions .
DATA DICTIONARY-

Table 1 DATA DICTIONARY

PCA:
Problem statement:
The ‘Hair Salon.csv’ dataset contains various variables used for the
context of Market Segmentation. This particular case study is based
on various parameters of a salon chain of hair products. You are
expected to do Principal Component Analysis for this case study
according to the instructions given in the rubric.
Dataset for Part 1: Hair Salon.csv
Descriptive statistics
 shape- there are 100 rows and 13 columns are present in the dataset.
 Top 5 rows of the data frame are as follows.

Table 2 top five rows of dataset

 In our dataset 1 column is int and 12 columns are float datatype.

Table 3 info of the data frame

 There are 5-point summary of all the features.


 Min.,25% ,50%,75%, max. values
 Also, we have detail about standard deviation, mean, mode, median of all data. Which are
shown bellow-

Table 4 statistic information about the feature

 There are no Duplicate row present in the Data


 There is no missing value present in the Dataset.

EXPLORATORY DATA ANALYSIS-

UNIVARIATE ANALYSIS-

 Data is not normally distributed about centre


 All data have some skewness.
 Wee see that in four features outliers present which also affect the data. Which are
 Ecom, SalesFImage, OrdBilling, DelSpeep,
 BOXPLOTS AND HISTPLOTS OF THE FEATURES.

Figure 1 HISTPLOT AND BOXPLOT OF ALL FEATURES


MULTIVARIATE ANALYSIS –

 PAIRPLOT

Figure 2 PAIRPLOT
 HEATMAP

Figure 3 heatmap

 There is some independent feature which are corelated with each other.
 Corelation between features.
 TechSup and CompRes have corelation no.0.87
 OrdBilling and CompRes have corelation no.0.76
 salsfimage and ECOM have corelation no.0.79
 TechSup and Wartyclaim have corelation no.0.80
SCALING –

 For scaling we first Dropping the ID fearure before we scale numeric values as the same will not
add any value in model building.
 Using Zscore for scaling.
 In z score mean of the features tends to zero.
 Stndard deviation tends to one
 It’s a weihteg based technique.
 There are top five rows of scaled data.

Table 5 TOP 5 ROWS OF SCALED DATA

Descriptive statistics of scaled data

Table 6 DESCRIPTIVE STATISTIC OF SCALED DATA


COVARIANCE AND CORELATION MATRIX .

Correlation matrix-

 Correlation Matrix can be used to get a snapshot of the relationship between more than two
variables in a tabular format
 It shows how independent are corelated with each other.
 There is some independent feature which are corelated with each other.
 Corelation between features.
 TechSup and CompRes have corelation no.0.87
 OrdBilling and CompRes have corelation no.0.76
 salsfimage and ECOM have corelation no.0.79
 TechSup and Wartyclaim have corelation no.0.80

Table 7 CORRELATION
Table 8 CORELATION MATRIX

Covariance matrix-

 covariance matrix will explain the relationship between any two features in the data. This
process is used in collinear variables. A positive covariance value shows a direct relationship
(both variables increase or decrease).

 A positive covariance indicates that the two variables have a positive relationship whereas
negative covariance shows that they have a negative relationship. If two elements do not vary
together then they will display a zero covariance.

Table 9 COVARIANCE MATRIX


Table 10 COVARIANCE TABLE
OUTLIERS DETECTION –

 For outlier detection we make boxplots of all the features.

We see that otliers present in 4 features,

 Ecom
 SalesFImage
 OrdBilling
 DelSpeed

Figure 4 boxplot of features with outlier

 Thann we remove outlier by help of interquartli range


 We Define a function which returns the Upper and Lower limit to detect outliers for each
feature.
 Cap & floor the values beyond the outlier boundaries.
Check to verify if outliers have been treated.

 We see that outlier has been treated from all features.

Figure 5 boxplot of feature after outlier treatment

 We see that all the outliers are removed.


After outlier treatment we scaled the data

 We Using Score for scaling in which mean of the feature tends to zero which we see in the
boxplots are there.
 Scaling only change scale of boxplots
 We see that in our boxplots mean tends to zero and they come in symmetry.

Figure 6 boxplot of scaled features after outlier treatment


EIGEN VALUES AND EIGEN VECTORS –

WHEN WE APPLYING PCA

 Extract eigen vectors

Figure 7EIGEN VECTORS


EIGEN VALUES-

Check the eigen values –

Figure 8 EIGEN VALUES

THE EXPLICIT FORM OF THE FIRST PC (IN TERMS OF EIGEN VECTORS).

 Check the explained variance for each PC.


 Explained variance = (eigen value of each PC)/(sum of eigen values of all PCs).

Figure 9 variance for each pc

Create a data frame containing the loadings or coefficients of all PCs.

Table 11 coefficient of all pcs


Figure 10 Scree plot

variance ratio to find a cut off for selecting the number of PCs

Figure 11 variance ratio

PCs basis cumulative explained variance

Table 12PCs basis cumulative explained variance


original features matter to each PC only considering the absolute values.

Figure 12 original features matter to each PC

Heat map of features and PCs

 We Compare how the original features influence various PCs

Figure 13Heat map of features and PCs


original scaled features

Figure 14original scaled features

Transformed scaled data -

Figure 15fit_transformed scaled data


Now we Check for presence of correlations among the PCs by heat map

There is no correlation between pcs

Figure 16 heat map of pcs


CLUSTERING
PROBLEM STATMENT

The State_wise_Health_income.csv dataset given is about the Health and economic conditions in
different States of a country. The Group States based on how similar their situation is, so as to provide
these groups to the government so that appropriate measures can be taken to escalate their Health
and Economic conditions. 2.1. Read the data and do exploratory data analysis. Describe the data
briefly. (Check the null values, Data types, shape, EDA, etc, etc) 2.2. Do you think scaling is necessary
for clustering in this case? Justify 2.3. Apply hierarchical clustering to scaled data. Identify the number
of optimum clusters using Dendrogram and briefly describe them. 2.4. Apply K-Means clustering on
scaled data and determine optimum clusters. Apply elbow curve and find the silhouette score. 2.5.
Describe cluster profiles for the clusters defined. Recommend different priority-based actions that
need to be taken for different clusters on the bases of their vulnerability situations according to their
Economic and Health Conditions.

Data Dictionary for State_wise_Health_income Dataset:

1. States: names of States

2. Health_indeces1: A composite index rolls several related measures (indicators) into a single
score that provides a summary of how the health system is performing in the State.

3. Health_indeces2: A composite index rolls several related measures (indicators) into a single
score that provides a summary of how the health system is performing in certain areas of the
States.

4. Per_capita_income-Per capita income (PCI) measures the average income earned per person
in a given area (city, region, country, etc.) in a specified year. It is calculated by dividing the
area's total income by its total population.

5. GDP: GDP provides an economic snapshot of a country/state, used to estimate the size of an
economy and growth rate.
Descriptive statistics
 shape- there are 297 rows and 6 columns are present in the dataset.
 Top 5 rows of the data frame are as follows.

Table 13 TP FIVE ROVES OF DATAFRAME

 In our dataset 5 column is int and 1columns are OBJECT datatype.

Table 14 INFO OF DATASET

 There are 5-point summary of all the features.


 Min.,25% ,50%,75%, max. values.
 Also, we have detail about standard deviation, mean, mode, median of all data. Which are
shown bellow
 Descriptive statistics of data.

Table 15 DESCRIPTIVE STATISTIC OF DATA

 There are no Duplicate row present in the Data


 There is no missing value present in the Dataset.
EXPLORATORY DATA ANALYSIS-

UNIVARIATE ANALYSIS-

 In our data frame 2 features have outliers.


 BOXPLOTS AND HISTPLOTS OF THE FEATURES.

Figure 17 BOXPLOTS AND HISTPLOTS OF THE FEATURES.


Skewness-

 GDP= 0.829
 PER_CAPITAL_INCOME =0.823
 Health indices =0.715
 Health_indices2 = -0.173

MULTIVARIATE ANALYSIS -

 PAIRPLOT-

Figure 18 pair plot


 HEAT MAP -

Figure 19 heat map

 There are some feature which are corelated each other


 Health indices1 and gdp = 0.91
 Health indices2n and gdp =0.87
 Health indices 1 and health indices =0.87
 Health indices 2 and per capita_income=0.81
 Health indices 1 and per capita_income = 0.67
Outlier detection -

Figure 20 boxplot with outliers

 There are outlier present in two features-


 Health indices 1
 Per capital income.

When we see outlier, we have to treat the outlier

 Thann we remove outlier by help of interquartli range


 We Define a function which returns the Upper and Lower limit to detect outliers for each
feature.
 Cap & floor the values beyond the outlier boundaries.

Check to verify if outliers have been treated.

Figure 21 boxplot after outlier treatment

SCALING –

 For scaling we first Dropping the Unnamed: 0','States fearure before we scale numeric values
as the same will not add any value in model building.
 Using MIN MAX for scaling.
 In z score mean of the features tends to zero.
 Stndard deviation tends to one
 It’s a distance based techniquein which we used euclidean distance .
 We build a scalor modal using min max scaler
 We need to scale the dataset and transform it
 There are top five rows of scaled data.

Table 16 top five rows of scaled data

 Descriptive statistics of scaled data-

Table 17 Descriptive statistic of scaled data


HIERARCHYCAL CLUSTERING –

 We use here SciPy. cluster. Hierarchy and import dendrogram linkage.


 In Average linkage clustering, the distance between two clusters is defined as the average of
distances between all pairs of objects, where each pair is made up of one object from each
group.
 By using method- ward by linkage, we get the output of the hierarchy clustering in form of
dendrogram.
 Ward linkage method isused to minimize the variance in the data with a hierarchical approach.
Maximum linkage: It is used to minimizes the maximum distance of the clusters' data points.
Average linkage: It is used to average the distance of the clusters' data points.

Table 18 dendrogram

 Mean value of both the linkage is different with lot of variation in cluster frequency 4
 We will prefer ward linkage here because it is perform significantly.
 By putting the value of P equal to 10 and mode = last we get the below dendrogram having 10
clusters
Table 19 dendrogram

 By using method 1 in which we used ward linkage and criterion equal to max-cluster we get
the below array.

Table 20 Shows the clusters of all Rows in form of array.

 In the method two we used ward linkage and value of 23 , and by using criterion equal to
distance we get the below array.
Table 21 Top 5 values of clusters

 Now we get data frame with clusters in our data set and the above figure shows the head of
top five clusters.
 Mean value of both the linkage is different with lot of variation in cluster frequency 4
 We will prefer ward linkage here because it is performed significantly.
 Three group cluster solution gives a pattern based on high ,medium,and low gdp per capital.

K- MEANS CLUSTERING – "K-Mean Clustering- This is an iterative method of partitioning the data into
K predefined distinct non-overlapping subgroups also known as clusters. In this each data point belong
to a single group. In the intra-cluster data points are as similar as possible while the distance between
different clusters as far as possible.

Working steps of k-means algorithm-

 Specify number of clusters K.


 Initialize centroids by first shuffling the dataset and then randomly selecting K data points for
the centroids without replacement.
 Keep iterating until there is no change to the centroids, i.e. assignment of data points to
clusters isn’t changing.
 Compute the sum of the squared distance between data points and all centroids.
 Assign each data point to the closest cluster (centroid).
 Compute the centroids for the clusters by taking the average of the all data points that belong
to each"
 After applying k means we get the labels array.

Table 22 labels array

 K means inertia = 27.39819099491779


 K means inertia (n clusters = 3, after fitting data) = 14.862515314114027
 K means inertia (n clusters = 4, after fitting data) = 10.351754230441658
 k means inertia (n clusters = 1, after fitting data) = 75.27232015177754
 k means inertia (n clusters = 5, after fitting data) = 8.312798417750045
 k means inertia (n clusters = 6, after fitting data) = 6.810758830310194
Table 23 wss

Table 24 scree plot


Table 25 final output label of k mean cluster in our data set

 Insights- From the above graph optimal number of clusters will either 3 or 4. We will go
forward with 3 clusters

Silhouette Method- In this we compute the silhouette coefficients for each data point. It is the
measure of how close it is to its own cluster rather than other clusters.

 Silhouette score is 0.54068

Table 26 silhouette values


Now, adding Silhouette width to the K-Mean dataset-Silhouette width is a measure between -1 to +1,
with value 1 indicating very good cluster.

 Observations - Based on the above cluster solution, 3 cluster solution seems to be the best fit
as it differentiates the 3 clusters as-
 High GDP per capita area
 Medium GDP per capita area
 Low GDP per capita area

Cluster Group Profiles-

Cluster 1: High GDP per capita Areas

- These are the areas which have the highest growth rate.

- The health and economic conditions in these ares excellent.

- Per capita income in these areas is very high.

Cluster 2: Low GDP per capita Areas

- These are the areas which have very low growth rate.

- The health and economic conditions are not good in these areas.

- Per capita income in these areas is very low.

Cluster 3: Medium GDP per capita Areas

- These are the areas which have an average growth rate.

- The health and economic conditions in these areas are adequate.

- Per capita income in these areas is average.

Recommendations for each cluster profile.

Main features that affect the Health and Economic conditions are workforce and productivity. Higher
these attributes

higher is the GDP per capita and thus higher the Health and Economic conditions

You might also like