Graded Project
Graded Project
DATA MINING
2. Scale the variables and write the inference for using the type of scaling function for this case
study.
3. Comment on the comparison between covariance and the correlation matrix after scaling.
4. Check the dataset for outliers before and after scaling. Draw your inferences from this
exercise.
6. Write the explicit form of the first PC (in terms of Eigen Vectors).
7. Discuss the cumulative values of the eigenvalues. How does it help you to decide on the
optimum number of principal components? What do the eigenvectors indicate? Perform PCA
and export the data of the Principal Component scores into a data frame.
8. Mention the business implication of using the Principal Component Analysis for this case
study.
9. Read the data and do exploratory data analysis. Describe the data briefly. (Check the null
values, Data types, shape, EDA, etc, etc)
10. Do you think scaling is necessary for clustering in this case? Justify
11. Apply hierarchical clustering to scaled data. Identify the number of optimum clusters using
Dendrogram and briefly describe them.
12. Apply K-Means clustering on scaled data and determine optimum clusters. Apply elbow
curve and find the silhouette score.
13. Describe cluster profiles for the clusters defined. Recommend different priority-based
actions that need to be taken for different clusters on the bases of their vulnerability situations
according to their Economic and Health Conditions .
DATA DICTIONARY-
PCA:
Problem statement:
The ‘Hair Salon.csv’ dataset contains various variables used for the
context of Market Segmentation. This particular case study is based
on various parameters of a salon chain of hair products. You are
expected to do Principal Component Analysis for this case study
according to the instructions given in the rubric.
Dataset for Part 1: Hair Salon.csv
Descriptive statistics
shape- there are 100 rows and 13 columns are present in the dataset.
Top 5 rows of the data frame are as follows.
UNIVARIATE ANALYSIS-
PAIRPLOT
Figure 2 PAIRPLOT
HEATMAP
Figure 3 heatmap
There is some independent feature which are corelated with each other.
Corelation between features.
TechSup and CompRes have corelation no.0.87
OrdBilling and CompRes have corelation no.0.76
salsfimage and ECOM have corelation no.0.79
TechSup and Wartyclaim have corelation no.0.80
SCALING –
For scaling we first Dropping the ID fearure before we scale numeric values as the same will not
add any value in model building.
Using Zscore for scaling.
In z score mean of the features tends to zero.
Stndard deviation tends to one
It’s a weihteg based technique.
There are top five rows of scaled data.
Correlation matrix-
Correlation Matrix can be used to get a snapshot of the relationship between more than two
variables in a tabular format
It shows how independent are corelated with each other.
There is some independent feature which are corelated with each other.
Corelation between features.
TechSup and CompRes have corelation no.0.87
OrdBilling and CompRes have corelation no.0.76
salsfimage and ECOM have corelation no.0.79
TechSup and Wartyclaim have corelation no.0.80
Table 7 CORRELATION
Table 8 CORELATION MATRIX
Covariance matrix-
covariance matrix will explain the relationship between any two features in the data. This
process is used in collinear variables. A positive covariance value shows a direct relationship
(both variables increase or decrease).
A positive covariance indicates that the two variables have a positive relationship whereas
negative covariance shows that they have a negative relationship. If two elements do not vary
together then they will display a zero covariance.
Ecom
SalesFImage
OrdBilling
DelSpeed
We Using Score for scaling in which mean of the feature tends to zero which we see in the
boxplots are there.
Scaling only change scale of boxplots
We see that in our boxplots mean tends to zero and they come in symmetry.
variance ratio to find a cut off for selecting the number of PCs
The State_wise_Health_income.csv dataset given is about the Health and economic conditions in
different States of a country. The Group States based on how similar their situation is, so as to provide
these groups to the government so that appropriate measures can be taken to escalate their Health
and Economic conditions. 2.1. Read the data and do exploratory data analysis. Describe the data
briefly. (Check the null values, Data types, shape, EDA, etc, etc) 2.2. Do you think scaling is necessary
for clustering in this case? Justify 2.3. Apply hierarchical clustering to scaled data. Identify the number
of optimum clusters using Dendrogram and briefly describe them. 2.4. Apply K-Means clustering on
scaled data and determine optimum clusters. Apply elbow curve and find the silhouette score. 2.5.
Describe cluster profiles for the clusters defined. Recommend different priority-based actions that
need to be taken for different clusters on the bases of their vulnerability situations according to their
Economic and Health Conditions.
2. Health_indeces1: A composite index rolls several related measures (indicators) into a single
score that provides a summary of how the health system is performing in the State.
3. Health_indeces2: A composite index rolls several related measures (indicators) into a single
score that provides a summary of how the health system is performing in certain areas of the
States.
4. Per_capita_income-Per capita income (PCI) measures the average income earned per person
in a given area (city, region, country, etc.) in a specified year. It is calculated by dividing the
area's total income by its total population.
5. GDP: GDP provides an economic snapshot of a country/state, used to estimate the size of an
economy and growth rate.
Descriptive statistics
shape- there are 297 rows and 6 columns are present in the dataset.
Top 5 rows of the data frame are as follows.
UNIVARIATE ANALYSIS-
GDP= 0.829
PER_CAPITAL_INCOME =0.823
Health indices =0.715
Health_indices2 = -0.173
MULTIVARIATE ANALYSIS -
PAIRPLOT-
SCALING –
For scaling we first Dropping the Unnamed: 0','States fearure before we scale numeric values
as the same will not add any value in model building.
Using MIN MAX for scaling.
In z score mean of the features tends to zero.
Stndard deviation tends to one
It’s a distance based techniquein which we used euclidean distance .
We build a scalor modal using min max scaler
We need to scale the dataset and transform it
There are top five rows of scaled data.
Table 18 dendrogram
Mean value of both the linkage is different with lot of variation in cluster frequency 4
We will prefer ward linkage here because it is perform significantly.
By putting the value of P equal to 10 and mode = last we get the below dendrogram having 10
clusters
Table 19 dendrogram
By using method 1 in which we used ward linkage and criterion equal to max-cluster we get
the below array.
In the method two we used ward linkage and value of 23 , and by using criterion equal to
distance we get the below array.
Table 21 Top 5 values of clusters
Now we get data frame with clusters in our data set and the above figure shows the head of
top five clusters.
Mean value of both the linkage is different with lot of variation in cluster frequency 4
We will prefer ward linkage here because it is performed significantly.
Three group cluster solution gives a pattern based on high ,medium,and low gdp per capital.
K- MEANS CLUSTERING – "K-Mean Clustering- This is an iterative method of partitioning the data into
K predefined distinct non-overlapping subgroups also known as clusters. In this each data point belong
to a single group. In the intra-cluster data points are as similar as possible while the distance between
different clusters as far as possible.
Insights- From the above graph optimal number of clusters will either 3 or 4. We will go
forward with 3 clusters
Silhouette Method- In this we compute the silhouette coefficients for each data point. It is the
measure of how close it is to its own cluster rather than other clusters.
Observations - Based on the above cluster solution, 3 cluster solution seems to be the best fit
as it differentiates the 3 clusters as-
High GDP per capita area
Medium GDP per capita area
Low GDP per capita area
- These are the areas which have the highest growth rate.
- These are the areas which have very low growth rate.
- The health and economic conditions are not good in these areas.
Main features that affect the Health and Economic conditions are workforce and productivity. Higher
these attributes
higher is the GDP per capita and thus higher the Health and Economic conditions