Machine Learning Guided Project
Machine Learning Guided Project
-1-
Project
Report-
Guided
P a g e 1 | 23
Contents:
- Apply PCA with the desired number of components = 2 - Create a new DataFrame with the PCA results
P a g e 2 | 23
Question 1: Define the problem and perform an Exploratory Data Analysis
- Problem definition, questions to be answered - Data background and contents - Univariate
analysis - Bivariate analysis - Insights based on EDA
Sl_No and Customer_Key can be dropped as it is not needed for model training
There is no target variable here. So, we need to build an unsupervised learning
model
Credit limit has higher magnitude compared to other columns. We need to scale the
features
All columns are in int datatype
No nulls being observed
Ignore SL_No and Customer Key columns
"Avg_Credit_Limit" ranges from 3K to 200K with an avg of 34.5K but 50% of
customers have less than 18K. So it is heavily right skewed. It has a Std Deviation of
34K
"Total_Credit_Cards" ranges from 1 to 10 cards with an avg of ~5 and also 50% of
customers have less than 5. It has a Std Deviation of ~5
"Total_visits_bank" ranges from 0 to 5 times with an avg of ~2 and also 50% of
customers have less than 2. It has a Std Deviation of ~2
"Total_visits_online" ranges from 0 to 15 times with an avg of ~3 and also 50% of
customers have less than 2. 75% of customers in the dataset have online visits less
than 4. So it is heavily right skewed. It has a Std Deviation of ~3
"Total_calls_made" ranges from 0 to 10 times with an avg of ~4 and also 50% of
customers have less than 3. Little bit right skewed. It has a Std Deviation of ~3
No Null or NA values in the data
No duplicates found based on serial number
But with same customer key there are 5 sets of duplicates but apart of customer key
nothing else is same. Other columns have different number. Not sure if these are
really duplicates.
For now not deleting any duplicate rows
Univariate Analysis:
P a g e 3 | 23
Avg Credit limit is heavily right skewed
more number of customers with lesser avg credit limits (<20K)
credit cards, visits online, total calls have multiple peaks
majority of customers have 4 credit cards
majority of customers visited bank twice
majority of customers visited bank online is twice
majority of customers made <=4 calls to the bank
Avg Credit limit has lot of outliers
visits online also has outliers
Majority of customers have 4 credit cards and then 6 and 7 credit cards. Very few
have >=8 cards
~160 customers visited bank twice
majority of customers visited bank online is twice and majority of visits by most of
the customers are <=5
majority of customers made <=4 calls to the bank
P a g e 4 | 23
Avg Credit limit has lot of outliers
visits online also has outliers
Majority of customers have 4 credit cards and then 6 and 7 credit cards. Very few
have >=8 cards
~160 customers visited bank twice
majority of customers visited bank online is twice and majority of visits by most of
the customers are <=5
majority of customers made <=4 calls to the bank
P a g e 5 | 23
Bivariate Analysis:
P a g e 6 | 23
One intresting observation is calls made to bank is reducing as the number of credit
cards increased and customers who have less credit cards 1 - 3 are calling bank
frequently.
vists to bank is more for customers having credit cards between 4 to 7
more visits online for customers having more credit cards 8 to 10
P a g e 7 | 23
more number of customers below 10K avg credit limit
22% of customers have 4 credit cards and ~3% with 10 credit cards
24% of customer visited bank twice and 15% of customers never visited bank. 85%
visited atleast once
~22% of customer never did online banking. 78% of customers visited online banking
atleast once
~15% of customers never called bank. 85% of customers called atleast once
Analyzing customers using different mode of communicationS;
7 customers who never contacted bank through calls and visited bank but used only
online
These customer have more number of credit cards and more online visits
30 customers who never contacted bank through calls and not used online but
visited bank
Need to see the demographics of customers to know why are they not using call or
online services and preferring only visiting bank
0 customers who never used bank or visited online have ever contacted through
phone
60 customers who never contacted bank through calls but visited bank and used
online
93 customers who never visited bank but contacted through calls and used online
114 customers who never used online but contacted bank through calls and visited
bank
We can target these customers to use online banking services. But we should also
need to know the reason behind not using online services. We do not have enough
information of age of customers or educational background or even region where
they stay (if internet services are available or not)
356 customers used all 3 banking options (calls / online / visited) which is more than
half of our customer database
P a g e 8 | 23
Now no outliers observed
P a g e 9 | 23
From the above graph of "Elbow Method", we can find the possible number of
clusters. From the graph, we can see the elbow bend at 3.
we can choose K=3 or K=4
At k=3 we have an inertia of 50.42 and at K=4 we have 34.23
Will decide K based on Silhouette score
This visualization is much richer than the previous one: although it confirms that k =
3 is a very good choice, it also underlines the fact that k = 4 is not that good, and
much better than k = 6,7, 8,9... This was not visible when comparing inertias.
At K=3, we have a Sil.Score of 0.53.
We will take K=3
P a g e 10 | 23
The vertical dashed lines represent the silhouette score for each number of clusters.
When most of the instances in a cluster have a lower coefficient than this score (i.e.,
if many of the instances stop short of the dashed line, ending to the left of it), then
the cluster is rather bad since this means its instances are much too close to other
clusters.
when k = 3, the clusters look good: most instances extend beyond the dashed line, to
the right and closer to 1.0
P a g e 11 | 23
Cluster Analysis:
Cluster 0: Customers with medium credit limit (medium number of credit cards), few
calls made, more bank visits and very less online visits
Cluster 1: Customers with lower Total credit limit (fewer credit cards),more calls to
bank, medium bank visits, medium online visits
Cluster 2: It includes cutomers with higher Total credit limit (higher number of credit
cards), fewer calls made to bank, very less bank visits and highest online visits
P a g e 12 | 23
Question 4: Applying Hierarchical Clustering
- Apply Hierarchical clustering with different linkage methods - Plot dendrograms for each
linkage method - Figure out appropriate number of clusters - Cluster Profiling
Answer:
Cluster 0: Customers with low credit limit, fewer credit cards, medium bank visits,
medium online visits and highest calls to bank
Cluster 1: It includes cutomers with highest Avg credit limit, highest number of credit
cards, very less bank visits, highest online visits and fewer calls made to bank
Cluster 2: Customers with medium credit limit ,medium number of credit cards
(~5,6), Highest average bank visits and very less online visits and medium calls to
bank
Here the data is distributed in 3 clusters. 1st cluster (Cluster 0) has 218 data points,
2nd cluster (Cluster 1) has 50 data points and 3rd cluster (cluster 2) has 392 data
points.
P a g e 13 | 23
Question 5: K-means vs Hierarchical Clustering
Compare clusters obtained from K-means and Hierarchical clustering techniques
Answer: Hierarchical clustering can’t handle big data well but K Means clustering can. This
is because the time complexity of K Means is linear i.e. O(n) while that of hierarchical
clustering is quadratic i.e. O(n2).
In K Means clustering, since we start with a random choice of clusters, the results
produced by running the algorithm multiple times might differ. While results are
reproducible in Hierarchical clustering.
K Means is found to work well when the shape of the clusters is hyper spherical (like
circle in 2D, sphere in 3D).
P a g e 14 | 23
K Means clustering requires prior knowledge of K i.e. no. of clusters you want to
divide your data into. But, you can stop at whatever number of clusters you find
appropriate in hierarchical clustering by interpreting the dendrogram
With a large number of variables, K-means compute faster
The result of K-means is unstructured, but that of hierarchal is more interpretable
and informative
It is easier to determine the number of clusters by hierarchical clustering’s
dendrogram
Customers with high credit limit tend to vist online. Hence they can be targeted for
online campaigns, coupons and accordingly products and services can be offered to
them
Whereas customers with comparatively low credit limit make visits to bank more
often, hence they can either be prompted with benefits of online banking or can be
catered with in-bank offers/services and flyers
Customers with low credit limits have less frequency on online platform, they can be
marketed with benefits of online banking and/or make customer call center reps
aware of promotions and offers so that they can target this segment of customers
Based on how bank wants to promote its products and services, a segment of customers can
be targeted as we know their prefered mode of commmunication with bank
Answer: By reducing the original dimensions of 4 to 3, we are able to get 95% of variance
explained
With K=3, cluster error (or clusters Inertia) is reduced from 50.42 (without PCA and
with 4 columns) to 46.29 (with PCA and 3 columns)
we can choose K=3 or K=4
At K=3 we have inertia of cluster errors of 46.29
P a g e 15 | 23
With K=3, Silhouette score is increased from 0.53 (without PCA) to 0.73 (with PCA)
P a g e 16 | 23
Question 8: Interpretation of Principal Components
- Provide a clear interpretation of the principal components in terms of the original features.
- Explain how each principal component contributes to the variation in the data.
Answer:
P a g e 17 | 23
We will consider k=3 to be the optimal number of clusters. This is because:
P a g e 18 | 23
Question 10: Visualization
- Plot the data points in the reduced PCA space - Interpret the plot and discuss any insights
gained from it.
Answer: By reducing the original dimensions of 4 to 3, we are able to get 95% of variance
explained:
P a g e 19 | 23
Question 11: Dimensionality Reduction Impact
- Analyze the impact of PCA-based dimensionality reduction on the performance of the
clustering algorithm used earlier. - Compare the clustering results before and after applying
PCA and discuss any improvements or changes in the identified customer segments
Answer: Compare K-means clusters with Hierarchical clusters
Hierarchical clustering can’t handle big data well but K Means clustering can. This is
because the time complexity of K Means is linear i.e. O(n) while that of hierarchical
clustering is quadratic i.e. O(n2).
In K Means clustering, since we start with random choice of clusters, the results
produced by running the algorithm multiple times might differ. While results are
reproducible in Hierarchical clustering.
K Means is found to work well when the shape of the clusters is hyper spherical (like
circle in 2D, sphere in 3D).
K Means clustering requires prior knowledge of K i.e. no. of clusters you want to
divide your data into. But, you can stop at whatever number of clusters you find
appropriate in hierarchical clustering by interpreting the dendrogram
With a large number of variables, K-means compute faster
The result of K-means is unstructured, but that of hierarchal is more interpretable
and informative
It is easier to determine the number of clusters by hierarchical clustering’s
dendrogram
P a g e 20 | 23
KMeans Sil. score is 0.529
Agglomerative Clustering with Avg linkage Sil. Score is 0.532
From the above analysis of clusters formed from each model, KMeans and
Agglomerative clustering with Avg distance yielded very similar silhouette score.
Also each cluster from each model (KMeans or Linkage models) have similar
distributions of customers under each cluster except for Single linkage
KMeans and Agg Clustering with Avg methods have similar customer distribution
numbers in each cluster
KMeans customer distributions - 382, 228, 50
Agg Cluster with Avg linkage - 218,50,392
Cluster 1: It includes cutomers with highest avg credit limit, highest number of credit
cards, very less bank visits, highest online visits and fewer calls made to bank
Cluster 2: Customers with low credit limit, fewer credit cards, medium bank visits,
medium online visits and highest calls to bank
Cluster 3: Customers with medium credit limit ,medium number of credit cards
(~5,6), Highest average bank visits and very less online visits and medium calls to
bank
Linkage with average distance have a Cophentic corr of 0.82 outperforming other
Scipy linkage models
Cluster 1 of K-Means is similar to Cluster 2 of Scipy Linkage Avg: These are the
customers with lowest credit card limits (min to avg 12K) and have the least number
of credit cards (1-3). They also have lower in person visits to the bank and moderate
online visits. However, they have the highest number of calls. This group has
considerable number of customers, over 200.
P a g e 21 | 23
THANK
YOU
P a g e 22 | 23
P a g e 23 | 23