0% found this document useful (0 votes)
54 views

Machine Learning Guided Project

This document outlines questions for a machine learning project on customer segmentation using clustering algorithms. It will define the problem, perform exploratory data analysis on a dataset of 660 banking customers, preprocess the data, apply k-means clustering and hierarchical clustering to identify customer segments, compare the results, provide insights and recommendations. It will also apply principal component analysis for dimensionality reduction before clustering to analyze the impact on identified customer segments.

Uploaded by

Richa Ahuja
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
54 views

Machine Learning Guided Project

This document outlines questions for a machine learning project on customer segmentation using clustering algorithms. It will define the problem, perform exploratory data analysis on a dataset of 660 banking customers, preprocess the data, apply k-means clustering and hierarchical clustering to identify customer segments, compare the results, provide insights and recommendations. It will also apply principal component analysis for dimensionality reduction before clustering to analyze the impact on identified customer segments.

Uploaded by

Richa Ahuja
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 23

Machine Learning

-1-
Project
Report-
Guided

P a g e 1 | 23
Contents:

Question 1: Define the problem and perform an Exploratory Data Analysis


- Problem definition, questions to be answered - Data background and contents - Univariate analysis -
Bivariate analysis - Insights based on EDA

Question 2: Data preprocessing


Prepare the data for analysis - Feature engineering - Missing value treatment - Outlier treatment -
Duplicate observations check

Question 3: Applying K-means Clustering


- Apply K-means Clustering - Plot the Elbow curve - Check Silhouette Scores - Figure out appropriate
number of clusters - Cluster Profiling

Question 4: Applying Hierarchical Clustering


- Apply Hierarchical clustering with different linkage methods - Plot dendrograms for each linkage
method - Figure out appropriate number of clusters - Cluster Profiling

Question 5: K-means vs Hierarchical Clustering


Compare clusters obtained from K-means and Hierarchical clustering techniques

Question 6: Actionable Insights & Recommendations


Conclude with the key takeaways for the business - What would be your recommendations to the
business?

Question 7: PCA Transformation

- Apply PCA with the desired number of components = 2 - Create a new DataFrame with the PCA results

Question 8: Interpretation of Principal Components


- Provide a clear interpretation of the principal components in terms of the original features. - Explain
how each principal component contributes to the variation in the data.

Question 9: Variance Explanation


- Report the cumulative explained variance of the retained principal components. - Discuss how much of
the total variance is captured by the selected principal components

Question 10: Visualization


- Plot the data points in the reduced PCA space - Interpret the plot and discuss any insights gained from
it.

Question 11: Dimensionality Reduction Impact


- Analyze the impact of PCA-based dimensionality reduction on the performance of the clustering
algorithm used earlier. - Compare the clustering results before and after applying PCA and discuss any
improvements or changes in the identified customer segments

P a g e 2 | 23
Question 1: Define the problem and perform an Exploratory Data Analysis
- Problem definition, questions to be answered - Data background and contents - Univariate
analysis - Bivariate analysis - Insights based on EDA

Answer: 660 observations(rows) and 7 variables(features)

 Sl_No and Customer_Key can be dropped as it is not needed for model training
 There is no target variable here. So, we need to build an unsupervised learning
model
 Credit limit has higher magnitude compared to other columns. We need to scale the
features
 All columns are in int datatype
 No nulls being observed
 Ignore SL_No and Customer Key columns
 "Avg_Credit_Limit" ranges from 3K to 200K with an avg of 34.5K but 50% of
customers have less than 18K. So it is heavily right skewed. It has a Std Deviation of
34K
 "Total_Credit_Cards" ranges from 1 to 10 cards with an avg of ~5 and also 50% of
customers have less than 5. It has a Std Deviation of ~5
 "Total_visits_bank" ranges from 0 to 5 times with an avg of ~2 and also 50% of
customers have less than 2. It has a Std Deviation of ~2
 "Total_visits_online" ranges from 0 to 15 times with an avg of ~3 and also 50% of
customers have less than 2. 75% of customers in the dataset have online visits less
than 4. So it is heavily right skewed. It has a Std Deviation of ~3
 "Total_calls_made" ranges from 0 to 10 times with an avg of ~4 and also 50% of
customers have less than 3. Little bit right skewed. It has a Std Deviation of ~3
 No Null or NA values in the data
 No duplicates found based on serial number
 But with same customer key there are 5 sets of duplicates but apart of customer key
nothing else is same. Other columns have different number. Not sure if these are
really duplicates.
 For now not deleting any duplicate rows

Univariate Analysis:

P a g e 3 | 23
 Avg Credit limit is heavily right skewed
 more number of customers with lesser avg credit limits (<20K)
 credit cards, visits online, total calls have multiple peaks
 majority of customers have 4 credit cards
 majority of customers visited bank twice
 majority of customers visited bank online is twice
 majority of customers made <=4 calls to the bank
 Avg Credit limit has lot of outliers
 visits online also has outliers
 Majority of customers have 4 credit cards and then 6 and 7 credit cards. Very few
have >=8 cards
 ~160 customers visited bank twice
 majority of customers visited bank online is twice and majority of visits by most of
the customers are <=5
 majority of customers made <=4 calls to the bank

P a g e 4 | 23
 Avg Credit limit has lot of outliers
 visits online also has outliers

 Majority of customers have 4 credit cards and then 6 and 7 credit cards. Very few
have >=8 cards
 ~160 customers visited bank twice
 majority of customers visited bank online is twice and majority of visits by most of
the customers are <=5
 majority of customers made <=4 calls to the bank

P a g e 5 | 23
Bivariate Analysis:

P a g e 6 | 23
 One intresting observation is calls made to bank is reducing as the number of credit
cards increased and customers who have less credit cards 1 - 3 are calling bank
frequently.
 vists to bank is more for customers having credit cards between 4 to 7
 more visits online for customers having more credit cards 8 to 10

Question 2: Data preprocessing


Prepare the data for analysis - Feature engineering - Missing value treatment - Outlier
treatment - Duplicate observations check
Answer: Feature Engineering:

 Out of 660 rows of customers:


 there are 110 distinct abg credit limits
 There are 10 distinct total credit cards

P a g e 7 | 23
 more number of customers below 10K avg credit limit
 22% of customers have 4 credit cards and ~3% with 10 credit cards
 24% of customer visited bank twice and 15% of customers never visited bank. 85%
visited atleast once
 ~22% of customer never did online banking. 78% of customers visited online banking
atleast once
 ~15% of customers never called bank. 85% of customers called atleast once
 Analyzing customers using different mode of communicationS;
 7 customers who never contacted bank through calls and visited bank but used only
online
 These customer have more number of credit cards and more online visits
 30 customers who never contacted bank through calls and not used online but
visited bank
 Need to see the demographics of customers to know why are they not using call or
online services and preferring only visiting bank
 0 customers who never used bank or visited online have ever contacted through
phone
 60 customers who never contacted bank through calls but visited bank and used
online
 93 customers who never visited bank but contacted through calls and used online
 114 customers who never used online but contacted bank through calls and visited
bank
 We can target these customers to use online banking services. But we should also
need to know the reason behind not using online services. We do not have enough
information of age of customers or educational background or even region where
they stay (if internet services are available or not)
 356 customers used all 3 banking options (calls / online / visited) which is more than
half of our customer database

P a g e 8 | 23
 Now no outliers observed

 +ve correlation (0.61) between credit limit and no. of cards


 +ve correlation (0.55) between credit limit and visits online
 -ve correlation (-0.41) between credit limit and calls made
 -ve correlation (-0.65) between calls made and total credit cards
 -ve correlation (-0.55) between online visits and bank visits
 -ve correlation (-0.51) between calls made and vists to bank

Question 3: Applying K-means Clustering


- Apply K-means Clustering - Plot the Elbow curve - Check Silhouette Scores - Figure out
appropriate number of clusters - Cluster Profiling
Answer:

P a g e 9 | 23
 From the above graph of "Elbow Method", we can find the possible number of
clusters. From the graph, we can see the elbow bend at 3.
 we can choose K=3 or K=4
 At k=3 we have an inertia of 50.42 and at K=4 we have 34.23
 Will decide K based on Silhouette score

 This visualization is much richer than the previous one: although it confirms that k =
3 is a very good choice, it also underlines the fact that k = 4 is not that good, and
much better than k = 6,7, 8,9... This was not visible when comparing inertias.
 At K=3, we have a Sil.Score of 0.53.
 We will take K=3

P a g e 10 | 23
 The vertical dashed lines represent the silhouette score for each number of clusters.
When most of the instances in a cluster have a lower coefficient than this score (i.e.,
if many of the instances stop short of the dashed line, ending to the left of it), then
the cluster is rather bad since this means its instances are much too close to other
clusters.
 when k = 3, the clusters look good: most instances extend beyond the dashed line, to
the right and closer to 1.0

P a g e 11 | 23
Cluster Analysis:

 Cluster 0: Customers with medium credit limit (medium number of credit cards), few
calls made, more bank visits and very less online visits
 Cluster 1: Customers with lower Total credit limit (fewer credit cards),more calls to
bank, medium bank visits, medium online visits
 Cluster 2: It includes cutomers with higher Total credit limit (higher number of credit
cards), fewer calls made to bank, very less bank visits and highest online visits

 By reducing the original dimensions of 4 to 3, we are able to get 95% of variance


explained

P a g e 12 | 23
Question 4: Applying Hierarchical Clustering
- Apply Hierarchical clustering with different linkage methods - Plot dendrograms for each
linkage method - Figure out appropriate number of clusters - Cluster Profiling
Answer:

 Cluster 0: Customers with low credit limit, fewer credit cards, medium bank visits,
medium online visits and highest calls to bank
 Cluster 1: It includes cutomers with highest Avg credit limit, highest number of credit
cards, very less bank visits, highest online visits and fewer calls made to bank
 Cluster 2: Customers with medium credit limit ,medium number of credit cards
(~5,6), Highest average bank visits and very less online visits and medium calls to
bank
 Here the data is distributed in 3 clusters. 1st cluster (Cluster 0) has 218 data points,
2nd cluster (Cluster 1) has 50 data points and 3rd cluster (cluster 2) has 392 data
points.

P a g e 13 | 23
Question 5: K-means vs Hierarchical Clustering
Compare clusters obtained from K-means and Hierarchical clustering techniques

Answer: Hierarchical clustering can’t handle big data well but K Means clustering can. This
is because the time complexity of K Means is linear i.e. O(n) while that of hierarchical
clustering is quadratic i.e. O(n2).

 In K Means clustering, since we start with a random choice of clusters, the results
produced by running the algorithm multiple times might differ. While results are
reproducible in Hierarchical clustering.
 K Means is found to work well when the shape of the clusters is hyper spherical (like
circle in 2D, sphere in 3D).

P a g e 14 | 23
 K Means clustering requires prior knowledge of K i.e. no. of clusters you want to
divide your data into. But, you can stop at whatever number of clusters you find
appropriate in hierarchical clustering by interpreting the dendrogram
 With a large number of variables, K-means compute faster
 The result of K-means is unstructured, but that of hierarchal is more interpretable
and informative
 It is easier to determine the number of clusters by hierarchical clustering’s
dendrogram

Question 6: Actionable Insights & Recommendations


Conclude with the key takeaways for the business - What would be your recommendations
to the business?
Answer: As we see there are three (3) categories/segments of customers with each
segment having a preference for communication channel with bank, it is recommended that
products are marketed to specific segment of customers through their preferred channel.
Also, additional services can be provided based on how they connect with bank and also
based on their spending pattern which can deduced from average credit limit.

 Customers with high credit limit tend to vist online. Hence they can be targeted for
online campaigns, coupons and accordingly products and services can be offered to
them
 Whereas customers with comparatively low credit limit make visits to bank more
often, hence they can either be prompted with benefits of online banking or can be
catered with in-bank offers/services and flyers
 Customers with low credit limits have less frequency on online platform, they can be
marketed with benefits of online banking and/or make customer call center reps
aware of promotions and offers so that they can target this segment of customers

Based on how bank wants to promote its products and services, a segment of customers can
be targeted as we know their prefered mode of commmunication with bank

Question 7: PCA Transformation


- Apply PCA with the desired number of components = 2 - Create a new DataFrame with the
PCA results

Answer: By reducing the original dimensions of 4 to 3, we are able to get 95% of variance
explained

 With K=3, cluster error (or clusters Inertia) is reduced from 50.42 (without PCA and
with 4 columns) to 46.29 (with PCA and 3 columns)
 we can choose K=3 or K=4
 At K=3 we have inertia of cluster errors of 46.29

P a g e 15 | 23

 With K=3, Silhouette score is increased from 0.53 (without PCA) to 0.73 (with PCA)

P a g e 16 | 23
Question 8: Interpretation of Principal Components
- Provide a clear interpretation of the principal components in terms of the original features.
- Explain how each principal component contributes to the variation in the data.
Answer:

P a g e 17 | 23
We will consider k=3 to be the optimal number of clusters. This is because:

 The elbow at k=3 was steepest with a huge drop in inertia


 The highest silhouette score was for k=3 by far
 The gap statistic implies k=3 is best
 The DB score also showed k=3 as best
 The Calinski-Harabasz Index highlighted k=3 as best
 The cluster distribution for k=3 was subjectively very distinct, especially compared to
k=5, which was another consideration.
 The PCA graphics all show three distinct groups with little overlap

Question 9: Variance Explanation
- Report the cumulative explained variance of the retained principal components. - Discuss
how much of the total variance is captured by the selected principal components.
Answer:

For 90% variance explained, the number of components looks to be 3


Explained variance = 88.0 %

P a g e 18 | 23
Question 10: Visualization
- Plot the data points in the reduced PCA space - Interpret the plot and discuss any insights
gained from it.

Answer: By reducing the original dimensions of 4 to 3, we are able to get 95% of variance
explained:

 660 rows × 3 columns


 With K=3, cluster error (or clusters Inertia) is reduced from 50.42 (without PCA and
with 4 columns) to 46.29 (with PCA and 3 columns)

 we can choose K=3 or K=4


 At K=3 we have inertia of cluster errors of 46.29

P a g e 19 | 23
Question 11: Dimensionality Reduction Impact
- Analyze the impact of PCA-based dimensionality reduction on the performance of the
clustering algorithm used earlier. - Compare the clustering results before and after applying
PCA and discuss any improvements or changes in the identified customer segments
Answer: Compare K-means clusters with Hierarchical clusters

 Hierarchical clustering can’t handle big data well but K Means clustering can. This is
because the time complexity of K Means is linear i.e. O(n) while that of hierarchical
clustering is quadratic i.e. O(n2).
 In K Means clustering, since we start with random choice of clusters, the results
produced by running the algorithm multiple times might differ. While results are
reproducible in Hierarchical clustering.
 K Means is found to work well when the shape of the clusters is hyper spherical (like
circle in 2D, sphere in 3D).
 K Means clustering requires prior knowledge of K i.e. no. of clusters you want to
divide your data into. But, you can stop at whatever number of clusters you find
appropriate in hierarchical clustering by interpreting the dendrogram
 With a large number of variables, K-means compute faster
 The result of K-means is unstructured, but that of hierarchal is more interpretable
and informative
 It is easier to determine the number of clusters by hierarchical clustering’s
dendrogram

Compare KMeans with sklearn Agglomerative Clustering - Avg linkage

P a g e 20 | 23
 KMeans Sil. score is 0.529
 Agglomerative Clustering with Avg linkage Sil. Score is 0.532
 From the above analysis of clusters formed from each model, KMeans and
Agglomerative clustering with Avg distance yielded very similar silhouette score.
 Also each cluster from each model (KMeans or Linkage models) have similar
distributions of customers under each cluster except for Single linkage
 KMeans and Agg Clustering with Avg methods have similar customer distribution
numbers in each cluster
 KMeans customer distributions - 382, 228, 50
 Agg Cluster with Avg linkage - 218,50,392

Compare KMeans with Scipy Linkage with Avg distance

Linkage_Avg ------ 50, 218, 392

 Cluster 1: It includes cutomers with highest avg credit limit, highest number of credit
cards, very less bank visits, highest online visits and fewer calls made to bank
 Cluster 2: Customers with low credit limit, fewer credit cards, medium bank visits,
medium online visits and highest calls to bank

 Cluster 3: Customers with medium credit limit ,medium number of credit cards
(~5,6), Highest average bank visits and very less online visits and medium calls to
bank

 Linkage with average distance have a Cophentic corr of 0.82 outperforming other
Scipy linkage models

 Cluster 2 of K-Means is similar to Cluster 1 of Scipy Linkage Avg: Customers in this


cluster have high credit card limits (100K to avg 140K) and highest total number of
credit cards (8-10). They have the lowest in person bank visits and highest online
visits. They also have the lowest calls made. This cluster has fewer customers or
around 50

 Cluster 1 of K-Means is similar to Cluster 2 of Scipy Linkage Avg: These are the
customers with lowest credit card limits (min to avg 12K) and have the least number
of credit cards (1-3). They also have lower in person visits to the bank and moderate
online visits. However, they have the highest number of calls. This group has
considerable number of customers, over 200.

 Cluster 0 of K-Means is similar to Cluster 3 of Scipy Linkage Avg: This is group of


customers with moderate credit card limits (20K to avg ~34K) and have considerable
number of credit cards (4-6). They have the highest in person bank visits and lowest
online visits. Also, they have lower phone calls to the bank. This group has highest
number of customers, over 380

P a g e 21 | 23
THANK
YOU

P a g e 22 | 23
P a g e 23 | 23

You might also like