0% found this document useful (0 votes)

54 views

Machine Learning Guided Project

This document outlines questions for a machine learning project on customer segmentation using clustering algorithms. It will define the problem, perform exploratory data analysis on a dataset of 660 banking customers, preprocess the data, apply k-means clustering and hierarchical clustering to identify customer segments, compare the results, provide insights and recommendations. It will also apply principal component analysis for dimensionality reduction before clustering to analyze the impact on identified customer segments.

Uploaded by

Richa Ahuja

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

54 views

Machine Learning Guided Project

Uploaded by

Richa Ahuja

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 23

Machine Learning

-1-
Project
Report-
Guided

P a g e 1 | 23
Contents:

Question 1: Define the problem and perform an Exploratory Data Analysis

- Problem definition, questions to be answered - Data background and contents - Univariate analysis -
Bivariate analysis - Insights based on EDA

Question 2: Data preprocessing

Prepare the data for analysis - Feature engineering - Missing value treatment - Outlier treatment -
Duplicate observations check

Question 3: Applying K-means Clustering

- Apply K-means Clustering - Plot the Elbow curve - Check Silhouette Scores - Figure out appropriate
number of clusters - Cluster Profiling

Question 4: Applying Hierarchical Clustering

- Apply Hierarchical clustering with different linkage methods - Plot dendrograms for each linkage
method - Figure out appropriate number of clusters - Cluster Profiling

Question 5: K-means vs Hierarchical Clustering

Compare clusters obtained from K-means and Hierarchical clustering techniques

Question 6: Actionable Insights & Recommendations

Conclude with the key takeaways for the business - What would be your recommendations to the
business?

Question 7: PCA Transformation

- Apply PCA with the desired number of components = 2 - Create a new DataFrame with the PCA results

Question 8: Interpretation of Principal Components

- Provide a clear interpretation of the principal components in terms of the original features. - Explain
how each principal component contributes to the variation in the data.

Question 9: Variance Explanation

- Report the cumulative explained variance of the retained principal components. - Discuss how much of
the total variance is captured by the selected principal components

Question 10: Visualization

- Plot the data points in the reduced PCA space - Interpret the plot and discuss any insights gained from
it.

Question 11: Dimensionality Reduction Impact

- Analyze the impact of PCA-based dimensionality reduction on the performance of the clustering
algorithm used earlier. - Compare the clustering results before and after applying PCA and discuss any
improvements or changes in the identified customer segments

P a g e 2 | 23
Question 1: Define the problem and perform an Exploratory Data Analysis
- Problem definition, questions to be answered - Data background and contents - Univariate
analysis - Bivariate analysis - Insights based on EDA

Answer: 660 observations(rows) and 7 variables(features)

 Sl_No and Customer_Key can be dropped as it is not needed for model training
 There is no target variable here. So, we need to build an unsupervised learning
model
 Credit limit has higher magnitude compared to other columns. We need to scale the
features
 All columns are in int datatype
 No nulls being observed
 Ignore SL_No and Customer Key columns
 "Avg_Credit_Limit" ranges from 3K to 200K with an avg of 34.5K but 50% of
customers have less than 18K. So it is heavily right skewed. It has a Std Deviation of
34K
 "Total_Credit_Cards" ranges from 1 to 10 cards with an avg of ~5 and also 50% of
customers have less than 5. It has a Std Deviation of ~5
 "Total_visits_bank" ranges from 0 to 5 times with an avg of ~2 and also 50% of
customers have less than 2. It has a Std Deviation of ~2
 "Total_visits_online" ranges from 0 to 15 times with an avg of ~3 and also 50% of
customers have less than 2. 75% of customers in the dataset have online visits less
than 4. So it is heavily right skewed. It has a Std Deviation of ~3
 "Total_calls_made" ranges from 0 to 10 times with an avg of ~4 and also 50% of
customers have less than 3. Little bit right skewed. It has a Std Deviation of ~3
 No Null or NA values in the data
 No duplicates found based on serial number
 But with same customer key there are 5 sets of duplicates but apart of customer key
nothing else is same. Other columns have different number. Not sure if these are
really duplicates.
 For now not deleting any duplicate rows

Univariate Analysis:

P a g e 3 | 23
 Avg Credit limit is heavily right skewed
 more number of customers with lesser avg credit limits (<20K)
 credit cards, visits online, total calls have multiple peaks
 majority of customers have 4 credit cards
 majority of customers visited bank twice
 majority of customers visited bank online is twice
 majority of customers made <=4 calls to the bank
 Avg Credit limit has lot of outliers
 visits online also has outliers
 Majority of customers have 4 credit cards and then 6 and 7 credit cards. Very few
have >=8 cards
 ~160 customers visited bank twice
 majority of customers visited bank online is twice and majority of visits by most of
the customers are <=5
 majority of customers made <=4 calls to the bank

P a g e 4 | 23
 Avg Credit limit has lot of outliers
 visits online also has outliers

 Majority of customers have 4 credit cards and then 6 and 7 credit cards. Very few
have >=8 cards
 ~160 customers visited bank twice
 majority of customers visited bank online is twice and majority of visits by most of
the customers are <=5
 majority of customers made <=4 calls to the bank

P a g e 5 | 23
Bivariate Analysis:

P a g e 6 | 23
 One intresting observation is calls made to bank is reducing as the number of credit
cards increased and customers who have less credit cards 1 - 3 are calling bank
frequently.
 vists to bank is more for customers having credit cards between 4 to 7
 more visits online for customers having more credit cards 8 to 10

Question 2: Data preprocessing

Prepare the data for analysis - Feature engineering - Missing value treatment - Outlier
treatment - Duplicate observations check
Answer: Feature Engineering:

 Out of 660 rows of customers:

 there are 110 distinct abg credit limits
 There are 10 distinct total credit cards

P a g e 7 | 23
 more number of customers below 10K avg credit limit
 22% of customers have 4 credit cards and ~3% with 10 credit cards
 24% of customer visited bank twice and 15% of customers never visited bank. 85%
visited atleast once
 ~22% of customer never did online banking. 78% of customers visited online banking
atleast once
 ~15% of customers never called bank. 85% of customers called atleast once
 Analyzing customers using different mode of communicationS;
 7 customers who never contacted bank through calls and visited bank but used only
online
 These customer have more number of credit cards and more online visits
 30 customers who never contacted bank through calls and not used online but
visited bank
 Need to see the demographics of customers to know why are they not using call or
online services and preferring only visiting bank
 0 customers who never used bank or visited online have ever contacted through
phone
 60 customers who never contacted bank through calls but visited bank and used
online
 93 customers who never visited bank but contacted through calls and used online
 114 customers who never used online but contacted bank through calls and visited
bank
 We can target these customers to use online banking services. But we should also
need to know the reason behind not using online services. We do not have enough
information of age of customers or educational background or even region where
they stay (if internet services are available or not)
 356 customers used all 3 banking options (calls / online / visited) which is more than
half of our customer database

P a g e 8 | 23
 Now no outliers observed

 +ve correlation (0.61) between credit limit and no. of cards

 +ve correlation (0.55) between credit limit and visits online
 -ve correlation (-0.41) between credit limit and calls made
 -ve correlation (-0.65) between calls made and total credit cards
 -ve correlation (-0.55) between online visits and bank visits
 -ve correlation (-0.51) between calls made and vists to bank

Question 3: Applying K-means Clustering

- Apply K-means Clustering - Plot the Elbow curve - Check Silhouette Scores - Figure out
appropriate number of clusters - Cluster Profiling
Answer:

P a g e 9 | 23
 From the above graph of "Elbow Method", we can find the possible number of
clusters. From the graph, we can see the elbow bend at 3.
 we can choose K=3 or K=4
 At k=3 we have an inertia of 50.42 and at K=4 we have 34.23
 Will decide K based on Silhouette score

 This visualization is much richer than the previous one: although it confirms that k =
3 is a very good choice, it also underlines the fact that k = 4 is not that good, and
much better than k = 6,7, 8,9... This was not visible when comparing inertias.
 At K=3, we have a Sil.Score of 0.53.
 We will take K=3

P a g e 10 | 23
 The vertical dashed lines represent the silhouette score for each number of clusters.
When most of the instances in a cluster have a lower coefficient than this score (i.e.,
if many of the instances stop short of the dashed line, ending to the left of it), then
the cluster is rather bad since this means its instances are much too close to other
clusters.
 when k = 3, the clusters look good: most instances extend beyond the dashed line, to
the right and closer to 1.0

P a g e 11 | 23
Cluster Analysis:

 Cluster 0: Customers with medium credit limit (medium number of credit cards), few
calls made, more bank visits and very less online visits
 Cluster 1: Customers with lower Total credit limit (fewer credit cards),more calls to
bank, medium bank visits, medium online visits
 Cluster 2: It includes cutomers with higher Total credit limit (higher number of credit
cards), fewer calls made to bank, very less bank visits and highest online visits

 By reducing the original dimensions of 4 to 3, we are able to get 95% of variance

explained

P a g e 12 | 23
Question 4: Applying Hierarchical Clustering
- Apply Hierarchical clustering with different linkage methods - Plot dendrograms for each
linkage method - Figure out appropriate number of clusters - Cluster Profiling
Answer:

 Cluster 0: Customers with low credit limit, fewer credit cards, medium bank visits,
medium online visits and highest calls to bank
 Cluster 1: It includes cutomers with highest Avg credit limit, highest number of credit
cards, very less bank visits, highest online visits and fewer calls made to bank
 Cluster 2: Customers with medium credit limit ,medium number of credit cards
(~5,6), Highest average bank visits and very less online visits and medium calls to
bank
 Here the data is distributed in 3 clusters. 1st cluster (Cluster 0) has 218 data points,
2nd cluster (Cluster 1) has 50 data points and 3rd cluster (cluster 2) has 392 data
points.

P a g e 13 | 23
Question 5: K-means vs Hierarchical Clustering
Compare clusters obtained from K-means and Hierarchical clustering techniques

Answer: Hierarchical clustering can’t handle big data well but K Means clustering can. This
is because the time complexity of K Means is linear i.e. O(n) while that of hierarchical
clustering is quadratic i.e. O(n2).

 In K Means clustering, since we start with a random choice of clusters, the results
produced by running the algorithm multiple times might differ. While results are
reproducible in Hierarchical clustering.
 K Means is found to work well when the shape of the clusters is hyper spherical (like
circle in 2D, sphere in 3D).

P a g e 14 | 23
 K Means clustering requires prior knowledge of K i.e. no. of clusters you want to
divide your data into. But, you can stop at whatever number of clusters you find
appropriate in hierarchical clustering by interpreting the dendrogram
 With a large number of variables, K-means compute faster
 The result of K-means is unstructured, but that of hierarchal is more interpretable
and informative
 It is easier to determine the number of clusters by hierarchical clustering’s
dendrogram

Question 6: Actionable Insights & Recommendations

Conclude with the key takeaways for the business - What would be your recommendations
to the business?
Answer: As we see there are three (3) categories/segments of customers with each
segment having a preference for communication channel with bank, it is recommended that
products are marketed to specific segment of customers through their preferred channel.
Also, additional services can be provided based on how they connect with bank and also
based on their spending pattern which can deduced from average credit limit.

 Customers with high credit limit tend to vist online. Hence they can be targeted for
online campaigns, coupons and accordingly products and services can be offered to
them
 Whereas customers with comparatively low credit limit make visits to bank more
often, hence they can either be prompted with benefits of online banking or can be
catered with in-bank offers/services and flyers
 Customers with low credit limits have less frequency on online platform, they can be
marketed with benefits of online banking and/or make customer call center reps
aware of promotions and offers so that they can target this segment of customers

Based on how bank wants to promote its products and services, a segment of customers can
be targeted as we know their prefered mode of commmunication with bank

Question 7: PCA Transformation

- Apply PCA with the desired number of components = 2 - Create a new DataFrame with the
PCA results

Answer: By reducing the original dimensions of 4 to 3, we are able to get 95% of variance
explained

 With K=3, cluster error (or clusters Inertia) is reduced from 50.42 (without PCA and
with 4 columns) to 46.29 (with PCA and 3 columns)
 we can choose K=3 or K=4
 At K=3 we have inertia of cluster errors of 46.29

P a g e 15 | 23


 With K=3, Silhouette score is increased from 0.53 (without PCA) to 0.73 (with PCA)

P a g e 16 | 23
Question 8: Interpretation of Principal Components
- Provide a clear interpretation of the principal components in terms of the original features.
- Explain how each principal component contributes to the variation in the data.
Answer:

P a g e 17 | 23
We will consider k=3 to be the optimal number of clusters. This is because:

 The elbow at k=3 was steepest with a huge drop in inertia

 The highest silhouette score was for k=3 by far
 The gap statistic implies k=3 is best
 The DB score also showed k=3 as best
 The Calinski-Harabasz Index highlighted k=3 as best
 The cluster distribution for k=3 was subjectively very distinct, especially compared to
k=5, which was another consideration.
 The PCA graphics all show three distinct groups with little overlap

Question 9: Variance Explanation
- Report the cumulative explained variance of the retained principal components. - Discuss
how much of the total variance is captured by the selected principal components.
Answer:

For 90% variance explained, the number of components looks to be 3

Explained variance = 88.0 %

P a g e 18 | 23
Question 10: Visualization
- Plot the data points in the reduced PCA space - Interpret the plot and discuss any insights
gained from it.

Answer: By reducing the original dimensions of 4 to 3, we are able to get 95% of variance
explained:

 660 rows × 3 columns

 With K=3, cluster error (or clusters Inertia) is reduced from 50.42 (without PCA and
with 4 columns) to 46.29 (with PCA and 3 columns)


 we can choose K=3 or K=4

 At K=3 we have inertia of cluster errors of 46.29

P a g e 19 | 23
Question 11: Dimensionality Reduction Impact
- Analyze the impact of PCA-based dimensionality reduction on the performance of the
clustering algorithm used earlier. - Compare the clustering results before and after applying
PCA and discuss any improvements or changes in the identified customer segments
Answer: Compare K-means clusters with Hierarchical clusters

 Hierarchical clustering can’t handle big data well but K Means clustering can. This is
because the time complexity of K Means is linear i.e. O(n) while that of hierarchical
clustering is quadratic i.e. O(n2).
 In K Means clustering, since we start with random choice of clusters, the results
produced by running the algorithm multiple times might differ. While results are
reproducible in Hierarchical clustering.
 K Means is found to work well when the shape of the clusters is hyper spherical (like
circle in 2D, sphere in 3D).
 K Means clustering requires prior knowledge of K i.e. no. of clusters you want to
divide your data into. But, you can stop at whatever number of clusters you find
appropriate in hierarchical clustering by interpreting the dendrogram
 With a large number of variables, K-means compute faster
 The result of K-means is unstructured, but that of hierarchal is more interpretable
and informative
 It is easier to determine the number of clusters by hierarchical clustering’s
dendrogram

Compare KMeans with sklearn Agglomerative Clustering - Avg linkage

P a g e 20 | 23
 KMeans Sil. score is 0.529
 Agglomerative Clustering with Avg linkage Sil. Score is 0.532
 From the above analysis of clusters formed from each model, KMeans and
Agglomerative clustering with Avg distance yielded very similar silhouette score.
 Also each cluster from each model (KMeans or Linkage models) have similar
distributions of customers under each cluster except for Single linkage
 KMeans and Agg Clustering with Avg methods have similar customer distribution
numbers in each cluster
 KMeans customer distributions - 382, 228, 50
 Agg Cluster with Avg linkage - 218,50,392

Compare KMeans with Scipy Linkage with Avg distance

Linkage_Avg ------ 50, 218, 392

 Cluster 1: It includes cutomers with highest avg credit limit, highest number of credit
cards, very less bank visits, highest online visits and fewer calls made to bank
 Cluster 2: Customers with low credit limit, fewer credit cards, medium bank visits,
medium online visits and highest calls to bank

 Cluster 3: Customers with medium credit limit ,medium number of credit cards
(~5,6), Highest average bank visits and very less online visits and medium calls to
bank

 Linkage with average distance have a Cophentic corr of 0.82 outperforming other
Scipy linkage models

 Cluster 2 of K-Means is similar to Cluster 1 of Scipy Linkage Avg: Customers in this

cluster have high credit card limits (100K to avg 140K) and highest total number of
credit cards (8-10). They have the lowest in person bank visits and highest online
visits. They also have the lowest calls made. This cluster has fewer customers or
around 50

 Cluster 1 of K-Means is similar to Cluster 2 of Scipy Linkage Avg: These are the
customers with lowest credit card limits (min to avg 12K) and have the least number
of credit cards (1-3). They also have lower in person visits to the bank and moderate
online visits. However, they have the highest number of calls. This group has
considerable number of customers, over 200.

 Cluster 0 of K-Means is similar to Cluster 3 of Scipy Linkage Avg: This is group of

customers with moderate credit card limits (20K to avg ~34K) and have considerable
number of credit cards (4-6). They have the highest in person bank visits and lowest
online visits. Also, they have lower phone calls to the bank. This group has highest
number of customers, over 380

P a g e 21 | 23
THANK
YOU

P a g e 22 | 23
P a g e 23 | 23

Data Mining Business Report Hansraj Yadav
83% (12)
Data Mining Business Report Hansraj Yadav
34 pages
SMDM Guided Project Sample Business Report
No ratings yet
SMDM Guided Project Sample Business Report
17 pages
Nagareddy 18-Nov-2023
No ratings yet
Nagareddy 18-Nov-2023
20 pages
Azure Machine Learning
No ratings yet
Azure Machine Learning
18 pages
SMDM Project Report-Survi Ghura
100% (1)
SMDM Project Report-Survi Ghura
26 pages
Clustering Analysis: Prepared by Muralidharan N
100% (1)
Clustering Analysis: Prepared by Muralidharan N
16 pages
Project Questions
No ratings yet
Project Questions
4 pages
Sample - Customer Churn Prediction Python Documentation
No ratings yet
Sample - Customer Churn Prediction Python Documentation
33 pages
Thera Bank-Project
100% (12)
Thera Bank-Project
26 pages
1) Introduction A) Defining Problem Statement:-: ST ST
No ratings yet
1) Introduction A) Defining Problem Statement:-: ST ST
10 pages
M4 Data Mining W4 Business Report
No ratings yet
M4 Data Mining W4 Business Report
22 pages
Answer Report: Data Mining
No ratings yet
Answer Report: Data Mining
32 pages
Business Report On Data Mining: By: Aditya Janardan Hajare Batch: PGPDSBA Mar'C21 Group 1
100% (1)
Business Report On Data Mining: By: Aditya Janardan Hajare Batch: PGPDSBA Mar'C21 Group 1
12 pages
Data Mining Assignment: Sudhanva Saralaya
100% (1)
Data Mining Assignment: Sudhanva Saralaya
16 pages
PM Guided Project Sample Business Report
No ratings yet
PM Guided Project Sample Business Report
52 pages
Business Report On Data Mining: By: Aditya Janardan Hajare Batch: PGPDSBA Mar'C21 Group 1
No ratings yet
Business Report On Data Mining: By: Aditya Janardan Hajare Batch: PGPDSBA Mar'C21 Group 1
18 pages
Predictive Modeling - Supporting File1
No ratings yet
Predictive Modeling - Supporting File1
3 pages
Simple Regression Quiz
No ratings yet
Simple Regression Quiz
6 pages
Rahulsharma - 03 12 23
No ratings yet
Rahulsharma - 03 12 23
25 pages
SMT Capstone PPT Ayushi Rastogi PGPDSBA.O.MAY22.C
No ratings yet
SMT Capstone PPT Ayushi Rastogi PGPDSBA.O.MAY22.C
12 pages
AS Extended Buisnesss Report
No ratings yet
AS Extended Buisnesss Report
25 pages
Buisiness Reoprt Extended As Project Report
No ratings yet
Buisiness Reoprt Extended As Project Report
18 pages
Anisha SMDM
No ratings yet
Anisha SMDM
11 pages
Predictive Modeling
No ratings yet
Predictive Modeling
38 pages
Vijayalakshmi
No ratings yet
Vijayalakshmi
17 pages
FRA Project Report Milestone 1 PDF
No ratings yet
FRA Project Report Milestone 1 PDF
29 pages
SMDM Report
No ratings yet
SMDM Report
12 pages
ML - Project - Business Report
No ratings yet
ML - Project - Business Report
43 pages
Advance Stats Project Parijat
No ratings yet
Advance Stats Project Parijat
18 pages
Surabhi FRA PartA
No ratings yet
Surabhi FRA PartA
13 pages
Project Questions
No ratings yet
Project Questions
3 pages
Cars Project PDF
No ratings yet
Cars Project PDF
9 pages
Data Mining Problem 2 Report
No ratings yet
Data Mining Problem 2 Report
13 pages
Finance Risk Analytics - Priyanka Sharma - Business Report
No ratings yet
Finance Risk Analytics - Priyanka Sharma - Business Report
49 pages
Anshul Dyundi Machine Learning July 2022
50% (2)
Anshul Dyundi Machine Learning July 2022
46 pages
Capstone Project Final Report Rupesh Kumar PGP-DSBA APR 21C
No ratings yet
Capstone Project Final Report Rupesh Kumar PGP-DSBA APR 21C
77 pages
Business Report: Advanced Statistics Module Project - II
No ratings yet
Business Report: Advanced Statistics Module Project - II
9 pages
Pradeep Chauhan Business Report 09july'23
100% (1)
Pradeep Chauhan Business Report 09july'23
32 pages
Clustering Project
100% (1)
Clustering Project
44 pages
The Cricket Winner Prediction With Applications of ML and Data Analytics
No ratings yet
The Cricket Winner Prediction With Applications of ML and Data Analytics
18 pages
Week 1 Quiz
100% (1)
Week 1 Quiz
28 pages
Business Report SMDM Bhushan
No ratings yet
Business Report SMDM Bhushan
18 pages
Business Report Project - Sheetal - SMDM
100% (1)
Business Report Project - Sheetal - SMDM
20 pages
Capstone Notes-1
No ratings yet
Capstone Notes-1
18 pages
Uber Drive Practice DP PDF
No ratings yet
Uber Drive Practice DP PDF
10 pages
SMDM Project
No ratings yet
SMDM Project
17 pages
Cart-Rf-Ann: Prepared by Muralidharan N
67% (3)
Cart-Rf-Ann: Prepared by Muralidharan N
33 pages
Color: Due On Sunday June 7th, by 11:59PM
No ratings yet
Color: Due On Sunday June 7th, by 11:59PM
2 pages
Mini Project - Factor Hair Analysis: Sravanthi.M
100% (2)
Mini Project - Factor Hair Analysis: Sravanthi.M
24 pages
ML Quiz 3
No ratings yet
ML Quiz 3
2 pages
AS Graded Project Suchi Solanki
No ratings yet
AS Graded Project Suchi Solanki
21 pages
Wholesale Custumer
100% (1)
Wholesale Custumer
32 pages
VARUNSAINI - 13 Nov 2022
No ratings yet
VARUNSAINI - 13 Nov 2022
14 pages
SMDM Project Report
100% (1)
SMDM Project Report
9 pages
BUSINESS REPORT Part 1
No ratings yet
BUSINESS REPORT Part 1
9 pages
Business Report DSBA Data Mining Project - Part 2 Segmentation Using K-Means Clustering
No ratings yet
Business Report DSBA Data Mining Project - Part 2 Segmentation Using K-Means Clustering
28 pages
Data Mining Clustering PDF
No ratings yet
Data Mining Clustering PDF
15 pages
Rajiv Ranjan 11 Dec 2022
No ratings yet
Rajiv Ranjan 11 Dec 2022
18 pages
Predictive Modelling Project 2
100% (4)
Predictive Modelling Project 2
32 pages
Business Report
No ratings yet
Business Report
18 pages
Dictionary of Credit Risk Business Terms - EXTRACT
From Everand
Dictionary of Credit Risk Business Terms - EXTRACT
Steve Preece
No ratings yet
Data Mining and Business Intelligence
No ratings yet
Data Mining and Business Intelligence
41 pages
Final Lesson Plan
No ratings yet
Final Lesson Plan
11 pages
Advances in Information Retrieval
No ratings yet
Advances in Information Retrieval
913 pages
Differentiable Deep Clustering with Cluster Size Constraints
No ratings yet
Differentiable Deep Clustering with Cluster Size Constraints
8 pages
Tybscit Sem Vi Subject: Business Intelligence Sample Questions For Self Practice
0% (1)
Tybscit Sem Vi Subject: Business Intelligence Sample Questions For Self Practice
219 pages
Review of Studies On The Public-Private Partnerships (PPP) For Infrastructure Project
No ratings yet
Review of Studies On The Public-Private Partnerships (PPP) For Infrastructure Project
22 pages
Cluster Analysis
No ratings yet
Cluster Analysis
5 pages
Question Bank With 2 Marks
100% (1)
Question Bank With 2 Marks
21 pages
Physica A: Bilal Saoud, Abdelouahab Moussaoui
No ratings yet
Physica A: Bilal Saoud, Abdelouahab Moussaoui
9 pages
Bumps and Pothole Detection Report Final
No ratings yet
Bumps and Pothole Detection Report Final
64 pages
Can You Trust Online Ratings A Mutual
No ratings yet
Can You Trust Online Ratings A Mutual
6 pages
Data Mining and KDD
No ratings yet
Data Mining and KDD
15 pages
Data Mining: Concepts and Techniques: - Slides For Textbook - Chapter 3
No ratings yet
Data Mining: Concepts and Techniques: - Slides For Textbook - Chapter 3
52 pages
Analysis of Accident Times For Highway Locations Using K-Means Clustering and Decision Rules Extracted From Decision Trees.
No ratings yet
Analysis of Accident Times For Highway Locations Using K-Means Clustering and Decision Rules Extracted From Decision Trees.
11 pages
Information Storage and Retrieval - 783
100% (1)
Information Storage and Retrieval - 783
12 pages
DMMLASSIGNMENT
No ratings yet
DMMLASSIGNMENT
36 pages
AI FUND Midterm Lab Exam - 100 - 100
No ratings yet
AI FUND Midterm Lab Exam - 100 - 100
17 pages
(eBook PDF) Data Mining and Predictive Analytics 2nd Editionpdf download
No ratings yet
(eBook PDF) Data Mining and Predictive Analytics 2nd Editionpdf download
55 pages
Ai Paper 5
No ratings yet
Ai Paper 5
8 pages
ANN Unit 3 Answers
No ratings yet
ANN Unit 3 Answers
12 pages
Udacity Enterprise Syllabus Introduction To Machine Learning With TensorFlow nd230
No ratings yet
Udacity Enterprise Syllabus Introduction To Machine Learning With TensorFlow nd230
12 pages
Predict Classify Cluster
No ratings yet
Predict Classify Cluster
12 pages
Sajjad DS
100% (2)
Sajjad DS
97 pages
Cluster Analysis
No ratings yet
Cluster Analysis
77 pages
Cluster Sampling - Definition, Method and Examples
No ratings yet
Cluster Sampling - Definition, Method and Examples
12 pages
A Quick Review of Machine Learning Algorithms: Susmita Ray
No ratings yet
A Quick Review of Machine Learning Algorithms: Susmita Ray
5 pages
Syllabus of BscIT SEM-5( Revised FYUGP Under NEP 2020)[2] (1)
No ratings yet
Syllabus of BscIT SEM-5( Revised FYUGP Under NEP 2020)[2] (1)
7 pages
CLIQUE and PROCLUS
0% (1)
CLIQUE and PROCLUS
13 pages
Zhang Money Laundering
No ratings yet
Zhang Money Laundering
10 pages

Machine Learning Guided Project

Uploaded by

Machine Learning Guided Project

Uploaded by

Machine Learning

Question 1: Define the problem and perform an Exploratory Data Analysis

Question 2: Data preprocessing

Question 3: Applying K-means Clustering

Question 4: Applying Hierarchical Clustering

Question 5: K-means vs Hierarchical Clustering

Question 6: Actionable Insights & Recommendations

Question 7: PCA Transformation

Question 8: Interpretation of Principal Components

Question 9: Variance Explanation

Question 10: Visualization

Question 11: Dimensionality Reduction Impact

Answer: 660 observations(rows) and 7 variables(features)

Question 2: Data preprocessing

 Out of 660 rows of customers:

 +ve correlation (0.61) between credit limit and no. of cards

Question 3: Applying K-means Clustering

 By reducing the original dimensions of 4 to 3, we are able to get 95% of variance

Question 6: Actionable Insights & Recommendations

Question 7: PCA Transformation

 The elbow at k=3 was steepest with a huge drop in inertia

For 90% variance explained, the number of components looks to be 3

 660 rows × 3 columns

 we can choose K=3 or K=4

Compare KMeans with sklearn Agglomerative Clustering - Avg linkage

Compare KMeans with Scipy Linkage with Avg distance

Linkage_Avg ------ 50, 218, 392

 Cluster 2 of K-Means is similar to Cluster 1 of Scipy Linkage Avg: Customers in this

 Cluster 0 of K-Means is similar to Cluster 3 of Scipy Linkage Avg: This is group of

You might also like