Data Mininig Project
Data Mininig Project
5/23/2021
KARTHIKEYAN M
Problem 1: Clustering
A leading bank wants to develop a customer segmentation to give promotional offers to its customers.
They collected a sample that summarizes the activities of users during the past few months. You are
given the task to identify the segments based on credit card usage.
1.1 Read the data, do the necessary initial steps, and exploratory data analysis (Univariate, Bi-variate,
and multivariate analysis).
1.2 Do you think scaling is necessary for clustering in this case? Justify
1.3 Apply hierarchical clustering to scaled data. Identify the number of optimum clusters using
Dendrogram and briefly describe them
1.4 Apply K-Means clustering on scaled data and determine optimum clusters. Apply elbow curve and
silhouette score. Explain the results properly. Interpret and write inferences on the finalized clusters.
1.5 Describe cluster profiles for the clusters defined. Recommend different promotional strategies for
different clusters.
1.1 Read the data, do the necessary initial steps, and exploratory data analysis (Univariate, Bi-variate,
and multivariate analysis).
2
All the columns are of datatype float/integer
Mean and median of each column is more or less equal which indicates that data is normally distributed
and it is visible from the distribution plot as well.
3
There are no duplicate values present in the dataset.
Outliers are present only in two columns and they are only very few data points hence outlier treatment
has not been done.
4
1.2 Do you think scaling is necessary for clustering in this case? Justify
Yes scaling is required.
The descriptive statistics shows that measures of central tendencies are very close to each other, so in
order to have a quality clusters it is better to scale the data and normalize it. And also as Euclidean
distance is very sensitive to the changes in the differences, it becomes critical to scale the data.
1.3 Apply hierarchical clustering to scaled data. Identify the number of optimum clusters using
Dendrogram and briefly describe them
The average linkage method is chosen here to do hierarchical clustering
5
Choosing the suitable value for dendogram,
Cluster Frequency:
6
1.4 Apply K-Means clustering on scaled data and determine optimum clusters. Apply elbow curve and
silhouette score. Explain the results properly. Interpret and write inferences on the finalized clusters.
Creating clusters using KMeans Forming 3 cluster with K = 3
7
For K - Elbow Method
From the above plot, WSS does not change much after 3, hence selecting the optimal value as 3.
8
The score is positive, hence the clustering is good.
1.5 Describe cluster profiles for the clusters defined. Recommend different promotional strategies for
different clusters.
Cluster Profiling – Hierarchical
KMeans – Clustering:
9
They are already spending a lot, hence
• These are customers who are paying bill on time hence some discounts will attract them to
spend more
• As they pay on time, increase the credit limit so that they will be elated.
• They are not spending much hence giving promotional offers may attract them
10
Problem 2: CART-RF-ANN
An Insurance firm providing tour insurance is facing higher claim frequency. The management decides to
collect data from the past few years. You are assigned the task to make a model which predicts the claim
status and provide recommendations to management. Use CART, RF & ANN and compare the models'
performances in train and test sets.
2.1 Read the data, do the necessary initial steps, and exploratory data analysis (Univariate, Bi-variate,
and multivariate analysis).
2.2 Data Split: Split the data into test and train, build classification model CART, Random Forest, Artificial
Neural Network
2.3 Performance Metrics: Comment and Check the performance of Predictions on Train and Test sets
using Accuracy, Confusion Matrix, Plot ROC curve and get ROC_AUC score, classification reports for each
model.
2.4 Final Model: Compare all the models and write an inference which model is best/optimized.
2.5 Inference: Based on the whole Analysis, what are the business insights and recommendations
2.1 Read the data, do the necessary initial steps, and exploratory data analysis (Univariate, Bi-variate,
and multivariate analysis).
11
There are no null elements. The columns has few categorical variables and few integers
12
There are extreme min and max values for duration.
There are 139 duplicate values, however as there are no primary key indicator, duplicate values are not
treated.
There are outliers present in the continuous variables like age, commission, duration, sales, however
cart model can handle outliers so not treating them
13
The heat map shows, sales and commission are positively correlated.
14
2.2 Data Split: Split the data into test and train, build classification model CART, Random Forest, Artificial
Neural Network
15
.
CART Model:
16
Feature Importance as per Random forest model.
17
2.3 Performance Metrics: Comment and Check the performance of Predictions on Train and Test sets
using Accuracy, Confusion Matrix, Plot ROC curve and get ROC_AUC score, classification reports for each
model.
CART Model:
18
19
CART Conclusion:
Cart Conclusion
Train Data: AUC: 81% Accuracy: 78% Precision: 66% f1-Score: 62%
Test Data: AUC: 79% Accuracy:78% Precision: 65% f1-Score: 62%
The model has good accuracy % with training and test data results are almost similar.
20
Agency_Code is the variable of importancefor predicting Target variable.
21
22
23
Train Data: AUC: 86% Accuracy: 80% Precision: 72% f1-Score: 66%
Test Data: AUC: 82% Accuracy:78% Precision: 68% f1-Score: 62%
The model has good accuracy % with training and test data results are almost similar.
24
25
Train Data: AUC: 82% Accuracy: 78% Precision: 68% f1-Score: 59%
Test Data: AUC: 80% Accuracy:77% Precision: 67% f1-Score: 57%
The model has good accuracy % similar to other models with training and test data results are almost
similar.
26
2.4 Final Model: Compare all the models and write an inference which model is best/optimized.
All the three models have training ad test data more or less equal within each model.
Random Forest has slightly higher accuracy compared with other two models.
27
2.5 Inference: Based on the whole Analysis, what are the business insights and recommendations?
There is a lot of online conversions as 90% of data says insurance is done by online channel
Most of the sales happens via these agencies which is also important factor and claims happens via
agencies, so we may have to understand the workflow to arrive at a conclusion.
As per the models we understand that agency code is important variable and there are four agency code
EPX, C2B, CWT, JZI.
Of this JZI agency has not done good in sales hence their performance may have to be evaluated and
also if not satisfactory, can try out for other agencies.
28