100% found this document useful (1 vote)
211 views40 pages

Data Mining Project Shivani Pandey

This document summarizes a data mining project report on customer segmentation and insurance claim prediction. The report includes clustering bank customers based on credit card usage, comparing decision tree, random forest and neural network models for insurance claim prediction, and recommending the decision tree model performs best with higher accuracy, AUC score, recall, precision and F1 score. Key business insights are that online sales channel and travel agencies are more effective, and claims are higher in Asia. The recommendation is to improve customer benefits and claims processing, and collect more detailed data.

Uploaded by

Shivich10
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
100% found this document useful (1 vote)
211 views40 pages

Data Mining Project Shivani Pandey

This document summarizes a data mining project report on customer segmentation and insurance claim prediction. The report includes clustering bank customers based on credit card usage, comparing decision tree, random forest and neural network models for insurance claim prediction, and recommending the decision tree model performs best with higher accuracy, AUC score, recall, precision and F1 score. Key business insights are that online sales channel and travel agencies are more effective, and claims are higher in Asia. The recommendation is to improve customer benefits and claims processing, and collect more detailed data.

Uploaded by

Shivich10
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 40

DATA MINING

PROJECT REPORT

Name: Shivani PandeyBatch: PGP DSBA,


Dec 21 Date: 24/04/2022
Problem 1: Clustering
A leading bank wants to develop a customer segmentation to give promotional offers to its customers. They collected a sample
that summarizes the activities of users during the past few months. You are given the task to identify the segmentsbased on
credit card usage.

1.1 Read the data and do exploratory data analysis. Describe the data briefly
Solution.

Answer:
• There are 7 variables and 210 records.
• Data has No missing record based on intial analysis.
• All the variables numeric type.
• The data looks much aligned and good.
• We see that mean for (spending and advance payments) are almost equal and the same goes for (max spent in single
shopping and current balance) al ongwith( credit limit and min payment amt)
• There are no outliers in the data set expect for the min payment amountvariable.
• The variables are rightly skewed expect for the variable probability of fullpayment
• Strong positive correlation is seen between the followingspending & advance_payments, (0.99)
advance_payments & current_balance, (0.97) credit_limit & spending(0.97)

spending & current_balance(0.94) credit_limit & advance_payments(0.94)

max_spent_in_single_shopping current_balance(0.932)
1.2 Do you think scaling is necessary for clustering in this case? Justify

Answer:

• Scaling is a technique to standardize the independent features present in thedata in a fixed range. It is performed
during the data pre-processing to handlehighly varying magnitudes or values or units.

• Standardization prevents variables with larger scales from dominatinghow clusters are defined.

• We observe that in this dataset, we have spending and advance payments havinghigher values and may get more
weightage.

• Scaling will have the data scaled in the same relative range, hereby scaling shouldbe performed in the dataset before
clustering.

• Scaling needs to be done as the values of the variables are different.

• Spending, advance_payments are in different values and this may get more weightage.

• Also have shown below the plot of the data prior and after scaling.

• Scaling will have all the values in the relative same range.

• I have used zscore to standardize the data to relative same scale -3 to +3.
1.3 Apply hierarchical clustering to scaled data. Identify the number of optimumclusters using Dendrogram and
briefly describe them
Answer:
In hierarchical clustering:
Records are sequentially grouped to create clusters, based on distances betweenrecords and distances between clusters.
Hierarchical clustering also produces a useful graphical display of the clusteringprocess and results, called a dendrogram.
Strengths of Hierarchical Clustering are:

No assumptions on the number of clusters

Any desired number of clusters can be obtained by ‘cutting ‘the dendrogram at


the proper leve

Hierarchical clustering may correspond to meaningful taxonomies.

We perform clustering using ward linkage method and perform clustering for both3 and 4 clusters.

We can go for 3 clusters, as it can be further categorized into 3 groups of high,med and low with respect to the variables
max_spent_in_single_shopping and probability_of_full_payment.
cluster frequency
1 70
2 67
3 73

We can also use the average linkage method


Observation
• Both the method give almost similar means , minor variation, which we know it occurs.
• We for cluster grouping based on the dendrogram, 3 or 4 looks good. Did the further analysis, and based on the dataset had gone
for 3 group cluster solution based on the hierarchical clustering
• Also in real time, there could have been more variables value captured - tenure, BALANCE_FREQUENCY, balance, purchase,
installment of purchase, others.
• And three group cluster solution gives a pattern based on high/medium/low spending with max_spent_in_single_shopping (high
value item) and probability_of_full_payment (payment made).¶

1.4 Apply K-Means clustering on scaled data and determine optimum clusters.Apply elbow curve and silhouette
score.

Answer:

o A non-hierarchical approach to forming good clusters is to pre-specify a desired number of clusters

o We Assign each record to one of the k clusters, according to their distancefrom each cluster, so as to minimize a
measure of dispersion within the cluster

o The ‘means’ in the K-means refers to averaging of the data; that is, findingthe centroid

o K-means clustering is widely used in large dataset applications.

We apply k means clustering for 3 and 4 clusters and we go for 3 clusters considering the silhouette score for 3 clusters being
optimal. i.e., 0.40 greater thanthat of 4 clusters. i.e. 0.327.
based on current dataset given, 3 cluster solution makes sense based on thespending pattern (High, Medium, Low)
1.5 Describe cluster profiles for the clusters defined. Recommend differentpromotional strategies for
different clusters
Group 1: High Spending
• Giving more reward points might increase their purchases.
- offering discount/offer on next transactions upon full payment
- Increase their credit limit and giving them premium benefits.
- giving more loan options with less interest rates and giving premium credit card options.
-

Group 3: Medium Spending


• Promote premium cards/loyalty cars to increase transactions.

• Giving them new offers so they start being more loyal and spendconsiderably more.

• Lowering interest rates for loans.

Group 2: Low Spending

• Giving them payment reminders and tying up with local grocery stores andphone /gas /electricity for cashbacks.

• Discount on instant payments and early ones.


Problem 2: CART-RF-ANN
An Insurance firm providing tour insurance is facing higher claim frequency. Themanagement decides to collect data from
the past few years. You are assigned the task to make a model which predicts the claim status and provide
recommendations to management. Use CART, RF & ANN and compare the models' performances in train and test sets.

2.1 Data Ingestion: Read the dataset. Do the descriptive statistics and do nullvalue condition check, write an
inference on it?

Answer:
• There are 10 variables, with a total of 3000 values.There are no missing values.
• Age commission, duration and sales are continuous/numeric type the othersbeing categorical variables.

• There are 9 independent variables whereas the target variable is Claimed. When we perform the descriptive

statistics, we see that duration has -ve min, which depicts incorrect entry.

• There are 139 duplicate values in the dataset.


• There 5 categorical variables: agency code, type, channel, product name ,destination
• We treat the categorical variables into codes of 0 and 1.
• There are outliers present in the data set for the continuous variables.
Correlation is high between commission and sales(0.77) .
From the pair plot its evident that the continuous variables are right skewed

2.2 Data Split: Split the data into test and train, build classification model CART,Random Forest, Artificial Neural
Network
Answer:

Extracting the target column into separate vectors for training set and test setand performing standardization to
scale the data.
Splitting the data into train and test

Test data is 30% Train data is 70%

Building decision tree classifier


Building random forest classifier
Building artificial neural network
2.3 Performance Metrics: Check the performance of Predictions on Train and Test sets using Accuracy, Confusion
Matrix, Plot ROC curve and get ROC_AUC score foreach model

Answer:

Performance metric for CART

1. For train dataset


2. For test dataset
for detailed code, please refer python sheet.

Performance metric for random forest


For train data set
For test data
Performance metric for Artificial neural network

For train data


For test data test
Training and Test set results are almost similar, and with the overall measureshigh, the model is a good model.
2.4 Final Model: Compare all the model and write an inference which model isbest/optimized

Random Random Neural Neural


CART CART Forest Forest Network Network
Train Test
Train Test Train Test
Accuracy 0.79 0.76 0.78 0.75 0.78 0.76
AUC 0.83 0.79 0.82 0.80 0.82 0.80
Recall 0.54 0.43 0.46 0.36 0.49 0.40
Precision 0.70 0.72 0.70 0.73 0.68 0.73
F1 Score 0.61 0.54 0.55 0.48 0.57 0.52

I am selecting CART as it has better accuracy, Auc score, recall, precision andF1 score with respect to other models.

2.5 Inference: Based on the whole Analysis, what are the business insights andrecommendations

Answer:

• The data needs to be more streamlined and more unstructured data can beput.
• The online channel seems to be more convenient platform for insurancesales.
• The people in Asia have claimed more than other countries.
• The other interesting fact being sales are more via travel agency.
• Recommendations:
• Give more options of benefits and put emphasis on customer satisfaction.

• Claim cycle should be deduced.

• I strongly recommended we collect more real time unstructured data and past data if possible.
• This is understood by looking at the insurance data by drawing relations between different variables such as day of the incident,
time, age group, and associating it with other external information such as location, behaviour patterns, weather information,
airline/vehicle types, etc.

• Streamlining online experiences benefitted customers, leading to an increase in conversions, which subsequently raised profits.

• As per the data 90% of insurance is done by online channel

• Other interesting fact, is almost all the offline business has a claimed associated, need to find why?

• Need to train the JZI agency resources to pick up sales as they are in bottom, need to run promotional marketing campaign or
evaluate if we need to tie up with alternate agency

• Also based on the model we are getting 80%accuracy, so we need customer books airline tickets or plans, cross sell the insurance
based on the claim data pattern.

• Other interesting fact is more sales happen via Agency than Airlines and the trend shows the claim are processed more at Airline.
So, we may need to deep dive into the process to understand the workflow and why?

• Key performance indicators (KPI) The KPI’s of insurance claims are: • Reduce claims cycle time • Increase customer satisfaction •
Combat fraud • Optimize claims recovery • Reduce claim handling costs Insights gained from data and AI-powered analytics
could expand the boundaries of insurability, extend existing products, and give rise to new risk transfer solutions in areas like a
non-damage business interruption and reputational damage.

-------------------- The End -------------------------------------------------------

You might also like