100% found this document useful (1 vote)

211 views40 pages

Data Mining Project Shivani Pandey

This document summarizes a data mining project report on customer segmentation and insurance claim prediction. The report includes clustering bank customers based on credit card usage, comparing decision tree, random forest and neural network models for insurance claim prediction, and recommending the decision tree model performs best with higher accuracy, AUC score, recall, precision and F1 score. Key business insights are that online sales channel and travel agencies are more effective, and claims are higher in Asia. The recommendation is to improve customer benefits and claims processing, and collect more detailed data.

Uploaded by

Shivich10

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

100% found this document useful (1 vote)

211 views40 pages

Data Mining Project Shivani Pandey

Uploaded by

Shivich10

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 40

DATA MINING

PROJECT REPORT

Name: Shivani PandeyBatch: PGP DSBA,

Dec 21 Date: 24/04/2022
Problem 1: Clustering
A leading bank wants to develop a customer segmentation to give promotional offers to its customers. They collected a sample
that summarizes the activities of users during the past few months. You are given the task to identify the segmentsbased on
credit card usage.

1.1 Read the data and do exploratory data analysis. Describe the data briefly
Solution.

Answer:
• There are 7 variables and 210 records.
• Data has No missing record based on intial analysis.
• All the variables numeric type.
• The data looks much aligned and good.
• We see that mean for (spending and advance payments) are almost equal and the same goes for (max spent in single
shopping and current balance) al ongwith( credit limit and min payment amt)
• There are no outliers in the data set expect for the min payment amountvariable.
• The variables are rightly skewed expect for the variable probability of fullpayment
• Strong positive correlation is seen between the followingspending & advance_payments, (0.99)
advance_payments & current_balance, (0.97) credit_limit & spending(0.97)

spending & current_balance(0.94) credit_limit & advance_payments(0.94)

max_spent_in_single_shopping current_balance(0.932)
1.2 Do you think scaling is necessary for clustering in this case? Justify

Answer:

• Scaling is a technique to standardize the independent features present in thedata in a fixed range. It is performed
during the data pre-processing to handlehighly varying magnitudes or values or units.

• Standardization prevents variables with larger scales from dominatinghow clusters are defined.

• We observe that in this dataset, we have spending and advance payments havinghigher values and may get more
weightage.

• Scaling will have the data scaled in the same relative range, hereby scaling shouldbe performed in the dataset before
clustering.

• Scaling needs to be done as the values of the variables are different.

• Spending, advance_payments are in different values and this may get more weightage.

• Also have shown below the plot of the data prior and after scaling.

• Scaling will have all the values in the relative same range.

• I have used zscore to standardize the data to relative same scale -3 to +3.
1.3 Apply hierarchical clustering to scaled data. Identify the number of optimumclusters using Dendrogram and
briefly describe them
Answer:
In hierarchical clustering:
Records are sequentially grouped to create clusters, based on distances betweenrecords and distances between clusters.
Hierarchical clustering also produces a useful graphical display of the clusteringprocess and results, called a dendrogram.
Strengths of Hierarchical Clustering are:

No assumptions on the number of clusters

Any desired number of clusters can be obtained by ‘cutting ‘the dendrogram at

the proper leve

Hierarchical clustering may correspond to meaningful taxonomies.

We perform clustering using ward linkage method and perform clustering for both3 and 4 clusters.

We can go for 3 clusters, as it can be further categorized into 3 groups of high,med and low with respect to the variables
max_spent_in_single_shopping and probability_of_full_payment.
cluster frequency
1 70
2 67
3 73

We can also use the average linkage method

Observation
• Both the method give almost similar means , minor variation, which we know it occurs.
• We for cluster grouping based on the dendrogram, 3 or 4 looks good. Did the further analysis, and based on the dataset had gone
for 3 group cluster solution based on the hierarchical clustering
• Also in real time, there could have been more variables value captured - tenure, BALANCE_FREQUENCY, balance, purchase,
installment of purchase, others.
• And three group cluster solution gives a pattern based on high/medium/low spending with max_spent_in_single_shopping (high
value item) and probability_of_full_payment (payment made).¶

1.4 Apply K-Means clustering on scaled data and determine optimum clusters.Apply elbow curve and silhouette
score.

Answer:

o A non-hierarchical approach to forming good clusters is to pre-specify a desired number of clusters

o We Assign each record to one of the k clusters, according to their distancefrom each cluster, so as to minimize a
measure of dispersion within the cluster

o The ‘means’ in the K-means refers to averaging of the data; that is, findingthe centroid

o K-means clustering is widely used in large dataset applications.

We apply k means clustering for 3 and 4 clusters and we go for 3 clusters considering the silhouette score for 3 clusters being
optimal. i.e., 0.40 greater thanthat of 4 clusters. i.e. 0.327.
based on current dataset given, 3 cluster solution makes sense based on thespending pattern (High, Medium, Low)
1.5 Describe cluster profiles for the clusters defined. Recommend differentpromotional strategies for
different clusters
Group 1: High Spending
• Giving more reward points might increase their purchases.
- offering discount/offer on next transactions upon full payment
- Increase their credit limit and giving them premium benefits.
- giving more loan options with less interest rates and giving premium credit card options.
-

Group 3: Medium Spending

• Promote premium cards/loyalty cars to increase transactions.

• Giving them new offers so they start being more loyal and spendconsiderably more.

• Lowering interest rates for loans.

Group 2: Low Spending

• Giving them payment reminders and tying up with local grocery stores andphone /gas /electricity for cashbacks.

• Discount on instant payments and early ones.

Problem 2: CART-RF-ANN
An Insurance firm providing tour insurance is facing higher claim frequency. Themanagement decides to collect data from
the past few years. You are assigned the task to make a model which predicts the claim status and provide
recommendations to management. Use CART, RF & ANN and compare the models' performances in train and test sets.

2.1 Data Ingestion: Read the dataset. Do the descriptive statistics and do nullvalue condition check, write an
inference on it?

Answer:
• There are 10 variables, with a total of 3000 values.There are no missing values.
• Age commission, duration and sales are continuous/numeric type the othersbeing categorical variables.

• There are 9 independent variables whereas the target variable is Claimed. When we perform the descriptive

statistics, we see that duration has -ve min, which depicts incorrect entry.

• There are 139 duplicate values in the dataset.

• There 5 categorical variables: agency code, type, channel, product name ,destination
• We treat the categorical variables into codes of 0 and 1.
• There are outliers present in the data set for the continuous variables.
Correlation is high between commission and sales(0.77) .
From the pair plot its evident that the continuous variables are right skewed

2.2 Data Split: Split the data into test and train, build classification model CART,Random Forest, Artificial Neural
Network
Answer:

Extracting the target column into separate vectors for training set and test setand performing standardization to
scale the data.
Splitting the data into train and test

Test data is 30% Train data is 70%

Building decision tree classifier

Building random forest classifier
Building artificial neural network
2.3 Performance Metrics: Check the performance of Predictions on Train and Test sets using Accuracy, Confusion
Matrix, Plot ROC curve and get ROC_AUC score foreach model

Answer:

Performance metric for CART

1. For train dataset

2. For test dataset
for detailed code, please refer python sheet.

Performance metric for random forest

For train data set
For test data
Performance metric for Artificial neural network

For train data

For test data test
Training and Test set results are almost similar, and with the overall measureshigh, the model is a good model.
2.4 Final Model: Compare all the model and write an inference which model isbest/optimized

Random Random Neural Neural

CART CART Forest Forest Network Network
Train Test
Train Test Train Test
Accuracy 0.79 0.76 0.78 0.75 0.78 0.76
AUC 0.83 0.79 0.82 0.80 0.82 0.80
Recall 0.54 0.43 0.46 0.36 0.49 0.40
Precision 0.70 0.72 0.70 0.73 0.68 0.73
F1 Score 0.61 0.54 0.55 0.48 0.57 0.52

I am selecting CART as it has better accuracy, Auc score, recall, precision andF1 score with respect to other models.

2.5 Inference: Based on the whole Analysis, what are the business insights andrecommendations

Answer:

• The data needs to be more streamlined and more unstructured data can beput.
• The online channel seems to be more convenient platform for insurancesales.
• The people in Asia have claimed more than other countries.
• The other interesting fact being sales are more via travel agency.
• Recommendations:
• Give more options of benefits and put emphasis on customer satisfaction.

• Claim cycle should be deduced.

• I strongly recommended we collect more real time unstructured data and past data if possible.
• This is understood by looking at the insurance data by drawing relations between different variables such as day of the incident,
time, age group, and associating it with other external information such as location, behaviour patterns, weather information,
airline/vehicle types, etc.

• Streamlining online experiences benefitted customers, leading to an increase in conversions, which subsequently raised profits.

• As per the data 90% of insurance is done by online channel

• Other interesting fact, is almost all the offline business has a claimed associated, need to find why?

• Need to train the JZI agency resources to pick up sales as they are in bottom, need to run promotional marketing campaign or
evaluate if we need to tie up with alternate agency

• Also based on the model we are getting 80%accuracy, so we need customer books airline tickets or plans, cross sell the insurance
based on the claim data pattern.

• Other interesting fact is more sales happen via Agency than Airlines and the trend shows the claim are processed more at Airline.
So, we may need to deep dive into the process to understand the workflow and why?

• Key performance indicators (KPI) The KPI’s of insurance claims are: • Reduce claims cycle time • Increase customer satisfaction •
Combat fraud • Optimize claims recovery • Reduce claim handling costs Insights gained from data and AI-powered analytics
could expand the boundaries of insurability, extend existing products, and give rise to new risk transfer solutions in areas like a
non-damage business interruption and reputational damage.

-------------------- The End -------------------------------------------------------

Capstone Project - DS With R
No ratings yet
Capstone Project - DS With R
2 pages
Nagareddy 18-Nov-2023
No ratings yet
Nagareddy 18-Nov-2023
20 pages
SMDM Project Report-Survi Ghura
100% (1)
SMDM Project Report-Survi Ghura
26 pages
Sample - Customer Churn Prediction Python Documentation
No ratings yet
Sample - Customer Churn Prediction Python Documentation
33 pages
Churn Prediction
100% (1)
Churn Prediction
11 pages
Valuation of Bonds and Shares
100% (1)
Valuation of Bonds and Shares
9 pages
Data Mining Assignment: Sudhanva Saralaya
100% (1)
Data Mining Assignment: Sudhanva Saralaya
16 pages
Clustering Analysis: Prepared by Muralidharan N
100% (1)
Clustering Analysis: Prepared by Muralidharan N
16 pages
Business Report On Data Mining: By: Aditya Janardan Hajare Batch: PGPDSBA Mar'C21 Group 1
100% (1)
Business Report On Data Mining: By: Aditya Janardan Hajare Batch: PGPDSBA Mar'C21 Group 1
12 pages
(Excerpts From) Investigating Performance: Design and Outcomes With Xapi
From Everand
(Excerpts From) Investigating Performance: Design and Outcomes With Xapi
Janet Laane Effron
No ratings yet
LDA KNN Logistic
100% (1)
LDA KNN Logistic
29 pages
Machine Learning - Customer Segment Project. Approved by UDACITY
100% (1)
Machine Learning - Customer Segment Project. Approved by UDACITY
19 pages
Assignment 2 PDF
No ratings yet
Assignment 2 PDF
25 pages
Wholesale Custumer
100% (1)
Wholesale Custumer
32 pages
Sajjad DS
100% (2)
Sajjad DS
97 pages
Machine Learning Project Car Price Prediction Algorithm
No ratings yet
Machine Learning Project Car Price Prediction Algorithm
4 pages
Market Segmentation - Product Service Management
No ratings yet
Market Segmentation - Product Service Management
16 pages
Machine Learning Mini-Project Report
No ratings yet
Machine Learning Mini-Project Report
26 pages
Data Science & Business Analytics: Post Graduate Program in
No ratings yet
Data Science & Business Analytics: Post Graduate Program in
16 pages
Problem 1
No ratings yet
Problem 1
12 pages
ML MU Unit 2
100% (3)
ML MU Unit 2
84 pages
Prathamesh Shukla SMDM Project 20.08.23
100% (1)
Prathamesh Shukla SMDM Project 20.08.23
34 pages
Simple Linear Regression - Assign3
No ratings yet
Simple Linear Regression - Assign3
8 pages
Answer Book (Ashish)
100% (1)
Answer Book (Ashish)
21 pages
Tutorial 2 - Clustering
100% (2)
Tutorial 2 - Clustering
6 pages
Churn Predict Analysis
100% (1)
Churn Predict Analysis
23 pages
WINE Prediction Quality
100% (1)
WINE Prediction Quality
6 pages
Customer Churn Analysis and Prediction
No ratings yet
Customer Churn Analysis and Prediction
4 pages
KPMG Data
50% (2)
KPMG Data
3,723 pages
Problem 2 - Survey: Importing Nessceary Libraries
No ratings yet
Problem 2 - Survey: Importing Nessceary Libraries
10 pages
All Life Bank - AIML_ML_Project_low_code_notebook
No ratings yet
All Life Bank - AIML_ML_Project_low_code_notebook
78 pages
An Introduction To Clustering and Different Methods of Clustering
No ratings yet
An Introduction To Clustering and Different Methods of Clustering
9 pages
FRA Project Report - Chilla Nagaraju
100% (1)
FRA Project Report - Chilla Nagaraju
66 pages
Akshaya SMDM Project Report
100% (1)
Akshaya SMDM Project Report
18 pages
12 Outlier
No ratings yet
12 Outlier
55 pages
Machine Learning GL
No ratings yet
Machine Learning GL
25 pages
Machine Learning Project Report
100% (1)
Machine Learning Project Report
4 pages
Random Forest - US - Heart - Patients - Class
100% (1)
Random Forest - US - Heart - Patients - Class
24 pages
Dinya Antony MRA ML2
100% (1)
Dinya Antony MRA ML2
24 pages
Machine Learning With Real Life Project: by - Rishabh Gaur
100% (2)
Machine Learning With Real Life Project: by - Rishabh Gaur
26 pages
Learning Best Practices For Model Evaluation and Hyperparameter Tuning
No ratings yet
Learning Best Practices For Model Evaluation and Hyperparameter Tuning
17 pages
Time Series Forecasting - Rose - Buisness Report
100% (1)
Time Series Forecasting - Rose - Buisness Report
69 pages
Ensemble Techniques Project
100% (2)
Ensemble Techniques Project
28 pages
Bank Customer Churn Analysis - Jupyter Notebook
No ratings yet
Bank Customer Churn Analysis - Jupyter Notebook
11 pages
02 - Decision Tree Classification On Iris Dataset
No ratings yet
02 - Decision Tree Classification On Iris Dataset
6 pages
DataMining Lecture 1
No ratings yet
DataMining Lecture 1
35 pages
Machine Learning Guided Project
No ratings yet
Machine Learning Guided Project
23 pages
Anshul Dyundi Machine Learning July 2022
50% (2)
Anshul Dyundi Machine Learning July 2022
46 pages
Capstone Notes-Model
No ratings yet
Capstone Notes-Model
20 pages
AS Extended Buisnesss Report
No ratings yet
AS Extended Buisnesss Report
25 pages
Churn Modelling: TM 298 - Big Data Analytics Group 1
No ratings yet
Churn Modelling: TM 298 - Big Data Analytics Group 1
31 pages
Project Time Series Forecasting ROSE Dataset by Somya Dhar 1 PDF
No ratings yet
Project Time Series Forecasting ROSE Dataset by Somya Dhar 1 PDF
52 pages
Chapter 5 - Classification Problems
100% (1)
Chapter 5 - Classification Problems
25 pages
Simple Linear Regression - Assignn5
No ratings yet
Simple Linear Regression - Assignn5
8 pages
Predictive Modeling - Supporting File1
No ratings yet
Predictive Modeling - Supporting File1
3 pages
Machine Lpipearning Interview Questions: Algorithms/Tp: Q1-What's The Trade-Off Between Bias and Variance?
No ratings yet
Machine Lpipearning Interview Questions: Algorithms/Tp: Q1-What's The Trade-Off Between Bias and Variance?
46 pages
Exploratory Data Analysis
No ratings yet
Exploratory Data Analysis
22 pages
Project - Ipynb - Colaboratory
No ratings yet
Project - Ipynb - Colaboratory
4 pages
Project 5 PDF
100% (1)
Project 5 PDF
48 pages
Synopsis Minor Project-2
No ratings yet
Synopsis Minor Project-2
5 pages
The Concise Guide to the Internet of Things for Executives
From Everand
The Concise Guide to the Internet of Things for Executives
alasdair gilchrist
4/5 (7)
Cash Management PPT SEM-6
No ratings yet
Cash Management PPT SEM-6
61 pages
Heirs of Zoilo Espiritu Vs Sps. Landrito Case Digest
No ratings yet
Heirs of Zoilo Espiritu Vs Sps. Landrito Case Digest
3 pages
Higher Education Loans Board
No ratings yet
Higher Education Loans Board
8 pages
Our BRIC Mindset Will Ruin Us
No ratings yet
Our BRIC Mindset Will Ruin Us
4 pages
The Role of Deposit Money Banks' Loan Facilities in Financing Small and Medium-Scale Enterprises in Nigeria
No ratings yet
The Role of Deposit Money Banks' Loan Facilities in Financing Small and Medium-Scale Enterprises in Nigeria
8 pages
IREDA Solar PV Loan Scheme
100% (1)
IREDA Solar PV Loan Scheme
31 pages
Discuss The Arguments For and Against Regulating The Financial Reporting System
100% (10)
Discuss The Arguments For and Against Regulating The Financial Reporting System
9 pages
Grade 12 and SP Caps Errata Dec 2013
No ratings yet
Grade 12 and SP Caps Errata Dec 2013
78 pages
2018IJEFRA_Vol21_33-40 (1)
No ratings yet
2018IJEFRA_Vol21_33-40 (1)
9 pages
Draft Letter To European Commission
No ratings yet
Draft Letter To European Commission
16 pages
Final Report On PNB
No ratings yet
Final Report On PNB
55 pages
Day 1A - Receivables - AM Seatwork
No ratings yet
Day 1A - Receivables - AM Seatwork
7 pages
LSBF SBR Class Notes 2018 19
No ratings yet
LSBF SBR Class Notes 2018 19
230 pages
(eBook PDF) Engineering Economic Analysis 13th by Donald G. Newnan instant download
100% (1)
(eBook PDF) Engineering Economic Analysis 13th by Donald G. Newnan instant download
42 pages
Harish Banking Synopsis
No ratings yet
Harish Banking Synopsis
14 pages
Unit 8
50% (2)
Unit 8
12 pages
Botswana Financial Statistics - October 2022
No ratings yet
Botswana Financial Statistics - October 2022
120 pages
Rule 18 of the Companies (Share Capital and Debentures Rules, 2014)
No ratings yet
Rule 18 of the Companies (Share Capital and Debentures Rules, 2014)
5 pages
Construction
No ratings yet
Construction
17 pages
Individual Assignment Financial Management
100% (1)
Individual Assignment Financial Management
17 pages
Agrifinance 110818232145 Phpapp02
No ratings yet
Agrifinance 110818232145 Phpapp02
18 pages
AE 322 Financial Markets: Mrs. Carazelli A. Furigay, Mba
No ratings yet
AE 322 Financial Markets: Mrs. Carazelli A. Furigay, Mba
18 pages
Literature Review of NBFC
No ratings yet
Literature Review of NBFC
7 pages
Basis Points (BPS)
No ratings yet
Basis Points (BPS)
10 pages
MATH 6 SUMMATIVE 02 edited
No ratings yet
MATH 6 SUMMATIVE 02 edited
3 pages
Promissory Notes
No ratings yet
Promissory Notes
3 pages
Grier, Waymond A. - Credit Analysis of Financial Institutions-Euromoney Books (2012) - 301-433
No ratings yet
Grier, Waymond A. - Credit Analysis of Financial Institutions-Euromoney Books (2012) - 301-433
133 pages
Hitory Bank
No ratings yet
Hitory Bank
121 pages

Data Mining Project Shivani Pandey

Uploaded by

Data Mining Project Shivani Pandey

Uploaded by

DATA MINING

Name: Shivani PandeyBatch: PGP DSBA,

spending & current_balance(0.94) credit_limit & advance_payments(0.94)

• Scaling needs to be done as the values of the variables are different.

No assumptions on the number of clusters

Any desired number of clusters can be obtained by ‘cutting ‘the dendrogram at

Hierarchical clustering may correspond to meaningful taxonomies.

We can also use the average linkage method

o A non-hierarchical approach to forming good clusters is to pre-specify a desired number of clusters

o K-means clustering is widely used in large dataset applications.

Group 3: Medium Spending

• Lowering interest rates for loans.

Group 2: Low Spending

• Discount on instant payments and early ones.

• There are 139 duplicate values in the dataset.

Test data is 30% Train data is 70%

Building decision tree classifier

Performance metric for CART

1. For train dataset

Performance metric for random forest

For train data

Random Random Neural Neural

• Claim cycle should be deduced.

• As per the data 90% of insurance is done by online channel

-------------------- The End -------------------------------------------------------

You might also like