Data Mining Project Shivani Pandey
Data Mining Project Shivani Pandey
PROJECT REPORT
1.1 Read the data and do exploratory data analysis. Describe the data briefly
Solution.
Answer:
• There are 7 variables and 210 records.
• Data has No missing record based on intial analysis.
• All the variables numeric type.
• The data looks much aligned and good.
• We see that mean for (spending and advance payments) are almost equal and the same goes for (max spent in single
shopping and current balance) al ongwith( credit limit and min payment amt)
• There are no outliers in the data set expect for the min payment amountvariable.
• The variables are rightly skewed expect for the variable probability of fullpayment
• Strong positive correlation is seen between the followingspending & advance_payments, (0.99)
advance_payments & current_balance, (0.97) credit_limit & spending(0.97)
max_spent_in_single_shopping current_balance(0.932)
1.2 Do you think scaling is necessary for clustering in this case? Justify
Answer:
• Scaling is a technique to standardize the independent features present in thedata in a fixed range. It is performed
during the data pre-processing to handlehighly varying magnitudes or values or units.
• Standardization prevents variables with larger scales from dominatinghow clusters are defined.
• We observe that in this dataset, we have spending and advance payments havinghigher values and may get more
weightage.
• Scaling will have the data scaled in the same relative range, hereby scaling shouldbe performed in the dataset before
clustering.
• Spending, advance_payments are in different values and this may get more weightage.
• Also have shown below the plot of the data prior and after scaling.
• Scaling will have all the values in the relative same range.
• I have used zscore to standardize the data to relative same scale -3 to +3.
1.3 Apply hierarchical clustering to scaled data. Identify the number of optimumclusters using Dendrogram and
briefly describe them
Answer:
In hierarchical clustering:
Records are sequentially grouped to create clusters, based on distances betweenrecords and distances between clusters.
Hierarchical clustering also produces a useful graphical display of the clusteringprocess and results, called a dendrogram.
Strengths of Hierarchical Clustering are:
We perform clustering using ward linkage method and perform clustering for both3 and 4 clusters.
We can go for 3 clusters, as it can be further categorized into 3 groups of high,med and low with respect to the variables
max_spent_in_single_shopping and probability_of_full_payment.
cluster frequency
1 70
2 67
3 73
1.4 Apply K-Means clustering on scaled data and determine optimum clusters.Apply elbow curve and silhouette
score.
Answer:
o We Assign each record to one of the k clusters, according to their distancefrom each cluster, so as to minimize a
measure of dispersion within the cluster
o The ‘means’ in the K-means refers to averaging of the data; that is, findingthe centroid
We apply k means clustering for 3 and 4 clusters and we go for 3 clusters considering the silhouette score for 3 clusters being
optimal. i.e., 0.40 greater thanthat of 4 clusters. i.e. 0.327.
based on current dataset given, 3 cluster solution makes sense based on thespending pattern (High, Medium, Low)
1.5 Describe cluster profiles for the clusters defined. Recommend differentpromotional strategies for
different clusters
Group 1: High Spending
• Giving more reward points might increase their purchases.
- offering discount/offer on next transactions upon full payment
- Increase their credit limit and giving them premium benefits.
- giving more loan options with less interest rates and giving premium credit card options.
-
• Giving them new offers so they start being more loyal and spendconsiderably more.
• Giving them payment reminders and tying up with local grocery stores andphone /gas /electricity for cashbacks.
2.1 Data Ingestion: Read the dataset. Do the descriptive statistics and do nullvalue condition check, write an
inference on it?
Answer:
• There are 10 variables, with a total of 3000 values.There are no missing values.
• Age commission, duration and sales are continuous/numeric type the othersbeing categorical variables.
• There are 9 independent variables whereas the target variable is Claimed. When we perform the descriptive
statistics, we see that duration has -ve min, which depicts incorrect entry.
2.2 Data Split: Split the data into test and train, build classification model CART,Random Forest, Artificial Neural
Network
Answer:
Extracting the target column into separate vectors for training set and test setand performing standardization to
scale the data.
Splitting the data into train and test
Answer:
I am selecting CART as it has better accuracy, Auc score, recall, precision andF1 score with respect to other models.
2.5 Inference: Based on the whole Analysis, what are the business insights andrecommendations
Answer:
• The data needs to be more streamlined and more unstructured data can beput.
• The online channel seems to be more convenient platform for insurancesales.
• The people in Asia have claimed more than other countries.
• The other interesting fact being sales are more via travel agency.
• Recommendations:
• Give more options of benefits and put emphasis on customer satisfaction.
• I strongly recommended we collect more real time unstructured data and past data if possible.
• This is understood by looking at the insurance data by drawing relations between different variables such as day of the incident,
time, age group, and associating it with other external information such as location, behaviour patterns, weather information,
airline/vehicle types, etc.
• Streamlining online experiences benefitted customers, leading to an increase in conversions, which subsequently raised profits.
• Other interesting fact, is almost all the offline business has a claimed associated, need to find why?
• Need to train the JZI agency resources to pick up sales as they are in bottom, need to run promotional marketing campaign or
evaluate if we need to tie up with alternate agency
• Also based on the model we are getting 80%accuracy, so we need customer books airline tickets or plans, cross sell the insurance
based on the claim data pattern.
• Other interesting fact is more sales happen via Agency than Airlines and the trend shows the claim are processed more at Airline.
So, we may need to deep dive into the process to understand the workflow and why?
• Key performance indicators (KPI) The KPI’s of insurance claims are: • Reduce claims cycle time • Increase customer satisfaction •
Combat fraud • Optimize claims recovery • Reduce claim handling costs Insights gained from data and AI-powered analytics
could expand the boundaries of insurability, extend existing products, and give rise to new risk transfer solutions in areas like a
non-damage business interruption and reputational damage.