0% found this document useful (0 votes)
14 views39 pages

SMDM Project Report

This document provides a detailed analysis of an automotive customer dataset. It examines the dataset variables, checks for data quality issues, explores relationships between variables, and draws insights. It aims to help devise an improved marketing strategy by segmenting customers. The analysis looks at the dataset thoroughly using descriptive statistics and data visualizations to understand customer demographics and purchase behavior.

Uploaded by

Abhishek Abhi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views39 pages

SMDM Project Report

This document provides a detailed analysis of an automotive customer dataset. It examines the dataset variables, checks for data quality issues, explores relationships between variables, and draws insights. It aims to help devise an improved marketing strategy by segmenting customers. The analysis looks at the dataset thoroughly using descriptive statistics and data visualizations to understand customer demographics and purchase behavior.

Uploaded by

Abhishek Abhi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 39

______________________________________

SMDM PROJECT REPORT

DSBA
CONTENTS :
Table of Contents
Problem 1............................................................................................................................................................... ...3
A. What is the important technical information about the dataset that a database administrator would be
interested in? (Hint: Information about the size of the dataset and the nature of the variables)............................ ..3
B. Take a critical look at the data and do a preliminary analysis of the variables. Do a quality check of the data
so that the variables are consistent. Are there any discrepancies present in the data?............................................. ..4
C. Explore all the features of the data separately by using appropriate visualizations and draw insights that
can be utilized by the business....................................................................................................................................8
D. Understanding the relationships among the variables in the dataset is crucial for every analytical project.
Perform analysis on the data fields to gain deeper insights. Comment on your understanding of the data........…..15
E. Employees working on the existing marketing campaign have made the following remarks. Based on the
data and your analysis state whether you agree or disagree with their observations. Justify your answer Based
on the data available..................................................................................................................................................23
a. Steve Roger says “Men prefer SUV by a large margin, compared to the women”..............................................23
b. Ned Stark believes that a salaried person is more likely to buy a Sedan.............................................................24
c. Sheldon Cooper does not believe any of them; he claims that a salaried male is an easier target for a SUV
sale over a Sedan Sale..............................................................................................................................................25
F. From the given data, comment on the amount spent on purchasing automobiles across the following
categories. Comment on how a Business can utilize the results from this exercise. Give justification along with
presenting metrics/charts used for arriving at the conclusions.................................................................................26
a. Gender...................................................................................................................................................................26
b. Personal_loan........................................................................................................................................................27
G. From the current data set comment if having a working partner leads to the purchase of a higher-priced
car. …………………………………………………………..……………………………………………………..28
H. The main objective of this analysis is to devise an improved marketing strategy to send targeted
information to different groups of potential buyers present in the data. For the current analysis use the Gender
and Marital_status - fields to arrive at groups with similar purchase history...........................................................30
Problem 2................................................................................................................................................................. 32
Problem 1
1.What is the important technical information about the dataset that a database administrator
would be interested in?
By Looking information from the DataFrame.The following important information we can infer
Shape of the Data set is
 Total Rows=1581
 Total Columns=14
Datatype of 14 Variables are
Sl No Object Int64 Float64
1 Gender Age Partner_salary
2 Profession Salary
3 Marital_status Total_salary
4 Education Price
5 Personal_loan No_of_Dependents
6 House_loan
7 Partner_working
8 Make
9

Bifurcation of 14 variables are 9 Categorical and 5 Continuous Variables.


Sl No Categorical Continuous
1 Gender Age
2 Profession Salary
3 Marital_status Total_salary
4 Education Price
5 Personal_loan Partner_salary
6 House_loan
7 Partner_working
8 Make
9 No_of_Dependents
 Gender and partner_salary have some missing values / null values need to be imputed .
 Size of the dataset is 173.+KB.Size can be reduced by converting No_of_Dependents
variable into object type.
 There are no duplicates in the Dataset.

B. Take a critical look at the data and do a preliminary analysis of the variables. Do a quality
check of the data so that the variables are consistent. Are there any discrepancies present in the
data?
By looking at the data in detail, following is the important information we can infer -
 By looking at the unique values for categorical variables, we can see that only the Gender
variable has total 4 unique values, some incorrect values like – Femal, Femle which needs to
be replaced with correct values i.e. Female. Rest other categorical variable have correct
unique values
Gender
array(['Male', 'Femal', 'Female', nan, 'Femle'], dtype=object)
Profession
array(['Business', 'Salaried'], dtype=object)
Marital_status
array(['Married', 'Single'], dtype=object)
Education
array(['Post Graduate', 'Graduate'], dtype=object)
No_of_dependents
array([4, 3, 2, 1, 0], dtype=object)
Personal_loan
array(['No', 'Yes'], dtype=object)
House_loan
array(['No', 'Yes'], dtype=object)
Partner_working
array(['Yes', 'No'], dtype=object)
Make
array(['SUV', 'Sedan', 'Hatchback'], dtype=object)
Gender and Partner_salary column has some NULL values which might need some
imputation.
For Gender variable, we have 53 missing values which is 3% of the total values.
Dropping these records will not make sense as we will lose out on the values for other
variable for the same records. Since this is a very small percentage, so imputing with
Other/Unknown will create a new value for the category and analysis on this new
category value will not be good since we will have very less data for this category
value. So, can impute missing values with mode. Also, there is no other variable which
can tell us the Gender of the missing value.
Gender:
Male 76%
Female 21%
NULL 3%

For Partner_salary variable, we have 106 missing values which is 6.7% of the total
values. Also, this variable is dependent on the Marital_status and Partner_working.
 For Marital_status = Single, first we need to check if we have any value of
Partner_working = Yes. If any such records, then those are bad data. And we might need
to check with business if their Marital_status is wrong in the dataset or if Marital_status
is correct then these needs to be corrected for Partner_working = No and Partner_salary =
0. Luckily, we don’t have any such records in the dataset.
 For Marital_status = Single, we should have only one value of Partner_working = No.
For such data, Partner_working should have 0 value. We can impute Partner_salary with
0 for such records where there is a missing value.
 For Marital_status = Married and Partner_working = No, for such data, Partner_working
should have 0 value. We can impute Partner_salary with 0 for such records where there is
a missing value.
 For Marital_status = Married and Partner_working = Yes, for suchdata, Partner_working
should have some value. For missing values in such dataset, we can impute using the
formula Partner_salary = Total_salary – Salary.
Now, checking whether any continuous variable have any outliers. Yes, Total_salary has
outliers. Since the percentage of outliers is 1.7% then these can be treated further using IQR

Also, skewness of continuous variables –


Skewness of Age: 0.8930870865867485
Skewness of Salary : -0.011570808595835032
Skewness of Partner_salary : 0.4410686067568632
Skewness of Total_Salary : 0.4244118094953136
Skewness of Price : 0.740873956667395

Here, we can say that apart from Salary all are positively skewed or right skewed. Age is highly
right skewed followed by Price and Total_salary.

Upon describing the dataset (Before outlier treatment) –


 Age – Mean age is 32 with median as 29 and minimum and maximum is 22 and 54
respectively. So, since it is an automobile data, so no records are there having age < 18.
 Salary – Mean salary is approx. 60392 with median as 59500 which are very close to
each other so distribution is normal. Minimum and maximum salary is 30000 and 99300
respectively.

 Partner_salary – Minimum and maximum is 0 and 80500 respectively with mean as


19233 and median as 25100. So, as we saw the skewness this is right skewed.

 Total_salary - Minimum and maximum is 30000 and 171000 respectively with mean
as79625 and median as 78000. So, as we saw the skewness this is right skewed.

 Price – Minimum and maximum price of the automobile is 18000 and 70000
respectively. Mean is 35597 and median is 31000 which are somewhat closer to each
other but it is highly right skewed.

After treating the outlier.


 Total_salary – Minimum remains the same as 30000 but maximum is now adjusted to
149000 with mean now as 79398 and median remains the same as 78000. So, we kind of
made the distribution normal
C. Explore all the features of the data separately by using appropriate visualizations and draw
insights that can be utilized by the business.

Firstly, for categorical columns we can plot countplots/piecharts for better understanding.
 Gender – There are more number of people in the given dataset are male.

 Profession – There are more number of salaried people who are buying the cars than the
Business people. So, we can make our marketing strategy which focuses on Salaried
people for higher sales.
 Marital Status – There are more number of married people than Single. So, we can infer
that Married people are more likely to buy cars. Hence, marketing strategy can be
focused more for Married people for higher sales/profit.

 Education – There are more number of Post Graduate people who are buying cars
compared to Graduate people. So, we can focus our marketing strategy for Post Graduate
people to sell more cars and increase sales.
 No_of_Dependents – From the below plot, we can infer that people having 2 or 3
dependents are more likely to buy cars compared to people with 0 or 1 or 4 dependents.
So, suggested marketing strategy is focus towards people with 2 or 3 dependents

 Personal_loan – From the below plot, we can see that people with a personal loan or no
personal loan are equally intended to buy the car. So we can ignore this variable to make
our marketing strategy. Though this variable in combination with other variable might
speak a different story and can give more insight.
House_loan – From the plot, we can infer that people having no house loans are likely to
buy carscompared to people having a house loan. So, suggestion to business is to focus
marketing strategy towards people with no house loan in order to increase the sales.

Partner_working – From the below plot, initial inference is that people having a working
partner are more likely to buy a car compared to people who are either single or married with no
working partner. So marketing strategy should be focused for people having a working partner
to increase the sales.
Make – From the below plot we can see that most popular car among people is Sedan
followed by Hathback and SUV. So marketing strategy should be focused on selling Sedans
more and then Hatchback as they are the more likely choice of the customer. But, other variable
when included can have a impact on the analysis and can speak a different story.

For continuous columns we can plot histograms/boxplot for better understanding.

Salary – From the below plot, we can see that there are no outliers and mean and median are
closer to each other. But a slight negative/left skewness can be seen. Average salary is 60000
and we can make our marketing strategy to focus customer group as per the average salary. E.g
higher priced cars for people having salary > 60000 and focusing lower price cars for people
having salary < 60000. Also, graph is multi-modal.
Partner Salary – From the below plot, we can see that partner salary is ranging from 0 till
80000 and median (approx. 25000) is greater than the mean (19000 approx.) value so is right
skewed. Here, we can’t deduce anything concretely as we should be looking for the Total Salary
variable for devising the marketing strategy. Also, the graph is bimodal.

Total Salary – Here, we can see that mean and median value is closer to each other but it is
right skewed.
Price – From the below we can infer that Price variable is highly right skewed and bimodal.

Age – From the below we can infer that Age is highly right skewed
D. Understanding the relationships among the variables in the dataset is crucial for every
analytical project. Perform analysis on the data fields to gain deeper insights. Comment on your
understanding of the data.
Firstly, we can see the Pairplot and correlation of continuous variables.

Pairplot – We can see that there is linear relationship for following groups –
Age and Salary
Age and Total Salary
Age and Price
Salary and Total Salary
Price and Salary
Partner Salary and Total Salary
Total Salary and Price
.
Moreover, correlation can be seen as
Heatmap is more easy to observe the same correlation.

So, we can see that all are continuous variables are positively correlated and following are
highly correlated –
Age and Price
Age and Salary
Salary and Total Salary
Partner Salary and Total Salary
Now, doing analysis between Gender and Make. We can infer that Males are obviously
preferring buying more cars than Females. And within Males, they prefer Sedan and
Hatchbacks the most. And among Females, they prefer SUV and Sedans most. So, marketing
strategy could be focused for these groups to boost sales of the specified vehicles among these
groups and for other groups, discounts or promotional offers could be rolled out to attract more
from those groups to boost sales
Now, including another variable as Profession in the above picture, we can further drill down
that among Males, Sedans are preferred most by Salaried males and Hatchbacks are preferred
by Business males. Whereas, with Females, SUV and Sedan are preferred by both Salaried and
Business Females butSUV is definitely popular among Females. Business Females don’t prefer
Hatchback at all.
Now including Marital Status also in the picture, we can conclude that among Males definitely
married males in both professions are highest buyers of Sedans and Hatchback. And among
Females, married females in both professions are highest buyers of SUV and Sedan. So, among
Males marketing strategy could be to send SMS/ads of Sedans and Hatchbacks to attract them
and among Females, to send SMS/ads of Sedans and SUV to attract them.
SMS/ads of Sedans and SUV to attract them.From the below plot, we can infer that within
Males, who have either 2 or 3 dependents are buying more cars. And within Females, who have
either 1 or 2 dependents are buying more cars. Marketing strategy here could be focused for
dependents ranging from 1 to 3
From the below plot, we can infer that people buying SUV have a higher average salary
compared to people buying Sedan or Hatchback. So, people having salary below $125000
approx. mostly prefer Sedan and Hatchback
Now, from the below plot we can infer that Single people are mostly buying lower priced car
more thanthe Married people. And Married people whose Total Salary ranges from $50000 to
$110000 are still buying lower prices car within the range of $18000 to $35000.
E. Employees working on the existing marketing campaign have made the following remarks.
Based on the data and your analysis state whether you agree or disagree with their observations.
Justify your answer Based on the data available.

E1) Steve Roger says “Men prefer SUV by a large margin, compared to the women”

We made a crosstab for Gender and Make and got the counts as below.
From the above we can clearly see that for the SUV counts for Female is higher than the
male. Same is inferred from the countplot below.
So, statement made by Steve Roger is false.
E2) Ned Stark believes that a salaried person is more likely to buy a Sedan.

From the above we can clearly see that for the salaried person is buying more number of
sedans than Business person. Same is inferred from the countplot below

So, statement made by Ned Stark is true


E3) Sheldon Cooper does not believe any of them; he claims that a salaried male is an easier
target for a SUV sale over a Sedan Sale.

From the above we can clearly see that for the salaried male is buying more number of
sedans than SUV. Same is inferred from the plot below

So, statement made by Sheldon Cooper is false.


F. From the given data, comment on the amount spent on purchasing automobiles across the
following categories. Comment on how a Business can utilize the results from this exercise.
Give justification along with presenting metrics/charts used for arriving at the conclusions.
F1) Gender
We plotted the bargraph of Gender v/s Price. So from the below, we can infer that on an
average Females spent more compared to males when buying cars.
F2) Personal_loan
We plotted the bargraph of Personal Loan v/s Price. So from the below, we can infer that on an
average people having no personal loan are spending on buying cars compared people having a
personal loan.
G. From the current data set comment if having a working partner leads to the purchase of a
higher-priced car.
From the below plot, we can see that the median value for both groups is the same but the
average spent is slightly greater for the group where Partner is not working. So, we can
conclude that having a working partner doesn’t lead to purchase of a higher priced car
But, we can slice the dataset for having records only for Marital_status = Married because only
married
people will have a working partner. All Singles will never have a working partner. So,
analyzing among the
dataset having Marital_status = Married will give more proper insight.
From the below plot, we can see that the median value is slightly higher for non working
partner group
compared to working partner group and the average spent is slightly greater for the group where
Partner.
H. The main objective of this analysis is to devise an improved marketing strategy to send
targeted information to different groups of potential buyers present in the data. For the current
analysis use the Gender and Marital_status - fields to arrive at groups with similar purchase
history.

From the below plot, we can easily see that group Married Males are purchasing
Sedans/Hatchback quite high compared to other groups.
SUVs are mostly popular among Married Females.
Also, Single Females are the least buyers of any kind of cars.
After Married Males, Married Females buys Sedans compared to other groups.
Marketing strategy could be –
Most marketing can be done for Sedan and Hatchbacks targeting Married Males as
Sedans/Hatchback are their preferred choice. Same goes for Married Females for SUV.
Targeted SMS or youtube ads can be sent to these groups for attracting sales. Same is inferred
from the below mode value
Problem No -2
Framing An Analytics Problem Analyse the dataset and list down the top 5 important variables,
along with the business justifications.
From looking at the problem statement and variables, I believe following are the important
variables –

Sl No
card_type
cc_active30
cc_active60
cc_active90
annual_income_at_source
other_bank_cc_holding
bank_vintage
T+1_month_activity
T+2_month_activity
T+3_month_activity
T+6_month_activity
T+12_month_activity
Transactor_revolver
avg_spends_l3m
Occupation_at_source
cc_limit

So, as per my business understanding, following are the top 5 important variables from the
above list.
1.cc_active30.
2.hotlist_flag.
3.Tth month,T+1_month_activity.
4. Bank Vintage.
5.Transactor-revolver.
6.avg_spends_l3m.
7.Occupation.
8.cc_limit.

1.What is active_30? What it represents? how it is different from cc_active30?


(Similarlyactive_60, active_90 --- cc_active60, cc_active90 )

active stands for account activity, it can be savings, current account. This will give us the
background of the customer for their transaction over the recent past 30 days.
There are 3 Category in the data.
1.active_30, 2.active_60, 3.active_90.
active_30 stands for account activities within 30days.
0 = 5978, 1 = 2470
active_60 stands for account activities within 60 days.
0 = 4268, 1 = 4180
active_90 stands for account activities within 90 days.
0 = 3024, 1 = 5424

cc_active stands for credit card account. This will give us the background of the
customer for their CC spend over the recent past 30 days.
If they are using the CC recently or not
There are 3 Category of credit card account in the data.
1.cc_active30, 2.cc_active60, 3.cc_active90.
cc_active30 stands for credit card usage within 30 days.
0 = 6048, 1 = 2400
cc_active60 stands for credit card usage within 60 days.
0 = 4355, 1 = 4093
cc_active90 stands for credit card usage within 90 days.
0 = 3106, 1 = 5342
What does 0 and 1 denotes in active_30; cc_active30; T+1_month_activity etc.

1 denotes active and 0 denotes non active.

2.What is hotlist_flag? What it represents?


Hotlist:
If there is any problem with the card, it would be blocked or hotlisted.
The Need for this usually arises when someone loses or misplaces their card. Hotlisted cards
cannot be used for transactions any more.
Flagging:
Transactions that are inconsistent with a customer’s known financial profile or that lack a clear
source or business purpose it may be considered suspicious by banks.
A large number of online purchases in a short period of time is also likely to get a credit card
account flagged. Multiple purchases in rapid succession will also set off the credit card
companies ‘alarm bells’ for flagging
By analyzing the given data.
N = 8410, Y = 38.
If "Y", it means that the card is hotlisted.
From the above results it is shown that there are 38 card holder with hotlisted or hotlistflag.

3.What is the Tth month, Tth+1 month etc.


a.Tth Month is credit card activities is the current month.
0 = 7508, 1 = 940.

.
This shows that the credit card activities done by the customer in the current month This will
help the bank to predict and allocate the offers/right credit card for the customer to increase the
credit card spends
.i.e 7508 customers have no credit card activity and 940 customers have credit card activity.
b.T+1 is next month.
0 = 8043, 1 = 405.
This shows that the credit card activities done by the customer in the next month.
i.e 8043 customers have no credit card activity and 405 customers have credit card activity.

c.T+2 is 2 months after current month


0 = 7769, 1 = 679.
This shows that the credit card activities done by the customer in 2 months after current month.
i.e 7769 customers have no credit card activity and 679 customers have credit card activity.

d.T+6 is 5 months after current month


0 = 8373, 1 = 75.
This shows that the credit card activities done by the customer in 5 months after current month.
i.e 8373 customers have no credit card activity and 75 customers have credit card activity.

e.T+12 is 11 months after current month


0 = 8368, 1 = 80.
This shows that the credit card activities done by the customer in 11 months after current
month.
i.e 8368 customers have no credit card activity and 80 customers have credit card activity.

4.bank_vintage:
Vintage with the bank (in months) as on Tth month
The term 'Vintage' refers to the month or quarter in which account was opened (loan was
granted)
5. Transactor_revolver:
Revolver : Customer who carries balances over from one month to the next.
Transactor : Customer who pays off their balances in full every month.
T = 7153, R = 1295.
From the above analysis out of 8448 account holders in the bank 7153 are Transactor and 1295
are revolver .Which means the bank having 85% of the customers are transactor means they are
paying their Emi /loans fully every month without carrying the due.And remaining 15% of the
customers are revolver means they are paying their Emi/Loans fully every month.

6.avg_spends_l3m:
Which means that the Average credit card spends by the customers in last 3 months
7.Occupation :
Occupation recorded at the time of credit card application
Salaried = 3918
Self Employed = 2175
Retired = 1089
Student = 621
Housewife = 384
0 = 261
From the Above Analysis we got know that the customers occupation at the time of applying
the credit card are so on
8.cc_limit / Current credit card limit
This variable will speak about the current credit card limit available with the customer.
Less the current CC limit available means that they have spend more on the credit card
and are likely to spend more in near future. Soo for these customer, right credit card
could be suggested going forward. Also, for customers with high current CC limit, means
they are not spending much through credit card and right card which gives them more
advantages/offers could be suggested in order to increase their spends on the credit card.

You might also like