SMDM Project Report
SMDM Project Report
DSBA
CONTENTS :
Table of Contents
Problem 1............................................................................................................................................................... ...3
A. What is the important technical information about the dataset that a database administrator would be
interested in? (Hint: Information about the size of the dataset and the nature of the variables)............................ ..3
B. Take a critical look at the data and do a preliminary analysis of the variables. Do a quality check of the data
so that the variables are consistent. Are there any discrepancies present in the data?............................................. ..4
C. Explore all the features of the data separately by using appropriate visualizations and draw insights that
can be utilized by the business....................................................................................................................................8
D. Understanding the relationships among the variables in the dataset is crucial for every analytical project.
Perform analysis on the data fields to gain deeper insights. Comment on your understanding of the data........…..15
E. Employees working on the existing marketing campaign have made the following remarks. Based on the
data and your analysis state whether you agree or disagree with their observations. Justify your answer Based
on the data available..................................................................................................................................................23
a. Steve Roger says “Men prefer SUV by a large margin, compared to the women”..............................................23
b. Ned Stark believes that a salaried person is more likely to buy a Sedan.............................................................24
c. Sheldon Cooper does not believe any of them; he claims that a salaried male is an easier target for a SUV
sale over a Sedan Sale..............................................................................................................................................25
F. From the given data, comment on the amount spent on purchasing automobiles across the following
categories. Comment on how a Business can utilize the results from this exercise. Give justification along with
presenting metrics/charts used for arriving at the conclusions.................................................................................26
a. Gender...................................................................................................................................................................26
b. Personal_loan........................................................................................................................................................27
G. From the current data set comment if having a working partner leads to the purchase of a higher-priced
car. …………………………………………………………..……………………………………………………..28
H. The main objective of this analysis is to devise an improved marketing strategy to send targeted
information to different groups of potential buyers present in the data. For the current analysis use the Gender
and Marital_status - fields to arrive at groups with similar purchase history...........................................................30
Problem 2................................................................................................................................................................. 32
Problem 1
1.What is the important technical information about the dataset that a database administrator
would be interested in?
By Looking information from the DataFrame.The following important information we can infer
Shape of the Data set is
Total Rows=1581
Total Columns=14
Datatype of 14 Variables are
Sl No Object Int64 Float64
1 Gender Age Partner_salary
2 Profession Salary
3 Marital_status Total_salary
4 Education Price
5 Personal_loan No_of_Dependents
6 House_loan
7 Partner_working
8 Make
9
B. Take a critical look at the data and do a preliminary analysis of the variables. Do a quality
check of the data so that the variables are consistent. Are there any discrepancies present in the
data?
By looking at the data in detail, following is the important information we can infer -
By looking at the unique values for categorical variables, we can see that only the Gender
variable has total 4 unique values, some incorrect values like – Femal, Femle which needs to
be replaced with correct values i.e. Female. Rest other categorical variable have correct
unique values
Gender
array(['Male', 'Femal', 'Female', nan, 'Femle'], dtype=object)
Profession
array(['Business', 'Salaried'], dtype=object)
Marital_status
array(['Married', 'Single'], dtype=object)
Education
array(['Post Graduate', 'Graduate'], dtype=object)
No_of_dependents
array([4, 3, 2, 1, 0], dtype=object)
Personal_loan
array(['No', 'Yes'], dtype=object)
House_loan
array(['No', 'Yes'], dtype=object)
Partner_working
array(['Yes', 'No'], dtype=object)
Make
array(['SUV', 'Sedan', 'Hatchback'], dtype=object)
Gender and Partner_salary column has some NULL values which might need some
imputation.
For Gender variable, we have 53 missing values which is 3% of the total values.
Dropping these records will not make sense as we will lose out on the values for other
variable for the same records. Since this is a very small percentage, so imputing with
Other/Unknown will create a new value for the category and analysis on this new
category value will not be good since we will have very less data for this category
value. So, can impute missing values with mode. Also, there is no other variable which
can tell us the Gender of the missing value.
Gender:
Male 76%
Female 21%
NULL 3%
For Partner_salary variable, we have 106 missing values which is 6.7% of the total
values. Also, this variable is dependent on the Marital_status and Partner_working.
For Marital_status = Single, first we need to check if we have any value of
Partner_working = Yes. If any such records, then those are bad data. And we might need
to check with business if their Marital_status is wrong in the dataset or if Marital_status
is correct then these needs to be corrected for Partner_working = No and Partner_salary =
0. Luckily, we don’t have any such records in the dataset.
For Marital_status = Single, we should have only one value of Partner_working = No.
For such data, Partner_working should have 0 value. We can impute Partner_salary with
0 for such records where there is a missing value.
For Marital_status = Married and Partner_working = No, for such data, Partner_working
should have 0 value. We can impute Partner_salary with 0 for such records where there is
a missing value.
For Marital_status = Married and Partner_working = Yes, for suchdata, Partner_working
should have some value. For missing values in such dataset, we can impute using the
formula Partner_salary = Total_salary – Salary.
Now, checking whether any continuous variable have any outliers. Yes, Total_salary has
outliers. Since the percentage of outliers is 1.7% then these can be treated further using IQR
Here, we can say that apart from Salary all are positively skewed or right skewed. Age is highly
right skewed followed by Price and Total_salary.
Total_salary - Minimum and maximum is 30000 and 171000 respectively with mean
as79625 and median as 78000. So, as we saw the skewness this is right skewed.
Price – Minimum and maximum price of the automobile is 18000 and 70000
respectively. Mean is 35597 and median is 31000 which are somewhat closer to each
other but it is highly right skewed.
Firstly, for categorical columns we can plot countplots/piecharts for better understanding.
Gender – There are more number of people in the given dataset are male.
Profession – There are more number of salaried people who are buying the cars than the
Business people. So, we can make our marketing strategy which focuses on Salaried
people for higher sales.
Marital Status – There are more number of married people than Single. So, we can infer
that Married people are more likely to buy cars. Hence, marketing strategy can be
focused more for Married people for higher sales/profit.
Education – There are more number of Post Graduate people who are buying cars
compared to Graduate people. So, we can focus our marketing strategy for Post Graduate
people to sell more cars and increase sales.
No_of_Dependents – From the below plot, we can infer that people having 2 or 3
dependents are more likely to buy cars compared to people with 0 or 1 or 4 dependents.
So, suggested marketing strategy is focus towards people with 2 or 3 dependents
Personal_loan – From the below plot, we can see that people with a personal loan or no
personal loan are equally intended to buy the car. So we can ignore this variable to make
our marketing strategy. Though this variable in combination with other variable might
speak a different story and can give more insight.
House_loan – From the plot, we can infer that people having no house loans are likely to
buy carscompared to people having a house loan. So, suggestion to business is to focus
marketing strategy towards people with no house loan in order to increase the sales.
Partner_working – From the below plot, initial inference is that people having a working
partner are more likely to buy a car compared to people who are either single or married with no
working partner. So marketing strategy should be focused for people having a working partner
to increase the sales.
Make – From the below plot we can see that most popular car among people is Sedan
followed by Hathback and SUV. So marketing strategy should be focused on selling Sedans
more and then Hatchback as they are the more likely choice of the customer. But, other variable
when included can have a impact on the analysis and can speak a different story.
Salary – From the below plot, we can see that there are no outliers and mean and median are
closer to each other. But a slight negative/left skewness can be seen. Average salary is 60000
and we can make our marketing strategy to focus customer group as per the average salary. E.g
higher priced cars for people having salary > 60000 and focusing lower price cars for people
having salary < 60000. Also, graph is multi-modal.
Partner Salary – From the below plot, we can see that partner salary is ranging from 0 till
80000 and median (approx. 25000) is greater than the mean (19000 approx.) value so is right
skewed. Here, we can’t deduce anything concretely as we should be looking for the Total Salary
variable for devising the marketing strategy. Also, the graph is bimodal.
Total Salary – Here, we can see that mean and median value is closer to each other but it is
right skewed.
Price – From the below we can infer that Price variable is highly right skewed and bimodal.
Age – From the below we can infer that Age is highly right skewed
D. Understanding the relationships among the variables in the dataset is crucial for every
analytical project. Perform analysis on the data fields to gain deeper insights. Comment on your
understanding of the data.
Firstly, we can see the Pairplot and correlation of continuous variables.
Pairplot – We can see that there is linear relationship for following groups –
Age and Salary
Age and Total Salary
Age and Price
Salary and Total Salary
Price and Salary
Partner Salary and Total Salary
Total Salary and Price
.
Moreover, correlation can be seen as
Heatmap is more easy to observe the same correlation.
So, we can see that all are continuous variables are positively correlated and following are
highly correlated –
Age and Price
Age and Salary
Salary and Total Salary
Partner Salary and Total Salary
Now, doing analysis between Gender and Make. We can infer that Males are obviously
preferring buying more cars than Females. And within Males, they prefer Sedan and
Hatchbacks the most. And among Females, they prefer SUV and Sedans most. So, marketing
strategy could be focused for these groups to boost sales of the specified vehicles among these
groups and for other groups, discounts or promotional offers could be rolled out to attract more
from those groups to boost sales
Now, including another variable as Profession in the above picture, we can further drill down
that among Males, Sedans are preferred most by Salaried males and Hatchbacks are preferred
by Business males. Whereas, with Females, SUV and Sedan are preferred by both Salaried and
Business Females butSUV is definitely popular among Females. Business Females don’t prefer
Hatchback at all.
Now including Marital Status also in the picture, we can conclude that among Males definitely
married males in both professions are highest buyers of Sedans and Hatchback. And among
Females, married females in both professions are highest buyers of SUV and Sedan. So, among
Males marketing strategy could be to send SMS/ads of Sedans and Hatchbacks to attract them
and among Females, to send SMS/ads of Sedans and SUV to attract them.
SMS/ads of Sedans and SUV to attract them.From the below plot, we can infer that within
Males, who have either 2 or 3 dependents are buying more cars. And within Females, who have
either 1 or 2 dependents are buying more cars. Marketing strategy here could be focused for
dependents ranging from 1 to 3
From the below plot, we can infer that people buying SUV have a higher average salary
compared to people buying Sedan or Hatchback. So, people having salary below $125000
approx. mostly prefer Sedan and Hatchback
Now, from the below plot we can infer that Single people are mostly buying lower priced car
more thanthe Married people. And Married people whose Total Salary ranges from $50000 to
$110000 are still buying lower prices car within the range of $18000 to $35000.
E. Employees working on the existing marketing campaign have made the following remarks.
Based on the data and your analysis state whether you agree or disagree with their observations.
Justify your answer Based on the data available.
E1) Steve Roger says “Men prefer SUV by a large margin, compared to the women”
We made a crosstab for Gender and Make and got the counts as below.
From the above we can clearly see that for the SUV counts for Female is higher than the
male. Same is inferred from the countplot below.
So, statement made by Steve Roger is false.
E2) Ned Stark believes that a salaried person is more likely to buy a Sedan.
From the above we can clearly see that for the salaried person is buying more number of
sedans than Business person. Same is inferred from the countplot below
From the above we can clearly see that for the salaried male is buying more number of
sedans than SUV. Same is inferred from the plot below
From the below plot, we can easily see that group Married Males are purchasing
Sedans/Hatchback quite high compared to other groups.
SUVs are mostly popular among Married Females.
Also, Single Females are the least buyers of any kind of cars.
After Married Males, Married Females buys Sedans compared to other groups.
Marketing strategy could be –
Most marketing can be done for Sedan and Hatchbacks targeting Married Males as
Sedans/Hatchback are their preferred choice. Same goes for Married Females for SUV.
Targeted SMS or youtube ads can be sent to these groups for attracting sales. Same is inferred
from the below mode value
Problem No -2
Framing An Analytics Problem Analyse the dataset and list down the top 5 important variables,
along with the business justifications.
From looking at the problem statement and variables, I believe following are the important
variables –
Sl No
card_type
cc_active30
cc_active60
cc_active90
annual_income_at_source
other_bank_cc_holding
bank_vintage
T+1_month_activity
T+2_month_activity
T+3_month_activity
T+6_month_activity
T+12_month_activity
Transactor_revolver
avg_spends_l3m
Occupation_at_source
cc_limit
So, as per my business understanding, following are the top 5 important variables from the
above list.
1.cc_active30.
2.hotlist_flag.
3.Tth month,T+1_month_activity.
4. Bank Vintage.
5.Transactor-revolver.
6.avg_spends_l3m.
7.Occupation.
8.cc_limit.
active stands for account activity, it can be savings, current account. This will give us the
background of the customer for their transaction over the recent past 30 days.
There are 3 Category in the data.
1.active_30, 2.active_60, 3.active_90.
active_30 stands for account activities within 30days.
0 = 5978, 1 = 2470
active_60 stands for account activities within 60 days.
0 = 4268, 1 = 4180
active_90 stands for account activities within 90 days.
0 = 3024, 1 = 5424
cc_active stands for credit card account. This will give us the background of the
customer for their CC spend over the recent past 30 days.
If they are using the CC recently or not
There are 3 Category of credit card account in the data.
1.cc_active30, 2.cc_active60, 3.cc_active90.
cc_active30 stands for credit card usage within 30 days.
0 = 6048, 1 = 2400
cc_active60 stands for credit card usage within 60 days.
0 = 4355, 1 = 4093
cc_active90 stands for credit card usage within 90 days.
0 = 3106, 1 = 5342
What does 0 and 1 denotes in active_30; cc_active30; T+1_month_activity etc.
.
This shows that the credit card activities done by the customer in the current month This will
help the bank to predict and allocate the offers/right credit card for the customer to increase the
credit card spends
.i.e 7508 customers have no credit card activity and 940 customers have credit card activity.
b.T+1 is next month.
0 = 8043, 1 = 405.
This shows that the credit card activities done by the customer in the next month.
i.e 8043 customers have no credit card activity and 405 customers have credit card activity.
4.bank_vintage:
Vintage with the bank (in months) as on Tth month
The term 'Vintage' refers to the month or quarter in which account was opened (loan was
granted)
5. Transactor_revolver:
Revolver : Customer who carries balances over from one month to the next.
Transactor : Customer who pays off their balances in full every month.
T = 7153, R = 1295.
From the above analysis out of 8448 account holders in the bank 7153 are Transactor and 1295
are revolver .Which means the bank having 85% of the customers are transactor means they are
paying their Emi /loans fully every month without carrying the due.And remaining 15% of the
customers are revolver means they are paying their Emi/Loans fully every month.
6.avg_spends_l3m:
Which means that the Average credit card spends by the customers in last 3 months
7.Occupation :
Occupation recorded at the time of credit card application
Salaried = 3918
Self Employed = 2175
Retired = 1089
Student = 621
Housewife = 384
0 = 261
From the Above Analysis we got know that the customers occupation at the time of applying
the credit card are so on
8.cc_limit / Current credit card limit
This variable will speak about the current credit card limit available with the customer.
Less the current CC limit available means that they have spend more on the credit card
and are likely to spend more in near future. Soo for these customer, right credit card
could be suggested going forward. Also, for customers with high current CC limit, means
they are not spending much through credit card and right card which gives them more
advantages/offers could be suggested in order to increase their spends on the credit card.