100% found this document useful (3 votes)
2K views14 pages

Project-Predictive Modeling-Rajendra M Bhat

The document describes using linear regression and logistic regression models to analyze diamond and tourism customer data. For the diamond data, linear regression was used to predict price based on attributes like carat, color, clarity, cut, depth and table. Key predictors of higher price were larger carat, clarity grades VVS1, VVS2, VS1, VS2, and color grades D, E, F, G. For the tourism data, both logistic regression and LDA were used to predict if a customer would opt for a holiday package based on attributes like nationality, education, salary, age, and family status. The models showed foreigners and those without young children were more likely to opt for packages. Logistic regression performed slightly better with

Uploaded by

Rajendra Bhat
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
100% found this document useful (3 votes)
2K views14 pages

Project-Predictive Modeling-Rajendra M Bhat

The document describes using linear regression and logistic regression models to analyze diamond and tourism customer data. For the diamond data, linear regression was used to predict price based on attributes like carat, color, clarity, cut, depth and table. Key predictors of higher price were larger carat, clarity grades VVS1, VVS2, VS1, VS2, and color grades D, E, F, G. For the tourism data, both logistic regression and LDA were used to predict if a customer would opt for a holiday package based on attributes like nationality, education, salary, age, and family status. The models showed foreigners and those without young children were more likely to opt for packages. Logistic regression performed slightly better with

Uploaded by

Rajendra Bhat
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

DSBA

Project 4
- Predictive Modeling
Rajendra M Bhat
Problem 1: Linear Regression

You are hired by a company Gem Stones co ltd, which is a cubic zirconia
manufacturer. You are provided with the dataset containing the prices and other
attributes of almost 27,000 cubic zirconia (which is an inexpensive diamond alternative
with many of the same qualities as a diamond). The company is earning different
profits on different prize slots. You have to help the company in predicting the price for
the stone on the bases of the details given in the dataset so it can distinguish between
higher profitable stones and lower profitable stones so as to have better profit share.
Also, provide them with the best 5 attributes that are most important.

1.1. Read the data and do exploratory data analysis. Describe the data briefly. (Check
the null values, Data types, shape, EDA). Perform Univariate and Bivariate
Analysis.
Brief description as below.
count unique top freq mean std min 25% 50% 75% max

Unnamed:
26967 NaN NaN NaN 13484 7784.85 1 6742.5 13484 20225.5 26967
0

carat 26967 NaN NaN NaN 0.798 0.478 0.2 0.4 0.7 1.05 4.5

cut 26967 5 Ideal 10816 NaN NaN NaN NaN NaN NaN NaN

color 26967 7 G 5661 NaN NaN NaN NaN NaN NaN NaN

clarity 26967 8 SI1 6571 NaN NaN NaN NaN NaN NaN NaN

depth 26270 NaN NaN NaN 61.745 1.413 50.8 61 61.8 62.5 73.6

table 26967 NaN NaN NaN 57.456 2.232 49 56 57 59 79

x 26967 NaN NaN NaN 5.73 1.129 0 4.71 5.69 6.55 10.23
count unique top freq mean std min 25% 50% 75% max

y 26967 NaN NaN NaN 5.734 1.166 0 4.71 5.71 6.54 58.9

z 26967 NaN NaN NaN 3.538 0.721 0 2.9 3.52 4.04 31.8

price 26967 NaN NaN NaN 3939.52 4024.86 326 945 2375 5360 18818

Info
0 Unnamed: 0 26967 non-null int64
1 carat 26967 non-null float64
2 cut 26967 non-null object
3 color 26967 non-null object
4 clarity 26967 non-null object
5 depth 26270 non-null float64
6 table 26967 non-null float64
7 x 26967 non-null float64
8 y 26967 non-null float64
9 z 26967 non-null float64
10 price 26967 non-null int64

Count of Null values


Unnamed: 0 0
carat 0
cut 0
color 0
clarity 0
depth 697
table 0
x 0
y 0
z 0
price 0

Data set has 26,967 rows with 11 variables. Column indicating row number (Unnamed:0)
cannot be used for analysis and needs to be deleted. Excluding row number data set has 3
categorical variables and 7 numerical variables. i.e 10 variables available for analysis. Price
is dependent variable and other 9 independent (predictive variables)

There are 697 ‘Null Values’ in variable ‘depth’


Univariate analysis
All numerical variables have outliers and treating outliers may impact characteristics of data
set and model itself therefore, outliers are not considered to be treated.

Bivariate analysis
It can be seen that variable x, y,z and carat are highly correlated and also these
variables have correlation with price (dependent variable).

1.2 Impute null values if present, also check for the values which are equal to zero. Do
they have any meaning or do we need to change them or drop them? Do you think
scaling is necessary in this case?

There are 697 null values in variable depth and imputed with mean of depth. After
deleing duplicates checked for zero values in data, There are values zero for x, y and
z.
carat cut color clarity depth table x y z price

5821 0.71 Good F SI2 64.1 60.0 0.00 0.00 0.0 2130

6034 2.02 Premium H VS2 62.7 53.0 8.02 7.95 0.0 18207

10827 2.20 Premium H SI1 61.2 59.0 8.42 8.37 0.0 17265

12498 2.18 Premium H SI2 59.4 61.0 8.49 8.45 0.0 12631

12689 1.10 Premium G SI2 63.0 59.0 6.50 6.47 0.0 3696

17506 1.14 Fair G VS1 57.5 67.0 0.00 0.00 0.0 6381

18194 1.01 Premium H I1 58.1 59.0 6.66 6.60 0.0 3167

23758 1.12 Premium G I1 60.4 59.0 6.71 6.67 0.0 2383

Since variables x, y, z are dimensions, there is no meaning for value zero. Since these
variables are highly correlated with carat, these variables have been dropped for
further analysis. Scaling is not necessary for linear regression model and variables
without scaling will not affect the model performance.

1.3 Encode the data (having string values) for Modelling. Data Split: Split the data into
train and test (70:30). Apply Linear regression. Performance Metrics: Check the
performance of Predictions on Train and Test sets using Rsquare, RMSE.

Encoding, splitting of data and applying linear regression is provided in the code file.
Intercept and coefficients associated with variables are as under.

Intercept -788.707596
carat 8929.506754
depth -18.479610
table -24.244721
cut_Fair -741.925579
cut_Good -161.216457
cut_Ideal 102.622727
cut_Premium 6.516229
cut_Very_Good 5.295484
color_D 728.863187
color_E 536.512066
color_F 411.410887
color_G 207.384263
color_H -277.915384
color_I -767.550287
color_J -1627.412329
clarity_I1 -3774.371840
clarity_IF 1502.938422
clarity_SI1 -349.355690
clarity_SI2 -1324.032133
clarity_VS1 613.366177
clarity_VS2 296.592423
clarity_VVS1 1172.482746
clarity_VVS2 1073.672299
dtype: float64

.Rsqure for training data=0.916


.Rsqure for test data 0.919
.RMSE for training data= 1151
.RMSE for test data =1159

1.4 Inference: Basis on these predictions, what are the business insights and
recommendations.

Carat is the dominant factor in deciding the price of diamond. Higher the Carat higher
the price of diamond. Carat is measure of weight which has direct correlation with
physical dimensions (x,y,z). Diamond with clarify IF, and colour D has higher price.
Clarity VVS1, VVS2, VS1, VS2 and colour E, F, G also have positive effect on price of
the diamond. In terms of cut, Ideal, Premium Very Good would fetch better price.

It advisable to avoid diamonds of cut ‘Fair’, & Good. Regarding Colour J, H and J will
have less price, clarity I1, SI2 and SI1 will have lower price and should be avoided.

Using these parameter diamonds of higher price can be selected and avoid lower price
for better marketability and profit.

Problem 2: Logistic Regression and LDA

You are hired by a tour and travel agency which deals in selling holiday packages.
You are provided details of 872 employees of a company. Among these employees,
some opted for the package and some didn't. You have to help the company in
predicting whether an employee will opt for the package or not on the basis of the
information given in the data set. Also, find out the important factors on the basis of
which the company will focus on particular employees to sell their packages.

2.1 Data Ingestion: Read the dataset. Do the descriptive statistics and do null value
condition check, write an inference on it. Perform Univariate and Bivariate Analysis.
Do exploratory data analysis.

The data set has 872 rows. Column indicating row number (Unnamed:0) cannot be used for
analysis and needs to be deleted. Excluding row number data set has 2 object variables and
5 numerical variables. i.e. 7 variables available for analysis. ‘Holliday_Package’ is dependent
variable and other 6 independent (predictive variables). There are no null values and duplicate
values in the data set.
Univariate and bivariate analysis.
If employee is foreigner and employee not having young children, chances of opting for Holiday
Package is good. Independent variables are not correlated with other variables. Salary has
some outliers.

2.2 Do not scale the data. Encode the data (having string values) for Modelling. Data
Split: Split the data into train and test (70:30). Apply Logistic Regression and LDA
(linear discriminant analysis).

Logistic Regression and LDA done in the code file

Logistic Regression Coefficients

coef

foreign 1.266482

educ 0.060348

Salary -0.000016

no_older_children -0.048943

age -0.057072

no_young_children -1.348832

LDA coefficient
coef

foreign 1.320602

educ 0.058604

Salary -0.000014

no_older_children -0.037567

age -0.057795

no_young_children -1.282791

2.3 Performance Metrics: Check the performance of Predictions on Train and Test
sets using Accuracy, Confusion Matrix, Plot ROC curve and get ROC_AUC score for
each model Final Model: Compare Both the models and write inference which model
is best/optimized.

Logistic Regression

Classification Report of the training data:


precision recall f1-score support

0 0.67 0.74 0.71 329


1 0.66 0.58 0.62 281
accuracy 0.67 610
macro avg 0.67 0.66 0.66 610
weighted avg 0.67 0.67 0.66 610
Classification Report of the test data:

precision recall f1-score support

0 0.65 0.77 0.71 142


1 0.65 0.52 0.58 120

accuracy 0.65 262


macro avg 0.65 0.64 0.64 262
weighted avg 0.65 0.65 0.65 262
Training Data Test Data

Area under the curve =0.735


Accuracy for Training data=0.67 and Accuracy for test data= 0.65

Linear discriminant analysis


Classification Report of the training data:

precision recall f1-score support

0 0.67 0.74 0.70 329


1 0.65 0.58 0.61 281

accuracy 0.66 610


macro avg 0.66 0.66 0.66 610
weighted avg 0.66 0.66 0.66 610
Classification Report of the test data:

precision recall f1-score support

0 0.64 0.77 0.70 142


1 0.64 0.49 0.56 120

accuracy 0.64 262


macro avg 0.64 0.63 0.63 262
weighted avg 0.64 0.64 0.63 262

Confusion matrix for both training and test data

AUC for the Training Data: 0.733


AUC for the Test Data: 0.714
Accuracy for Training data=0.66 and Accuracy for test data= 0.64
Accuracy score both in Training and Testing data is higher in case of Logistic Regression (LR)
compared to LDA. The data set has outliers in ‘salary’ LR is more robust predictor in case of
outliers. Therefore, it is recommended to use Logistic Regression (LR).

2.4 Inference: Basis on these predictions, what are the insights and
recommendations. Please explain and summarise the various steps performed in this
project. There should be proper business interpretation and actionable insights present.

If employee is foreigner and employee not having young children, chances of opting for
Holiday Package is good. Special offer can be designed to domestic employees to opt for
Holiday Package.

Many high salary employees are not opting for Holiday Package, company can focus on high
salary employees to sell Holiday Package. Employees having older children are not opting for
Holiday Package. Age of the employee is not a material in opting for holiday package.

It can be observed from coefficient arrived from both models that opting for Holiday package
has strong negative relation with number of young children. Holiday packages can be modified
to make infant and young children friendly to attract more employees having young children.

You might also like