100% found this document useful (3 votes)

2K views14 pages

Project-Predictive Modeling-Rajendra M Bhat

The document describes using linear regression and logistic regression models to analyze diamond and tourism customer data. For the diamond data, linear regression was used to predict price based on attributes like carat, color, clarity, cut, depth and table. Key predictors of higher price were larger carat, clarity grades VVS1, VVS2, VS1, VS2, and color grades D, E, F, G. For the tourism data, both logistic regression and LDA were used to predict if a customer would opt for a holiday package based on attributes like nationality, education, salary, age, and family status. The models showed foreigners and those without young children were more likely to opt for packages. Logistic regression performed slightly better with

Uploaded by

Rajendra Bhat

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

100% found this document useful (3 votes)

2K views14 pages

Project-Predictive Modeling-Rajendra M Bhat

Uploaded by

Rajendra Bhat

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 14

DSBA

Project 4
- Predictive Modeling
Rajendra M Bhat
Problem 1: Linear Regression

You are hired by a company Gem Stones co ltd, which is a cubic zirconia
manufacturer. You are provided with the dataset containing the prices and other
attributes of almost 27,000 cubic zirconia (which is an inexpensive diamond alternative
with many of the same qualities as a diamond). The company is earning different
profits on different prize slots. You have to help the company in predicting the price for
the stone on the bases of the details given in the dataset so it can distinguish between
higher profitable stones and lower profitable stones so as to have better profit share.
Also, provide them with the best 5 attributes that are most important.

1.1. Read the data and do exploratory data analysis. Describe the data briefly. (Check
the null values, Data types, shape, EDA). Perform Univariate and Bivariate
Analysis.
Brief description as below.
count unique top freq mean std min 25% 50% 75% max

Unnamed:
26967 NaN NaN NaN 13484 7784.85 1 6742.5 13484 20225.5 26967
0

carat 26967 NaN NaN NaN 0.798 0.478 0.2 0.4 0.7 1.05 4.5

cut 26967 5 Ideal 10816 NaN NaN NaN NaN NaN NaN NaN

color 26967 7 G 5661 NaN NaN NaN NaN NaN NaN NaN

clarity 26967 8 SI1 6571 NaN NaN NaN NaN NaN NaN NaN

depth 26270 NaN NaN NaN 61.745 1.413 50.8 61 61.8 62.5 73.6

table 26967 NaN NaN NaN 57.456 2.232 49 56 57 59 79

x 26967 NaN NaN NaN 5.73 1.129 0 4.71 5.69 6.55 10.23
count unique top freq mean std min 25% 50% 75% max

y 26967 NaN NaN NaN 5.734 1.166 0 4.71 5.71 6.54 58.9

z 26967 NaN NaN NaN 3.538 0.721 0 2.9 3.52 4.04 31.8

price 26967 NaN NaN NaN 3939.52 4024.86 326 945 2375 5360 18818

Info
0 Unnamed: 0 26967 non-null int64
1 carat 26967 non-null float64
2 cut 26967 non-null object
3 color 26967 non-null object
4 clarity 26967 non-null object
5 depth 26270 non-null float64
6 table 26967 non-null float64
7 x 26967 non-null float64
8 y 26967 non-null float64
9 z 26967 non-null float64
10 price 26967 non-null int64

Count of Null values

Unnamed: 0 0
carat 0
cut 0
color 0
clarity 0
depth 697
table 0
x 0
y 0
z 0
price 0

Data set has 26,967 rows with 11 variables. Column indicating row number (Unnamed:0)
cannot be used for analysis and needs to be deleted. Excluding row number data set has 3
categorical variables and 7 numerical variables. i.e 10 variables available for analysis. Price
is dependent variable and other 9 independent (predictive variables)

There are 697 ‘Null Values’ in variable ‘depth’

Univariate analysis
All numerical variables have outliers and treating outliers may impact characteristics of data
set and model itself therefore, outliers are not considered to be treated.

Bivariate analysis
It can be seen that variable x, y,z and carat are highly correlated and also these
variables have correlation with price (dependent variable).

1.2 Impute null values if present, also check for the values which are equal to zero. Do
they have any meaning or do we need to change them or drop them? Do you think
scaling is necessary in this case?

There are 697 null values in variable depth and imputed with mean of depth. After
deleing duplicates checked for zero values in data, There are values zero for x, y and
z.
carat cut color clarity depth table x y z price

5821 0.71 Good F SI2 64.1 60.0 0.00 0.00 0.0 2130

6034 2.02 Premium H VS2 62.7 53.0 8.02 7.95 0.0 18207

10827 2.20 Premium H SI1 61.2 59.0 8.42 8.37 0.0 17265

12498 2.18 Premium H SI2 59.4 61.0 8.49 8.45 0.0 12631

12689 1.10 Premium G SI2 63.0 59.0 6.50 6.47 0.0 3696

17506 1.14 Fair G VS1 57.5 67.0 0.00 0.00 0.0 6381

18194 1.01 Premium H I1 58.1 59.0 6.66 6.60 0.0 3167

23758 1.12 Premium G I1 60.4 59.0 6.71 6.67 0.0 2383

Since variables x, y, z are dimensions, there is no meaning for value zero. Since these
variables are highly correlated with carat, these variables have been dropped for
further analysis. Scaling is not necessary for linear regression model and variables
without scaling will not affect the model performance.

1.3 Encode the data (having string values) for Modelling. Data Split: Split the data into
train and test (70:30). Apply Linear regression. Performance Metrics: Check the
performance of Predictions on Train and Test sets using Rsquare, RMSE.

Encoding, splitting of data and applying linear regression is provided in the code file.
Intercept and coefficients associated with variables are as under.

Intercept -788.707596
carat 8929.506754
depth -18.479610
table -24.244721
cut_Fair -741.925579
cut_Good -161.216457
cut_Ideal 102.622727
cut_Premium 6.516229
cut_Very_Good 5.295484
color_D 728.863187
color_E 536.512066
color_F 411.410887
color_G 207.384263
color_H -277.915384
color_I -767.550287
color_J -1627.412329
clarity_I1 -3774.371840
clarity_IF 1502.938422
clarity_SI1 -349.355690
clarity_SI2 -1324.032133
clarity_VS1 613.366177
clarity_VS2 296.592423
clarity_VVS1 1172.482746
clarity_VVS2 1073.672299
dtype: float64

.Rsqure for training data=0.916

.Rsqure for test data 0.919
.RMSE for training data= 1151
.RMSE for test data =1159

1.4 Inference: Basis on these predictions, what are the business insights and
recommendations.

Carat is the dominant factor in deciding the price of diamond. Higher the Carat higher
the price of diamond. Carat is measure of weight which has direct correlation with
physical dimensions (x,y,z). Diamond with clarify IF, and colour D has higher price.
Clarity VVS1, VVS2, VS1, VS2 and colour E, F, G also have positive effect on price of
the diamond. In terms of cut, Ideal, Premium Very Good would fetch better price.

It advisable to avoid diamonds of cut ‘Fair’, & Good. Regarding Colour J, H and J will
have less price, clarity I1, SI2 and SI1 will have lower price and should be avoided.

Using these parameter diamonds of higher price can be selected and avoid lower price
for better marketability and profit.

Problem 2: Logistic Regression and LDA

You are hired by a tour and travel agency which deals in selling holiday packages.
You are provided details of 872 employees of a company. Among these employees,
some opted for the package and some didn't. You have to help the company in
predicting whether an employee will opt for the package or not on the basis of the
information given in the data set. Also, find out the important factors on the basis of
which the company will focus on particular employees to sell their packages.

2.1 Data Ingestion: Read the dataset. Do the descriptive statistics and do null value
condition check, write an inference on it. Perform Univariate and Bivariate Analysis.
Do exploratory data analysis.

The data set has 872 rows. Column indicating row number (Unnamed:0) cannot be used for
analysis and needs to be deleted. Excluding row number data set has 2 object variables and
5 numerical variables. i.e. 7 variables available for analysis. ‘Holliday_Package’ is dependent
variable and other 6 independent (predictive variables). There are no null values and duplicate
values in the data set.
Univariate and bivariate analysis.
If employee is foreigner and employee not having young children, chances of opting for Holiday
Package is good. Independent variables are not correlated with other variables. Salary has
some outliers.

2.2 Do not scale the data. Encode the data (having string values) for Modelling. Data
Split: Split the data into train and test (70:30). Apply Logistic Regression and LDA
(linear discriminant analysis).

Logistic Regression and LDA done in the code file

Logistic Regression Coefficients

coef

foreign 1.266482

educ 0.060348

Salary -0.000016

no_older_children -0.048943

age -0.057072

no_young_children -1.348832

LDA coefficient
coef

foreign 1.320602

educ 0.058604

Salary -0.000014

no_older_children -0.037567

age -0.057795

no_young_children -1.282791

2.3 Performance Metrics: Check the performance of Predictions on Train and Test
sets using Accuracy, Confusion Matrix, Plot ROC curve and get ROC_AUC score for
each model Final Model: Compare Both the models and write inference which model
is best/optimized.

Logistic Regression

Classification Report of the training data:

precision recall f1-score support

0 0.67 0.74 0.71 329

1 0.66 0.58 0.62 281
accuracy 0.67 610
macro avg 0.67 0.66 0.66 610
weighted avg 0.67 0.67 0.66 610
Classification Report of the test data:

precision recall f1-score support

0 0.65 0.77 0.71 142

1 0.65 0.52 0.58 120

accuracy 0.65 262

macro avg 0.65 0.64 0.64 262
weighted avg 0.65 0.65 0.65 262
Training Data Test Data

Area under the curve =0.735

Accuracy for Training data=0.67 and Accuracy for test data= 0.65

Linear discriminant analysis

Classification Report of the training data:

precision recall f1-score support

0 0.67 0.74 0.70 329

1 0.65 0.58 0.61 281

accuracy 0.66 610

macro avg 0.66 0.66 0.66 610
weighted avg 0.66 0.66 0.66 610
Classification Report of the test data:

precision recall f1-score support

0 0.64 0.77 0.70 142

1 0.64 0.49 0.56 120

accuracy 0.64 262

macro avg 0.64 0.63 0.63 262
weighted avg 0.64 0.64 0.63 262

Confusion matrix for both training and test data

AUC for the Training Data: 0.733

AUC for the Test Data: 0.714
Accuracy for Training data=0.66 and Accuracy for test data= 0.64
Accuracy score both in Training and Testing data is higher in case of Logistic Regression (LR)
compared to LDA. The data set has outliers in ‘salary’ LR is more robust predictor in case of
outliers. Therefore, it is recommended to use Logistic Regression (LR).

2.4 Inference: Basis on these predictions, what are the insights and
recommendations. Please explain and summarise the various steps performed in this
project. There should be proper business interpretation and actionable insights present.

If employee is foreigner and employee not having young children, chances of opting for
Holiday Package is good. Special offer can be designed to domestic employees to opt for
Holiday Package.

Many high salary employees are not opting for Holiday Package, company can focus on high
salary employees to sell Holiday Package. Employees having older children are not opting for
Holiday Package. Age of the employee is not a material in opting for holiday package.

It can be observed from coefficient arrived from both models that opting for Holiday package
has strong negative relation with number of young children. Holiday packages can be modified
to make infant and young children friendly to attract more employees having young children.

Ritesh Tandon Machine Learning Project
100% (5)
Ritesh Tandon Machine Learning Project
23 pages
Predictive Modelling Project Report Final
45% (11)
Predictive Modelling Project Report Final
49 pages
Project - Time Series Forecasting - Rajendra M Bhat
82% (11)
Project - Time Series Forecasting - Rajendra M Bhat
33 pages
Project - Time Series Forecasting - Rajendra M Bhat
82% (11)
Project - Time Series Forecasting - Rajendra M Bhat
33 pages
Final Document of SQL Project With Questions
0% (2)
Final Document of SQL Project With Questions
5 pages
Project Report - FRA V1.0
71% (7)
Project Report - FRA V1.0
28 pages
Greatiearning: Project - (Sqlite & Mysql)
0% (1)
Greatiearning: Project - (Sqlite & Mysql)
7 pages
Predictive Modeling PDF
100% (3)
Predictive Modeling PDF
49 pages
Project: Animesh Halder
67% (3)
Project: Animesh Halder
12 pages
Time Series Forecasting - SoftDrink - Business Report
75% (4)
Time Series Forecasting - SoftDrink - Business Report
37 pages
MRA Project Milestone 2
100% (2)
MRA Project Milestone 2
31 pages
Capstone Proect Notes 2
100% (2)
Capstone Proect Notes 2
16 pages
Project +Sweta+Kumari+ +FRA+Milestone+1+ July+ 2021
100% (2)
Project +Sweta+Kumari+ +FRA+Milestone+1+ July+ 2021
31 pages
Project-Predictive Modelling - Tanaya - Lokhande
100% (1)
Project-Predictive Modelling - Tanaya - Lokhande
55 pages
Harshini Week 8 Doc PDF
No ratings yet
Harshini Week 8 Doc PDF
10 pages
Week 7 Project Report 1 and 2
No ratings yet
Week 7 Project Report 1 and 2
10 pages
Business Report Machine Learning-1
100% (7)
Business Report Machine Learning-1
60 pages
FRA Milestone1 - Maminulislam
100% (4)
FRA Milestone1 - Maminulislam
23 pages
Boston Condo Sale Story
0% (1)
Boston Condo Sale Story
11 pages
Data Mininig Project
67% (3)
Data Mininig Project
28 pages
MRA Project Milestone2 PDF
100% (1)
MRA Project Milestone2 PDF
1 page
FRA Project Business Report
100% (2)
FRA Project Business Report
27 pages
SMDM Business-Report Arvind Soni-2
0% (1)
SMDM Business-Report Arvind Soni-2
15 pages
DM Gopala Satish Kumar Business Report G8 DSBA
100% (2)
DM Gopala Satish Kumar Business Report G8 DSBA
26 pages
Project - Machine Learning - Rajendra M Bhat
100% (11)
Project - Machine Learning - Rajendra M Bhat
19 pages
Project-Predictive Modeling-Rajendra M Bhat
100% (3)
Project-Predictive Modeling-Rajendra M Bhat
14 pages
Business Report: Predictive Modelling
100% (2)
Business Report: Predictive Modelling
37 pages
Linear - Regression - Assignment: Problem Statement
100% (3)
Linear - Regression - Assignment: Problem Statement
24 pages
Predictive Modeling
100% (1)
Predictive Modeling
22 pages
Predictive Modelling Project 1 PDF
50% (2)
Predictive Modelling Project 1 PDF
38 pages
Predictive Modelling Project Gloria Susan Raju 11 APR 2021 PDF
No ratings yet
Predictive Modelling Project Gloria Susan Raju 11 APR 2021 PDF
56 pages
Predictive Modeling Business Report Seetharaman Final Changes PDF
100% (1)
Predictive Modeling Business Report Seetharaman Final Changes PDF
28 pages
Predective Modellig Project
100% (1)
Predective Modellig Project
18 pages
Project Report - Predictive Modeling
No ratings yet
Project Report - Predictive Modeling
49 pages
Predective Modelling Project Business Report
50% (2)
Predective Modelling Project Business Report
58 pages
Predictive Modeling Project
No ratings yet
Predictive Modeling Project
16 pages
Project Predictive Modeling
50% (2)
Project Predictive Modeling
69 pages
MRA - Project - Puvya - Ravi
100% (3)
MRA - Project - Puvya - Ravi
46 pages
Mra Project1 - Firoz Afzal
60% (5)
Mra Project1 - Firoz Afzal
20 pages
Problem 1:: Readingcsv PD Read - Excel (Readingcsv) Readingcsv Head
No ratings yet
Problem 1:: Readingcsv PD Read - Excel (Readingcsv) Readingcsv Head
18 pages
FRA Assignment
100% (1)
FRA Assignment
31 pages
Predictive Modelling Project - Business Report
100% (1)
Predictive Modelling Project - Business Report
23 pages
FRA Assignment - India Credit Model
No ratings yet
FRA Assignment - India Credit Model
14 pages
Shoe Sales
100% (3)
Shoe Sales
105 pages
Anamit Deb Gupta Mra - Project Milestone - 1
100% (1)
Anamit Deb Gupta Mra - Project Milestone - 1
30 pages
Time Series Project
50% (4)
Time Series Project
2 pages
MRA+Project+-+Milestone+2+ Sweta+Kumari+ July+2021
100% (4)
MRA+Project+-+Milestone+2+ Sweta+Kumari+ July+2021
29 pages
MRA Project Milestone 1 PDF
No ratings yet
MRA Project Milestone 1 PDF
1 page
MRA Project ML 1: Abhishek Kapoor Dsba Aug A20
100% (1)
MRA Project ML 1: Abhishek Kapoor Dsba Aug A20
47 pages
Gowtham Mra 2
No ratings yet
Gowtham Mra 2
18 pages
Milestone 1
No ratings yet
Milestone 1
2 pages
Data Mining Project
100% (2)
Data Mining Project
20 pages
Data Mining Quiz 1 Clustering
100% (2)
Data Mining Quiz 1 Clustering
4 pages
Problem 2
100% (1)
Problem 2
10 pages
Project Report - Advanced - Stats - Final PDF
No ratings yet
Project Report - Advanced - Stats - Final PDF
25 pages
Jupyter Notebook Project CART RF ANN
100% (1)
Jupyter Notebook Project CART RF ANN
41 pages
Advance Statistics - Buisness Report
100% (1)
Advance Statistics - Buisness Report
26 pages
FRA Report
100% (1)
FRA Report
30 pages
PROJECT - Time Series Forecasting by Akshay Kharote PDF
100% (2)
PROJECT - Time Series Forecasting by Akshay Kharote PDF
85 pages
Financial Risk Analysis Project Report Financial Risk Analysis Project Report
100% (2)
Financial Risk Analysis Project Report Financial Risk Analysis Project Report
29 pages
Grocery Project
100% (5)
Grocery Project
40 pages
Project 7 - DVT - Manoj
No ratings yet
Project 7 - DVT - Manoj
1 page
Mra Project
No ratings yet
Mra Project
12 pages
Nanduri Naga Sowri Pgp-Dsba - Octa - G2 Great Learning
No ratings yet
Nanduri Naga Sowri Pgp-Dsba - Octa - G2 Great Learning
40 pages
HBL Mumbai 21052022
No ratings yet
HBL Mumbai 21052022
12 pages
HBL Mumbai 19072021
No ratings yet
HBL Mumbai 19072021
10 pages
House Price Prediction Using Machine Learning
No ratings yet
House Price Prediction Using Machine Learning
6 pages
Appc 2.6 Packet
No ratings yet
Appc 2.6 Packet
7 pages
Sales-Forecasting of Retail Stores Using Machine Learning Techniques
No ratings yet
Sales-Forecasting of Retail Stores Using Machine Learning Techniques
7 pages
MAEconomics
100% (1)
MAEconomics
41 pages
Calibration of T-Scan Sensors For Recording Bite Forces in Denture Patients
No ratings yet
Calibration of T-Scan Sensors For Recording Bite Forces in Denture Patients
8 pages
Chap 013
No ratings yet
Chap 013
40 pages
M4 - M1 - Strategic Financial Management
No ratings yet
M4 - M1 - Strategic Financial Management
10 pages
Wa0006.
No ratings yet
Wa0006.
4 pages
CORRELATION AND COVARIANCE in R
100% (1)
CORRELATION AND COVARIANCE in R
24 pages
Review - Sci Meth-Metrics-Graphing - Key
No ratings yet
Review - Sci Meth-Metrics-Graphing - Key
4 pages
Exploring The Concept of Correlation and Its Applications in Data Science
No ratings yet
Exploring The Concept of Correlation and Its Applications in Data Science
17 pages
S1 - Correlation and Regression
No ratings yet
S1 - Correlation and Regression
9 pages
Music Streaming Patterns A Research On The Music Streaming Patterns in Different Age Groups
No ratings yet
Music Streaming Patterns A Research On The Music Streaming Patterns in Different Age Groups
7 pages
Detection of Adulteration of Kudzu Powder by Terahertz Time Domain Spectros
No ratings yet
Detection of Adulteration of Kudzu Powder by Terahertz Time Domain Spectros
8 pages
Practice Sample Questions STA404
100% (1)
Practice Sample Questions STA404
5 pages
Methodology For The Development
No ratings yet
Methodology For The Development
12 pages
5818 25169 1 PB
No ratings yet
5818 25169 1 PB
7 pages
Explainable Artificial Intelligence For Cybersecurity: A Literature Survey
No ratings yet
Explainable Artificial Intelligence For Cybersecurity: A Literature Survey
24 pages
Data Science by CFA
No ratings yet
Data Science by CFA
27 pages
Six Sigma_Black-Belt2021 (1)
No ratings yet
Six Sigma_Black-Belt2021 (1)
11 pages
Lecture 3 Simple Linear Regression
No ratings yet
Lecture 3 Simple Linear Regression
46 pages
Finance
No ratings yet
Finance
37 pages
RSM1282-2025-Session 9-Binary Dependent Variables & Logistic Regression - POST
No ratings yet
RSM1282-2025-Session 9-Binary Dependent Variables & Logistic Regression - POST
35 pages
Improved Likelihood Inference in Beta Regression: Journal of Statistical Computation and Simulation
No ratings yet
Improved Likelihood Inference in Beta Regression: Journal of Statistical Computation and Simulation
14 pages
Logistic Regression
No ratings yet
Logistic Regression
10 pages
Final Report
No ratings yet
Final Report
38 pages
Cito Proefschrift Maarten Marsman PDF
No ratings yet
Cito Proefschrift Maarten Marsman PDF
114 pages
BASIC Scientific Subroutines Vol. II
No ratings yet
BASIC Scientific Subroutines Vol. II
805 pages

Project-Predictive Modeling-Rajendra M Bhat

Uploaded by

Project-Predictive Modeling-Rajendra M Bhat

Uploaded by

DSBA

table 26967 NaN NaN NaN 57.456 2.232 49 56 57 59 79

Count of Null values

There are 697 ‘Null Values’ in variable ‘depth’

18194 1.01 Premium H I1 58.1 59.0 6.66 6.60 0.0 3167

23758 1.12 Premium G I1 60.4 59.0 6.71 6.67 0.0 2383

.Rsqure for training data=0.916

Problem 2: Logistic Regression and LDA

Logistic Regression and LDA done in the code file

Logistic Regression Coefficients

Classification Report of the training data:

0 0.67 0.74 0.71 329

precision recall f1-score support

0 0.65 0.77 0.71 142

accuracy 0.65 262

Area under the curve =0.735

Linear discriminant analysis

precision recall f1-score support

0 0.67 0.74 0.70 329

accuracy 0.66 610

precision recall f1-score support

0 0.64 0.77 0.70 142

accuracy 0.64 262

Confusion matrix for both training and test data

AUC for the Training Data: 0.733

You might also like