Project-Predictive Modeling-Rajendra M Bhat
Project-Predictive Modeling-Rajendra M Bhat
Project 4
- Predictive Modeling
Rajendra M Bhat
Problem 1: Linear Regression
You are hired by a company Gem Stones co ltd, which is a cubic zirconia
manufacturer. You are provided with the dataset containing the prices and other
attributes of almost 27,000 cubic zirconia (which is an inexpensive diamond alternative
with many of the same qualities as a diamond). The company is earning different
profits on different prize slots. You have to help the company in predicting the price for
the stone on the bases of the details given in the dataset so it can distinguish between
higher profitable stones and lower profitable stones so as to have better profit share.
Also, provide them with the best 5 attributes that are most important.
1.1. Read the data and do exploratory data analysis. Describe the data briefly. (Check
the null values, Data types, shape, EDA). Perform Univariate and Bivariate
Analysis.
Brief description as below.
count unique top freq mean std min 25% 50% 75% max
Unnamed:
26967 NaN NaN NaN 13484 7784.85 1 6742.5 13484 20225.5 26967
0
carat 26967 NaN NaN NaN 0.798 0.478 0.2 0.4 0.7 1.05 4.5
cut 26967 5 Ideal 10816 NaN NaN NaN NaN NaN NaN NaN
color 26967 7 G 5661 NaN NaN NaN NaN NaN NaN NaN
clarity 26967 8 SI1 6571 NaN NaN NaN NaN NaN NaN NaN
depth 26270 NaN NaN NaN 61.745 1.413 50.8 61 61.8 62.5 73.6
x 26967 NaN NaN NaN 5.73 1.129 0 4.71 5.69 6.55 10.23
count unique top freq mean std min 25% 50% 75% max
y 26967 NaN NaN NaN 5.734 1.166 0 4.71 5.71 6.54 58.9
z 26967 NaN NaN NaN 3.538 0.721 0 2.9 3.52 4.04 31.8
price 26967 NaN NaN NaN 3939.52 4024.86 326 945 2375 5360 18818
Info
0 Unnamed: 0 26967 non-null int64
1 carat 26967 non-null float64
2 cut 26967 non-null object
3 color 26967 non-null object
4 clarity 26967 non-null object
5 depth 26270 non-null float64
6 table 26967 non-null float64
7 x 26967 non-null float64
8 y 26967 non-null float64
9 z 26967 non-null float64
10 price 26967 non-null int64
Data set has 26,967 rows with 11 variables. Column indicating row number (Unnamed:0)
cannot be used for analysis and needs to be deleted. Excluding row number data set has 3
categorical variables and 7 numerical variables. i.e 10 variables available for analysis. Price
is dependent variable and other 9 independent (predictive variables)
Bivariate analysis
It can be seen that variable x, y,z and carat are highly correlated and also these
variables have correlation with price (dependent variable).
1.2 Impute null values if present, also check for the values which are equal to zero. Do
they have any meaning or do we need to change them or drop them? Do you think
scaling is necessary in this case?
There are 697 null values in variable depth and imputed with mean of depth. After
deleing duplicates checked for zero values in data, There are values zero for x, y and
z.
carat cut color clarity depth table x y z price
5821 0.71 Good F SI2 64.1 60.0 0.00 0.00 0.0 2130
6034 2.02 Premium H VS2 62.7 53.0 8.02 7.95 0.0 18207
10827 2.20 Premium H SI1 61.2 59.0 8.42 8.37 0.0 17265
12498 2.18 Premium H SI2 59.4 61.0 8.49 8.45 0.0 12631
12689 1.10 Premium G SI2 63.0 59.0 6.50 6.47 0.0 3696
17506 1.14 Fair G VS1 57.5 67.0 0.00 0.00 0.0 6381
Since variables x, y, z are dimensions, there is no meaning for value zero. Since these
variables are highly correlated with carat, these variables have been dropped for
further analysis. Scaling is not necessary for linear regression model and variables
without scaling will not affect the model performance.
1.3 Encode the data (having string values) for Modelling. Data Split: Split the data into
train and test (70:30). Apply Linear regression. Performance Metrics: Check the
performance of Predictions on Train and Test sets using Rsquare, RMSE.
Encoding, splitting of data and applying linear regression is provided in the code file.
Intercept and coefficients associated with variables are as under.
Intercept -788.707596
carat 8929.506754
depth -18.479610
table -24.244721
cut_Fair -741.925579
cut_Good -161.216457
cut_Ideal 102.622727
cut_Premium 6.516229
cut_Very_Good 5.295484
color_D 728.863187
color_E 536.512066
color_F 411.410887
color_G 207.384263
color_H -277.915384
color_I -767.550287
color_J -1627.412329
clarity_I1 -3774.371840
clarity_IF 1502.938422
clarity_SI1 -349.355690
clarity_SI2 -1324.032133
clarity_VS1 613.366177
clarity_VS2 296.592423
clarity_VVS1 1172.482746
clarity_VVS2 1073.672299
dtype: float64
1.4 Inference: Basis on these predictions, what are the business insights and
recommendations.
Carat is the dominant factor in deciding the price of diamond. Higher the Carat higher
the price of diamond. Carat is measure of weight which has direct correlation with
physical dimensions (x,y,z). Diamond with clarify IF, and colour D has higher price.
Clarity VVS1, VVS2, VS1, VS2 and colour E, F, G also have positive effect on price of
the diamond. In terms of cut, Ideal, Premium Very Good would fetch better price.
It advisable to avoid diamonds of cut ‘Fair’, & Good. Regarding Colour J, H and J will
have less price, clarity I1, SI2 and SI1 will have lower price and should be avoided.
Using these parameter diamonds of higher price can be selected and avoid lower price
for better marketability and profit.
You are hired by a tour and travel agency which deals in selling holiday packages.
You are provided details of 872 employees of a company. Among these employees,
some opted for the package and some didn't. You have to help the company in
predicting whether an employee will opt for the package or not on the basis of the
information given in the data set. Also, find out the important factors on the basis of
which the company will focus on particular employees to sell their packages.
2.1 Data Ingestion: Read the dataset. Do the descriptive statistics and do null value
condition check, write an inference on it. Perform Univariate and Bivariate Analysis.
Do exploratory data analysis.
The data set has 872 rows. Column indicating row number (Unnamed:0) cannot be used for
analysis and needs to be deleted. Excluding row number data set has 2 object variables and
5 numerical variables. i.e. 7 variables available for analysis. ‘Holliday_Package’ is dependent
variable and other 6 independent (predictive variables). There are no null values and duplicate
values in the data set.
Univariate and bivariate analysis.
If employee is foreigner and employee not having young children, chances of opting for Holiday
Package is good. Independent variables are not correlated with other variables. Salary has
some outliers.
2.2 Do not scale the data. Encode the data (having string values) for Modelling. Data
Split: Split the data into train and test (70:30). Apply Logistic Regression and LDA
(linear discriminant analysis).
coef
foreign 1.266482
educ 0.060348
Salary -0.000016
no_older_children -0.048943
age -0.057072
no_young_children -1.348832
LDA coefficient
coef
foreign 1.320602
educ 0.058604
Salary -0.000014
no_older_children -0.037567
age -0.057795
no_young_children -1.282791
2.3 Performance Metrics: Check the performance of Predictions on Train and Test
sets using Accuracy, Confusion Matrix, Plot ROC curve and get ROC_AUC score for
each model Final Model: Compare Both the models and write inference which model
is best/optimized.
Logistic Regression
2.4 Inference: Basis on these predictions, what are the insights and
recommendations. Please explain and summarise the various steps performed in this
project. There should be proper business interpretation and actionable insights present.
If employee is foreigner and employee not having young children, chances of opting for
Holiday Package is good. Special offer can be designed to domestic employees to opt for
Holiday Package.
Many high salary employees are not opting for Holiday Package, company can focus on high
salary employees to sell Holiday Package. Employees having older children are not opting for
Holiday Package. Age of the employee is not a material in opting for holiday package.
It can be observed from coefficient arrived from both models that opting for Holiday package
has strong negative relation with number of young children. Holiday packages can be modified
to make infant and young children friendly to attract more employees having young children.