Project Predictive Modeling PDF
Project Predictive Modeling PDF
Problem Statement
You are hired by a company Gem Stones co ltd, which is a cubic zirconia manufacturer. You
are provided with the dataset containing the prices and other attributes of almost 27,000
cubic zirconia (which is an inexpensive diamond alternative with many of the same
qualities as a diamond). The company is earning different profits on different prize slots.
You have to help the company in predicting the price for the stone based on the details
given in the dataset so it can distinguish between higher profitable stones and lower
profitable stones to have a better profit share. Also, provide them with the best 5 attributes
that are most important.
Dataset for Problem 1: cubic_zirconia.csv
1
Problem 1.1:
● Read the data and do exploratory data analysis. Describe the data briefly. (Check the
null values, Data types, shape, EDA). Perform Univariate and Bivariate Analysis.
Solution:
Initial Information and shape of the data:
Top 10 values:
2
Bottom 10 values:
3
Insights:
● The present variables are both numeric and categorical types in nature - i.e. float,
int, and object data types are present
● There are 11 variables and 26967 records
Insights:
● The target variable will be: price
● Looking at the statistical analysis it is seen that:
○ Categorical data - cut color and clarity
○ Continuous data - carat, depth, table, x, y, z, and price
Checking for duplicate records:
5
● Total CUT is 5.
● Preferred CUT would be Ideal
6
Insights:
● The distribution of data in carat has positive skewness. The data range is 0 to 1, a
major portion of data lies in this range.
● The distribution of depth is in the normal distribution and it ranges from 55 to 65.
● The distribution of data in the table has positive skewness. The maximum
distribution range is between 55 to 65.
● The distribution of x has positive skewness. The distribution range is 4 to 8. X
represents the length of the cubic zirconia in mm.
8
Representation of CUT with Price, in the order of Fair, Good, Very Good, Premium, Ideal
Representation of clarity, in the order of FL, IF, VVS1, VVS2, VS1, VS2, SI1, SI2, I1, I2, I3
Representation of clarity with Price, in the order of FL, IF, VVS1, VVS2, VS1, VS2, SI1, SI2, I1,
I2, I3
11
Correlation Matrix:
Insight:
● Multicollinearity is present in the dataset.
Problem 1.2:
● Impute null values if present, also check for the values which are equal to zero. Do
they have any meaning or do we need to change them or drop them? Do you think
scaling is necessary in this case?
Solution:
Checking for null values:
● We can see that there are 697 null values found in-depth
● To solve this we will do the median imputation as depth is a continuous variable.
Please refer to the Jupyter notebook to go through the codes
After median imputation:
15
Depth
16
Table
X
17
Price
18
Outlier Treatment:
Please check the Jupyter notebook attached to view the codes of outlier treatment: Post
outlier treatment, graphs are given below:
Carat
Depth
Table
19
Y
20
Price
Problem 1.3:
● Encode the data (having string values) for Modelling. Data Split: Split the data into
train and test (70:30). Apply Linear regression. Performance Metrics: Check the
performance of Predictions on Train and Test sets using R-square, RMSE.
Solution:
Step 1: Convert categorical to dummy variables in data
Categorical values cannot be given in the linear regression model. Hence, we will encode
categorical values to an integer by converting categorical to dummy variables. It is a way to
make the categorical variable into a series of dichotomous variables.
21
Step 2: Data Split: Split the data into train and test (70:30)
We will drop the id column.
22
Root Mean Square Error (RMSE) for training and test data
Statsmodels
Insights:
● Multicollinearity is still present in the dataset. The exemplary interpretation
of VIF is less than 5%.
● The obtained stats model displays that its features do not add value to the
model, hence those features can be removed and that’ll ultimately lead to
reducing the VIF value. This will build a better linear regression model.
25
Model Score
26
mpg prediction
If improved accuracy is expected, dropping the depth column in the iteration for
better results would be a solution.
Here’s the formula:
(-0.66) * Intercept + (1.1) * carat + (-0.02) * table + (-0.38) * x + (0.39) * y + (-0.15) * z + (-0.01)
* cut_Good + (0.04) * cut_Ideal + (0.04) * cut_Premium + (-0.05) * color_E + (-0.06) * color_F
+ (-0.1) * color_G + (-0.21) * color_H + (-0.32) * color_I + (-0.47) * color_J + (1.02) * clarity_IF +
(0.66) * clarity_SI1 + (0.45) * clarity_SI2 + (0.86) * clarity_VS1 + (0.79) * clarity_VS2 + (0.96) *
clarity_VVS1 + (0.95) * clarity_VVS2 +
Problem 1.4:
● Inference: Basis on these predictions, what are the business insights and
recommendations.
Solution:
Overview:
In this business case study, we're expected to help the Gem Stones co ltd to predict
the price for the cubic zirconia and apprehension on the different price ranges. It should be
noted that jewelry pricing is a vaguely understood process that typically results in very high
27
retail prices, and then shockingly low resale prices. Below are the insights from the
exploratory data analysis and linear regression model build from the dataset is given.
Business insights:
● While finding the unique categorical value during the exploratory data analysis it
was observed that the ideal cut had given profit to the company.
● In terms of the colors, H, I, J turned out to be profit-generating.
● Similarly, the clarity levels VS1, VS2, SI1 were the most profitable among all. And SI2
turned out to be the costliest in terms of price
● While comparing the cut and color, J was found as the most profitable and similarly
WS2 during the comparison of cut and clarity.
● Coming to the built linear regression model, in the training set 94% discrepancy in
the price has been explained
Business Recommendations:
● By talking in favor of revenue-generation from the product sales the focus will have
to be customer preference, market preference, highest selling items, and factors
pressing customer demand.
● In terms of Cut, the customer preference and sale goes with ideal, premium, and
very good cuts. Hence, these highly selling products will be the prime focus in the
marketing campaigns.
● The marketing ads can be broadcasted by focusing on the cut perfection, customer
acceptance, quality, and pricing.
● Talking about the best 5 attributes that are most important: Cut, Carat, Y (Width of
the stone), clarity VS1, VS2, SI1, and price.
------------------------------------------------------------------------------------------------------------------------------------
------------------------------------------------------------------------------------------------------------------------------------
28
Problem Statement
You are hired by a tour and travel agency which deals in selling holiday packages. You are
provided details of 872 employees of a company. Among these employees, some opted for
the package and some didn't. You have to help the company in predicting whether an
employee will opt for the package or not based on the information given in the data set.
Also, find out the important factors based on which the company will focus on particular
employees to sell their packages.
Dataset for Problem 2: Holiday_Package.csv
29
Problem 2.1:
● Data Ingestion: Read the dataset. Do the descriptive statistics and do a null value
condition check, write an inference on it. Perform Univariate and Bivariate Analysis.
Do exploratory data analysis.
Solution:
Getting the basic info of the data:
Top 10 values:
Bottom 10 values:
30
Insights:
● The present variables are both numeric and categorical types in nature - i.e. int, and
object data types are present
● There are 8 variables and 872 records
31
● This result indicates that 45% of the employees have an interest in getting
the holiday package.
Insights:
● The distribution of salary data has positive skewness. The data range is 0 to 50,000.
● The distribution of age is in the normal distribution and it ranges from 30 to 50.
● The distribution of education in the table has slightly negative skewness. The
distribution range is between 7.5 to 12.5
● The distribution of the number ofnumber of young children (younger than 7 years)
has the least value.
● The distribution of the number of older children has no skewness and it ranges in
between 0 to 2
Skewness Measurement:
Foreign:
● It can be observed that employee below salary range 150000 have tendency
of mandatory opting for holiday package
36
Insights:
● The holiday package is preferred for employees salary range below 50,000
● Employees' salaries ranging below 50,000 are of age range 30 -50. Hence, this age
group has actively opted for the holiday package
● Employees aged over 50 to 60 have shown a tendency of not opting for the holiday
package.
39
Correlation Check
40
Age
Educ
No_young_children
42
No_older_children
Outlier Treatment:
Please check the Jupyter notebook attached to view the codes of outlier treatment: Post
outlier treatment, graphs are given below:
Salary
Age
43
Educ
No_young_children
No_older_children
44
Problem 2.2:
● Do not scale the data. Encode the data (having string values) for Modelling. Data
Split: Split the data into train and test (70:30). Apply Logistic Regression and LDA
(linear discriminant analysis).
Solution:
Logistic Regression Model:
Step 1: Convert categorical to dummy variables in data
Step 2: Data Split: Split the data into train and test (70:30).
45
● Please refer to the Jupyter notebook attached to view the codes of the model.
● Grid-search is used to find the optimal hyperparameters of a model which results in the most
‘accurate’ predictions.
● It’s seen here that the Grid search method gave a liblinear solver. This liblinear solver is most
suitable for small datasets.
● Penalty and tolerance have been found using this method
46
Step 2: Data Split: Split the data into train and test (70:30):
47
Problem 2.3:
● Performance Metrics: Check the performance of Predictions on Train and Test sets
using Accuracy, Confusion Matrix, Plot ROC curve and get ROC_AUC score for each
model Final Model: Compare Both the models and write inference which model is
best/optimized.
Solution:
Logistic Regression Model:
Prediction on the training set
48
Precision, Recall and F1 Scores of Train and Test data of Linear Regression Model
● lr_train_precision 0.65 ● lr_test_precision 0.69
● lr_train_recall 0.45 ● lr_test_recall 0.45
● lr_train_f1 0.53 ● lr_test_f1 0.55
Precision, Recall and F1 Scores of Train and Test data of Linear Regression Model
Comparing the Linear Regression and Linear Discriminant Analysis (LDA) Models - based on
their respective Precision, Recall and F1 Scores of Train and Test data
Model Section:
Problem 2.4:
● Inference: Basis on these predictions, what are the insights and recommendations.
Solution:
Overview:
In this business case, a travel agency is looking for statistical evident predictions on
employees likely to prefer to opt for the holiday package or not. The given dataset has
information about the economic, behavioral, and age group. Based on the exploratory data
analysis and models built, below are the business insights.
Insights:
● The comparison between the employee age range, salary, holiday package
preference shows that the holiday package is preferred by employees salary range
below 50,000.
● Employees' salaries ranging below 50,000 are of age range 30 -50. Hence, this age
group has actively opted for the holiday package
57
● Employees aged over 50 to 60 have shown a tendency of not opting for the holiday
package.
● On the other hand, employees with a salary of more than 150,000 are also not
opting for the holiday package. This salary range has employees from both 30 - 50
and 50 - 60 age groups
● The holiday package is also not preferred by the employees having young children.
● Employees with older children are opting for the package normally.
● We can see three major groups where strategy implementation is needed to grow
the holiday package subscription:
○ Old age group - Employees aged between 50-60
○ Elite group - Salary range more than 150,000
○ Employees with small children
Business Recommendations:
The business recommendation can be given based on the targeting segments created. Let's
look into all these one by one:
-------------------------------------------------------------------------------------------------------------------------------------