0% found this document useful (0 votes)
16 views

Group1 - DS Assignment 1

Uploaded by

drkdiyea
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views

Group1 - DS Assignment 1

Uploaded by

drkdiyea
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 8

e

Decision Sciences – II
Assignment 1
THE PROFESSOR PROPOSES

Submitted To: Submitted By:


Prof. Trilochan Sastry Group 1
J Sathvika 1911104
Yatin Maini 1911115
Alok Ranjan 1911078
Pavan Chandra 1911126
Kritika Saini 1911131
Executive Summary
Using the data given in the case “The Professor Proposes,” we analyse the dependence of the
price of a diamond on several variables. We then categorize the variables into ones that are
continuous and ones that are categorical. We find that only one variable – carat is continuous.
The rest of the variables are categorical and need to be codified to go ahead with the
regression. Further, we need to create several dummy variables during the process of the
regression itself, as can be seen in the spreadsheets used for the modeling. We proceed first
with simple linear regression and obtain the following model:

Price = -200.484 + 2864.733*Carat

with an R2 = 0.856, meaning a good regression.

Further, we proceed to do a stepwise multiple regression to account for the effect of other
variables as well. We find that, in all, 11 variables are significant and we construct the model
as follows:

Price = -392.58+4152.202*Carat -1680.414*I2 -848.398*I1-352.347*EGL -963.728*Very


light yellow-533.472*faint yellow-224.557*near colorless-521.283*SI3+203.865*VS1-
255.977*fair symmetry-121.673*SI2

With an R2 = 0.965. Looking at the variables, we find that Carat and VS1(Clarity) positively
affect the price increase, while an increase in others leads to a decrease in the price.

Using this model, we find that the price of a diamond that caught the eye of the professor is
worth $2689.30.

Problem Description
A professor has gone to buy a diamond based on several specifications at some price. Once
he does find the diamond, he wants to check whether the quoted price was a fair one or not.
For this purpose, he conducts some research online and obtains data on multiple parameters
as below:

Colour, Carat, Cut, Clarity, Certification, Polish, Symmetry – 7 variables of which Carat is
continuous, and the others are categorical.
We have not considered Wholesaler as a variable in our analysis as each wholesaler was
selling diamonds in different carat size category, hence choice of wholesaler on its own
didn’t influence price.

Analysis Pedagogy
First of all, we converted the categorical variables into binary dummy variables, ensuring
collinearity due to redundancy is avoided. We then used SPSS to construct the regression
model using a stepwise regression method. We ultimately found that there are 11 significant
variables, including the dummy variables. They are as listed down below:

Carat, I2 (Clarity), I1 (Clarity), EGL (Certification), Very Light Yellow (Colour), Faint
Yellow (Colour), Near Colourless (Colour), S13 (Clarity), VS1 (Clarity), Fair (Symmetry),
and S12 (Clarity).

Simple Linear Regression


Finding that Carat has the highest R2 value of 0.856 among all the other variables, we do a
simple linear regression between Price and Carats. The plot is given below:

Sum of Mean
Squares Df Square F significance
519687583.40 519687583. 2612.
Regression 7 1 4 8 1.0237E-186
87117937.537 198899.400
Residual 438 8
606805520.94
Total 3 439
Carat Line Fit Plot
5000
4500
4000
3500
3000
Price
2500
Price

Predicted Price
2000
1500
1000
500
0
0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8
Carat

A few comments on the plot: here, despite having a good R 2 value and a strong significance,
the plot reveals that the data is quite heteroskedastic, i.e., the standard deviation is
“clustered.”

Let us next look at the normal plot. We find that the plot is quite linear and seems like
normality is a valid assumption.

Carat Residual Plot Normal Probability Plot


1500 3500
1000 3000
500 2500
2000
Residuals

0
Price

1500
-500 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8
1000
-1000 500
-1500 0
-2000 0 20 40 60 80 100 120
Carat Sample Percentile

Looking at the residual plot, we find the residuals plot reinforcing our observations in the plot
of the data. For further analysis, however, we need to use other significant variables as well –
namely, we need to do multiple regression.
Multiple Linear Regression
As has been mentioned in the problem statement, we next did a multiple linear regression
analysis with price as the dependent variable and the other 11 variables as independent
variables. The model summary that we obtained is as follows:

The significance of the regression is inferred using the following hypothesis test:

H_0: All the Betas are 0 H_1: At least one of the Betas are non-zero

The significance can be understood using the T-test and the F test.

1. T-test:
The T values obtained are listed in Table 1. From the same table, we also find that the
p values are less than 0.05 for all the 11 coefficients and they are thus, significant.
2. F test:
The F value for the final regression model is 1060.334. With the degrees of freedom
11 and 428 for the numerator and denominator, we find that the p-value is <<0.05 and
thus, the overall regression is significant.

As can be inferred, the significance of the F test, as can be deducted from the extremely small
p-value, is very significant. Further, the R2 value is 0.965, signifying that it is a good
regression.

Standardize
Unstandardize d
d Coefficients Coefficients t Sig.
Std.
B Error Beta
(Constant -392.580 28.19 - 1.31E-36
) 6 13.92
3
Carat 4152.202 50.36 1.341 82.44 8.56 E-
3 6 265
I2 -1680.414 59.98 -0.349 - 7.44 E-99
1 28.01
6
I1 -848.398 43.03 -0.281 - 4.96 E-62
3 19.71
5
EGL -352.347 36.16 -0.133 -9.742 2.17 E-20
7
V.Light -963.728 71.62 -0.134 - 1.15 E-34
Yellow 7 13.45
5
Faint -533.472 32.99 -0.192 - 3.03 E-46
Yellow 2 16.17
0
Near -224.557 26.13 -0.095 -8.592 1.61 E-16
Colourles 6
s
SI3 -521.283 60.13 -0.105 -8.669 9.08 E-17
1
VS1 203.865 45.94 0.044 4.438 1.15 E-05
1
Fair -255.977 52.40 -0.046 -4.885 1.46 E-06
symmetry 4

SI2 -121.673 30.63 -0.045 -3.972 8.35 E-05


1

From the above table we can infer that the carat having highest significant coefficient would
have maximum impact on Price.

Determination of offer price


Using the coefficients obtained, along with the values of the independent variables that the
professor is looking for, we get a value of $2689.3. Recall that the price quoted was $3100.
To decide is this price is indeed fair, we have to see of this quoted price falls within the C.I.
formed using the standard error in regression.

Using the table obtained, we find that the standard error for the regression is

S = SQRT(MSE) = 224.017

Thus, the C.I. is thus,

y^ - t(0.05, 428) * S = 2248.98963 and y^ + t(0.05, 428) * S = 3129.61037

We thus find that the price indeed falls in the interval and is hence, fair to the level of
confidence we are looking at.

Plots
The normal probability is as follows:

Eyeballing the plot, we find the normal plot roughly


coinciding with the 45o plot. This implies that the fit
is quite good

We next look at the histogram plotted between the price and the regression standardized
residual, as can be seen below.
Observations:

In the collected data, there is no sample having following attributes:


1) Light yellow and yellow color
2) Poor cut, polish, and symmetry
3) FL, IF I3, clarity

Recommendations
Using our model, we find that the price which should be quoted for the diamond is $2689.30.

Limitations of the modeling process


Looking at the single regression, we find that eyeballing the plot, and linear regression may
not be the best fit.

You might also like