Group1 - DS Assignment 1
Group1 - DS Assignment 1
Decision Sciences – II
Assignment 1
THE PROFESSOR PROPOSES
Further, we proceed to do a stepwise multiple regression to account for the effect of other
variables as well. We find that, in all, 11 variables are significant and we construct the model
as follows:
With an R2 = 0.965. Looking at the variables, we find that Carat and VS1(Clarity) positively
affect the price increase, while an increase in others leads to a decrease in the price.
Using this model, we find that the price of a diamond that caught the eye of the professor is
worth $2689.30.
Problem Description
A professor has gone to buy a diamond based on several specifications at some price. Once
he does find the diamond, he wants to check whether the quoted price was a fair one or not.
For this purpose, he conducts some research online and obtains data on multiple parameters
as below:
Colour, Carat, Cut, Clarity, Certification, Polish, Symmetry – 7 variables of which Carat is
continuous, and the others are categorical.
We have not considered Wholesaler as a variable in our analysis as each wholesaler was
selling diamonds in different carat size category, hence choice of wholesaler on its own
didn’t influence price.
Analysis Pedagogy
First of all, we converted the categorical variables into binary dummy variables, ensuring
collinearity due to redundancy is avoided. We then used SPSS to construct the regression
model using a stepwise regression method. We ultimately found that there are 11 significant
variables, including the dummy variables. They are as listed down below:
Carat, I2 (Clarity), I1 (Clarity), EGL (Certification), Very Light Yellow (Colour), Faint
Yellow (Colour), Near Colourless (Colour), S13 (Clarity), VS1 (Clarity), Fair (Symmetry),
and S12 (Clarity).
Sum of Mean
Squares Df Square F significance
519687583.40 519687583. 2612.
Regression 7 1 4 8 1.0237E-186
87117937.537 198899.400
Residual 438 8
606805520.94
Total 3 439
Carat Line Fit Plot
5000
4500
4000
3500
3000
Price
2500
Price
Predicted Price
2000
1500
1000
500
0
0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8
Carat
A few comments on the plot: here, despite having a good R 2 value and a strong significance,
the plot reveals that the data is quite heteroskedastic, i.e., the standard deviation is
“clustered.”
Let us next look at the normal plot. We find that the plot is quite linear and seems like
normality is a valid assumption.
0
Price
1500
-500 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8
1000
-1000 500
-1500 0
-2000 0 20 40 60 80 100 120
Carat Sample Percentile
Looking at the residual plot, we find the residuals plot reinforcing our observations in the plot
of the data. For further analysis, however, we need to use other significant variables as well –
namely, we need to do multiple regression.
Multiple Linear Regression
As has been mentioned in the problem statement, we next did a multiple linear regression
analysis with price as the dependent variable and the other 11 variables as independent
variables. The model summary that we obtained is as follows:
The significance of the regression is inferred using the following hypothesis test:
H_0: All the Betas are 0 H_1: At least one of the Betas are non-zero
The significance can be understood using the T-test and the F test.
1. T-test:
The T values obtained are listed in Table 1. From the same table, we also find that the
p values are less than 0.05 for all the 11 coefficients and they are thus, significant.
2. F test:
The F value for the final regression model is 1060.334. With the degrees of freedom
11 and 428 for the numerator and denominator, we find that the p-value is <<0.05 and
thus, the overall regression is significant.
As can be inferred, the significance of the F test, as can be deducted from the extremely small
p-value, is very significant. Further, the R2 value is 0.965, signifying that it is a good
regression.
Standardize
Unstandardize d
d Coefficients Coefficients t Sig.
Std.
B Error Beta
(Constant -392.580 28.19 - 1.31E-36
) 6 13.92
3
Carat 4152.202 50.36 1.341 82.44 8.56 E-
3 6 265
I2 -1680.414 59.98 -0.349 - 7.44 E-99
1 28.01
6
I1 -848.398 43.03 -0.281 - 4.96 E-62
3 19.71
5
EGL -352.347 36.16 -0.133 -9.742 2.17 E-20
7
V.Light -963.728 71.62 -0.134 - 1.15 E-34
Yellow 7 13.45
5
Faint -533.472 32.99 -0.192 - 3.03 E-46
Yellow 2 16.17
0
Near -224.557 26.13 -0.095 -8.592 1.61 E-16
Colourles 6
s
SI3 -521.283 60.13 -0.105 -8.669 9.08 E-17
1
VS1 203.865 45.94 0.044 4.438 1.15 E-05
1
Fair -255.977 52.40 -0.046 -4.885 1.46 E-06
symmetry 4
From the above table we can infer that the carat having highest significant coefficient would
have maximum impact on Price.
Using the table obtained, we find that the standard error for the regression is
S = SQRT(MSE) = 224.017
We thus find that the price indeed falls in the interval and is hence, fair to the level of
confidence we are looking at.
Plots
The normal probability is as follows:
We next look at the histogram plotted between the price and the regression standardized
residual, as can be seen below.
Observations:
Recommendations
Using our model, we find that the price which should be quoted for the diamond is $2689.30.