Practice Question 15.12.2023 2
Practice Question 15.12.2023 2
Practice exercises
1. The district manager of Jasons, a large discount retail chain, is investigating why
certain stores in her region are performing better than others. She believes that three factors
are related to total sales: the number of competitors in the region, the population in the
surrounding area, and the amount spent on advertising. From her district, she selects a sample
of 30 stores. For each store she gathered the following information.
The sample data was run on computer and the following results were obtained
Analysis of Variance
Model Sum of Squares df Mean Square
Regression
Residual 3050 3 762.50
Total
2200 26 84.62
5250 29
Y = β 0 + β1 X 1 + β 2 X 2 + β 3 X 3 + ε
where
Y= number of hours of television watched last week
X1 = age
X2 = number of years of education
X3 = income (in RM’000)
The sample data was run on computer and the following results were obtained
Analysis of Variance
Model Sum of Squares df Mean Square
Regression
Residual 227 3 75.7
Total
426 21 20.3
653 24
a) Test the overall validity of the model using the 5% level of significance.
b) Which variable affects the number of hours of television watched the most?
c) Is there sufficient evidence that the number of hours of television watched and age are
linearly related? (Use the 1% level of significance).
There are many regression techniques that can be used to determine which set of independent
variables are the most important predictors of the dependent variable, Y. The researcher first
identifies the dependent variable (response) Y, and the set of potentially important independent
variables, X1, X2,……,Xk, where k is generally large. (Note: This set of variables could include
both first-order and higher-order terms as well as interactions). The data is then entered into the
computer software, and the variable selection method is selected.
1. Stepwise Regression
This is the most widely used variable selection method.
Step 1
First all possible one-variable models in the form of Y = β 0 + β1 X i are formed. The
independent variable that has the largest t-value or F-value or r value and is significant is
∧
selected as the best one-variable predictor of Y . Using the example regarding the 26 fast
food restaurants selling fried chicken, the F values of all three independent variables are as
follows.
Fx1 = 0.621
Fx2 = 37.991
Fx3 = 54.395
Fc = Fα ,1,n − k −1
The critical value is
=F0.05, 1, 24 = 4.26
Since Fx3 is the largest and is significant, x3 will be the first variable to be selected
Step 2
Fc = Fα ,1,n − k −1
The critical value is
F0.05,1,23 = 4.28
Fx 2 x3
Since is larger and > 4.28, x2 is selected.
Step 3
Stepwise will now go back and check if the variable selected in step 1 is still significant
given that the variable selected in step 2 is already in the model. This is because the value of
∧
β1 will change in a two variable model and we need to see if it is still significant. Hence, we
need to work out the partial-F value for the variable selected in step 1 given that the variable
selected in step 2 is in the model.
Fc = Fα ,1,n − k −1
F0.05,1,23 = 4.28
Fx3 x 2
Since > 4.28, we can conclude that x3 is sig and cannot be dropped from the model.
If the variable selected in step 1 given that the second variable selected is in the model is
found to be not significant, it is dropped from the model and stepwise will proceed to
construct all possible two-variable models with the second variable in the model and a search
is made for the independent variable with a β parameter that will yield the most significant
partial F-value.
Step 4
The stepwise regression procedure now checks for a third independent variable to include
in the model with X1 and X2. That is, we are seeking for the best model in the form of
∧
Y = β 0 + β1 X 1 + β 2 X 2 + β 3 X i
. To do this, all possible three variable models are formed with
X1, X2 and each of the remaining (k – 2) independent variables and the partial-F value for the
third variable is computed and tested for significance. The third variable with the largest
partial-F value and is significant is then selected.
MSR(x1 x2 x3 ) [ SSR(x1 x2 x3 )− SSR(x2 x3 )] / 1
Fx1 x2 x3 = =
MSE (x1 x2 x3 ) MSE (x1 x2 x3 )
176732.665 − 176503.756
= = 2.5415
90.07
Fc = Fα ,1,n − k −1
F0.05,1,22 = 4.30
Fx1 x 2 x3
Since < 4.30, x1 is not sig and hence it will not be selected.
Stepwise will then recheck the partial-F values of X1 and X2 given that the third selected
variable is in the model. If any of the variables are found to be not significant, it is dropped
from the model and a search is made for other significant independent variables. This
procedure is continued until no further independent variables can be found to yield a
significant partial-F value.
The forward selection method is similar to the stepwise method. Variables are selected
one at a time based on their importance in explaining the dependent variable. The only
difference is that the forward selection technique does not check the partial-F values
corresponding to the predictor variables that have been entered into the model in an earlier
step.
Step 1
First all possible one-variable models in the form of Y = β 0 + β1 X i are formed. The
independent variable that has the largest t-value or F-value or r value and is significant is
∧
selected as the best one-variable predictor of Y . Using the example regarding the 26 fast
food restaurants selling fried chicken, the F values of all three independent variables are as
follows.
Fx1 = 0.621
Fx2 = 37.991
Fx3 = 54.395
Fc = Fα ,1,n − k −1
The critical value is
=F0.05,1,24 = 4.26
Since Fx3 is the largest and is significant, x3 will be the first variable to be selected
Step 2
Forward selection method will now form all possible two-variable models
∧
Y = β 0 + β 1 X 1 + β 2 X i which contains the variable that was selected in step 1 and each of the
remaining (k – 1) remaining variables. The partial-F value for the second variable in each of
the two-variable models is then computed. The variable with the largest partial-F value and is
significant will be the second variable that will be selected.
Fc = Fα ,1,n − k −1
The critical value is
F0.05,1,23 = 4.28
Fx 2 x3
Since is larger and > 4.28, x2 is selected.
Step 3
The forward regression procedure now checks for a third independent variable to include
in the model with X1 and X2. That is, we are seeking for the best model in the form of
∧
Y = β 0 + β 1 X 1 + β 2 X 2 + β 3 X i . To do this, all possible three variable models are formed with
X1, X2 and each of the remaining (k – 2) independent variables and the partial-F value for the
third variable is computed and tested for significance. The third variable with the largest
partial-F value and is significant is then selected.
Fc = Fα ,1,n − k −1
F0.05,1,22 = 4.30
Fx1 x 2 x3
Since < 4.30, x1 is not sig and hence it will not be selected.
The forward selection method will now check for the fourth independent variable to
include in the model based on the most significant partial-F value. This procedure is
continued until no further independent variables can be found to yield a significant partial-F
value.
Unlike the first two methods, the backward elimination method begins with the full
model and then drops variables that are least significant one at a time at each step based on
their partial-F values.
Step 1
This method begins by constructing a regression model with all potential independent
variables. Assuming that there are k potential independent variables, the model
∧
Y = β 0 + β 1 X 1 + β 2 X 2 + .......... + β k X k is first constructed. The variable with the smallest
partial-F value, t-value or r value and is not significant is identified and dropped from the
model.
MSR(x1 x2 x3 ) [176732.665 − 176503.756] / 1
Fx1 x 2 x3 = = = 2.5415
MSE (x1 x2 x3 ) 90.070
Fc = F0.05,1, 22 = 4.30
Fx1 x 2 x3
Since is the smallest and not significant x1 is dropped from the model.
(*Note: If all the variables are significant, the procedure stops and the model in this step is
considered as the best model)
Step 2
The backward elimination method will now form a model with the (k – 1) remaining
variables and the variable with the smallest non-significant partial-F value is then dropped.
Fc = Fα ,1,n − k −1
The critical value is
F0.05,1,23 = 4.28
Since both partial F values are > 4.28., both x2 and x3 cannot be dropped from the model.
This process is repeated until no further non-significant independent variables can be found.
Practice Exercises
1. A real estate agent wanted to predict the selling price of single-storey houses in a new
housing development area in Banting. In order to use a multiple regression model, he
gathered the prices of 15 recently sold houses. He also collected data on three independent
variables which he felt may affect the price of the houses. The variables collected are as
follows:
Model 1
Sum of
df Mean Square F Sig.
Squares
Regression 4034.414 1 4034.414 23.885 .000(a)
Residual 2195.822 13 168.909
Total 6230.236 14
a Predictors: (Constant), Size
b Dependent Variable: Price
Model 2
.
Sum of
df Mean Square F Sig.
Squares
Regression 1690.364 1 1690.364 4.840 .046(a)
Residual 4539.872 13 349.221
Total 6230.236 14
a Predictors: (Constant), Age
b Dependent Variable: Price
Model 3
Sum of
df Mean Square F Sig.
Squares
Regression 864.026 1 864.026 2.093 .172(a)
Residual 5366.210 13 412.785
Total 6230.236 14
a Predictors: (Constant), Lotsize
b Dependent Variable: Price
Model 4
Sum of
df Mean Square F Sig.
Squares
Regression 4341.373 2 2170.686 13.790 .001(a)
Residual 1888.863 12 157.405
Total 6230.236 14
a Predictors: (Constant), Age, Size
b Dependent Variable: Price
Model 5
Sum of
df Mean Square F Sig.
Squares
Regression 5704.027 2 2852.014 65.039 .000(a)
Residual 526.209 12 43.851
Total 6230.236 14
a Predictors: (Constant), Lotsize, Size
b Dependent Variable: Price
Model 6
Sum of
df Mean Square F Sig.
Squares
Regression 4259.516 2 2129.758 12.968 .001(a)
Residual 1970.720 12 164.227
Total 6230.236 14
a Predictors: (Constant), Lotsize, Age
b Dependent Variable: Price
Model 7
Sum of
df Mean Square F Sig.
Squares
Regression 5707.438 3 1902.479 40.029 .000(a)
Residual 522.798 11 47.527
Total 6230.236 14
a Predictors: (Constant), Lotsize, Size, Age
b Dependent Variable: Price
Using the stepwise selection procedure, determine the best set of independent variables to
explain the selling price of single storey houses in Banting.
2. A real estate company evaluates vacancy rates, square footage, rental rates, and operating expenses
for commercial properties in a large metropolitan area in order to provide clients with
quantitative information upon which to make rental decisions. The data below are taken from 81
suburban properties that are the newest, best locate, most attractive and expensive for five
geographic areas. Shown here are the age (X1), operating expenses (X2), vacancy rates (X3), total
square footage (X4) and rental rates (Y). The following table shows the values of SSR and SSE
when the listed variables are included in the regression model.
Using the forward selection method, determine the best subset of variables to predict rental
rates of these commercial properties.
3. Three variables are being considered to measure the kidney function of young adults.
The data collected for this study are creatinine clearance (y), creatinine concentration (x1),
age (x2) and weight (x3).A regression analysis was performed and the following outputs were
obtained.
Model 1
Sum of
df Mean Square F Sig.
Squares
Regression 24.920 3 8.307 52.870 .001(a)
Residual 0.628 4 0.157
Total 25.548 7
a Predictors: (Constant), x3, x2, x1
b Dependent Variable: y
Model 2
.
Sum of
df Mean Square F Sig.
Squares
Regression 24.875 2 12.438 92.303 .000(a)
Residual 0.674 5 0.135
Total 25.548 7
a Predictors: (Constant), x1, x2
b Dependent Variable: y
Model 3
Sum of
df Mean Square F Sig.
Squares
Regression 23.304 2 11.652 25.949 .002(a)
Residual 2.245 5 0.449
Total 25.548 7
a Predictors: (Constant), x1, x3
b Dependent Variable: y
Model 4
Sum of
df Mean Square F Sig.
Squares
Regression 23.365 2 11.683 26.754 .002(a)
Residual 2.183 5 0.437
Total 25.548 7
a Predictors: (Constant), x2, x3
b Dependent Variable: y
Model 5
Sum of
df Mean Square F Sig.
Squares
Regression 22.981 1 22.981 53.695 .000(a)
Residual 2.568 6 0.428
Total 25.548 7
a Predictors: (Constant), x1
b Dependent Variable: y
Model 6
Sum of
df Mean Square F Sig.
Squares
Regression 19.366 1 19.366 18.793 .005(a)
Residual 6.183 6 1.030
Total 25.548 7
a Predictors: (Constant), x2
b Dependent Variable: y
Model 7
Sum of
df Mean Square F Sig.
Squares
Regression 20.917 1 20.917 27.093 .002(a)
Residual 4.632 6 0.772
Total 25.548 7
a Predictors: (Constant), x3
b Dependent Variable: y
Perform the Backward Elimination Process to find the most appropriate regression model.
26