0% found this document useful (0 votes)
10 views

Practice Question 15.12.2023 2

The document discusses regression analysis fundamentals, including practical exercises involving multiple regression models to analyze sales performance and television viewing habits based on various independent variables. It details the results of regression analyses, including coefficients, standard errors, and significance tests, as well as variable selection methods such as stepwise regression, forward selection, and backward elimination. The exercises require the application of statistical techniques to interpret results and make predictions based on the models constructed.

Uploaded by

sulianahsaddisin
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views

Practice Question 15.12.2023 2

The document discusses regression analysis fundamentals, including practical exercises involving multiple regression models to analyze sales performance and television viewing habits based on various independent variables. It details the results of regression analyses, including coefficients, standard errors, and significance tests, as well as variable selection methods such as stepwise regression, forward selection, and backward elimination. The exercises require the application of statistical techniques to interpret results and make predictions based on the models constructed.

Uploaded by

sulianahsaddisin
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

STA 450: FUNDAMENTALS OF REGRESSION ANALYSIS

Practice exercises

1. The district manager of Jasons, a large discount retail chain, is investigating why
certain stores in her region are performing better than others. She believes that three factors
are related to total sales: the number of competitors in the region, the population in the
surrounding area, and the amount spent on advertising. From her district, she selects a sample
of 30 stores. For each store she gathered the following information.

Y= total sales last year (RM’000)


X1 = number of competitors in the region
X2 = population of the region (in millions)
X3 = advertising expense (in RM’000)

The sample data was run on computer and the following results were obtained

Predictor Coefficient Standard Error t-ratio


Constant 14.00 7.00 2.00
X1 -1.00 0.70 -1.43
X2 30.00 5.20 5.77
X1 0.20 0.08 2.50

Analysis of Variance
Model Sum of Squares df Mean Square
Regression
Residual 3050 3 762.50
Total
2200 26 84.62
5250 29

a) Write down the regression equation.


b) Interpret the coefficient of X1.
c) What is the estimated sales for a store which has four competitors, a regional
population of 400,000 and an advertising expense of RM30,000?
d) Determine the standard error of the estimate.
e) Determine the coefficient of multiple determination and interpret its meaning.
f) Test the significance of the model using the 5% level of significance.
g) Conduct tests of hypotheses to determine which of the independent variables have
significant regression coefficients. Use the 5% level of significance.

2. A statistician wanted to determine if demographic variables of age, education, and


income influence the number of hours of television watched per week. A random sample of
25 adults was selected and a multiple regression model was constructed as follows.

Y = β 0 + β1 X 1 + β 2 X 2 + β 3 X 3 + ε
where
Y= number of hours of television watched last week
X1 = age
X2 = number of years of education
X3 = income (in RM’000)

The sample data was run on computer and the following results were obtained

Predictor Coefficient Standard Error t-ratio


Constant 22.3 10.7 2.08
X1 0.41 0.19 2.16
X2 -0.29 0.13 -2.23
X3 -0.12 0.03 -4.00

Analysis of Variance
Model Sum of Squares df Mean Square
Regression
Residual 227 3 75.7
Total
426 21 20.3
653 24

a) Test the overall validity of the model using the 5% level of significance.

b) Which variable affects the number of hours of television watched the most?

c) Is there sufficient evidence that the number of hours of television watched and age are
linearly related? (Use the 1% level of significance).

d) At the 1% level of significance can we conclude that as the number of years of


education increases, the number of hours of television watched decreases?

e) Estimate the number of hours of television watched by a 40-year-old housewife who


has 15 years of education.

5.6 VARIABLE SELECTION METHODS

There are many regression techniques that can be used to determine which set of independent
variables are the most important predictors of the dependent variable, Y. The researcher first
identifies the dependent variable (response) Y, and the set of potentially important independent
variables, X1, X2,……,Xk, where k is generally large. (Note: This set of variables could include
both first-order and higher-order terms as well as interactions). The data is then entered into the
computer software, and the variable selection method is selected.

1. Stepwise Regression
This is the most widely used variable selection method.

Step 1

First all possible one-variable models in the form of Y = β 0 + β1 X i are formed. The
independent variable that has the largest t-value or F-value or r value and is significant is

selected as the best one-variable predictor of Y . Using the example regarding the 26 fast
food restaurants selling fried chicken, the F values of all three independent variables are as
follows.

Fx1 = 0.621
Fx2 = 37.991
Fx3 = 54.395

Fc = Fα ,1,n − k −1
The critical value is
=F0.05, 1, 24 = 4.26

Since Fx3 is the largest and is significant, x3 will be the first variable to be selected

Step 2

Stepwise will now form all possible two-variable models Y = β 0 + β1 X 1 + β 2 X i which


contains the variable that was selected in step 1 and each of the remaining (k – 1) remaining
variables. The partial-F value for the second variable in each of the two-variable models is
then computed. The variable with the largest partial-F value and is significant will be the
second variable that will be selected.

MSR(x1 x3 ) [ SSR(x1 x3 )− SSR(x3 )] / 1


Fx1 x3 = =
MSE (x1 x3 ) MSE (x1 x3 )
126888.997 − 124002.038
= = 1.2812
2253.27

MSR(x2 x3 ) [ SSR(x2 x3 )− SSR(x3 )] / 1


Fx 2 x3 = =
MSE (x2 x3 ) MSE (x2 x3 )
176503.756 − 124002.038
= = 546.2897
96.106

Fc = Fα ,1,n − k −1
The critical value is
F0.05,1,23 = 4.28

Fx 2 x3
Since is larger and > 4.28, x2 is selected.

Step 3

Stepwise will now go back and check if the variable selected in step 1 is still significant
given that the variable selected in step 2 is already in the model. This is because the value of

β1 will change in a two variable model and we need to see if it is still significant. Hence, we
need to work out the partial-F value for the variable selected in step 1 given that the variable
selected in step 2 is in the model.

MSR(x3 x2 ) [176503.756 − 109524.089] / 1


Fx3 {x2 = = = 696.94
MSE (x2 x3 ) 96.106

Fc = Fα ,1,n − k −1
F0.05,1,23 = 4.28
Fx3 x 2
Since > 4.28, we can conclude that x3 is sig and cannot be dropped from the model.

If the variable selected in step 1 given that the second variable selected is in the model is
found to be not significant, it is dropped from the model and stepwise will proceed to
construct all possible two-variable models with the second variable in the model and a search
is made for the independent variable with a β parameter that will yield the most significant
partial F-value.

Step 4

The stepwise regression procedure now checks for a third independent variable to include
in the model with X1 and X2. That is, we are seeking for the best model in the form of

Y = β 0 + β1 X 1 + β 2 X 2 + β 3 X i
. To do this, all possible three variable models are formed with
X1, X2 and each of the remaining (k – 2) independent variables and the partial-F value for the
third variable is computed and tested for significance. The third variable with the largest
partial-F value and is significant is then selected.
MSR(x1 x2 x3 ) [ SSR(x1 x2 x3 )− SSR(x2 x3 )] / 1
Fx1 x2 x3 = =
MSE (x1 x2 x3 ) MSE (x1 x2 x3 )
176732.665 − 176503.756
= = 2.5415
90.07

Fc = Fα ,1,n − k −1

F0.05,1,22 = 4.30

Fx1 x 2 x3
Since < 4.30, x1 is not sig and hence it will not be selected.

The best model using stepwise regression method is y = β 0 + β1 x2 + β 2 x3

Stepwise will then recheck the partial-F values of X1 and X2 given that the third selected
variable is in the model. If any of the variables are found to be not significant, it is dropped
from the model and a search is made for other significant independent variables. This
procedure is continued until no further independent variables can be found to yield a
significant partial-F value.

2. Forward Selection Method

The forward selection method is similar to the stepwise method. Variables are selected
one at a time based on their importance in explaining the dependent variable. The only
difference is that the forward selection technique does not check the partial-F values
corresponding to the predictor variables that have been entered into the model in an earlier
step.
Step 1

First all possible one-variable models in the form of Y = β 0 + β1 X i are formed. The
independent variable that has the largest t-value or F-value or r value and is significant is

selected as the best one-variable predictor of Y . Using the example regarding the 26 fast
food restaurants selling fried chicken, the F values of all three independent variables are as
follows.

Fx1 = 0.621
Fx2 = 37.991
Fx3 = 54.395

Fc = Fα ,1,n − k −1
The critical value is
=F0.05,1,24 = 4.26

Since Fx3 is the largest and is significant, x3 will be the first variable to be selected

Step 2

Forward selection method will now form all possible two-variable models

Y = β 0 + β 1 X 1 + β 2 X i which contains the variable that was selected in step 1 and each of the
remaining (k – 1) remaining variables. The partial-F value for the second variable in each of
the two-variable models is then computed. The variable with the largest partial-F value and is
significant will be the second variable that will be selected.

MSR(x1 x3 ) [ SSR(x1 x3 )− SSR(x3 )] / 1


Fx1 x3 = =
MSE (x1 x3 ) MSE (x1 x3 )
126888.997 − 124002.038
= = 1.2812
2253.27

MSR(x2 x3 ) [ SSR(x2 x3 )− SSR(x3 )] / 1


Fx 2 x3 = =
MSE (x2 x3 ) MSE (x2 x3 )
176503.756 − 124002.038
= = 546.2897
96.106

Fc = Fα ,1,n − k −1
The critical value is
F0.05,1,23 = 4.28

Fx 2 x3
Since is larger and > 4.28, x2 is selected.

Step 3
The forward regression procedure now checks for a third independent variable to include
in the model with X1 and X2. That is, we are seeking for the best model in the form of

Y = β 0 + β 1 X 1 + β 2 X 2 + β 3 X i . To do this, all possible three variable models are formed with
X1, X2 and each of the remaining (k – 2) independent variables and the partial-F value for the
third variable is computed and tested for significance. The third variable with the largest
partial-F value and is significant is then selected.

MSR(x1 x2 x3 ) [ SSR(x1 x2 x3 )− SSR(x2 x3 )] / 1


Fx1 x2 x3 = =
MSE (x1 x2 x3 ) MSE (x1 x2 x3 )
176732.665 − 176503.756
= = 2.5415
90.07

Fc = Fα ,1,n − k −1

F0.05,1,22 = 4.30

Fx1 x 2 x3
Since < 4.30, x1 is not sig and hence it will not be selected.

The best model using stepwise regression method is y = β 0 + β1 x2 + β 2 x3

The forward selection method will now check for the fourth independent variable to
include in the model based on the most significant partial-F value. This procedure is
continued until no further independent variables can be found to yield a significant partial-F
value.

3. The Backward Elimination Method.

Unlike the first two methods, the backward elimination method begins with the full
model and then drops variables that are least significant one at a time at each step based on
their partial-F values.

Step 1

This method begins by constructing a regression model with all potential independent
variables. Assuming that there are k potential independent variables, the model

Y = β 0 + β 1 X 1 + β 2 X 2 + .......... + β k X k is first constructed. The variable with the smallest
partial-F value, t-value or r value and is not significant is identified and dropped from the
model.
MSR(x1 x2 x3 ) [176732.665 − 176503.756] / 1
Fx1 x 2 x3 = = = 2.5415
MSE (x1 x2 x3 ) 90.070

MSR(x2 x1 x3 ) [176732.665 − 126888.997] / 1


Fx 2 x1 x3 = = = 553.388
MSE (x1 x2 x3 ) 90.070

MSR(x3 x1 x2 ) [176732.665 − 109628.322] / 1


Fx3 x1 x 2 = = = 745.024
MSE (x1 x2 x3 ) 90.07

Fc = F0.05,1, 22 = 4.30

Fx1 x 2 x3
Since is the smallest and not significant x1 is dropped from the model.

(*Note: If all the variables are significant, the procedure stops and the model in this step is
considered as the best model)

Step 2

The backward elimination method will now form a model with the (k – 1) remaining
variables and the variable with the smallest non-significant partial-F value is then dropped.

MSR(x3 x2 ) [176503.756 − 109524.089] / 1


Fx3 {x2 = = = 696.94
MSE (x2 x3 ) 96.106

MSR(x2 x3 ) [ SSR(x2 x3 )− SSR(x3 )] / 1


Fx 2 x3 = =
MSE (x2 x3 ) MSE (x2 x3 )
176503.756 − 124002.038
= = 546.2897
96.106

Fc = Fα ,1,n − k −1
The critical value is
F0.05,1,23 = 4.28

Since both partial F values are > 4.28., both x2 and x3 cannot be dropped from the model.

The best model using backward elimination regression method is y = β 0 + β1 x2 + β 2 x3

This process is repeated until no further non-significant independent variables can be found.
Practice Exercises

1. A real estate agent wanted to predict the selling price of single-storey houses in a new
housing development area in Banting. In order to use a multiple regression model, he
gathered the prices of 15 recently sold houses. He also collected data on three independent
variables which he felt may affect the price of the houses. The variables collected are as
follows:

Y = Price = Selling price of house (RM’000)


X1 = Size = House size (’00 sq. feet)
X2 = Age = Age of house (years)
X3 = Lotsize = Lot size of house (‘000 sq. feet)

The estimation results of all possible models are given below.

Model 1

Sum of
df Mean Square F Sig.
Squares
Regression 4034.414 1 4034.414 23.885 .000(a)
Residual 2195.822 13 168.909
Total 6230.236 14
a Predictors: (Constant), Size
b Dependent Variable: Price

Model 2
.
Sum of
df Mean Square F Sig.
Squares
Regression 1690.364 1 1690.364 4.840 .046(a)
Residual 4539.872 13 349.221
Total 6230.236 14
a Predictors: (Constant), Age
b Dependent Variable: Price

Model 3
Sum of
df Mean Square F Sig.
Squares
Regression 864.026 1 864.026 2.093 .172(a)
Residual 5366.210 13 412.785
Total 6230.236 14
a Predictors: (Constant), Lotsize
b Dependent Variable: Price

Model 4

Sum of
df Mean Square F Sig.
Squares
Regression 4341.373 2 2170.686 13.790 .001(a)
Residual 1888.863 12 157.405
Total 6230.236 14
a Predictors: (Constant), Age, Size
b Dependent Variable: Price

Model 5

Sum of
df Mean Square F Sig.
Squares
Regression 5704.027 2 2852.014 65.039 .000(a)
Residual 526.209 12 43.851
Total 6230.236 14
a Predictors: (Constant), Lotsize, Size
b Dependent Variable: Price

Model 6

Sum of
df Mean Square F Sig.
Squares
Regression 4259.516 2 2129.758 12.968 .001(a)
Residual 1970.720 12 164.227
Total 6230.236 14
a Predictors: (Constant), Lotsize, Age
b Dependent Variable: Price

Model 7
Sum of
df Mean Square F Sig.
Squares
Regression 5707.438 3 1902.479 40.029 .000(a)
Residual 522.798 11 47.527
Total 6230.236 14
a Predictors: (Constant), Lotsize, Size, Age
b Dependent Variable: Price

Using the stepwise selection procedure, determine the best set of independent variables to
explain the selling price of single storey houses in Banting.
2. A real estate company evaluates vacancy rates, square footage, rental rates, and operating expenses
for commercial properties in a large metropolitan area in order to provide clients with
quantitative information upon which to make rental decisions. The data below are taken from 81
suburban properties that are the newest, best locate, most attractive and expensive for five
geographic areas. Shown here are the age (X1), operating expenses (X2), vacancy rates (X3), total
square footage (X4) and rental rates (Y). The following table shows the values of SSR and SSE
when the listed variables are included in the regression model.

Variable SSR SSE


X1 14.819 221.739
X2 40.503 196.054
X3 1.047 235.511
X4 67.775 168.782
X1,X4 110.050 126.508
X2,X3 54.332 182.226
X3, X4 67.905 168.652
X1,X2,X3 137.907 98.650
X1,X2,X3,X4 138.327 98.231

Using the forward selection method, determine the best subset of variables to predict rental
rates of these commercial properties.

3. Three variables are being considered to measure the kidney function of young adults.
The data collected for this study are creatinine clearance (y), creatinine concentration (x1),
age (x2) and weight (x3).A regression analysis was performed and the following outputs were
obtained.

Model 1

Sum of
df Mean Square F Sig.
Squares
Regression 24.920 3 8.307 52.870 .001(a)
Residual 0.628 4 0.157
Total 25.548 7
a Predictors: (Constant), x3, x2, x1

b Dependent Variable: y

Model 2
.
Sum of
df Mean Square F Sig.
Squares
Regression 24.875 2 12.438 92.303 .000(a)
Residual 0.674 5 0.135
Total 25.548 7
a Predictors: (Constant), x1, x2
b Dependent Variable: y
Model 3

Sum of
df Mean Square F Sig.
Squares
Regression 23.304 2 11.652 25.949 .002(a)
Residual 2.245 5 0.449
Total 25.548 7
a Predictors: (Constant), x1, x3
b Dependent Variable: y

Model 4

Sum of
df Mean Square F Sig.
Squares
Regression 23.365 2 11.683 26.754 .002(a)
Residual 2.183 5 0.437
Total 25.548 7
a Predictors: (Constant), x2, x3
b Dependent Variable: y

Model 5

Sum of
df Mean Square F Sig.
Squares
Regression 22.981 1 22.981 53.695 .000(a)
Residual 2.568 6 0.428
Total 25.548 7
a Predictors: (Constant), x1
b Dependent Variable: y

Model 6

Sum of
df Mean Square F Sig.
Squares
Regression 19.366 1 19.366 18.793 .005(a)
Residual 6.183 6 1.030
Total 25.548 7
a Predictors: (Constant), x2
b Dependent Variable: y

Model 7

Sum of
df Mean Square F Sig.
Squares
Regression 20.917 1 20.917 27.093 .002(a)
Residual 4.632 6 0.772
Total 25.548 7
a Predictors: (Constant), x3
b Dependent Variable: y

Perform the Backward Elimination Process to find the most appropriate regression model.
26

You might also like