ch16 Solutions
ch16 Solutions
Textbook Exercises:
1. Consider the following data for two variables, X and Y.
b. Use the results from part (a) to test for a significant relationship between
X and Y. Use α = 0.05.
c. Develop a scatter diagram for the data. Does the scatter diagram suggest an
estimated regression equation of the form ŷ = b0 + b1x + b2x2? Explain.
e. Refer to part (d). Is the relationship between X, X2, and Y significant? Use α = 0.05.
4. The table below lists the number of people (millions) living with HIV globally
(www.avert.org/global-hiv-and-aids-statistics) from 2013 to 2017
Number of people
Year living with HIV (m)
2013 35.2
2014 35.9
2015 36.7
2016 36.7
2017 36.9
a. Plot the data, letting X = 0 correspond to the year 2013, Find a linear
ŷ = b0 + b1 x that models the data,
b. Plot the function on the graph with the data and determine how well the graph fits the
data.
5. In working further with the problem of exercise 4, statisticians suggested the use of the
following curvilinear estimated regression equation.
a. Use the data of exercise 4 to determine estimated regression equation.
b. Use α = 0.01 to test for a significant relationship.
a. Develop scatter diagrams for these data, treating LifeExp as the dependent variable.
b. Does a simple linear model appear to be appropriate? Explain.
c. Estimate simple regression equations for the data accordingly. Which do you prefer
and why?
a. Develop scatter diagrams for these data with pack and media as potential
independent variables.
b. Does a simple or multiple linear regression model appear to be appropriate?
c. Develop an estimated regression equation for the data you believe will best explain
the relationship between these variables.
8.
In Europe the number of Internet users varies widely from country to country. In 1999, 44.3
per cent of all Swedes used the Internet, while in France the audience was less than 10 per
cent. The disparities are expected to persist even though Internet usage is expected to grow
dramatically over the next several years. The following table shows the number of Internet
users in 2011 and in 2018 for selected European countries. (www.internetworldstats.com/)
2011 2018
Austria 74.8 87.9
Belgium 81.4 94.4
Denmark 89.0 96.9
Finland 88.6 94.3
France 77.2 92.6
Germany 82.7 96.2
Ireland 66.8 92.7
Netherlands 89.5 95.9
Norway 97.2 99.2
Spain 65.6 92.6
Sweden 92.9 96.7
Switzerland 84.2 91.0
UK 84.5 94.7
a. Develop a scatter diagram of the data using the 2011 Internet user percentage
as the independent variable. Does a simple linear regression model appear to
be appropriate? Discuss.
b. Develop an estimated multiple regression equation with X = the number of
2011 Internet users and X2 as the two independent variables.
c. Consider the nonlinear relationship shown by equation (16.6). Use logarithms
to develop an estimated regression equation for this model.
d. Do you prefer the estimated regression equation developed in part (b) or part
(c)? Explain.
For this estimated regression equation SST = 1550 and SSE = 520.
For this estimated regression equation SST = 1550 and SSE = 100.
b. Use an F test and a 0.05 level of significance to determine whether X2 and X3
contribute significantly to the model.
For this estimated regression equation SST = 1805 and SSR = 1760.
12. Failure data obtained in the course of the development of a silver-zinc battery for a NASA
programme were analyzed by Sidik, Leibecki and Bozek in 1980. Relevant variables were as
follows:
Adopting ln(y) as the response variable, a number of regression models were estimated for the
data using MINITAB:
a. Explain this computer output, carrying out any additional tests you think necessary
or appropriate.
b. Is the first model significantly better than the second?
c. Which model do you prefer and why?
13. A section of MINITAB output from an analysis of data relating to truck exhaust
emissions under different atmospheric conditions (Hare and Bradow, 1977) is as follows:
Variables used in this analysis are defi ned as follows:
Nox Nitrous oxides, NO and NO2, (grams/km)
Humi Humidity (grains H2O/lbm dry air)
Temp emperature (°F)
HT humi × temp
a. Provide a descriptive summary of this information, carrying out any further calculations or
statistical tests you think relevant or necessary.
b. It has been argued that the inclusion of quadratic terms
HH = humi × humi
TT = temp × temp
on the right-hand side of the model will lead to a significantly improved R-square outcome.
Details of the revised analysis are shown below. Is the claim justified?
14. Brownlee (1965) presents stack loss data for a chemical plant involving 21 observations
on four variables, namely:
Airflow: Flow of cooling air
Temp: Cooling Water Inlet Temperature
Acid: Concentration of acid [per 1000, minus 500]
Loss: Stack loss (the dependent variable) is 10 times the percentage of the ingoing
ammonia to the plant that escapes from the absorption column unabsorbed; that
is, an (inverse) measure of the over-all efficiency of the plant
a. Develop the estimated regression equation using all of the independent variables.
b. Did the estimated regression equation developed in part (a) provide a good fit?
Explain.
c. Develop a scatter diagram showing Delay as a function of Finished. What does this
scatter diagram indicate about the relationship between Delay and Finished?
d. On the basis of your observations about the relationship between Delay and Finished,
develop an alternative estimated regression equation to the one developed in (a) to
explain as much of the variability in Delay as possible.
16. In a study of car ownership in 24 countries, data (OECD, 1982) have been collected on the
following variables:
Selective results from a linear modelling analysis (ao is the dependent variable) are as
follows:
a. Which of the various model options considered here do you prefer and why?
b. Corresponding stepwise output from MINITAB terminates after two stages, gdp
being the first independent variable selected and pr the second. How does this latest
information reconcile with that summarized earlier?
c. Does it alter in any way, your inferences for a.? If so, why and if not, why not?
17. In a regression analysis of data from a cloud-seeding experiment (Hand et al, 1994)
relevant variables are defined thus:
x1 = 1, seeding or 0, no seeding
x2 = number of days since the experiment began
x3 = seeding suitability factor
x4 = percent cloud cover
x5 = total rainfall on target area one hour before seeding
x6 = 1, moving radar echo or 2, a stationary radar echo
y = amount of rain (cubic metres * 107) that fell in target area for a 6 hour
period on each day seeding was suitable
x1 x2 x3 x4 x5 x6
x2 0.030
0.888
x3 0.177 0.451
0.408 0.027
MODEL 1
Method
Analysis of Variance
Model Summary
Coefficients
Regression Equation
x1 x6
0 1 y = 6.82 - 0.0321 x2 - 0.911 x3 + 0.006 x4 + 1.84 x5
R Large residual
Durbin-Watson Statistic
MODEL 2
Regression Analysis: y versus x2, x3, x4, x5, x1x5, x1x3, x1x4, x1, x6, x1x6
Method
Analysis of Variance
Model Summary
Regression Equation
x1 x6
0 1 y = -0.14 - 0.0447 x2 + 0.373 x3 + 0.402 x4 + 3.84 x5 - 2.21 x1x5 - 3.19 x1x3
- 0.501 x1x4
R Large residual
Analysis of Variance
SOURCE DF SS MS F p
Regression 1 362.13 362.13 6.85 0.059
Residual Error 4 211.37 52.84
Total 5 573.50
b. Since the p-value corresponding to F = 6.85 is 0.59 > 0 the relationship
is not significant.
c.
-
40+ *
-
Y - * *
- *
-
30+
-
-
-
- *
20+
-
-
-
- *
10+
------+---------+---------+---------+---------+---------+X
20.0 25.0 30.0 35.0 40.0 45.0
Analysis of Variance
SOURCE DF SS MS F p
Regression 2 541.85 270.92 25.68 0.013
Residual Error 3 31.65 10.55
Total 5 573.50
Analysis of Variance
SOURCE DF SS MS F p
Regression 1 59.39 59.39 4.76 0.117
Residual Error 3 37.41 12.47
Total 4 96.80
The high p-value (0.117) indicates a weak relationship; note that 61.4% of the
variability in y has been explained by x.
Analysis of Variance
SOURCE DF SS MS F p
Regression 2 93.529 46.765 28.60 0.034
Residual Error 2 3.271 1.635
Total 4 96.800
Analysis of Variance
SOURCE DF SS MS F p
Regression 1 18.461 18.461 4.37 0.075
Residual Error 7 29.539 4.220
Total 8 48.000
-
- *
-
1.2+ *
-
-
- *
- * *
0.0+
-
- * *
-
-
-1.2+
- * *
-
-
+---------+---------+---------+---------+---------+------YHAT
3.0 4.0 5.0 6.0 7.0 8.0
Analysis of Variance
SOURCE DF SS MS F p
Regression 1 0.010501 0.010501 4.19 0.080
Residual Error 7 0.017563 0.002509
Total 8 0.028064
- *
-
-
-
1.0+ *
- *
-
-
- *
0.0+ *
-
-
- * *
-
-1.0+
- * *
-
--+---------+---------+---------+---------+---------+----YHAT
0.140 0.160 0.180 0.200 0.220 0.240
4. a./b.
The proposed linear function looks a fairly good fit from the plot above. The
high R2 of 86.13% appears to corroborate this viewpoint.
Analysis of Variance
SOURCE DF SS MS F p
Regression 2 36643 18322 73.15 0.003
Residual Error 3 751 250
Total 5 37395
a. Since the linear relationship was significant (Exercise 4), this relationship must
be significant. Note also that since the p-value of 0.003 < = 0.05, we can
reject H0.
b. The fitted value is 1302.01, with a standard deviation of 9.93. The 95%
confidence interval is 1270.41 to 1333.61; the 95% prediction interval is
1242.55 to 1361.47.
90
80
70
60
50
LifeExp
40
30
20
10
0
0 100 200 300 400 500 600 700
People.per.TV
90
80
70
60
50
LifeExp
40
30
20
10
0
0 10000 20000 30000 40000
People.per.Dr
90
80
70
60
50
LifeExp
40
30
20
10
0
50 60 70 80 90
LifeExp.Male
90
80
70
60
50
LifeExp
40
30
20
10
0
45 50 55 60 65 70 75 80
LifeExp.Female
The latter equation looks marginally better from an R square point of view.
However, neither model looks particularly impressive against their respective
scattergrams shown earlier.
b. Yes, the scattergrams suggests a regression model is likely to hold according to the
scattergrams in a.
c. From the selective SPSS (stepwise regression) output below, a number of
significant regression models can be found for the data:
b.
Regression Analysis: y versus x, x^2
Analysis of Variance
Model Summary
Regression Equation
R Large residual
Though the overall regression is significant according to the F value in the ANOVA table, the
coefficient for x and x squared from the preceding output are not significant (respective
pvalues of 0.133 and 0.093 are both > α = 0.05).
c.
Regression Analysis: lny versus x
Analysis of Variance
Coefficients
Regression Equation
R Large residual
From the above the regression of log(y) on log(x) yields a significant fit with (pvalue < 0.05)
for both regression coefficients.
d. The estimated regression in part (c) is preferred even though it has a lower Rsquare of
47.82%, because the multiple regression model in b) seems to be suffering from
multicollinearity.
Since 49.52 > 4.24 we reject H0: and conclude that x1 is significant.
b.
Since 48.3 > 3.42 the addition of variables x2 and x3 is statistically significant
F = 440/1.8 = 244.44
Since 244.44 > 2.76, variables x1 and x4 contribute significantly to the model
d.
Since 15.28 > 3.39 we conclude that x1 and x3 contribute significantly to the
model.
Analysis of Variance
Source DF SS MS F P
Regression 2 96157 48079 270.41 0.000
Residual Error 21 3734 178
Total 23 99891
Source DF Seq SS
W 1 94659
M 1 1499
Unusual Observations
Ominously the VIF’s above are both greater than 10 indicating a potential
multicollinearity problem. Corresponding correlation results below support this
assessment.
b. The Minitab output is shown below:
Analysis of Variance
Source DF SS MS F P
Regression 3 96193 32064 173.43 0.000
Residual Error 20 3698 185
Total 23 99891
Source DF Seq SS
W 1 94659
M 1 1499
M*W 1 36
Unusual Observations
Step 1 2
Constant 126.6 126.6
W 3.92 2.21
T-Value 19.95 3.61
P-Value 0.000 0.002
M 0.030
T-Value 2.90
P-Value 0.009
S 15.4 13.3
R-Sq 94.76 96.26
R-Sq(adj) 94.52 95.91
Mallows Cp 8.3 2.2
Because the interaction term M*W has not been loaded into the model in the
latter Stepwise Regression we deduce it does not contribute significantly to the
model.
12. For the first model featuring the five predictors x1, x2, x3, x4 and x5, the significant F
ratio from the ANOVA table (pvalue = 0.005 < α = 0.05) suggests that the overall model is
a significant fit to the data. Yet none of the individual t tests associated with each of the
regression slopes beforehand are significant except that for x4 (pvalue = 0.005 < α = 0.05).
From the VIF’s which are all close to 1, multicollinearity would not appear to be a
problem for the data. The R Square of 66.3% indicates that the multiple regression model
explains 66.3% of the variation in the response variable and this might be regarded as quite
favourable. On the downside, the model suffers from a single outlier according to
MINITAB but for a sample of size 20 this does not seem unreasonable. The Durbin-
Watson statistic of 1.72 but for a two-sided Durbin Watson test the relevant dL and dU
values (based on n = 20 and k = 5 predictors) are 0.70 and 1.87. As dL < 1.72 < dU we
deduce the test is inconclusive.
The second model is a simple regression with just x4 as the predictor. The model is
significant according to both the overall F test and the specific t tests associated with the
regression slope for x4. As would be expected the R square value has dropped – in this
case to 51.7%. Again, there is an outlier (albeit for observation 12 now instead of
observation 1 previously but with a corresponding standardized residual of -2.07 this does
not look too serious.)
To check if the earlier five predictor model is a significant improvement on this one predictor
model, a partial F test can be undertaken. The relevant calculation using equation (16,11) is
as follows (note that p = 5, q = 1):
= 23.002-16.032
5-1
___________
16.032
14
13. a. For the first model, the F ratio from the ANOVA table (pvalue = 0.000 < α = 0.05) is
highly significant which suggests the overall model offers a significant fit to the data.
Ignoring the constant, t tests for the regression slopes corresponding to the humi, temp
and HT variables are all significant (have a pvalue < 5%). The R Square (coefficient of
determination) of 71.5% is favourable and suggests the multiple regression model
explains the variation in the response quite well. There is one outlier but given the
sample size is 44 this does not seem to be especially problematic. Observation 6 is
categorized as influential and this should be investigated further. The Durbin-Watson
statistic of 1.63 but for a two-sided Durbin Watson test the relevant dL and dU values
(based on n = 44 and k = 3 predictors) are 1.29 and 1.58. As 1.58 = dU < 1.63 < 4-dU
=2.42 we deduce no evidence of first order serial correlation of residuals is present.
b. Again according to the F ratio details provided, the second model too is significant.
However from corresponding t tests only the slopes for humi and HH can be considered
significantly different from zero. In this case however the R square is an impressive
79.7%. There are two outliers and one (different) influential observation with this model.
The outliers do not look serious but as before the influential observation should be
investigated. The Durbin-Watson statistic is 1.78. For n = 44 and k = 5, dL and dU are 1.29
and 1.58 respectively. As 1.58 = dU <1.78 < 4 - dU = 2.42 we deduce there is no problem
with residuals suffering from first order serial correlation.
To check if the earlier five predictor model is a significant improvement on this one predictor
model, a partial F test can be undertaken. The relevant calculation using equation (16,11) is
as follows:
= 0.14166-0.100887
5-3
___________
0.100887
38
15. a. From the correlation matrix provided, the sales response is significantly correlated
with all of the three predictors listed. However, the attract variable is also significantly
correlated with that for airplay suggesting potential problems of multicollinearity if both
variables are fitted together in a linear model.
Regression Analysis: Delay versus Industry, Public, Quality, Finished
Method
Analysis of Variance
Model Summary
Coefficients
Term Coef SE Coef T-Value P-Value VIF
Constant 83.98 3.83 21.91 0.000
Industry
1 9.94 2.95 3.36 0.002 1.17
Public
1 1.85 3.46 0.54 0.596 1.27
Quality
2 -3.70 4.21 -0.88 0.387 1.73
3 -16.70 4.28 -3.90 0.000 1.61
4 -12.96 4.53 -2.86 0.008 1.59
5 -8.63 3.89 -2.22 0.034 1.73
Finished
2 -16.72 3.44 -4.86 0.000 1.79
3 -20.58 4.28 -4.81 0.000 1.79
4 -9.82 4.73 -2.08 0.047 1.49
Regression Equation
Delay = 83.98 + 0.0 Industry_0 + 9.94 Industry_1 + 0.0 Public_0 + 1.85 Public_1
+ 0.0 Quality_1 - 3.70 Quality_2 - 16.70 Quality_3 - 12.96 Quality_4 - 8.63 Quality_5
+ 0.0 Finished_1 - 16.72 Finished_2 - 20.58 Finished_3 - 9.82 Finished_4
R Large residual
b. Yes though R-sq(pred) suggests the model might not be as good predicting new
observations as existing data. (over-fitting may be an issue.)
c.
90
80
Delay
70
60
50
40
1.0 1.5 2.0 2.5 3.0 3.5 4.0
Finished
d. The relationship between Delay and finished looks more quadratic than linear so
adjusting our model in a. to allow for an additional term Finished_squared we obtain the
new regression analysis based on the stepwise method:
Method
Analysis of Variance
Model Summary
Coefficients
Regression Equation
Delay = 84.13 + 0.0 Industry_0 + 9.84 Industry_1 + 0.0 Quality_1 - 3.88 Quality_2
- 16.32 Quality_3 - 12.60 Quality_4 - 8.54 Quality_5 + 0.0 Finished_1
- 16.45 Finished_2 - 19.91 Finished_3 - 10.05 Finished_4
Fits and Diagnostics for Unusual Observations
R Large residual
This looks a marginally better model than that in a. but ironically does not include a term
in Finished_squared. The non-significant variable Public has also been dropped from the
earlier model.
16.
a. From the best subsets regression summary, the 5 predictor model with an R Square of
86.2% is almost as good on all measures as the full 6 predictor model represented by the
bottom line of the table. The same five predictor model is described in detail after the
correlation matrix and can be seen from the ANOVA F statistic to be significant overall.
Corresponding t statistics are also significant (though technically the pvalue (of 0.054)
associated with the regression slope for the pop variable is just slightly above the test size of
5%).
b. Clearly multicollinearity is a problem here. This is informed by significant correlations
between predictor variables e.g. pr and con. Also, the sign of the coefficient of the con
predictor in the detailed regression output is opposite to that of the corresponding correlation
between con and ao.
c. Yes. In these circumstances the two predictor model from Stepwise now looks technically
more appealing.
17 a. Yes, the test for improvement yields the significant test result.
F(3,14) = (136.751- 63.409)/(17-14) = 5.40 > 3.34 = 5% critical vale for the F(3,14)
distribution.
63.409/14
18 a.
Regression Analysis: Delay versus Industry, Quality
Method
Analysis of Variance
Model Summary
Coefficients
Delay = 73.33 + 0.0 Industry_0 + 11.41 Industry_1 + 0.0 Quality_1 - 9.51 Quality_2
- 18.93 Quality_3 - 15.83 Quality_4 - 11.85 Quality_5
R Large residual
b.
50 0
-10
10
-20
1
-20 -10 0 10 20 50 60 70 80
Residual Fitted Value
6 10
Frequency
Residual
4 0
-10
2
-20
0
-20 -10 0 10 20 1 5 10 15 20 25 30 35 40
Residual Observation Order
See bottom right hand residual plot in particular. Yes, this does seem to suggest possible
first order serial correlation is present.
c. Following on from b. the three predictor model based on Retired, Unemployment and
Total Staff would be preferred.
d. Durbin-Watson Statistic = 1.66441
For n=40 and k=2 and 5% significance level, dL = 1.39, dU = 1.60 and we have d = 1.66441 >
dU = 1.60. So we deduce from Figure 14.17 that there no evidence of positive autocorrelation.
Chapter 16: Regression Analysis: Model Building
Supplementary Exercises:
19. A study investigated the relationship between audit delay (Delay), the length of time from
a company’s fiscal year-end to the date of the auditor's report, and variables that describe
the client and the auditor. Some of the independent variables that were included in this
study follow.
Industry A dummy variable coded 1 if the firm was an industrial company or 0 if the
firm was a bank, savings and loan, or insurance company.
a. Develop the estimated regression equation using all of the independent variables
b. How well does the estimated regression equation developed in part (a) represent the
data?
c. Develop a scatter diagram showing Delay as a function of Finished. What does this
scatter diagram indicate about the relationship between Delay and Finished.
d. On the basis of your observations about the relationship between Delay and Finished,
develop an alternative estimated regression equation to the one developed in (a) to
explain as much of the variability in Delay as possible.
20. Annual data published by Conrad (1989) over a 21 year period features the following
variables:
Analysis of Variance
Source DF SS MS F P
Regression 5 0.011700 0.002340 1.60 0.219
Residual Error 15 0.021893 0.001460
Total 20 0.033593
Source DF Seq SS
HSA 1 0.004986
HSB 1 0.000710
HSC 1 0.005775
lnX1 1 0.000229
lnX2 1 0.000000
Unusual Observations
Step 1 2
Constant 5.556 5.551
HSA 0.046
T-Value 1.75
P-Value 0.096
S 0.0372 0.0353
R-Sq 21.81 33.23
R-Sq(adj) 17.70 25.82
Mallows C-p 1.0 0.4
Best Subsets Regression: lnY versus HSA, HSB, HSC, lnX1, lnX2
Response is lnY
ll
HHHnn
Mallows SSSXX
Vars R-Sq R-Sq(adj) C-p S ABC12
1 21.8 17.7 1.0 0.037181 X
1 14.8 10.4 2.6 0.038802 X
2 33.2 25.8 0.4 0.035299 X X
2 22.8 14.2 2.8 0.037965 X X
3 34.1 22.5 2.2 0.036073 X X X
3 33.7 22.0 2.3 0.036200 X X X
4 34.8 18.5 4.0 0.036991 X X X X
4 34.2 17.7 4.1 0.037173 X X X X
5 34.8 13.1 6.0 0.038204 X X X X X
a. Comment on the effectiveness of the various models here carrying out any statistical
tests or additional analysis you think appropriate.
b. How would you advise the Tobacco Research Council who sourced the data?
21. Refer to the data in exercise 19. Consider a model in which only Industry is used to
predict Delay. At a 0.01 level of significance, test for any positive autocorrelation in the
data.
b. Plot the residuals obtained from the estimated regression equation developed in part (a)
as a function of the order in which the data are presented. Does any autocorrelation
appear to be present in the data? Explain.
c. At the 0.05 level of significance, test for any positive autocorrelation in the data.
23. A regression analysis of heart disease by country (Cooper & Weekes, 1983) is based on the
following variables:
sug sugar consumption
tdp total dairy products consumption
agemp percentage employment in agriculture, fishing and forestry
ihdmr ischaemic heart disease mortality rate (RESPONSE variable)
MODEL 1
Regression Analysis
Analysis of Variance
Source DF SS MS F P
Regression 3 123022 41007 12.03 0.000
Residual Error 17 57950 3409
Total 20 180972
Source DF Seq SS
sug 1 114848
tdp 1 4806
agemp 1 3369
Unusual Observations
Obs sug ihdmr Fit SE Fit Residual St Resid
6 115 240.9 288.2 46.3 -47.3 -1.33 X
19 117 105.3 227.2 15.5 -121.9 -2.17R
MODEL 2
Regression Analysis
Analysis of Variance
Source DF SS MS F P
Regression 1 114848 114848 33.00 0.000
Residual Error 19 66124 3480
Total 20 180972
Unusual Observations
Obs sug ihdmr Fit SE Fit Residual St Resid
19 117 105.3 234.3 14.7 -129.0 -2.26R
Explain this computer output, carrying out any additional tests you think necessary or
appropriate. Is the first model significantly better than the second? Which model do you
prefer and why?
24. A regression analysis of UK imports (Barrow, 2001) is based on the following variables:
Analysis of Variance
Source DF SS MS F P
Regression 2 1.35348 0.67674 403.88 0.000
Residual Error 18 0.03016 0.00168
Total 20 1.38364
Source DF Seq SS
lngdp 1 1.34478
lnprice 1 0.00870
Unusual Observations
Obs lngdp lnimport Fit SE Fit Residual St Resid
2 5.55 4.32281 4.22598 0.01726 0.09683 2.61R
laglnimp 0.408
T-Value 4.55
P-Value 0.000
S 0.0430 0.0297
R-Sq 97.50 98.87
R-Sq(adj) 97.36 98.74
C-p 19.6 2.1
a. Explain this computer output, carrying out any additional tests you think necessary or
appropriate.
b. Which of the various models shown do you prefer and why?
25. Tony runs a used car business. He would like to predict monthly sales. Tony believes that
sales, y (in €’0,000s) is directly related to the number of sales-people employed (x 1) and the
average number of cars on the lot for sale (x2). The following data were collected over a
period of 10 months:
y x1 x2
5.8 4 20
8.1 4 25
7.5 5 15
13.3 8 30
11.4 7 25
15.0 9 35
7.0 3 17
8.3 5 20
5.1 2 18
6.8 4 23
d. Plot the residuals by , x1 and x2 and comment on the validity of the theoretical
26. For the data in Exercise 25, use MINITAB to carry out a
Stepwise and
best subsets analysis.
27. The CEO of a computer firm is interested in funding research proposals by graduate students
who wish to perform experiments in the firm's advanced technology laboratory during the
summer months. The CEO receives 18 proposals and sends these proposals to the director
of the laboratory for evaluation. The director rates the proposals on two different criteria and
gives a score between zero and ten for each criterion, with 10 representing the best score
possible. (The variables x1 and x2 represent these two scores. The variable y (in €000s) is
the level of funding that the CEO grants for the proposal.) The collected data are given
below:
y x1 x2
9.5 8.7 9.2
7.3 8.1 8.0
6.5 7.4 7.7
8.4 8.4 8.6
8.0 8.3 8.0
6.1 7.0 7.3
8.5 8.6 8.8
7.2 8.3 7.8
5.8 6.7 7.0
6.3 7.3 7.5
9.0 8.6 9.0
6.4 7.7 7.5
7.0 7.9 7.9
7.4 8.2 8.0
8.3 8.5 8.4
8.2 8.6 7.9
5.3 6.6 6.9
6.7 7.8 7.5
The director tries to work out what the CEO will grant, given how he scores a proposal.
a. Find a 90% confidence interval for the mean value of y at x1 = 8.0 and
x2 = 7.8.
b. Find a 90% confidence interval for the value of Y at x1 = 8.0 and x2 = 7.8.
c. Interpret each of these confidence intervals: what is the difference between them?
28. For the data in Exercise 27, use MINITAB to carry out a
Stepwise and
best subsets analysis.
29. Consider the following dataset for 12 growth-orientated companies. Y represents the growth
rate of a company for the current year. X1 represents the growth rate of the company for the
previous year. X2 represents the percentage of the market that does not use the company's
product or a similar product, and X3 represents the current growth rate for the industry sector
to which the company belongs. (All values are percentages.)
Y X1 X2 X3
20 10 30 2.8
30 15 60 3.4
24 12 35 5.6
36 42 38 2.8
18 15 25 10.1
47 45 40 6.2
33 30 40 2.8
35 32 32 7.9
27 19 32 3.4
28 24 31 10.1
20 24 20 7.9
32 20 50 2.8
a. Using MINITAB, derive the sample correlations for the variables and estimate the
regression equation of Y on X1, X2 and X3. Test the significance of X1, X2 and X3 in the
model. What do you deduce?
b. Perform a stepwise analysis of the data using the backward elimination procedure.
Comment on the results obtained and compare these with the outputs from (a). Are they
consistent?
30. Chatterjee and Price (1977) present attitude data for clerical staff towards their supervisors
within a large commercial organization. Details of the variables involved in the study and
of the predictive model obtained using the MINITAB package, are as follows:
Note that the figures in brackets here are the standard errors of associated estimated regression
coefficients. Also, that the total (corrected) sum of squares on y = 4296.97 and the sample size =
30
The sample correlations for the data are as follows:
y cmplain prvileg learn rises critcal
cmplain 0.825
prvileg 0.426 0.558
learn 0.624 0.597 0.493
rises 0.590 0.669 0.445 0.640
critcal 0.156 0.188 0.147 0.116 0.377
advance 0.155 0.225 0.343 0.532 0.574 0.283
Results from running the MINITAB’s best subsets procedure for the data are given below:
a. Given the evidence provided here and making any additional calculations and / or
statistical tests you think necessary, how would you interpret this information?
b. What is your overall view of the model's effectiveness?
31. Pre-employment tests are widely used in many large corporations as an approach for
estimating likely job performance. In a published study, separate regression analyses (see
MODEL 2 below) were conducted for white and minority sections of a recruitment
sample. The results, given, contrast with those from a pooled analysis of the entire sample
(MODEL 1):
jperf : Job Performance
test : Pro-employment test
race : 1 if a minority applicant, 0 if a white applicant
racetest : race X test
MODEL 1
jperf = 1.03 + 2.36 test
(0.868) (0.538)
ANOVA
SOURCE df SS MS F
Regression 1 48.723 48.723 19.25
Error 18 45.568 2.532
Itotal 19 94.291
Note that figures in brackets above denote the standard errors of corresponding regression slope
estimates, Corresponding to the 'test' variable taking the value 2.5 we have predicted jperf value,
confidence and prediction intervals as follows:-
Error SS = 31.655
For this alternative formulation the predicted value of jperf, corresponding to the value of 2.5 for
'test', confidence and prediction intervals shown separately for white and minority employees are
as follows:-
a. Interpret these results, carrying out any additional calculations, tests etc. you think necessary.
b. Is MODEL 2 significantly better than MODEL 1? Depending on your answer here, what
would you say this signifies in terns of the two groups?
32. Data relating to import activity in the French economy have been analysed by Malinvaud
(1966). Details of a multiple regression model developed from these data appear below:
import : Imports
doprod : Domestic Production
stock : Stock Formation
consum : Domestic Consumption
The sample correlations for these data are as follows:
import doprod stock
import
deprod 0.984
stock 0.266 0.215
consum 0.985 0.999 0.214
(Note the figures in brackets here are the standard errors of the corresponding regression slope
estimates.)
a. Interpret these results, carrying out any additional calculations, tests etc. you think
necessary.
The VIF values here reveal major problems with multicollinearity. Thus, estimated
coefficients in the regression model as well as corresponding t tests are likely to be very
dubious.
b. What is your overall view of the model as a technology for predicting French Imports?
What improvements (if any) are necessary, in your opinion, before implementation of
the model is finally considered?
33. A regression analysis of data from a cloud-seeding experiment (Woodley et al; 1977)
yields the following results:-
MODEL 1
MODEL 2
Analysis of Variance
SOURCE DF SS MS F p
Regression 4 2587.7 646.9 5.42 0.002
Residual Error 35 4176.3 119.3
Total 39 6764.0
b. The low value of the adjusted coefficient of determination (31.2%) does not
indicate a good fit.
96+
- * *
AUDELAY - 3
- * *
- *
80+ * 2 *
- * *
- 3 *
- 3 * 2
- 2 *
64+ *
- * * * *
- *
- 3
- *
48+ *
- 2
--+---------+---------+---------+---------+---------+----INTFIN
0.0 1.0 2.0 3.0 4.0 5.0
d. The output from the stepwise procedure is shown below, where INTFINSQ is
the square of INTFIN.
Step 1 2
Constant 112.4 112.8
PUBLIC -1.0
T-Value -0.29
P-Value 0.775
S 9.01 8.90
R-Sq 59.15 59.05
R-Sq(adj) 53.14 54.37
C-p 6.0 4.1
20. a. The results from the Stepwise procedure indicate that lnY can be significantly
explained in terms of the dummy variables HSC and HSA. At the same time, the R 2
and adj R2 values for this model (33.23%, 25.82% respectively) are not particularly
high. The same model features in the Best Subsets output (it corresponds with the
first of the two predictor alternatives of models in the list) and technically appears the
have the edge on its eight competitors. However, one practical problem with the HSA
variable is that the sign of the estimated regression coefficient is positive, suggesting
that the health scare in year 8 actually resulted in a growth rather than a decline in
tobacco consumption.
b. From the various comments in (a) the linear formulation adopted for analysing the data
does not seem to have been helpful or productive. The absence of lnX1 or lnX2 as
predictors in any of the models is a particular indictment so much so that one wonders
why this approach was ever investigated in the first place.
Analysis of Variance
SOURCE DF SS MS F p
Regression 1 1076.1 1076.1 7.19 0.011
Residual Error 38 5687.9 149.7
Total 39 6764.0
Unusual Observations
Obs. INDUS AUDELAY Fit Stdev.Fit Residual St.Resid
5 0.00 91.00 63.00 3.39 28.00 2.38R
38 1.00 46.00 74.07 2.35 -28.07 -2.34R
At the 0.05 level of significance, dL = 1.44 and dU = 1.54. Since d = 1.55 > dU,
there is no significant positive autocorrelation.
Analysis of Variance
SOURCE DF SS MS F p
Regression 2 1818.6 909.3 6.80 0.003
Residual Error 37 4945.4 133.7
Total 39 6764.0
SOURCE DF SEQ SS
INDUS 1 1076.1
ICQUAL 1 742.4
Unusual Observations
Obs. INDUS AUDELAY Fit Stdev.Fit Residual St.Resid
5 0.00 91.00 67.71 3.78 23.29 2.13R
38 1.00 46.00 71.70 2.44 -25.70 -2.27R
b. The residual plot as a function of the order in which the data are presented is
shown below:
RESID -
- 5
- 6
17.5+ 0
- 89 89
- 6 4 7
- 1
- 46 7 01 9
0.0+ 1 7 24 802 35
- 0 5
- 3 7 2
- 3 3
- 1 9 46
-17.5+ 2
- 5
- 8
-
+---------+---------+---------+---------+
0 10 20 30 40
c. At the .05 level of significance, dL = 1.44 and dU = 1.54. Since d = 1.43 > dU,
there is no significant positive autocorrelation.
23.
MODEL 1
This is a particularly flawed model. None of the predictors here are significant
according to their individual pvalues yet the F statistic has a pvalue of 0.000 < α
=0.05 indicating that the model, overall, is significant. Because of this there is a
strong suspicion that multicollinearity is present. The R2 of 68% for the model is
relatively good and there is a single outlying residual and as well as an influential
observation. The outlier does not look serious because of its standardized residual
value but observation 6’s influence needs to be carefully checked out.
MODEL 2
This is a much simpler model which not surprisingly overcomes many of the
problems of MODEL 1. The single predictor used in the model is significant.
Observation 6 is no longer influential. Observation 19 is still associated with an
outlying residual but this is hardly any worse than before.
Model DF Error SS
1 17 57950
2 19 66124
24. a. The model is significant overall with all its predictors significant also. This is borne
out by the pvalues for the F and t statistics which are < α = 0.05 without exception. A
two-sided Durbin-Watson test (α = 0.05) yields an inconclusive result since dL =1.01
< d= 1.09 < dU = 1.41 There is a single outlier, but this does not appear to be too
extreme according to its standardized residual value. The main problem with the
model is multicollinearity as evidenced by the high correlations between all variables
– and which was somehow played down by previous VIF values. The earlier t test
results are therefore likely to be very dubious.
The Stepwise output features a new predictor laglnimp which happens to be selected
for the final step 2 model. The problem is that this model too is likely to suffer from
multicollinearity.
b. Hence the preferred model of all those considered is the Stepwise (step 1) simple
regression model with the lngdp predictor. This model had a very high R square
(97.5%) and is highly significant according to the pvalue result (0.000) provided by
Stepwise.
25. a.
y x1
x1 0.964
0.000
x2 0.873 0.815
0.001 0.004
b.
Analysis of Variance
Source DF SS MS F P
Regression 2 93.072 46.536 68.88 0.000
Residual Error 7 4.729 0.676
Total 9 97.801
Source DF Seq SS
x1 1 90.855
x2 1 2.217
c. The VIF values in (b) do not suggest problems of multicollinearity are possible despite
the significant correlation between x1 and x2 found in (a). Clearly however there are
problems since x2 (with a pvalue = 0.113 > 0.05 = α) is not a significant predictor whereas
x1 (with a pvalue = 0.001 < 0.05 = α) is. (According to the correlations both should be
significant predictors.)
d. The relevant plots are as follows:
None of these plots seem to be out of line with theoretical assumptions but the sample
size is relatively small, so this is not altogether unexpected.
26. a.
Step 1 2
Constant 1.575278 0.008928
x1 1.42 1.11
T-Value 10.23 5.25
P-Value 0.000 0.001
x2 0.139
T-Value 1.81
P-Value 0.113
S 0.932 0.822
R-Sq 92.90 95.16
R-Sq(adj) 92.01 93.78
Mallows C-p 4.3 3.0
Response is y
Mallows xx
Vars R-Sq R-Sq(adj) C-p S 12
1 92.9 92.0 4.3 0.93182 X
1 76.1 73.2 28.5 1.7078 X
2 95.2 93.8 3.0 0.82193 X X
Stepwise seems to favour the full two predictor model which also corresponds to model
described on the bottom line of the Best Subsets output. Yet the adj R2 value (of 92.0%) for
the single x1 predictor model (step 1) is almost the same as that for the full model (93.8%).
Note that the corresponding difference between root mean square error values is slightly more
pronounced.
b. Despite this, the single x1 alternative might be favoured given earlier concerns about hidden
multicollinearity.
Analysis of Variance
Source DF SS MS F P
Regression 2 22.344 11.172 179.77 0.000
Residual Error 15 0.932 0.062
Total 17 23.276
Source DF Seq SS
x1 1 20.294
x2 1 2.050
Unusual Observations
a.
New
Obs Fit SE Fit 90% CI
1 7.2196 0.0709 (7.0953, 7.3439)
b.
New
Obs Fit SE Fit 90% PI
1 7.2196 0.0709 (6.7653, 7.6740)
c. The (confidence) interval in a. corresponds to any proposal with the scores x 1 = 8.0 and x2
= 7.8 whereas the (prediction) interval in b. corresponds to a specific proposal with the
scores x1 = 8.0 and x2 = 7.8
28. a.
Stepwise Regression: y versus x1, x2
Step 1 2
Constant -6.436 -6.926
x2 1.73 1.10
T-Value 13.69 5.74
P-Value 0.000 0.000
x1 0.70
T-Value 3.80
P-Value 0.002
S 0.338 0.249
R-Sq 92.13 96.00
R-Sq(adj) 91.64 95.46
Mallows C-p 15.5 3.0
Response is y
Mallows xx
Vars R-Sq R-Sq(adj) C-p S 12
1 92.1 91.6 15.5 0.33834 X
1 87.2 86.4 34.0 0.43170 X
2 96.0 95.5 3.0 0.24929 X X
From the Stepwise and Best Subsets output it is clear the full two predictor model is
most favoured. Both predictors X1 and X2 contribute very significantly to the model
according to the relevant T ratios and pvalues. The root mean square error value is
also markedly better for this model than the alternatives.
b. The two predictor model is conspicuously better than either single predictor
alternatives for representing the data.
Y X1 X2
X1 0.828
0.001
X2 0.492 0.041
0.104 0.899
From the correlations here, it can be seen X1 and X2 are respectively the most correlated with
y.
Analysis of Variance
Source DF SS MS F P
Regression 3 670.54 223.51 23.18 0.000
Residual Error 8 77.13 9.64
Total 11 747.67
Source DF Seq SS
X1 1 513.14
X2 1 156.94
X3 1 0.46
The output above shows a significant linear model has been fitted (the pvalue for the F ratio is
0.000 < 0.05 = α). X1 and X2 are significant predictors of Y (for each of the T ratios pvalue <
0.05 = α).
b.
Step 1 2
Constant 1.409 2.356
X1 0.589 0.590
T-Value 7.11 7.53
P-Value 0.000 0.000
X2 0.364 0.351
T-Value 3.42 4.27
P-Value 0.009 0.002
X3 0.09
T-Value 0.22
P-Value 0.832
S 3.10 2.94
R-Sq 89.68 89.62
R-Sq(adj) 85.82 87.32
Mallows C-p 4.0 2.0
The best model according to this procedure is the one featuring the two predictors X1
and X2. This is essentially what we would have expected following the regression
analysis in (a).
30. The initial model does not seem to be affected by multicollinearity from the VIF values
yet the sample correlations between predictors do look potentially problematic in places
e.g. the correlation between cmplian and rises of 0.666. T ratios for the model are given
below:
Only the t ratio for cmplain is significant since under H0: βi = 0 i= 1, 2, 3.. 6 t is
distributed on 23 = n – p – 1 = 30 – 6 -1 degrees of freedom. And apart from cmplain
none of the ratios above are > t.025(23) = 2.069 or < - t.025(23) = -2.069.
Using the Total sum of squares result of 4296.97 and the root mean square error value =
7.0680 from the bottom line of the Best Subsets output, the ANOVA table for the
model can be constructed as follows:
df SS MS F
Regression 6 3147.968 524.661 10.502
Error 23 1149.002 49.957
Total 29 4296.97
The F statistic here is significant (10.502 > 2.53 = F .05 for an F distribution on 6 and
23 degrees of freedom. Thus, we would reject H0: β1 = β2 = …β6 = 0 and deduce the
model is significant.
From the Best subsets output two models stand out - namely the first of the two
predictor models and the first of the three predictor models listed. The three predictor
model features the advance predictor which is not strongly correlated with y. In this
sense the two predictor one might therefore be preferred. The full six predictor model
falls well short of either of these alternatives.
31. a. MODEL 1
This is a significant regression model based on the F ratio result of 19.25. (Under H0:
β1 =0 , F has an F distribution on 1 and 18 degrees of freedom. The 5% critical value
for this distribution is F.05 =4.41. Since F = 19.25 > F.05 =4.41 we would therefore
reject H0.
MODEL 2
From the Error SS information provided we can recreate the corresponding ANOVA
table as follows:
df SS MS F
Regression 3 62.636 20.879 10.553
Error 16 31.655 1.978
Total 19 94.291
(since the TOTAL SS remains the same however many predictors we choose for the
modelling). The F statistic here is significant (10.553 > 3.24 = F .05 for an F
distribution on 3 and 16 degrees of freedom. Thus, we would reject H0: β1 = β2 = β3 =
0 and deduce the model is significant.
Model Error SS
1 45.568
2 31.655
32. a. The VIF values here reveal a major problem of multicollinearity. Thus, estimated
coefficients for the regression model as well as corresponding t tests are likely to be very
dubious. From the correlation matrix the source of the multicollinearity seems to be
between the doprod and consum predictors. With a correlation of 0.999 they would be
regarded mathematically by MINITAB as being in essence, identical variables. One of
them needs to be dropped from the model – it is up to the analyst to decide which.
Whether the stock predictor is worth retaining is another issue and could be investigated
using stepwise procedures.
The R2 = 97.3% result is impressive and corresponds with an F value for the ANOVA
table of
From the Durbin Watson tables provided, it can be shown for n = 18 p = 3 and α =
0.025 that dL = 0.82 and dU = 1.56. Based on a two sided test approach we deduce the
test result of d = 0.24 indicates significant positive autocorrelation of errors exists.
b. The model is very problematic as it stands. Both the multicollinearity and first order
serial correlation of errors problems need to be resolved before it can be seriously
considered as statistical prediction tool.
33. MODEL 1
T statistics can be calculated for the estimated model by calculating the regression
coefficient / standard error ratios as follows:
t
Constant 1.39
x1 0.84
x2 -1.10
x3 -1.21
x4 0.05
x5 0.69
x6 1.37
Given:
H0: βi = 0
H1: βi ≠ 0 i= 0, 1, 2, …6
and α = .05 the above ratios under H0 have a t distribution on 17 degrees of freedom
where 17 = n – p - 1 where n = 24 and p = the number of independent variables = 6.
As none of the ratios above are > t.025(17) = 2.11 or < - t.025(17) = -2.11 we cannot
reject H0 for i= 0, 1, 2, …6.
From the R2 value of 0.385, the F statistic for the ANOVA table can be calculated
as follows:
The Durbin Watson statistic cannot strictly be tested using the critical values
provided in the book (only a maximum of 5 predictors is catered for) but given for
n = 24 p = 5 and
α = 0.025 that dL = 0.83 and dU = 1.79 a two-sided test is likely to be inconclusive.
MODEL 2
The second model features an additional four interaction terms. Their presence
seems to considerably improve the R2 result and it can be shown based on the t
ratios below that x1and x3x1 are significant predictors:
t
Constant -0.86
x1 2.94
x2 1.8
x3 0.50
x4 1.78
x5 1.14
x6 1.63
x3x1 -2.53
x1x4 -2.01
x1x5 -0.57
x1x6 -0.21
(As before
H0: βi = 0
H1: βi = 0 i= 0, 1, 2, …10
and α = .05. The above ratios under H0 have a t distribution on 13 degrees of
freedom where 13 = n – p – 1, n = 24 and p = the number of independent variables
= 10. The relevant critical values in this case are t .025(17) = 2.17 and -t.025(17) = -
2.17).