Unit 4
Unit 4
Introduction
In most practical problems, especially those involving historical data, the analyst has a
rather large pool of possible candidate regressors, of which only a few are likely to be
important. Finding an appropriate subset of regressors for the model is often called the variable
selection problem.
Good variable selection methods are very important in the presence of multicollinearity.
Frankly, the most common corrective technique for multicollinearity is variable selection.
Variable selection does not guarantee elimination of multicollinearity. There are cases where
two or more regressors are highly related; yet, some subset of them really does belong in the
model. Our variable selection methods help to justify the presence of these highly related
regressors in the final model.
Multicollinearity is not the only reason to pursue variable selection techniques. Even
mild relationships that our multicollinearity diagnostics do not flag as problematic can have an
impact on model selection. The use of good model selection techniques increases our
confidence in the final model or models recommended.
Building a regression model that includes only a subset of the available regressors
involves two conflicting objectives.
(1) We would like the model to include as many regressors as possible so that the information
content in these factors can influence the predicted value of y.
(2) We want the model to include as few regressors as possible because the variance of the
prediction yˆ increases as the number of regressors increases.
Also, the more regressors there are in a model, the greater the costs of data collection and model
maintenance. The process of finding a model that is a compromise between these two
objectives is called selecting the “best” regression equation.
A hypothetical plot of the maximum value of Rp2 for each subset of size p against p is given
in the following graph. We can see that as number of regressors increases Rp2 also increases.
Limitations of R2:
1) In some cases, if the non-significant independent variable is added to the model R2 value
will get increased but the added regressor does not influence the dependent variable
(regression model). So in this situation R2 is not reliable/appropriate.
2) To calculate R2 value presence of intercept is necessary. We cannot find the R2 for
regression model which does not have intercept.
3) R2 is sensitive to the extreme values.
Adjusted R2
To avoid the difficulties of interpreting R2, some analysts prefer to use the adjusted R2
statistic, defined for a p - term equation as
The Adjusted Rp2 statistic does not necessarily increase as additional regressors are
introduced into the model. If s regressors are added to the model, Adjusted Rp+s2 will exceed
Adjusted Rp2 if and only if the partial F statistic for testing the significance of the s additional
regressors exceeds 1. One criterion for selection of an optimum subset model is to choose the
model that has a maximum Adjusted Rp2.
Residual Mean Square
Another method to model evaluation is Residual Mean Square for a subset regression model,
which is given by
As p increases, MSRes (p) initially decreases, then stabilizes, and eventually may increase.
The eventual increase in MSRes (p) occurs when the reduction in SSRes (p) from adding a
regressor to the model is not sufficient to compensate for the loss of one degree of freedom in
the denominator.
Advocates of the MSRes(p) criterion will plot MSRes(p) versus p and base the choice of p on
the following:
1. The minimum MSRes(p)
2. The value of p such that MSRes(p) is approximately equal to MSRes for the full model
3. A value of p near the point where the smallest MSRes(p) turns upward.
Mallows CP statistic
This measures the overall biased or mean square error associated with a fitted regression
model. The Mallows CP statistic is given by
𝑆𝑆𝑟𝑒𝑠 (𝑝)
𝐶𝑝 = − 𝑛 + (2 ∗ 𝑃)
𝑀𝑆𝑟𝑒𝑠 𝑓𝑢𝑙𝑙 𝑚𝑜𝑑𝑒𝑙
𝑆𝑆𝑟𝑒𝑠 (𝑝)
𝐶𝑝 = − 𝑛 + (2 ∗ 𝑃)
𝜎̂ 2
Forward Selection
The steps involved in the forward selection procedures are
1) This procedure begins with the assumption that there are no regressors in the model other
than the intercept.
2) An effort is made to find an optimal subset by inserting regressors into the model one at a
time.
3) The first regressor selected for entry into the equation is the one that has the largest simple
correlation with the response variable y.
4) Suppose that this regressor is x1.
5) This is also the regressor that will produce the largest value of the F statistic for testing
significance of regression.
F statistic is calculated by using the formula
Backward Elimination
Forward selection begins with no regressors in the model and attempts to insert variables
until a suitable model is obtained.
The steps involved in backward elimination are
1) we begin with a model that includes all K candidate regressors.
2) Then the partial F statistic (or equivalently, a t statistic) is computed for each regressor as
if it were the last variable to enter the model.
3) The smallest of these partial F (or t ) statistics is compared with a preselected value, FOUT
(or tOUT)
4) If the smallest partial F (or t), value is less than FOUT (or tOUT), that regressor is removed
from the model.
5) Now a regression model with K − 1 regressors is fi t, the partial F (or t) statistics for this
new model calculated, and the procedure repeated.
6) The backward elimination algorithm terminates when the smallest partial F (or t) value is
not less than the preselected cutoff value FOUT (or tOUT).
Stepwise Regression
Stepwise regression is a modification of forward selection in which at each step all regressors
entered into the model previously are reassessed via their partial F (or t) statistics.
A regressor added at an earlier step may now be redundant because of the relationships
between it and regressors now in the equation. If the partial F (or t) statistic for a variable is
less than FOUT (or tOUT), that variable is dropped from the model.
Stepwise regression requires two cutoff values, one for entering variables and one for
removing them.
Some analysts prefer to choose FIN (or tIN) = FOUT (or tOUT), although this is not necessary.
Frequently we choose FIN (or tIN) > FOUT (or tOUT), making it relatively more difficult to add a
regressor than to delete one.