0% found this document useful (0 votes)
4 views

Unit 4

Uploaded by

lithishr123
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views

Unit 4

Uploaded by

lithishr123
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

Unit 4: VARIABLE SELECTION AND MODEL BUILDING

Introduction
In most practical problems, especially those involving historical data, the analyst has a
rather large pool of possible candidate regressors, of which only a few are likely to be
important. Finding an appropriate subset of regressors for the model is often called the variable
selection problem.
Good variable selection methods are very important in the presence of multicollinearity.
Frankly, the most common corrective technique for multicollinearity is variable selection.
Variable selection does not guarantee elimination of multicollinearity. There are cases where
two or more regressors are highly related; yet, some subset of them really does belong in the
model. Our variable selection methods help to justify the presence of these highly related
regressors in the final model.
Multicollinearity is not the only reason to pursue variable selection techniques. Even
mild relationships that our multicollinearity diagnostics do not flag as problematic can have an
impact on model selection. The use of good model selection techniques increases our
confidence in the final model or models recommended.

Building a regression model that includes only a subset of the available regressors
involves two conflicting objectives.

(1) We would like the model to include as many regressors as possible so that the information
content in these factors can influence the predicted value of y.

(2) We want the model to include as few regressors as possible because the variance of the
prediction yˆ increases as the number of regressors increases.

Also, the more regressors there are in a model, the greater the costs of data collection and model
maintenance. The process of finding a model that is a compromise between these two
objectives is called selecting the “best” regression equation.

Criteria for Evaluating Subset Regression Models


Two key aspects of the variable selection problem are generating the subset models and
deciding if one subset is better than another.

Coefficient of Multiple Determination


A measure of the adequacy of a regression model that has been widely used is the coefficient
of multiple determination, R2. Let Rp2 denote the coefficient of multiple determination for a
subset regression model with p terms, that is, p − 1 regressors and an intercept term β0.
Computationally,
where SSR(p) and SSRes(p) denote the regression sum of squares and the residual sum of
squares, respectively, for a p - term subset model. Rp2 increases as p increases and is a maximum
when p = K + 1. Since we cannot find an “optimum” value of R2 for a subset regression model, we
must look for a “satisfactory” value.

A hypothetical plot of the maximum value of Rp2 for each subset of size p against p is given
in the following graph. We can see that as number of regressors increases Rp2 also increases.

Limitations of R2:
1) In some cases, if the non-significant independent variable is added to the model R2 value
will get increased but the added regressor does not influence the dependent variable
(regression model). So in this situation R2 is not reliable/appropriate.
2) To calculate R2 value presence of intercept is necessary. We cannot find the R2 for
regression model which does not have intercept.
3) R2 is sensitive to the extreme values.

Adjusted R2
To avoid the difficulties of interpreting R2, some analysts prefer to use the adjusted R2
statistic, defined for a p - term equation as

The Adjusted Rp2 statistic does not necessarily increase as additional regressors are
introduced into the model. If s regressors are added to the model, Adjusted Rp+s2 will exceed
Adjusted Rp2 if and only if the partial F statistic for testing the significance of the s additional
regressors exceeds 1. One criterion for selection of an optimum subset model is to choose the
model that has a maximum Adjusted Rp2.
Residual Mean Square
Another method to model evaluation is Residual Mean Square for a subset regression model,
which is given by

As p increases, MSRes (p) initially decreases, then stabilizes, and eventually may increase.
The eventual increase in MSRes (p) occurs when the reduction in SSRes (p) from adding a
regressor to the model is not sufficient to compensate for the loss of one degree of freedom in
the denominator.

Advocates of the MSRes(p) criterion will plot MSRes(p) versus p and base the choice of p on
the following:
1. The minimum MSRes(p)
2. The value of p such that MSRes(p) is approximately equal to MSRes for the full model
3. A value of p near the point where the smallest MSRes(p) turns upward.

Mallows CP statistic
This measures the overall biased or mean square error associated with a fitted regression
model. The Mallows CP statistic is given by
𝑆𝑆𝑟𝑒𝑠 (𝑝)
𝐶𝑝 = − 𝑛 + (2 ∗ 𝑃)
𝑀𝑆𝑟𝑒𝑠 𝑓𝑢𝑙𝑙 𝑚𝑜𝑑𝑒𝑙
𝑆𝑆𝑟𝑒𝑠 (𝑝)
𝐶𝑝 = − 𝑛 + (2 ∗ 𝑃)
𝜎̂ 2

Akaike Information Criterion (AIC)


Akaike proposed an information criterion, AIC, based on maximizing the expected entropy of
the model. AIC is given by
AIC = -2 ln(L) + 2*P
Where L is likelihood function and P is number of regression coefficients.
As we add regressors to the model, SSRes, cannot increase. The issue becomes whether the
decrease in SSRes justifies the inclusion of the extra terms.

Bayesian Information Criterion (BIC)


BIC is given by
BIC = -2 ln(L) + P*ln(n)
This criterion places a greater penalty on adding regressors as the sample size increases. AIC
and BIC are much more commonly used in the model selection procedures involving more
complicated modeling situations than ordinary least squares. Lower the AIC (BIC) better the
model.

Computational Techniques for Variable Selection


To find the subset of variables to use in the final equation, it is natural to consider fitting
models with various combinations of the candidate regressors.
All Possible Regressions
Steps involved in this procedure are
1) First analyst fit all the regression equations involving one candidate regressor, two
candidate regressors, and so on.
2) These equations are evaluated according to some suitable criterion and the “best”
regression model selected.
3) If we assume that the intercept term β0 is included in all equations, then if there are K
candidate regressors, there are 2K total equations to be estimated and examined.
4) For example, if K = 4, then there are 24 = 16 possible equations, while if K = 10, there are
210 = 1024 possible regression equations.
5) Clearly the number of equations to be examined increases rapidly as the number of
candidates regressors increases.
Stepwise Regression Methods
Because evaluating all possible regressions can be burdensome computationally, various
methods have been developed for evaluating only a small number of subset regression
models by either adding or deleting regressors one at a time. These methods are referred as
stepwise - type procedures. They can be classified into three broad categories:
(1) forward selection,
(2) backward elimination, and
(3) stepwise regression

Forward Selection
The steps involved in the forward selection procedures are
1) This procedure begins with the assumption that there are no regressors in the model other
than the intercept.
2) An effort is made to find an optimal subset by inserting regressors into the model one at a
time.
3) The first regressor selected for entry into the equation is the one that has the largest simple
correlation with the response variable y.
4) Suppose that this regressor is x1.
5) This is also the regressor that will produce the largest value of the F statistic for testing
significance of regression.
F statistic is calculated by using the formula

𝑆𝑆𝑟𝑒𝑠 𝐹𝑢𝑙𝑙 𝑚𝑜𝑑𝑒𝑙 − 𝑆𝑆𝑟𝑒𝑠 𝑚𝑜𝑑𝑒𝑙 𝑤𝑖𝑡ℎ 𝑋𝑖


𝐹=
𝑀𝑆𝑟𝑒𝑠 𝐹𝑢𝑙𝑙 𝑚𝑜𝑑𝑒𝑙
6) This regressor is entered if the F statistic exceeds a preselected F value, say FIN (or F - to -
enter).
Preselected F value is calculated as
(𝑛 − 𝑝) 𝑅2
𝐹=
(𝑝 − 1) 1 − 𝑅 2
7) The second regressor chosen for entry is the one that now has the largest correlation with y
after adjusting for the effect of the first regressor entered (x1) on y.
8) If this F value exceeds FIN,then x2 is added to the model.
9) In general, at each step the regressor having the highest partial correlation with y (or
equivalently the largest partial F statistic given the other regressors already in the model) is
added to the model if its partial F statistic exceeds the preselected entry level FIN.
10) The procedure terminates either when the partial F statistic at a particular step does not
exceed FIN or when the last candidate regressor is added to the model.

Backward Elimination
Forward selection begins with no regressors in the model and attempts to insert variables
until a suitable model is obtained.
The steps involved in backward elimination are
1) we begin with a model that includes all K candidate regressors.
2) Then the partial F statistic (or equivalently, a t statistic) is computed for each regressor as
if it were the last variable to enter the model.
3) The smallest of these partial F (or t ) statistics is compared with a preselected value, FOUT
(or tOUT)
4) If the smallest partial F (or t), value is less than FOUT (or tOUT), that regressor is removed
from the model.
5) Now a regression model with K − 1 regressors is fi t, the partial F (or t) statistics for this
new model calculated, and the procedure repeated.
6) The backward elimination algorithm terminates when the smallest partial F (or t) value is
not less than the preselected cutoff value FOUT (or tOUT).

Stepwise Regression
Stepwise regression is a modification of forward selection in which at each step all regressors
entered into the model previously are reassessed via their partial F (or t) statistics.
A regressor added at an earlier step may now be redundant because of the relationships
between it and regressors now in the equation. If the partial F (or t) statistic for a variable is
less than FOUT (or tOUT), that variable is dropped from the model.
Stepwise regression requires two cutoff values, one for entering variables and one for
removing them.
Some analysts prefer to choose FIN (or tIN) = FOUT (or tOUT), although this is not necessary.
Frequently we choose FIN (or tIN) > FOUT (or tOUT), making it relatively more difficult to add a
regressor than to delete one.

Strategy For Variable Selection and Model Building


The steps involved in variable selection and model building are
1. Fit the largest model possible to the data.
2. Perform a thorough analysis of this model.
3. Determine if a transformation of the response or of some of the regressors is necessary.
4. Determine if all possible regressions are feasible.
a) If all possible regressions are feasible, perform all possible regressions using such
criteria as Mallow’s Cp adjusted R2, and the PRESS statistic to rank the best subset
models.
b) If all possible regressions are not feasible, use stepwise selection techniques to
generate the largest model such that all possible regressions are feasible. Perform all
possible regressions as outlined above.
5. Compare and contrast the best models recommended by each criterion.
6. Perform a thorough analysis of the “best” models (usually three to five models).
7. Explore the need for further transformations.
8. Discuss with the subject - matter experts the relative advantages and disadvantages of the
final set of models.

You might also like