0% found this document useful (0 votes)

4 views

Unit 4

Uploaded by

lithishr123

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

4 views

Unit 4

Uploaded by

lithishr123

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 7

Unit 4: VARIABLE SELECTION AND MODEL BUILDING

Introduction
In most practical problems, especially those involving historical data, the analyst has a
rather large pool of possible candidate regressors, of which only a few are likely to be
important. Finding an appropriate subset of regressors for the model is often called the variable
selection problem.
Good variable selection methods are very important in the presence of multicollinearity.
Frankly, the most common corrective technique for multicollinearity is variable selection.
Variable selection does not guarantee elimination of multicollinearity. There are cases where
two or more regressors are highly related; yet, some subset of them really does belong in the
model. Our variable selection methods help to justify the presence of these highly related
regressors in the final model.
Multicollinearity is not the only reason to pursue variable selection techniques. Even
mild relationships that our multicollinearity diagnostics do not flag as problematic can have an
impact on model selection. The use of good model selection techniques increases our
confidence in the final model or models recommended.

Building a regression model that includes only a subset of the available regressors
involves two conflicting objectives.

(1) We would like the model to include as many regressors as possible so that the information
content in these factors can influence the predicted value of y.

(2) We want the model to include as few regressors as possible because the variance of the
prediction yˆ increases as the number of regressors increases.

Also, the more regressors there are in a model, the greater the costs of data collection and model
maintenance. The process of finding a model that is a compromise between these two
objectives is called selecting the “best” regression equation.

Criteria for Evaluating Subset Regression Models

Two key aspects of the variable selection problem are generating the subset models and
deciding if one subset is better than another.

Coefficient of Multiple Determination

A measure of the adequacy of a regression model that has been widely used is the coefficient
of multiple determination, R2. Let Rp2 denote the coefficient of multiple determination for a
subset regression model with p terms, that is, p − 1 regressors and an intercept term β0.
Computationally,
where SSR(p) and SSRes(p) denote the regression sum of squares and the residual sum of
squares, respectively, for a p - term subset model. Rp2 increases as p increases and is a maximum
when p = K + 1. Since we cannot find an “optimum” value of R2 for a subset regression model, we
must look for a “satisfactory” value.

A hypothetical plot of the maximum value of Rp2 for each subset of size p against p is given
in the following graph. We can see that as number of regressors increases Rp2 also increases.

Limitations of R2:
1) In some cases, if the non-significant independent variable is added to the model R2 value
will get increased but the added regressor does not influence the dependent variable
(regression model). So in this situation R2 is not reliable/appropriate.
2) To calculate R2 value presence of intercept is necessary. We cannot find the R2 for
regression model which does not have intercept.
3) R2 is sensitive to the extreme values.

Adjusted R2
To avoid the difficulties of interpreting R2, some analysts prefer to use the adjusted R2
statistic, defined for a p - term equation as

The Adjusted Rp2 statistic does not necessarily increase as additional regressors are
introduced into the model. If s regressors are added to the model, Adjusted Rp+s2 will exceed
Adjusted Rp2 if and only if the partial F statistic for testing the significance of the s additional
regressors exceeds 1. One criterion for selection of an optimum subset model is to choose the
model that has a maximum Adjusted Rp2.
Residual Mean Square
Another method to model evaluation is Residual Mean Square for a subset regression model,
which is given by

As p increases, MSRes (p) initially decreases, then stabilizes, and eventually may increase.
The eventual increase in MSRes (p) occurs when the reduction in SSRes (p) from adding a
regressor to the model is not sufficient to compensate for the loss of one degree of freedom in
the denominator.

Advocates of the MSRes(p) criterion will plot MSRes(p) versus p and base the choice of p on
the following:
1. The minimum MSRes(p)
2. The value of p such that MSRes(p) is approximately equal to MSRes for the full model
3. A value of p near the point where the smallest MSRes(p) turns upward.

Mallows CP statistic
This measures the overall biased or mean square error associated with a fitted regression
model. The Mallows CP statistic is given by
𝑆𝑆𝑟𝑒𝑠 (𝑝)
𝐶𝑝 = − 𝑛 + (2 ∗ 𝑃)
𝑀𝑆𝑟𝑒𝑠 𝑓𝑢𝑙𝑙 𝑚𝑜𝑑𝑒𝑙
𝑆𝑆𝑟𝑒𝑠 (𝑝)
𝐶𝑝 = − 𝑛 + (2 ∗ 𝑃)
𝜎̂ 2

Akaike Information Criterion (AIC)

Akaike proposed an information criterion, AIC, based on maximizing the expected entropy of
the model. AIC is given by
AIC = -2 ln(L) + 2*P
Where L is likelihood function and P is number of regression coefficients.
As we add regressors to the model, SSRes, cannot increase. The issue becomes whether the
decrease in SSRes justifies the inclusion of the extra terms.

Bayesian Information Criterion (BIC)

BIC is given by
BIC = -2 ln(L) + P*ln(n)
This criterion places a greater penalty on adding regressors as the sample size increases. AIC
and BIC are much more commonly used in the model selection procedures involving more
complicated modeling situations than ordinary least squares. Lower the AIC (BIC) better the
model.

Computational Techniques for Variable Selection

To find the subset of variables to use in the final equation, it is natural to consider fitting
models with various combinations of the candidate regressors.
All Possible Regressions
Steps involved in this procedure are
1) First analyst fit all the regression equations involving one candidate regressor, two
candidate regressors, and so on.
2) These equations are evaluated according to some suitable criterion and the “best”
regression model selected.
3) If we assume that the intercept term β0 is included in all equations, then if there are K
candidate regressors, there are 2K total equations to be estimated and examined.
4) For example, if K = 4, then there are 24 = 16 possible equations, while if K = 10, there are
210 = 1024 possible regression equations.
5) Clearly the number of equations to be examined increases rapidly as the number of
candidates regressors increases.
Stepwise Regression Methods
Because evaluating all possible regressions can be burdensome computationally, various
methods have been developed for evaluating only a small number of subset regression
models by either adding or deleting regressors one at a time. These methods are referred as
stepwise - type procedures. They can be classified into three broad categories:
(1) forward selection,
(2) backward elimination, and
(3) stepwise regression

Forward Selection
The steps involved in the forward selection procedures are
1) This procedure begins with the assumption that there are no regressors in the model other
than the intercept.
2) An effort is made to find an optimal subset by inserting regressors into the model one at a
time.
3) The first regressor selected for entry into the equation is the one that has the largest simple
correlation with the response variable y.
4) Suppose that this regressor is x1.
5) This is also the regressor that will produce the largest value of the F statistic for testing
significance of regression.
F statistic is calculated by using the formula

𝑆𝑆𝑟𝑒𝑠 𝐹𝑢𝑙𝑙 𝑚𝑜𝑑𝑒𝑙 − 𝑆𝑆𝑟𝑒𝑠 𝑚𝑜𝑑𝑒𝑙 𝑤𝑖𝑡ℎ 𝑋𝑖

𝐹=
𝑀𝑆𝑟𝑒𝑠 𝐹𝑢𝑙𝑙 𝑚𝑜𝑑𝑒𝑙
6) This regressor is entered if the F statistic exceeds a preselected F value, say FIN (or F - to -
enter).
Preselected F value is calculated as
(𝑛 − 𝑝) 𝑅2
𝐹=
(𝑝 − 1) 1 − 𝑅 2
7) The second regressor chosen for entry is the one that now has the largest correlation with y
after adjusting for the effect of the first regressor entered (x1) on y.
8) If this F value exceeds FIN,then x2 is added to the model.
9) In general, at each step the regressor having the highest partial correlation with y (or
equivalently the largest partial F statistic given the other regressors already in the model) is
added to the model if its partial F statistic exceeds the preselected entry level FIN.
10) The procedure terminates either when the partial F statistic at a particular step does not
exceed FIN or when the last candidate regressor is added to the model.

Backward Elimination
Forward selection begins with no regressors in the model and attempts to insert variables
until a suitable model is obtained.
The steps involved in backward elimination are
1) we begin with a model that includes all K candidate regressors.
2) Then the partial F statistic (or equivalently, a t statistic) is computed for each regressor as
if it were the last variable to enter the model.
3) The smallest of these partial F (or t ) statistics is compared with a preselected value, FOUT
(or tOUT)
4) If the smallest partial F (or t), value is less than FOUT (or tOUT), that regressor is removed
from the model.
5) Now a regression model with K − 1 regressors is fi t, the partial F (or t) statistics for this
new model calculated, and the procedure repeated.
6) The backward elimination algorithm terminates when the smallest partial F (or t) value is
not less than the preselected cutoff value FOUT (or tOUT).

Stepwise Regression
Stepwise regression is a modification of forward selection in which at each step all regressors
entered into the model previously are reassessed via their partial F (or t) statistics.
A regressor added at an earlier step may now be redundant because of the relationships
between it and regressors now in the equation. If the partial F (or t) statistic for a variable is
less than FOUT (or tOUT), that variable is dropped from the model.
Stepwise regression requires two cutoff values, one for entering variables and one for
removing them.
Some analysts prefer to choose FIN (or tIN) = FOUT (or tOUT), although this is not necessary.
Frequently we choose FIN (or tIN) > FOUT (or tOUT), making it relatively more difficult to add a
regressor than to delete one.

Strategy For Variable Selection and Model Building

The steps involved in variable selection and model building are
1. Fit the largest model possible to the data.
2. Perform a thorough analysis of this model.
3. Determine if a transformation of the response or of some of the regressors is necessary.
4. Determine if all possible regressions are feasible.
a) If all possible regressions are feasible, perform all possible regressions using such
criteria as Mallow’s Cp adjusted R2, and the PRESS statistic to rank the best subset
models.
b) If all possible regressions are not feasible, use stepwise selection techniques to
generate the largest model such that all possible regressions are feasible. Perform all
possible regressions as outlined above.
5. Compare and contrast the best models recommended by each criterion.
6. Perform a thorough analysis of the “best” models (usually three to five models).
7. Explore the need for further transformations.
8. Discuss with the subject - matter experts the relative advantages and disadvantages of the
final set of models.

E 74 - 06 - For Force Measuring Instruments
No ratings yet
E 74 - 06 - For Force Measuring Instruments
12 pages
Reg07
No ratings yet
Reg07
22 pages
Chapter 4
No ratings yet
Chapter 4
23 pages
9 - APM 1205 Linear Model
No ratings yet
9 - APM 1205 Linear Model
20 pages
WINSEM2023-24 MAT6015 ETH VL2023240501308 2024-03-19 Reference-Material-I
No ratings yet
WINSEM2023-24 MAT6015 ETH VL2023240501308 2024-03-19 Reference-Material-I
39 pages
Module07 - Model Selection and Regularization
No ratings yet
Module07 - Model Selection and Regularization
46 pages
Week8_Lecture_1_ML_SPR25
No ratings yet
Week8_Lecture_1_ML_SPR25
20 pages
Chapter 5
No ratings yet
Chapter 5
30 pages
Model Selection-Handout PDF
No ratings yet
Model Selection-Handout PDF
57 pages
Lecture 5 Model Selection I: STAT 441: Statistical Methods For Learning and Data Mining
No ratings yet
Lecture 5 Model Selection I: STAT 441: Statistical Methods For Learning and Data Mining
17 pages
L2D-Multiple Regression D 2022-03-03 21_20_03
No ratings yet
L2D-Multiple Regression D 2022-03-03 21_20_03
31 pages
SAS Code To Select The Best Multiple Linear Regression Model For Multivariate Data Using Information Criteria
No ratings yet
SAS Code To Select The Best Multiple Linear Regression Model For Multivariate Data Using Information Criteria
6 pages
STA302 Week12 Full
No ratings yet
STA302 Week12 Full
30 pages
Lesson 5 Model Selection
No ratings yet
Lesson 5 Model Selection
41 pages
Problems With Stepwise Regression
No ratings yet
Problems With Stepwise Regression
1 page
13 Paper PDF
No ratings yet
13 Paper PDF
14 pages
Jurnal Asli Diagram Sa
No ratings yet
Jurnal Asli Diagram Sa
11 pages
Chapter 2
No ratings yet
Chapter 2
37 pages
Chapter 9: Selection of Variables
No ratings yet
Chapter 9: Selection of Variables
30 pages
Stepwise Regression
0% (1)
Stepwise Regression
9 pages
Chapter 6 Variable Selection and Model Building
No ratings yet
Chapter 6 Variable Selection and Model Building
32 pages
Diagnostic Tests2
No ratings yet
Diagnostic Tests2
25 pages
Stat 136 Chapter 6 Variable Selection and Comparison of Regression Coefficients
No ratings yet
Stat 136 Chapter 6 Variable Selection and Comparison of Regression Coefficients
40 pages
SRM Notes
No ratings yet
SRM Notes
38 pages
3rd Module EDBA Contiuation1
No ratings yet
3rd Module EDBA Contiuation1
6 pages
Yang-39 2 Proof 27
No ratings yet
Yang-39 2 Proof 27
11 pages
Multiple Regression - Selecting The Best Equation: An Example
No ratings yet
Multiple Regression - Selecting The Best Equation: An Example
29 pages
A Universal Selection Method in Linear Regression Models: Eckhard Liebscher
No ratings yet
A Universal Selection Method in Linear Regression Models: Eckhard Liebscher
10 pages
Model Selection R Chap 4
No ratings yet
Model Selection R Chap 4
5 pages
Mathematics 07 01215
No ratings yet
Mathematics 07 01215
12 pages
A New Criterion For Model Selection
No ratings yet
A New Criterion For Model Selection
12 pages
Best Subset Methods
No ratings yet
Best Subset Methods
3 pages
Glmulti Walkthrough
No ratings yet
Glmulti Walkthrough
29 pages
02 - Using Advanced Regression
No ratings yet
02 - Using Advanced Regression
5 pages
Lab 5
No ratings yet
Lab 5
30 pages
Stat 452/652 - Minitab Lab 6 MULTIPLE REGRESSION - Choosing The Best Model
No ratings yet
Stat 452/652 - Minitab Lab 6 MULTIPLE REGRESSION - Choosing The Best Model
2 pages
0 Regularization PDF
No ratings yet
0 Regularization PDF
88 pages
Variable Selection 8.1 The Model Building Problem
No ratings yet
Variable Selection 8.1 The Model Building Problem
18 pages
Variable Selection 8.1 The Model Building Problem
No ratings yet
Variable Selection 8.1 The Model Building Problem
18 pages
stepwise regression
No ratings yet
stepwise regression
2 pages
Multi-Collineartity, Variance Inflation and Orthogonalization in Regression
No ratings yet
Multi-Collineartity, Variance Inflation and Orthogonalization in Regression
5 pages
BIOSTATISTICS
No ratings yet
BIOSTATISTICS
15 pages
Model Selection
No ratings yet
Model Selection
11 pages
Regression
No ratings yet
Regression
21 pages
02 - Using Advanced Regression
No ratings yet
02 - Using Advanced Regression
5 pages
PROBLEMS ch05
No ratings yet
PROBLEMS ch05
117 pages
Regression SPSS
No ratings yet
Regression SPSS
21 pages
analysis_regression_backward_stepwise_elimination_regression_model (1)
No ratings yet
analysis_regression_backward_stepwise_elimination_regression_model (1)
2 pages
Stepwise Regression
No ratings yet
Stepwise Regression
4 pages
Best Subsets Regression (Menu)
No ratings yet
Best Subsets Regression (Menu)
5 pages
Desingn of Experiments ch10
No ratings yet
Desingn of Experiments ch10
5 pages
AIMS-Lukacs Burnman Anderson-2010
No ratings yet
AIMS-Lukacs Burnman Anderson-2010
9 pages
Module 5.2
No ratings yet
Module 5.2
51 pages
Stepwise Regression: Forward (Step-Up) Selection
No ratings yet
Stepwise Regression: Forward (Step-Up) Selection
7 pages
Mendenhall Ch06-+modified
No ratings yet
Mendenhall Ch06-+modified
28 pages
Lars Based S Estimator
No ratings yet
Lars Based S Estimator
10 pages
Chapter 14
No ratings yet
Chapter 14
15 pages
Notes 12
No ratings yet
Notes 12
41 pages
Multi-dimensional Monte Carlo Integrations Utilizing Mathematica
From Everand
Multi-dimensional Monte Carlo Integrations Utilizing Mathematica
SUJAUL CHOWDHURY
No ratings yet
Acceptance-Rejection Sampling and Multi-dimensional Monte Carlo Integrations Utilizing Mathematica®
From Everand
Acceptance-Rejection Sampling and Multi-dimensional Monte Carlo Integrations Utilizing Mathematica®
SUJAUL CHOWDHURY
No ratings yet
Multiple Models Approach in Automation: Takagi-Sugeno Fuzzy Systems
From Everand
Multiple Models Approach in Automation: Takagi-Sugeno Fuzzy Systems
Mohammed Chadli
No ratings yet
Assignment 2
No ratings yet
Assignment 2
3 pages
Assignment Final_2(1)(1)
No ratings yet
Assignment Final_2(1)(1)
3 pages
QM 3 Multiple Regression 1
No ratings yet
QM 3 Multiple Regression 1
48 pages
Experiment 5
100% (1)
Experiment 5
6 pages
Polynomial Regression
No ratings yet
Polynomial Regression
15 pages
Quiz 2
No ratings yet
Quiz 2
5 pages
Problem Set 3 SOLUTIONS
No ratings yet
Problem Set 3 SOLUTIONS
7 pages
Individual Assignment (25%) BC
No ratings yet
Individual Assignment (25%) BC
4 pages
Two-Variable Regression Model - The Problem of Estimation
No ratings yet
Two-Variable Regression Model - The Problem of Estimation
35 pages
Pengaruh Lingkungan Kerja Fisik Dan Motivasi Terhadap Kinerja Karyawan Pt. Raya Azura Persada Jakarta Selatan
No ratings yet
Pengaruh Lingkungan Kerja Fisik Dan Motivasi Terhadap Kinerja Karyawan Pt. Raya Azura Persada Jakarta Selatan
10 pages
The Path of the Sun - Quadratic Regression - Editable-1
No ratings yet
The Path of the Sun - Quadratic Regression - Editable-1
2 pages
217 - Chapter 4 REGRESSION AND CORRELATION
No ratings yet
217 - Chapter 4 REGRESSION AND CORRELATION
69 pages
Homework 3
No ratings yet
Homework 3
10 pages
Midterm - Exam Regression
No ratings yet
Midterm - Exam Regression
5 pages
Measures of Relationships
No ratings yet
Measures of Relationships
13 pages
CFTool Tutorial
No ratings yet
CFTool Tutorial
16 pages
Ch. 9 Curve Fitting
No ratings yet
Ch. 9 Curve Fitting
25 pages
100 Employee Data Set
No ratings yet
100 Employee Data Set
7 pages
Econometric Lec6
No ratings yet
Econometric Lec6
52 pages
Problem Statement - Excel Project - Treo's Real Estate
No ratings yet
Problem Statement - Excel Project - Treo's Real Estate
3 pages
Mathcs41 Module 4
No ratings yet
Mathcs41 Module 4
28 pages
CO2 Emission Project Source Code
No ratings yet
CO2 Emission Project Source Code
2 pages
_Econometrics_One_Chapter_4-Violation_of__CLRM_assumptions_[1]
No ratings yet
_Econometrics_One_Chapter_4-Violation_of__CLRM_assumptions_[1]
31 pages
Mid-term-test-2021_2911
No ratings yet
Mid-term-test-2021_2911
5 pages
Arnav MLlab02
No ratings yet
Arnav MLlab02
6 pages
Literature Review On Multiple Linear Regression
100% (2)
Literature Review On Multiple Linear Regression
4 pages
Get Multivariate Generalized Linear Mixed Models Using R 1st Edition Damon Mark Berridge free all chapters
100% (2)
Get Multivariate Generalized Linear Mixed Models Using R 1st Edition Damon Mark Berridge free all chapters
74 pages
Regression Analysis by Example - (CHAPTER 7 WEIGHTED LEAST SQUARES)
No ratings yet
Regression Analysis by Example - (CHAPTER 7 WEIGHTED LEAST SQUARES)
18 pages
Question Bank NACP
No ratings yet
Question Bank NACP
12 pages

Unit 4

Uploaded by

Unit 4

Uploaded by

Unit 4: VARIABLE SELECTION AND MODEL BUILDING

Criteria for Evaluating Subset Regression Models

Coefficient of Multiple Determination

Akaike Information Criterion (AIC)

Bayesian Information Criterion (BIC)

Computational Techniques for Variable Selection

𝑆𝑆𝑟𝑒𝑠 𝐹𝑢𝑙𝑙 𝑚𝑜𝑑𝑒𝑙 − 𝑆𝑆𝑟𝑒𝑠 𝑚𝑜𝑑𝑒𝑙 𝑤𝑖𝑡ℎ 𝑋𝑖

Strategy For Variable Selection and Model Building

You might also like