AB1202 Statistics and Analysis: Model Building
AB1202 Statistics and Analysis: Model Building
Model Building
• Polynomial Models with One Random Variable
• Models with Qualitative Random Variables
• Multicollinearity
• Correlation Matrix
• Selecting “Better” Models
• Stepwise Modeling
• Stepwise Forward
• Stepwise Backward
NBS 2016S1 AB1202 CCK-STAT-018
3
8.5 42
km)
2.3 1 0
• Effect of turning “on” and
6.5 37 1 1 0 “off” the encoded variables is
7.1 23 0.85 0 1
9.5 18 0.9 1 0 to change the y-intercept
7.6 35 1.5 0 0
3.4 56 0.6 0 1 value
6.6 31 1.8 1 0
NBS 2016S1 AB1202 CCK-STAT-018
9
Class X 𝑿𝑬 , 𝑿𝑩 Model
Multicollinearity
• Multicollinearity is said to occur whenever explanatory
variables are dependent on one another.
• Bad thing to happen in a model. In serious cases, it can
make the resulting model completely useless.
• Eg: Consider 𝑦 = 𝑏0 + 𝑏1 𝑥1 + 𝑏2 𝑥2
• Suppose 𝑋2 is so correlated with 𝑋1 that in fact they are
the same; 𝑋2 = 𝑋1 (but you didn’t realize it)
• Our model would degenerate into
𝑦 = 𝑏0 + 𝑏1′ 𝑥1
you can easily see that the gradient which we think belongs
to 𝑥1 actually gets distorted from its true value (should’ve
been 𝑏1 but we only observe values which are closer to
(𝑏1 + 𝑏2 ). The variance of this gradient also gets inflated.
NBS 2016S1 AB1202 CCK-STAT-018
11
Correlation Matrix
• It is a table of correlations of all variables with all
variables.
• Shows suspicious multicollinearity
variables when a cell has correlation Dependent variable is
magnitude close to 1. expected to have some
correlation with
• Use Excel’s Data Analysis Correlation to explanatory variables.
get the correlation matrix. Use Conditional So this column is not
Formatting to color-highlight strong values important.
Stepwise Modeling
• When we have gathered many variables and lots of
samples, often we might not know where to begin.
• We can (let computer) “search” for the best model
by incrementally trying one variable at a time.
• Use selected objective function to adaptively zero-in
on the best model.
Forward Stepwise
1. Start with null model: 𝑦 = 𝑏0 (ie no explanatory
variables). Calculate its objective value (eg AIC).
2. For the remaining variables, try adding one
variable to existing model and calculating
objective value. The variable that results in most
improved objective value will be actually added.
3. Repeat step 2 until no more variable can
improve objective value.
NBS 2016S1 AB1202 CCK-STAT-018
17
Forward Stepwise in R
• Consider again the airline seating class data set.
We wonder which variable(s) we should be using to
best explain fluctuations in liking of the airline.
datatext = "Liking_Y Age_A Distance_D X_E X_B
8.5 42 2.3 1 0
6.5 37 1 1 0
7.1 23 0.85 0 1
9.5 18 0.9 1 0
7.6 35 1.5 0 0
3.4 56 0.6 0 1
6.6 31 1.8 1 0
" R’s step() function
d<-read.delim(textConnection(datatext),
header=TRUE, does this in one line.
sep="",
strip.white=TRUE)
model_null = lm(d$Liking_Y ~ 1)
model_full = lm(d$Liking_Y ~ d$Age_A + d$Distance_D + d$X_E + d$X_B)
model <- step(model_null, model_full$formula, direction="forward")
Step: AIC=3.71
d$Liking_Y ~ d$Age_A + d$Distance_D
Best model is to use Age and Distance
Df Sum of Sq RSS AIC only. Model is:
<none> 5.0471 3.7103
+ d$X_B 1 0.79654 4.2506 4.5080
Liking = 9.2235 – 0.1177 Age + 1.4647
+ d$X_E 1 0.18538 4.8617 5.4484 Distance
NBS 2016S1 AB1202 CCK-STAT-018
19
Backward Stepwise
1. Start with full model: 𝑦 = 𝑏0 + 𝑏1 𝑥1 + ⋯ + 𝑏𝑘 𝑥𝑘
(ie all explanatory variables). Calculate its
objective value (eg AIC).
2. For all the variables in the current model, test by
removing one variable from existing model and
calculating objective value. The variable
removed that results in most improved objective
value will be actually removed.
3. Repeat step 2 until removal of any variable can
no longer improve objective value.
• Very simply done in R. Use R’s step() function
with direction=“backward” will do.
NBS 2016S1 AB1202 CCK-STAT-018
20
Step: AIC=3.71
d$Liking_Y ~ d$Age_A + d$Distance_D Best model is to use Age and
Df Sum of Sq RSS AIC Distance only. Model is:
<none> 5.0471 3.7103 Liking = 9.2235 – 0.1177 Age +
- d$Distance_D 1 4.775 9.8221 6.3711 1.4647 Distance
- d$Age_A 1 13.016 18.0634 10.6358