0% found this document useful (0 votes)
138 views

AB1202 Statistics and Analysis: Model Building

This document discusses model building techniques including polynomial models with one random variable, models with qualitative random variables, multicollinearity, and correlation matrices. It covers constructing polynomial models by including higher order terms like x-squared, coding qualitative variables numerically, interpreting the effects of included variables, and using correlation matrices to detect multicollinearity between explanatory variables.

Uploaded by

xthele
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
138 views

AB1202 Statistics and Analysis: Model Building

This document discusses model building techniques including polynomial models with one random variable, models with qualitative random variables, multicollinearity, and correlation matrices. It covers constructing polynomial models by including higher order terms like x-squared, coding qualitative variables numerically, interpreting the effects of included variables, and using correlation matrices to detect multicollinearity between explanatory variables.

Uploaded by

xthele
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 20

AB1202

Statistics and Analysis


Lecture 10
Model Building

Chin Chee Kai


[email protected]
Nanyang Business School
Nanyang Technological University
NBS 2016S1 AB1202 CCK-STAT-018
2

Model Building
• Polynomial Models with One Random Variable
• Models with Qualitative Random Variables
• Multicollinearity
• Correlation Matrix
• Selecting “Better” Models
• Stepwise Modeling
• Stepwise Forward
• Stepwise Backward
NBS 2016S1 AB1202 CCK-STAT-018
3

Polynomial Models - One Random Variable


• We have seen one-variable polynomial model of
order 1 in Simple Linear Regression: 𝑦 = 𝑏0 + 𝑏1 𝑥
• Order 2: 𝑦 = 𝑏0 + 𝑏1 𝑥 + 𝑏2 𝑥 2 (more interesting)
• Order 3: 𝑦 = 𝑏0 + 𝑏1 𝑥 + 𝑏2 𝑥 2 + 𝑏3 𝑥 3 (very interesting)
• In practice, this is about enough since it is gets
increasingly difficult to explain higher order models.

• We will focus on order 2 models only in this course.


Order 3 or higher order models are constructed in
similar manners.
NBS 2016S1 AB1202 CCK-STAT-018
4

Constructing 1-Var Polynomial Model


• Data captured is still (𝑥𝑖 , 𝑦𝑖 ) for 𝑖 = 1,2, … , 𝑛 samples.
• For order 2, the data for 𝑥 2 term is given by a pseudo-data
formed by squaring individual 𝑥𝑖 to give a set of 𝑥𝑖2 values.
• Then we treat 𝑥𝑖2 as if it were another variable and perform a
multiple regression to explain variations of 𝑦 with 𝑥 and 𝑥 2 .
• Eg 𝑦 8 7 9 12 Note:
Dropping 𝑥 2
𝑥 2 3 5 8
term from
𝑥2 4 9 25 64 Order 2 model
does not make
it the Order 1
Order 1 : 𝑦 = 5.5714 + 0.7619𝑥 regression
𝑅2 = 0.8707, Adj-𝑅2 = 0.8061 model!

Order 2 : 𝑦 = 8.3182 − 0.6212𝑥 + 0.1364𝑥 2


𝑅2 = 0.9459, Adj-𝑅2 = 0.8377
NBS 2016S1 AB1202 CCK-STAT-018
5

Models with Qualitative Random Variables


• Real life data typically has plenty of qualitative
variables with values that cannot be compared.
▫ Eg favorite sports, economy/business/first-class seat,
type of housing, etc.
▫ Some values are sort-of comparable, like hotel rating
(3, 4, 5, 6 stars) – we can say hotels with 3 stars
provide facilities “less than” 6-star hotels. But we
also do not say 3-star hotel is less effective in serving
customers – they cater to different customers.
• So depends on how we use the (especially
numerical) values. If we interpret/treat the values
as qualitative, then we should use the following
technique to build model.
NBS 2016S1 AB1202 CCK-STAT-018
6

Coding Qualitative Random Variables


• Key is to encode qualitative values with 0’s
and 1’s.
▫ Eg Has investment (1) vs no-investment (0).
▫ Eg Male (1) vs female (0).
▫ Eg Lives in HDB (1) vs not in HDB (0).
• Binary qualitative values are common, and
are easily encoded into 0’s and 1’s. The choice
on which value is 0 or 1 is completely
artificial, though we tend to use 1 for the value
we are more interested in.
NBS 2016S1 AB1202 CCK-STAT-018
7

Coding Qualitative Random Variables


• What if there are 3 or more alternative values?
▫ Eg Economy/ Business/ First-class seat?
• Answer is not using 0, 1 and 2. We still ONLY
use 0’s and 1’s, but introduce more variables.
One data variable
0 − not Economy becomes 2 model
𝑋𝐸 = * … variables
1 − Economy
𝑿 𝑿𝑬 𝑿𝑩
0 − not Business Economy 1 0
𝑋𝐵 = *
1 − Business Economy 1 0
Business 0 1
There is no 𝑋𝐹 for First-Class. Economy 1 0
First-Class 0 0
𝑋𝐸 = 0 and 𝑋𝐵 = 0 implies Business 0 1
First-Class! Economy 1 0
NBS 2016S1 AB1202 CCK-STAT-018
8

Coding Qualitative Random Variables


Liking Y Age A
Distance
Flown D
Seat Class X • We get a raw regression
8.5 42 230000 Economy model encompassing all
6.5 37 100000 Economy
7.1 23 85000 Business variables.
9.5
7.6
18
35
90000
150000
Economy
First-Class
• Then we derive a set of
3.4 56 60000 Business models by turning “on” and
6.6 31 180000 Economy
“off” each encoded variable to
assess the effects of each
Distance
Y
(Liking Age A
Flown D
(hundred k
X_E X_B qualitative value.
)

8.5 42
km)
2.3 1 0
• Effect of turning “on” and
6.5 37 1 1 0 “off” the encoded variables is
7.1 23 0.85 0 1
9.5 18 0.9 1 0 to change the y-intercept
7.6 35 1.5 0 0
3.4 56 0.6 0 1 value
6.6 31 1.8 1 0
NBS 2016S1 AB1202 CCK-STAT-018
9

Interpreting Coded Model


Raw Model is 𝑦 = 9.8994 − 0.1063 𝐴 + 0.948 𝐷 − 0.144 𝑋𝐸 − 1.1368 𝑋𝐵

Class X 𝑿𝑬 , 𝑿𝑩 Model

Economy 𝑋𝐸 = 1, 𝑋𝐵 = 0 𝑦 = 9.7554 − 0.1063 𝐴 + 0.948 𝐷


Business 𝑋𝐸 = 0, 𝑋𝐵 = 1 𝑦 = 8.7626 − 0.1063 𝐴 + 0.948 𝐷
First-Class 𝑋𝐸 = 0, 𝑋𝐵 = 0 𝑦 = 9.8994 − 0.1063 𝐴 + 0.948 𝐷

• Being in Economy seat class • Being in First-Class


decreases the liking, on seat gives the highest
average, by 0.144 points. average liking, all
• Being in Business seat class other factors being
decreases the liking, on the same.
average, by 1.1368 points.
NBS 2016S1 AB1202 CCK-STAT-018
10

Multicollinearity
• Multicollinearity is said to occur whenever explanatory
variables are dependent on one another.
• Bad thing to happen in a model. In serious cases, it can
make the resulting model completely useless.
• Eg: Consider 𝑦 = 𝑏0 + 𝑏1 𝑥1 + 𝑏2 𝑥2
• Suppose 𝑋2 is so correlated with 𝑋1 that in fact they are
the same; 𝑋2 = 𝑋1 (but you didn’t realize it)
• Our model would degenerate into
𝑦 = 𝑏0 + 𝑏1′ 𝑥1
you can easily see that the gradient which we think belongs
to 𝑥1 actually gets distorted from its true value (should’ve
been 𝑏1 but we only observe values which are closer to
(𝑏1 + 𝑏2 ). The variance of this gradient also gets inflated.
NBS 2016S1 AB1202 CCK-STAT-018
11

Multicollinearity – Can’t Avoid


• Yet we cannot avoid some level of multicollinearity,
since practical data would always have little bit of
correlation with various other data.
• Thus, key is not to eliminate, but to reduce serious
multicollinearity cases.
• Two ways to detect multicollinearity:
▫ Correlation Matrix (or correlation diagrams),
▫ Variance Inflation Factor
• We will only look at Correlation Matrix.
NBS 2016S1 AB1202 CCK-STAT-018
12

Correlation Matrix
• It is a table of correlations of all variables with all
variables.
• Shows suspicious multicollinearity
variables when a cell has correlation Dependent variable is
magnitude close to 1. expected to have some
correlation with
• Use Excel’s Data Analysis  Correlation to explanatory variables.
get the correlation matrix. Use Conditional So this column is not
Formatting to color-highlight strong values important.

(very red and very green).


Liking Distance
Age A X_E X_B
Y Flown D
Liking Y Age A Distance D X_E X_B
8.5 42 2.3 1 0
6.5 37 1 1 0
Liking Y 1.0000
7.1 23 0.85 0 1 Age A (0.7472) 1.0000
9.5 18 0.9 1 0 Distance D 0.4331 0.0401 1.0000
7.6 35 1.5 0 0
X_E 0.4836 (0.2560) 0.4531 1.0000
3.4 56 0.6 0 1
6.6 31 1.8 1 0 X_B (0.6312) 0.2687 (0.6204) (0.7303) 1.0000
NBS 2016S1 AB1202 CCK-STAT-018
13

Multicollinearity – Can’t Avoid


• When two explanatory variables have strong
correlation, we may want to remove one of
them.
• Removing variable is an exercise of art and
statistics.
▫ Contextual understanding: eg, remove variable that
is a derived variable rather than a source.
▫ Remove variable giving lower 𝑅2 or Adjusted-𝑅2
▫ Remove variable which correlates with several
variables.
NBS 2016S1 AB1202 CCK-STAT-018
14

Selecting “Better” Models


• Need an objective way to see if a model is good or not.
▫ Notice we say “a way”, not “the way”.
• An objective function is a formula that combines one
or more outputs of a model into a decimal number so
that the goodness of the model can be compared
(larger value better or the other way).

• Must agree on what is an objective function


• Examples of objective function:
▫ F-test statistic (larger better),
▫ p-value (smaller better),
▫ 𝑅2 or Adjusted-𝑅2 (larger better),
▫ AIC (Akaike Information Criteria, used by R, smaller
better)
NBS 2016S1 AB1202 CCK-STAT-018
15

Stepwise Modeling
• When we have gathered many variables and lots of
samples, often we might not know where to begin.
• We can (let computer) “search” for the best model
by incrementally trying one variable at a time.
• Use selected objective function to adaptively zero-in
on the best model.

• This is cool! We get best model automatically!


• This is dangerous! The “best” model we get may
not make much practically sense (it may be best
numerically, but the variables selected may be quite
meaningless)
NBS 2016S1 AB1202 CCK-STAT-018
16

Forward Stepwise
1. Start with null model: 𝑦 = 𝑏0 (ie no explanatory
variables). Calculate its objective value (eg AIC).
2. For the remaining variables, try adding one
variable to existing model and calculating
objective value. The variable that results in most
improved objective value will be actually added.
3. Repeat step 2 until no more variable can
improve objective value.
NBS 2016S1 AB1202 CCK-STAT-018
17

Forward Stepwise in R
• Consider again the airline seating class data set.
We wonder which variable(s) we should be using to
best explain fluctuations in liking of the airline.
datatext = "Liking_Y Age_A Distance_D X_E X_B
8.5 42 2.3 1 0
6.5 37 1 1 0
7.1 23 0.85 0 1
9.5 18 0.9 1 0
7.6 35 1.5 0 0
3.4 56 0.6 0 1
6.6 31 1.8 1 0
" R’s step() function
d<-read.delim(textConnection(datatext),
header=TRUE, does this in one line.
sep="",
strip.white=TRUE)
model_null = lm(d$Liking_Y ~ 1)
model_full = lm(d$Liking_Y ~ d$Age_A + d$Distance_D + d$X_E + d$X_B)
model <- step(model_null, model_full$formula, direction="forward")

### model_back <- step(model_full, model_null$formula, direction="backward")


NBS 2016S1 AB1202 CCK-STAT-018
18

Forward Stepwise Results in R


Start: AIC=10.09
d$Liking_Y ~ 1
Starting with null model, R tries to
Df Sum of Sq RSS AIC add one of A, X_B, X_E and D and
+ d$Age_A 1 12.4121 9.8221 6.3711 ranks their performance by AIC
+ d$X_B 1 8.8573 13.3770 8.5334
<none> 22.2343 10.0901 values. Seem like adding A is best to
+ d$X_E 1 5.2001 17.0342 10.2252 reduce AIC from current 10.09 to
+ d$Distance_D 1 4.1709 18.0634 10.6358 6.3711.
Step: AIC=6.37
d$Liking_Y ~ d$Age_A R tries to add one of remaining
Df Sum of Sq RSS AIC
variables X_B, X_E and D and ranks
+ d$Distance_D 1 4.7750 5.0471 3.7103 their performances by AIC values.
+ d$X_B 1 4.4387 5.3835 4.1620 Seem like adding D is best to reduce
<none> 9.8221 6.3711
AIC from current 6.3711 to 3.7103.
+ d$X_E 1 2.0335 7.7887 6.7473

Step: AIC=3.71
d$Liking_Y ~ d$Age_A + d$Distance_D
Best model is to use Age and Distance
Df Sum of Sq RSS AIC only. Model is:
<none> 5.0471 3.7103
+ d$X_B 1 0.79654 4.2506 4.5080
Liking = 9.2235 – 0.1177 Age + 1.4647
+ d$X_E 1 0.18538 4.8617 5.4484 Distance
NBS 2016S1 AB1202 CCK-STAT-018
19

Backward Stepwise
1. Start with full model: 𝑦 = 𝑏0 + 𝑏1 𝑥1 + ⋯ + 𝑏𝑘 𝑥𝑘
(ie all explanatory variables). Calculate its
objective value (eg AIC).
2. For all the variables in the current model, test by
removing one variable from existing model and
calculating objective value. The variable
removed that results in most improved objective
value will be actually removed.
3. Repeat step 2 until removal of any variable can
no longer improve objective value.
• Very simply done in R. Use R’s step() function
with direction=“backward” will do.
NBS 2016S1 AB1202 CCK-STAT-018
20

Backward Stepwise Results in R


Start: AIC=6.48
d$Liking_Y ~ d$Age_A + d$Distance_D + d$X_E
+ d$X_B Starting with full model, R tries to
REMOVE one of A, D, X_E, and
Df Sum of Sq RSS AIC X_B and ranks their performances
- d$X_E 1 0.0164 4.2506 4.5080
- d$X_B 1 0.6276 4.8617 5.4484
by AIC values. Seem like removing
- d$Distance_D 1 1.1392 5.3734 6.1488 X_E is best to reduce AIC from
<none> 4.2341 6.4809 current 6.48 to 4.5080.
- d$Age_A 1 9.0560 13.2901 12.4878

Step: AIC=4.51 R tries to REMOVE one of


d$Liking_Y ~ d$Age_A + d$Distance_D + d$X_B
remaining variables A, D, X_B and
Df Sum of Sq RSS AIC ranks their performances by AIC
- d$X_B 1 0.7965 5.0471 3.7103 values. Seem like removing X_B is
- d$Distance_D 1 1.1329 5.3835 4.1620
<none> 4.2506 4.5080
best to reduce AIC from current
- d$Age_A 1 9.0640 13.3146 10.5007 4.5080 to 3.7103.

Step: AIC=3.71
d$Liking_Y ~ d$Age_A + d$Distance_D Best model is to use Age and
Df Sum of Sq RSS AIC Distance only. Model is:
<none> 5.0471 3.7103 Liking = 9.2235 – 0.1177 Age +
- d$Distance_D 1 4.775 9.8221 6.3711 1.4647 Distance
- d$Age_A 1 13.016 18.0634 10.6358

You might also like