2 Multiple LR
2 Multiple LR
Industrial Engineers
Yi β 0 β1 X1i β 2 X 2i β k X ki ε i
MULTIPLE REGRESSION
EQUATION
The coefficients of the multiple regression model are
estimated using sample data
ˆ b b X b X b X
Yi 0 1 1i 2 2i k ki
In this chapter we will use Excel or Minitab to obtain the
regression slope coefficients and other regression
summary measures.
MULTIPLE REGRESSION
EQUATION
Two variable model
(continued)
Y
Ŷ b0 b1X1 b 2 X 2
X1
e
abl
ri
r va
fo
l ope X2
S
f or v ariable X 2
Slope
X1
EXAMPLE:
2 INDEPENDENT VARIABLES
A distributor of frozen dessert pies wants to evaluate factors thought to
influence demand
R Square 0.52148
Adjusted R Square 0.44172
Standard Error 47.46341 Sales 306.526- 24.975(Price) 74.131(Advertising)
Observations 15
ANOVA df SS MS F Significance F
Regression 2 29460.027 14730.013 6.53861 0.01201
Residual 12 27033.306 2252.776
Total 14 56493.333
Analysis of Variance
Source DF SS MS F P
Regression 2 29460 14730 6.54 0.012
Residual Error 12 27033 2253
Total 14 56493
THE MULTIPLE REGRESSION
EQUATION
New
Obs Fit SE Fit 95% CI 95% PI
ˆ value 1 428.6 17.2 (391.1, 466.1) (318.6, 538.6)
Predicted Y
New
Obs Price Advertising
1 5.50 3.50 Prediction interval
for an individual Y
value, given these X
Input values values
Chap 14-12
COEFFICIENT OF
MULTIPLE DETERMINATION
Analysis of Variance
52.1% of the variation in pie
Source DF SS MS F P sales is explained by the
Regression 2 29460 14730 6.54 0.012 variation in price and
Residual Error 12 27033 2253
Total 14 56493
advertising
ADJUSTED R2
r2 never decreases when a new X variable is added to the model
This can be a disadvantage when comparing models
What is the net effect of adding a new variable?
We lose a degree of freedom when a new X variable
is added
Did the new X variable add enough explanatory
power to offset the loss of one degree of freedom?
ADJUSTED R2 (continued)
Shows the proportion of variation in Y explained by all X variables adjusted
for the number of X variables used
2 n 1
r 2
adj 1 (1 r )
n k 1
(where n = sample size, k = number of independent variables)
SSR
MSR k
FSTAT
MSE SSE
n k 1
where FSTAT has numerator d.f. = k and
denominator d.f. = (n – k - 1)
F TEST FOR OVERALL
SIGNIFICANCE IN EXCEL
Regression Statistics
(continued)
Multiple R 0.72213
Analysis of Variance
Source DF SS MS F P
Regression 2 29460 14730 6.54 0.012
Residual Error 12 27033 2253
Total 14 56493
ei = (Yi – Yi)
<
Yi
x2i
X2
x1i
The best fit equation is found
X1 by minimizing the sum of
squared errors, e2
MULTIPLE REGRESSION
ASSUMPTIONS
Errors (residuals) from the regression model:
<
ei = (Yi – Yi)
Assumptions:
The errors are normally distributed
Errors have a constant variance
The model errors are independent
RESIDUAL PLOTS USED
IN MULTIPLE REGRESSION
Residuals vs. Yi
Test Statistic:
bj 0
t STAT (df = n – k – 1)
Sb
j
ARE INDIVIDUAL VARIABLES
SIGNIFICANT? EXCEL OUTPUT
(continued)
Regression Statistics
Multiple R 0.72213
t Stat for Price is tSTAT = -2.306, with
R Square 0.52148 p-value .0398
Adjusted R Square 0.44172
Standard Error 47.46341 t Stat for Advertising is tSTAT = 2.855,
Observations 15
with p-value .0145
ANOVA df SS MS F Significance F
Regression 2 29460.027 14730.013 6.53861 0.01201
Residual 12 27033.306 2252.776
Total 14 56493.333
Analysis of Variance
t Stat for Advertising is tSTAT = 2.855,
with p-value .0145
Source DF SS MS F P
Regression 2 29460 14730 6.54 0.012
Residual Error 12 27033 2253
Total 14 56493
INFERENCES ABOUT THE SLOPE:
T TEST EXAMPLE
b j tα / 2 Sb where t has
(n – k – 1) d.f.
j
SSR(X1 | X2)
= SSR (all variables) – SSR(X2)
H0: variable Xj does not significantly improve the model after all
other variables are included
H1: variable Xj significantly improves the model after all other
variables are included
= .05, df = 1 and 12
F0.05 = 4.75
(For X1 and X2) (For X2 only)
ANOVA ANOVA
df SS MS df SS
Regression 2 29460.02687 14730.01343 Regression 1 17484.22249
Residual 12 27033.30647 2252.775539 Residual 13 39009.11085
Total 14 56493.33333 Total 14 56493.33333
TESTING PORTIONS OF
MODEL: EXAMPLE
(continued)
(For X1 and X2) (For X2 only)
ANOVA ANOVA
df SS MS df SS
Regression 2 29460.02687 14730.01343 Regression 1 17484.22249
Residual 12 27033.30647 2252.775539 Residual 13 39009.11085
Total 14 56493.33333 Total 14 56493.33333
2
ta F1,a
Where a = degrees of freedom
Chap 14-41
COEFFICIENT OF PARTIAL DETERMINATION
FOR K VARIABLE MODEL
2
rYj.(all variables except j)
Ŷ b0 b1 X1 b 2 X 2
Let:
Y = pie sales
X1 = price
X2 = holiday (X2 = 1 if a holiday occurred during the week)
(X2 = 0 if there was no holiday that week)
DUMMY-VARIABLE EXAMPLE
(WITH 2 LEVELS)
(continued)
Ŷ b0 b1 X1 b 2 (1) (b 0 b 2 ) b1 X1 Holiday
Ŷ b0 b1 X1 b 2 (0) b0 b1 X1 No Holiday
Different Same
intercept slope
Y (sales)
If H0: β2 = 0 is
b0 + b2
Holi rejected, then
day
b0 (X = “Holiday” has a
No H 2 1)
olida significant effect
y (X on pie sales
2 = 0
)
X1 (Price)
INTERPRETING THE DUMMY VARIABLE
COEFFICIENT (WITH 2 LEVELS)
Y = house price
X1 = square feet
X2 = 1 if ranch, 0 otherwise
X3 = 1 if split level, 0 otherwise
Ŷ b0 b1X1 b 2 X 2 b3 X3
b0 b1X1 b 2 X 2 b3 (X1X 2 )
EFFECT OF INTERACTION
Given:
Y β0 β1X1 β 2 X 2 β3 X1X 2 ε
4 X2 = 0:
Y = 1 + 2X1 + 3(0) + 4X1(0) = 1 + 2X1
0
X1
0 0.5 1 1.5
Slopes are different if the effect of X1 on Y depends on X2 value
SIGNIFICANCE OF
INTERACTION TERM
Used when the dependent variable Y is binary (i.e., Y takes on only two
values)
Examples
Customer prefers Brand A or Brand B
Employee chooses to work full-time or part-time
Loan is delinquent or is not delinquent
Person voted in last election or did not
probability of success
Odds ratio
1 probability of success
The logistic regression model is based on the natural log of this odds
ratio
LOGISTIC REGRESSION
(continued)
ln(oddsratio) β 0 β1X1i β 2 X 2i β k X ki ε i
Once you have the logistic regression equation, compute the estimated
odds ratio:
X1,
X2,
X1,
X3, Elimin
ation
Elimin Elimin
ation ation
X3
:
Xk
HOW TO IDENTIFY A
WEAK X
The weakest X is the variable in an MLR model with the largest P-value
that is also not significant (i.e. p-value< a).
WHAT TO DO WITH A
WEAK X
Remove X from the set of possible predictors Xi. Recompute Multiple
linear regression model with the remaining set of Xi-1.
HOW DO YOU RECOGNIZE
THE BEST FIT
Choose the best fit by looking for the highest adjusted r2 and lowest
standard error.
All variables in model Bo, B1, B2, etc. all have p-values < a, saying
that such variable’s probability of being randomly related to the
predicted Y is very low (i.e. non-random, something significant
occurrence caused the improbable to happen.)
EXAMPLE
PROBLEM
Find the best fit regression model for Y for candidate
variables X1, X2, X3, X4, X5 and X6
EXAMPLE
PROBLEM
Find the best fit regression model for Y for candidate
variables X1, X2, X3, X4, X5 and X6
Initial model
X1, X2, X3, X4, X5, X6
X1 out
Next model
EXAMPLE
PROBLEM
Find the best fit regression model for Y for candidate
variables X1, X2, X3, X4, X5 and X6
Current model
X2, X3, X4, X5, X6
X5 out
Next model
EXAMPLE
PROBLEM
Find the best fit regression model for Y for candidate
variables X1, X2, X3, X4, X5 and X6
Current model
X2, X3, X4, X6
X3 out
X2, X4, X6
Next model
EXAMPLE
PROBLEM
Find the best fit regression model for Y for candidate
variables X1, X2, X3, X4, X5 and X6
Current model
X2, X4, X6
X2 out
X4, X6
Next model
EXAMPLE
PROBLEMFind the best fit regression model for Y for candidate
variables X1, X2, X3, X4, X5 and X6
Current model
X4, X6
We have a
winner!
Stopping condition: All variables in model have
significant p-values.
ALTERNATIVE BEST SUBSETS
CRITERION
k+1
(1 Rk2 )(n T )
Cp (n 2(k 1))
1 RT2
73
Variance
Inflation
Factor (VIF)
EXAMPLE OF (PROBLEMATIC)
MULTICOLLINEARITY
High VIF