Module 3 - MultipleLinearRegression - Afterclass1b
Module 3 - MultipleLinearRegression - Afterclass1b
1
The wine data (1952 - 1978)
2
Multiple Regression Model
The population model of y with k independent variables (IV) is:
! = #! + #" %" + ## %# + ⋯ + #$ %$ + ε
§ Regression Function
§ E[Y|x] = !! + !" #" + !# ## + ⋯ + !$ #$ is the mean of Y given x1, x2, …, xk
§ !! is the y-intercept
§ !% is the slope for xj for j = 1, 2, …, k
§ Random errors e (not required)
Random samples are i.i.d. or
§ Random errors are a random sample from % 0, (& independent and identically
§ Each observation has a random error term distributed random variables
§ The output does not show these, but it does estimate se
– The random errors are uncorrelated with the IV
– These assumptions are critical for effective business analytics
3
Multiple Regression: Visually with 2 predictors
H
Regression function is a
y Sample plane in 3 dimensions.
observation
yi
+"
,
& Residual (! = !! − !*!
x1i
x1
4
Estimate a linear model: H0: Population Coefficient = 0 versus HA:
Two Variables Population Coefficient ≠ 0
0
Adjusted R-Squared: Consider number
of variables.
6
Estimate a linear model:
Two Variables
Estimated • The predicted LogPrice increases by 0.6 for each
intercept and degree increase of average growing season
slope -
temperature if Harvest rain is held constant
• The predicted LogPrice increases by -0.00457 for
each additional millimetre of Harvest rain if
average growing season temperature is held
constant.
7
Estimate a linear model:
Two Variables
AGST: tstat = ?
• Two-tail p-value = ?
• Conclusion: ?
• df = ?
AGST: tstat = ?
• Two-tail p-value = ?
• Conclusion: ?
8
Estimate a linear model:
Two Variables
'()!*+),- /012,3# #.6#&6&3#
AGST: tstat = = = 5.415
()-.,5515 #.$$$&7
⑧
• df = n – 1 - #IV = 25 – 1- 2 = 22
Factoids
1. The null hypothesis means the model is no better than using the mean /. to
predict price for wines, while the alternative hypothesis is that some of the IV help
to predict Y better than /,. but it does not specify which ones.
2. If H0 is true, the F-statistic has a F distribution.
3. Reject H0 if p-value < 0.05
4. If you cannot reject the H0, then your model is really bad
5. Some say, “the model is significant if you reject H0,” but it may not be a “good”
model.
10
O
Estimate a linear model (All Variables)
C I
O E
11 SSE = 1.73 Adjusted R-squared (consider number of variables)
R-squared
Variables R!
AGST 0.44
AGST, Harvest rain 0.71
AGST, Harvest rain, Age 0.79
AGST, Harvest rain, Age, Winter rain 0.83
AGST, Harvest rain, Age, Winter rain, Population 0.83
12
Problem with in-sample R-squared
R& = 1 − SSE/SST
13
L
8SE 0
Overfit the model
=
R= 1 .
⑲
Solve the problem:
Adjusted R-squared (consider number of variables)
-
Out-of-sample testing.
-
• Training set: Build model based on a subsample (e.g., 70%) of data
• Test set: Use remainder (e.g., 30%) of the dataset to assess model
14
Select variables
• Overfitting: high R& on data used to build the model, but bad
performance on unseen data
15
Refine the model
We need to remove insignificant independent variables one at a time.
>
§ Multicollinearity: correlation between independent variables
§ If there is high correlation, some significant variables seem insignificant
because they essentially represent the same thing.
§ If one variable is removed, the other could be significant.
16
My->
***
VarI)
·
Yex 1
=
Correlation
,
E[X]
⑧
- ECY]
& O
(not required)
A measure of the linear relationship (co-movement) between variables
-
§ +1 = perfect positive linear relationship
§ 0 = no linear relationship
C
§ -1 = perfect negative linear relationship
Y =
-
X ,
17
Example of Correlation
Correlation between Harvest rain and Average growing season temperature
>
= -0.0645
18
Example of Correlation
Correlation between Age of wine (years) and Population of France (in
thousands) = -0.9945
19
Multicollinearity: Correlations among IV
§ In a perfect world, we would like the IVs to be uncorrelated
-
– Results in the most efficient use of information
– Adding or subtracting a variable to the model would not change the estimated
coefficients of the other variables
§ In this world, IVs are often correlated or “collinear”
§ If the correlations among IVs are too large, then estimated coefficients cannot
-
be trusted due to multicollinearity
§ Variance inflation factors (VIF) checks for multicollinearity
– VIF < 10: no problem
– VIF > 10: problem
q Drop variable with large VIF
or combine variables together
20
-⑳
WinterRain AGST HarvestRain Age FrancePop
1.298801 1.274536 1.116584 97.219725 98.252693
Estimate a linear model (Re-run the model by leaving out
FrancePop)
Given that it may improve predictive ability of the model by removing highly
correlated variables, we need a criterion to select among models.
§ Our wine model with all variables has in-sample 6; = 0.83.
§ Tells accuracy on the training data that we used to build the model
§ But how well does the model perform on the new (test) data?
– Bordeaux wine buyers profit from being able to predict the quality of wine
before it matures.
23
Make predictions
24
Out–of–sample R;
25
Comparison between in-sample and out-of-sample R;
26
Steps for Building Good Regression Models
1. Check F-test. H0: all slopes = 0 vs HA: some slopes ≠ 0
– Stop if p-value > 0.05: you cannot reject H0 L
q H0 means IV are not useful in prediction the DV
– Continue if p-value < 0.05 J
2. Check T-Tests for each IV. H0: slope = 0 vs HA: slope ≠ 0
– If some IV have p-value > 0.05 L, remove the IV with the maximum p-value
and refit the model
– Continue if p-value < 0.05 for all IVs in the model J
3. See VIF to check the Multicollinearity problem
4. Pick the model with largest Adjusted R-Squared if multiple models
satisfy steps 1-3
5. Can you explain your model?
27
R output
§ Regress Price (DV)
– Model, Linear Regression,
– DV: LogPrice, IV: AGST, Harvest Rain, Age; Number of observations: 25
? ? ?
28
Exercises
? ? ?
? ? ?
? ? ?
Test H0: slope Age = 0 versus HA: slope Age ≠ 0 at a significance level of 0.001
tstat = 2.875
Degree of freedom: 25 – 1 – 3 = 21
P-value = 2P(T<-|tstat|) > 2P(T>3.819) = 0.001
P-value for two-tail test > 0.001
31
Do not reject H0
Exercises
? ? ?
32
Introduction to Statistics
The service time of patients in an urgent care clinic is known to be
exponentially distributed with a mean service time of 17 minutes per patient.
The standard deviation of the service time is also 17 minutes. Healthcare
regulations require an average service time of no more than 20 minutes per
patient during the auditor’s visit, or else the clinic will be fined. If during the
auditor’s visit to the clinic, a random sample 100 patient service times are
observed, what is the probability the clinic will NOT get fined?
33
Introduction to Statistics
The service time of patients in an urgent care clinic is known to be
exponentially distributed with a mean service time of 17 minutes per patient.
The standard deviation of the service time is also 17 minutes. Healthcare
regulations require an average service time of no more than 20 minutes per
patient during the auditor’s visit, or else the clinic will be fined. If during the
auditor’s visit to the clinic, a random sample 100 patient service times are
observed, what is the probability the clinic will NOT get fined?
'(
1 ∼ 4(67, (
Following central limit theorem, 2 )^:).
'))
'(
P 21 < := = ?@AB. CDEF :=, 67, , 6 = G H ≤ 6. 7J =
'))
=. LJ.
34