0% found this document useful (0 votes)
29 views

Module 3 - MultipleLinearRegression - Afterclass1b

This document discusses multiple linear regression analysis using a wine quality dataset from 1952-1978. It introduces the multiple linear regression model and how to estimate coefficients and evaluate significance using t-tests. It also covers topics like assessing model fit with R-squared, overfitting by adding too many variables, selecting important variables, dealing with multicollinearity among predictors, and refining the regression model.

Uploaded by

Vanessa Wong
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
29 views

Module 3 - MultipleLinearRegression - Afterclass1b

This document discusses multiple linear regression analysis using a wine quality dataset from 1952-1978. It introduces the multiple linear regression model and how to estimate coefficients and evaluate significance using t-tests. It also covers topics like assessing model fit with R-squared, overfitting by adding too many variables, selecting important variables, dealing with multicollinearity among predictors, and refining the regression model.

Uploaded by

Vanessa Wong
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 34

IIMT 2641 Introduction to Business Analytics

Module 3: Linear Regression


Topic 2: Multiple Linear Regression
-

1
The wine data (1952 - 1978)

2
Multiple Regression Model
The population model of y with k independent variables (IV) is:

! = #! + #" %" + ## %# + ⋯ + #$ %$ + ε

§ y is the dependent variable (DV)


§ x1, x2, …, xk are independent variables (IV)
-

§ Regression Function
§ E[Y|x] = !! + !" #" + !# ## + ⋯ + !$ #$ is the mean of Y given x1, x2, …, xk
§ !! is the y-intercept
§ !% is the slope for xj for j = 1, 2, …, k
§ Random errors e (not required)
Random samples are i.i.d. or
§ Random errors are a random sample from % 0, (& independent and identically
§ Each observation has a random error term distributed random variables
§ The output does not show these, but it does estimate se
– The random errors are uncorrelated with the IV
– These assumptions are critical for effective business analytics

3
Multiple Regression: Visually with 2 predictors

H
Regression function is a
y Sample plane in 3 dimensions.
observation
yi

+"
,
& Residual (! = !! − !*!

D!*! = .# + .$ %$,! + .& %&,! x2i


x2

x1i

x1

4
Estimate a linear model: H0: Population Coefficient = 0 versus HA:
Two Variables Population Coefficient ≠ 0

Estimated Standard Errors tstat = (Estimated Coefficient – 0)/(Standard Error)


intercept and for estimated
slope coefficients
Two-Tail Test: p-value = 2*P(T<-|tstat|)

Coefficient of Determination: R-Squared

0
Adjusted R-Squared: Consider number
of variables.

SSE = 2.97 < SSE1 = 5.73 F-Test


5
Estimate a linear model:
Two Variables
Estimated
intercept and
slope

6
Estimate a linear model:
Two Variables
Estimated • The predicted LogPrice increases by 0.6 for each
intercept and degree increase of average growing season
slope -
temperature if Harvest rain is held constant
• The predicted LogPrice increases by -0.00457 for
each additional millimetre of Harvest rain if
average growing season temperature is held
constant.

7
Estimate a linear model:
Two Variables
AGST: tstat = ?

• Two-tail p-value = ?
• Conclusion: ?
• df = ?

AGST: tstat = ?

• Two-tail p-value = ?
• Conclusion: ?
8
Estimate a linear model:
Two Variables
'()!*+),- /012,3# #.6#&6&3#
AGST: tstat = = = 5.415
()-.,5515 #.$$$&7

• Two-tail p-value = 2*P(T<-|tstat|) = 2*t.dist(-5.415, 22,1) = 1.94e-5


• Conclusion: Reject H0 at the 0.001 level of significance.


• df = n – 1 - #IV = 25 – 1- 2 = 22

'()!*+),- /012,3# 3#.##89:3#


HarvestRain: tstat = = = −4.525
()-.,5515 #.##$#$

• Two-tail p-value = 2*P(T<-|tstat|) = 2*t.dist(-4.525, 22,1) = 0.000167


• Conclusion: Reject H0 at the 0.001 level of significance.
9
Estimate a linear model:
Two Variables
H0: all of the slopes are zero versus HA: one or more of the slopes are non-zero
H0: b1 = … = bk = 0 versus HA: bj ≠ 0 for one or more j

Factoids
1. The null hypothesis means the model is no better than using the mean /. to
predict price for wines, while the alternative hypothesis is that some of the IV help
to predict Y better than /,. but it does not specify which ones.
2. If H0 is true, the F-statistic has a F distribution.
3. Reject H0 if p-value < 0.05
4. If you cannot reject the H0, then your model is really bad
5. Some say, “the model is significant if you reject H0,” but it may not be a “good”
model.

10
O
Estimate a linear model (All Variables)

C I

O E
11 SSE = 1.73 Adjusted R-squared (consider number of variables)
R-squared

Variables R!
AGST 0.44
AGST, Harvest rain 0.71
AGST, Harvest rain, Age 0.79
AGST, Harvest rain, Age, Winter rain 0.83
AGST, Harvest rain, Age, Winter rain, Population 0.83

• Adding more variables can improve the model


• The in-sample R& never declines.
• Diminishing returns as more variables are added

12
Problem with in-sample R-squared

• A major problem with 6; is adding another independent variable to an


equation can never decrease 6; .

R& = 1 − SSE/SST

• Adding a variable will not change SST.


• Adding a variable will, in most cases, decrease SSE and increase 6; .
• Even if the added variable is nonsensical, 6; will increase unless the
new coefficient is exactly zero.

13
L
8SE 0
Overfit the model
=

R= 1 .

A model can be “overly complicated” if we excessively add independent


variables to fit sample data.

• Example: If a univariate regression model has only 2 data points, you


will always get an 6; of 1.
-

• Fail to fit additional data/accurately predict future observations


Solve the problem:
Adjusted R-squared (consider number of variables)
-

Out-of-sample testing.
-
• Training set: Build model based on a subsample (e.g., 70%) of data
• Test set: Use remainder (e.g., 30%) of the dataset to assess model

14
Select variables

Not all available variables should always be used.


Adding more variables requires more data to diminish influence of noise.

• Overfitting: high R& on data used to build the model, but bad
performance on unseen data

How to choose what variables to use?

15
Refine the model
We need to remove insignificant independent variables one at a time.

>
§ Multicollinearity: correlation between independent variables
§ If there is high correlation, some significant variables seem insignificant
because they essentially represent the same thing.
§ If one variable is removed, the other could be significant.

16
My->
***
VarI)
·
Yex 1
=

Correlation
,

E[X]


- ECY]

& O
(not required)
A measure of the linear relationship (co-movement) between variables
-
§ +1 = perfect positive linear relationship
§ 0 = no linear relationship

C
§ -1 = perfect negative linear relationship

Y =
-

X ,

17
Example of Correlation
Correlation between Harvest rain and Average growing season temperature
>
= -0.0645

18
Example of Correlation
Correlation between Age of wine (years) and Population of France (in
thousands) = -0.9945

19
Multicollinearity: Correlations among IV
§ In a perfect world, we would like the IVs to be uncorrelated
-
– Results in the most efficient use of information
– Adding or subtracting a variable to the model would not change the estimated
coefficients of the other variables
§ In this world, IVs are often correlated or “collinear”
§ If the correlations among IVs are too large, then estimated coefficients cannot
-
be trusted due to multicollinearity
§ Variance inflation factors (VIF) checks for multicollinearity
– VIF < 10: no problem
– VIF > 10: problem
q Drop variable with large VIF
or combine variables together

20
-⑳
WinterRain AGST HarvestRain Age FrancePop
1.298801 1.274536 1.116584 97.219725 98.252693
Estimate a linear model (Re-run the model by leaving out
FrancePop)

VIF WinterRain AGST HarvestRain Age


21 1.241993 1.225811 1.113615 1.069950
22

What has changed?


§ All of our independent variables are significant!
§ By removing an independent variable, all of our coefficient estimates
adjusted slightly.
§ R& decreases slightly from 0.8294 to 0.8286, while adjusted R& increases
from 0.7845 to 0.7943.
– If we removed Age and FrancePop at the same time (they were both
insignificant in the original model), R# would decrease to 0.7537.
Predictive ability

Given that it may improve predictive ability of the model by removing highly
correlated variables, we need a criterion to select among models.
§ Our wine model with all variables has in-sample 6; = 0.83.
§ Tells accuracy on the training data that we used to build the model
§ But how well does the model perform on the new (test) data?
– Bordeaux wine buyers profit from being able to predict the quality of wine
before it matures.

23
Make predictions

Predict unknown value of the dependent variable


§ Plug new value of the independent variables into the estimated regression
equation
Prediction model: Informed guesses about some unobserved property based on
observed properties

24
Out–of–sample R;

Variables In – sample R# Out - of – sample R#


AGST 0.44 0.79
AGST, Harvest rain 0.71 -0.08
AGST, Harvest rain, Age 0.79 0.53
AGST, Harvest rain, Age, Winter rain 0.83 0.79
AGST, Harvest rain, Age, Winter rain, 0.83 0.76
Population

25
Comparison between in-sample and out-of-sample R;

• Better in-sample R& does not imply better out-of-sample R& .


• Need more data to be conclusive
• Out-of-sample R& can be negative
This happens when we use the training set mean to calculate SST.

26
Steps for Building Good Regression Models
1. Check F-test. H0: all slopes = 0 vs HA: some slopes ≠ 0
– Stop if p-value > 0.05: you cannot reject H0 L
q H0 means IV are not useful in prediction the DV
– Continue if p-value < 0.05 J
2. Check T-Tests for each IV. H0: slope = 0 vs HA: slope ≠ 0
– If some IV have p-value > 0.05 L, remove the IV with the maximum p-value
and refit the model
– Continue if p-value < 0.05 for all IVs in the model J
3. See VIF to check the Multicollinearity problem
4. Pick the model with largest Adjusted R-Squared if multiple models
satisfy steps 1-3
5. Can you explain your model?

27
R output
§ Regress Price (DV)
– Model, Linear Regression,
– DV: LogPrice, IV: AGST, Harvest Rain, Age; Number of observations: 25

? ? ?

28
Exercises

? ? ?

§ What is the estimated model?


LogPrice = -1.478 + 0.532*AGST – 0.005*HarvestRain + 0.025*Age
§ What proportion of the variance in price is explained by the
regression model?
29
R-Squared = 0.79
Exercises

? ? ?

§ What is the standard error of the coefficient of AGST?


0.0995
§ What is the test result for testing H0 that the slopes for AGST, HarvestRain,
and Age are 0 versus Ha that one or more of the slopes are non-zero?
30 F-test. Reject H0 because p-value < 0.05
Exercises

? ? ?

Test H0: slope Age = 0 versus HA: slope Age ≠ 0 at a significance level of 0.001
tstat = 2.875
Degree of freedom: 25 – 1 – 3 = 21
P-value = 2P(T<-|tstat|) > 2P(T>3.819) = 0.001
P-value for two-tail test > 0.001
31
Do not reject H0
Exercises

? ? ?

§ What is the prediction of LogPrice if Age is 35, HarvestRain is 40, and


AGST is 15.6?
LogPrice = -1.478 + 0.532*15.6 – 0.005*40 + 0.025*35 = 7.496

32
Introduction to Statistics
The service time of patients in an urgent care clinic is known to be
exponentially distributed with a mean service time of 17 minutes per patient.
The standard deviation of the service time is also 17 minutes. Healthcare
regulations require an average service time of no more than 20 minutes per
patient during the auditor’s visit, or else the clinic will be fined. If during the
auditor’s visit to the clinic, a random sample 100 patient service times are
observed, what is the probability the clinic will NOT get fined?

33
Introduction to Statistics
The service time of patients in an urgent care clinic is known to be
exponentially distributed with a mean service time of 17 minutes per patient.
The standard deviation of the service time is also 17 minutes. Healthcare
regulations require an average service time of no more than 20 minutes per
patient during the auditor’s visit, or else the clinic will be fined. If during the
auditor’s visit to the clinic, a random sample 100 patient service times are
observed, what is the probability the clinic will NOT get fined?

'(
1 ∼ 4(67, (
Following central limit theorem, 2 )^:).
'))
'(
P 21 < := = ?@AB. CDEF :=, 67, , 6 = G H ≤ 6. 7J =
'))
=. LJ.

34

You might also like