0% found this document useful (0 votes)

29 views

Module 3 - MultipleLinearRegression - Afterclass1b

This document discusses multiple linear regression analysis using a wine quality dataset from 1952-1978. It introduces the multiple linear regression model and how to estimate coefficients and evaluate significance using t-tests. It also covers topics like assessing model fit with R-squared, overfitting by adding too many variables, selecting important variables, dealing with multicollinearity among predictors, and refining the regression model.

Uploaded by

Vanessa Wong

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

29 views

Module 3 - MultipleLinearRegression - Afterclass1b

Uploaded by

Vanessa Wong

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 34

IIMT 2641 Introduction to Business Analytics

Module 3: Linear Regression

Topic 2: Multiple Linear Regression
-

1
The wine data (1952 - 1978)

2
Multiple Regression Model
The population model of y with k independent variables (IV) is:

! = #! + #" %" + ## %# + ⋯ + #$ %$ + ε

§ y is the dependent variable (DV)

§ x1, x2, …, xk are independent variables (IV)
-

§ Regression Function
§ E[Y|x] = !! + !" #" + !# ## + ⋯ + !$ #$ is the mean of Y given x1, x2, …, xk
§ !! is the y-intercept
§ !% is the slope for xj for j = 1, 2, …, k
§ Random errors e (not required)
Random samples are i.i.d. or
§ Random errors are a random sample from % 0, (& independent and identically
§ Each observation has a random error term distributed random variables
§ The output does not show these, but it does estimate se
– The random errors are uncorrelated with the IV
– These assumptions are critical for effective business analytics

3
Multiple Regression: Visually with 2 predictors

H
Regression function is a
y Sample plane in 3 dimensions.
observation
yi

+"
,
& Residual (! = !! − !*!

D!*! = .# + .$ %$,! + .& %&,! x2i

x1i

4
Estimate a linear model: H0: Population Coefficient = 0 versus HA:
Two Variables Population Coefficient ≠ 0

Estimated Standard Errors tstat = (Estimated Coefficient – 0)/(Standard Error)

intercept and for estimated
slope coefficients
Two-Tail Test: p-value = 2*P(T<-|tstat|)

Coefficient of Determination: R-Squared

0
Adjusted R-Squared: Consider number
of variables.

SSE = 2.97 < SSE1 = 5.73 F-Test

5
Estimate a linear model:
Two Variables
Estimated
intercept and
slope

6
Estimate a linear model:
Two Variables
Estimated • The predicted LogPrice increases by 0.6 for each
intercept and degree increase of average growing season
slope -
temperature if Harvest rain is held constant
• The predicted LogPrice increases by -0.00457 for
each additional millimetre of Harvest rain if
average growing season temperature is held
constant.

7
Estimate a linear model:
Two Variables
AGST: tstat = ?

• Two-tail p-value = ?
• Conclusion: ?
• df = ?

AGST: tstat = ?

• Two-tail p-value = ?
• Conclusion: ?
8
Estimate a linear model:
Two Variables
'()!*+),- /012,3# #.6#&6&3#
AGST: tstat = = = 5.415
()-.,5515 #.$$$&7

• Two-tail p-value = 2P(T<-|tstat|) = 2t.dist(-5.415, 22,1) = 1.94e-5

• Conclusion: Reject H0 at the 0.001 level of significance.

⑧
• df = n – 1 - #IV = 25 – 1- 2 = 22

'()!*+),- /012,3# 3#.##89:3#

HarvestRain: tstat = = = −4.525
()-.,5515 #.##$#$

• Two-tail p-value = 2P(T<-|tstat|) = 2t.dist(-4.525, 22,1) = 0.000167

• Conclusion: Reject H0 at the 0.001 level of significance.
9
Estimate a linear model:
Two Variables
H0: all of the slopes are zero versus HA: one or more of the slopes are non-zero
H0: b1 = … = bk = 0 versus HA: bj ≠ 0 for one or more j

Factoids
1. The null hypothesis means the model is no better than using the mean /. to
predict price for wines, while the alternative hypothesis is that some of the IV help
to predict Y better than /,. but it does not specify which ones.
2. If H0 is true, the F-statistic has a F distribution.
3. Reject H0 if p-value < 0.05
4. If you cannot reject the H0, then your model is really bad
5. Some say, “the model is significant if you reject H0,” but it may not be a “good”
model.

10
O
Estimate a linear model (All Variables)

C I

O E
11 SSE = 1.73 Adjusted R-squared (consider number of variables)
R-squared

Variables R!
AGST 0.44
AGST, Harvest rain 0.71
AGST, Harvest rain, Age 0.79
AGST, Harvest rain, Age, Winter rain 0.83
AGST, Harvest rain, Age, Winter rain, Population 0.83

• Adding more variables can improve the model

• The in-sample R& never declines.
• Diminishing returns as more variables are added

12
Problem with in-sample R-squared

• A major problem with 6; is adding another independent variable to an

equation can never decrease 6; .

R& = 1 − SSE/SST

• Adding a variable will not change SST.

• Adding a variable will, in most cases, decrease SSE and increase 6; .
• Even if the added variable is nonsensical, 6; will increase unless the
new coefficient is exactly zero.

13
L
8SE 0
Overfit the model
=

R= 1 .

A model can be “overly complicated” if we excessively add independent

variables to fit sample data.

• Example: If a univariate regression model has only 2 data points, you

will always get an 6; of 1.
-

• Fail to fit additional data/accurately predict future observations

⑲
Solve the problem:
Adjusted R-squared (consider number of variables)
-

Out-of-sample testing.
-
• Training set: Build model based on a subsample (e.g., 70%) of data
• Test set: Use remainder (e.g., 30%) of the dataset to assess model

14
Select variables

Not all available variables should always be used.

Adding more variables requires more data to diminish influence of noise.

• Overfitting: high R& on data used to build the model, but bad
performance on unseen data

How to choose what variables to use?

15
Refine the model
We need to remove insignificant independent variables one at a time.

>
§ Multicollinearity: correlation between independent variables
§ If there is high correlation, some significant variables seem insignificant
because they essentially represent the same thing.
§ If one variable is removed, the other could be significant.

16
My->
***
VarI)
·
Yex 1
=

Correlation
,

E[X]

⑧
- ECY]

& O
(not required)
A measure of the linear relationship (co-movement) between variables
-
§ +1 = perfect positive linear relationship
§ 0 = no linear relationship

C
§ -1 = perfect negative linear relationship

Y =
-

X ,

17
Example of Correlation
Correlation between Harvest rain and Average growing season temperature
>
= -0.0645

18
Example of Correlation
Correlation between Age of wine (years) and Population of France (in
thousands) = -0.9945

19
Multicollinearity: Correlations among IV
§ In a perfect world, we would like the IVs to be uncorrelated
-
– Results in the most efficient use of information
– Adding or subtracting a variable to the model would not change the estimated
coefficients of the other variables
§ In this world, IVs are often correlated or “collinear”
§ If the correlations among IVs are too large, then estimated coefficients cannot
-
be trusted due to multicollinearity
§ Variance inflation factors (VIF) checks for multicollinearity
– VIF < 10: no problem
– VIF > 10: problem
q Drop variable with large VIF
or combine variables together

20
-⑳
WinterRain AGST HarvestRain Age FrancePop
1.298801 1.274536 1.116584 97.219725 98.252693
Estimate a linear model (Re-run the model by leaving out
FrancePop)

VIF WinterRain AGST HarvestRain Age

21 1.241993 1.225811 1.113615 1.069950
22

What has changed?

§ All of our independent variables are significant!
§ By removing an independent variable, all of our coefficient estimates
adjusted slightly.
§ R& decreases slightly from 0.8294 to 0.8286, while adjusted R& increases
from 0.7845 to 0.7943.
– If we removed Age and FrancePop at the same time (they were both
insignificant in the original model), R# would decrease to 0.7537.
Predictive ability

Given that it may improve predictive ability of the model by removing highly
correlated variables, we need a criterion to select among models.
§ Our wine model with all variables has in-sample 6; = 0.83.
§ Tells accuracy on the training data that we used to build the model
§ But how well does the model perform on the new (test) data?
– Bordeaux wine buyers profit from being able to predict the quality of wine
before it matures.

23
Make predictions

Predict unknown value of the dependent variable

§ Plug new value of the independent variables into the estimated regression
equation
Prediction model: Informed guesses about some unobserved property based on
observed properties

24
Out–of–sample R;

Variables In – sample R# Out - of – sample R#

AGST 0.44 0.79
AGST, Harvest rain 0.71 -0.08
AGST, Harvest rain, Age 0.79 0.53
AGST, Harvest rain, Age, Winter rain 0.83 0.79
AGST, Harvest rain, Age, Winter rain, 0.83 0.76
Population

25
Comparison between in-sample and out-of-sample R;

• Better in-sample R& does not imply better out-of-sample R& .

• Need more data to be conclusive
• Out-of-sample R& can be negative
This happens when we use the training set mean to calculate SST.

26
Steps for Building Good Regression Models
1. Check F-test. H0: all slopes = 0 vs HA: some slopes ≠ 0
– Stop if p-value > 0.05: you cannot reject H0 L
q H0 means IV are not useful in prediction the DV
– Continue if p-value < 0.05 J
2. Check T-Tests for each IV. H0: slope = 0 vs HA: slope ≠ 0
– If some IV have p-value > 0.05 L, remove the IV with the maximum p-value
and refit the model
– Continue if p-value < 0.05 for all IVs in the model J
3. See VIF to check the Multicollinearity problem
4. Pick the model with largest Adjusted R-Squared if multiple models
satisfy steps 1-3
5. Can you explain your model?

27
R output
§ Regress Price (DV)
– Model, Linear Regression,
– DV: LogPrice, IV: AGST, Harvest Rain, Age; Number of observations: 25

？？？

28
Exercises

？？？

§ What is the estimated model?

LogPrice = -1.478 + 0.532*AGST – 0.005*HarvestRain + 0.025*Age
§ What proportion of the variance in price is explained by the
regression model?
29
R-Squared = 0.79
Exercises

？？？

§ What is the standard error of the coefficient of AGST?

0.0995
§ What is the test result for testing H0 that the slopes for AGST, HarvestRain,
and Age are 0 versus Ha that one or more of the slopes are non-zero?
30 F-test. Reject H0 because p-value < 0.05
Exercises

？？？

Test H0: slope Age = 0 versus HA: slope Age ≠ 0 at a significance level of 0.001
tstat = 2.875
Degree of freedom: 25 – 1 – 3 = 21
P-value = 2P(T<-|tstat|) > 2P(T>3.819) = 0.001
P-value for two-tail test > 0.001
31
Do not reject H0
Exercises

？？？

§ What is the prediction of LogPrice if Age is 35, HarvestRain is 40, and

AGST is 15.6?
LogPrice = -1.478 + 0.532*15.6 – 0.005*40 + 0.025*35 = 7.496

32
Introduction to Statistics
The service time of patients in an urgent care clinic is known to be
exponentially distributed with a mean service time of 17 minutes per patient.
The standard deviation of the service time is also 17 minutes. Healthcare
regulations require an average service time of no more than 20 minutes per
patient during the auditor’s visit, or else the clinic will be fined. If during the
auditor’s visit to the clinic, a random sample 100 patient service times are
observed, what is the probability the clinic will NOT get fined?

33
Introduction to Statistics
The service time of patients in an urgent care clinic is known to be
exponentially distributed with a mean service time of 17 minutes per patient.
The standard deviation of the service time is also 17 minutes. Healthcare
regulations require an average service time of no more than 20 minutes per
patient during the auditor’s visit, or else the clinic will be fined. If during the
auditor’s visit to the clinic, a random sample 100 patient service times are
observed, what is the probability the clinic will NOT get fined?

'(
1 ∼ 4(67, (
Following central limit theorem, 2 )^:).
'))
'(
P 21 < := = ?@AB. CDEF :=, 67, , 6 = G H ≤ 6. 7J =
'))
=. LJ.

Ms 236 N 0
No ratings yet
Ms 236 N 0
63 pages
Data Science and Machine Learning Project Ideas
100% (2)
Data Science and Machine Learning Project Ideas
20 pages
Module 3 - SimpleLinearRegression - Afterclass1b
No ratings yet
Module 3 - SimpleLinearRegression - Afterclass1b
26 pages
Regression
No ratings yet
Regression
90 pages
Linear Regression PDF
100% (1)
Linear Regression PDF
32 pages
Multiple Linear Regression
No ratings yet
Multiple Linear Regression
53 pages
3-Linear Regreesion-Assumptions
No ratings yet
3-Linear Regreesion-Assumptions
28 pages
HW3_solution_Fall_2024
No ratings yet
HW3_solution_Fall_2024
8 pages
Regression Analysis
100% (1)
Regression Analysis
280 pages
CUHK STAT5102 Ch3
No ratings yet
CUHK STAT5102 Ch3
73 pages
3 Regression Diagnostics
100% (1)
3 Regression Diagnostics
53 pages
Module01 LinearRegression
No ratings yet
Module01 LinearRegression
41 pages
Stat 378
No ratings yet
Stat 378
73 pages
Financial Statistics - Formula Sheet
No ratings yet
Financial Statistics - Formula Sheet
26 pages
Machine Learning and Linear Regression
100% (1)
Machine Learning and Linear Regression
55 pages
BES - Lecture 10 - Simple Linear Regression
No ratings yet
BES - Lecture 10 - Simple Linear Regression
15 pages
Unit 3 Notes
No ratings yet
Unit 3 Notes
33 pages
Intermediate Analytics-Regression-Week 1
No ratings yet
Intermediate Analytics-Regression-Week 1
52 pages
Regression Model
No ratings yet
Regression Model
30 pages
Data Science Interview Preparation
100% (1)
Data Science Interview Preparation
113 pages
Sec2 Regression PDF
No ratings yet
Sec2 Regression PDF
183 pages
Chapter 3 Econometrics
No ratings yet
Chapter 3 Econometrics
34 pages
Lab 2
No ratings yet
Lab 2
23 pages
Test Your Knowledge of Linear Regression and PCA in R
No ratings yet
Test Your Knowledge of Linear Regression and PCA in R
7 pages
Linear Regression
No ratings yet
Linear Regression
10 pages
linear regression (1)
No ratings yet
linear regression (1)
8 pages
STAT22209 - Chapter 03-Multiple Regression - 2022
No ratings yet
STAT22209 - Chapter 03-Multiple Regression - 2022
41 pages
Stat 353 Study Guide
No ratings yet
Stat 353 Study Guide
44 pages
Collinearity in Regression
No ratings yet
Collinearity in Regression
9 pages
Assosa University School of Graduate Studies Mba Program
No ratings yet
Assosa University School of Graduate Studies Mba Program
10 pages
Multiple Linear Regression & Nonlinear Regression Models
No ratings yet
Multiple Linear Regression & Nonlinear Regression Models
51 pages
BB Multiple Regression
No ratings yet
BB Multiple Regression
59 pages
HW2 Solution
No ratings yet
HW2 Solution
7 pages
Math2831 Course Pack
No ratings yet
Math2831 Course Pack
246 pages
Regression Kann Ur 14
No ratings yet
Regression Kann Ur 14
43 pages
Chapter 8 Regression Model - 2023
No ratings yet
Chapter 8 Regression Model - 2023
21 pages
SimpleLinearRegression PDF
No ratings yet
SimpleLinearRegression PDF
86 pages
Linear Regression
No ratings yet
Linear Regression
11 pages
Flexible Model Selection Criterion For Multiple Regression: Kunio Takezawa
No ratings yet
Flexible Model Selection Criterion For Multiple Regression: Kunio Takezawa
7 pages
Cheat Sheet Statistics
No ratings yet
Cheat Sheet Statistics
3 pages
Machine Learning and Deep Learning Course
No ratings yet
Machine Learning and Deep Learning Course
23 pages
Regression and Introduction To Bayesian Network
No ratings yet
Regression and Introduction To Bayesian Network
12 pages
Module01.1 LinearRegression
No ratings yet
Module01.1 LinearRegression
32 pages
Data Science Module 5 q & A
No ratings yet
Data Science Module 5 q & A
8 pages
Linear Regression
100% (2)
Linear Regression
228 pages
Prediction & Forecasting: Regression Analysis
No ratings yet
Prediction & Forecasting: Regression Analysis
3 pages
Regression Notes
No ratings yet
Regression Notes
7 pages
The Bucharest University of Economic Studies Bucharest Business School Romanian - French INDE MBA Program
No ratings yet
The Bucharest University of Economic Studies Bucharest Business School Romanian - French INDE MBA Program
67 pages
Regression Notes- Part-1
No ratings yet
Regression Notes- Part-1
17 pages
BST 32202 LINEAR REGRESSION 6 SLR ASSUMPTIONS LSE
No ratings yet
BST 32202 LINEAR REGRESSION 6 SLR ASSUMPTIONS LSE
20 pages
2.1 Linear Regression
No ratings yet
2.1 Linear Regression
39 pages
Formato_tareas_PUCE (14)
No ratings yet
Formato_tareas_PUCE (14)
15 pages
Simple Linear Regression
No ratings yet
Simple Linear Regression
11 pages
Lecture-3---Linear-Regression-imran-20022025-092939am
No ratings yet
Lecture-3---Linear-Regression-imran-20022025-092939am
46 pages
ML unit-2 ppt
No ratings yet
ML unit-2 ppt
34 pages
Classical Machine Learning: Linear Regression: Ramesh S
No ratings yet
Classical Machine Learning: Linear Regression: Ramesh S
28 pages
Cheat Sheet For Test 4 Updated
No ratings yet
Cheat Sheet For Test 4 Updated
8 pages
STAT630Slide Adv Data Analysis
No ratings yet
STAT630Slide Adv Data Analysis
238 pages
Fundamental Math
From Everand
Fundamental Math
Russell Pead
No ratings yet
GCSE Maths Revision: Cheeky Revision Shortcuts
From Everand
GCSE Maths Revision: Cheeky Revision Shortcuts
Scool Revision
3.5/5 (2)
A-level Maths Revision: Cheeky Revision Shortcuts
From Everand
A-level Maths Revision: Cheeky Revision Shortcuts
Scool Revision
3.5/5 (8)
Metacognitive Skills, Academic Success and Exam Anxiety As The Predictors of Psychological Well-Being
No ratings yet
Metacognitive Skills, Academic Success and Exam Anxiety As The Predictors of Psychological Well-Being
8 pages
Introduction To Multiple Regression: Chapter 14 - 1
No ratings yet
Introduction To Multiple Regression: Chapter 14 - 1
62 pages
GLM Assignment
No ratings yet
GLM Assignment
4 pages
Different Kinds of Variables and Their Uses
67% (3)
Different Kinds of Variables and Their Uses
38 pages
Chapter 15
No ratings yet
Chapter 15
7 pages
Determinants of Energy Consumption in The Dairy in
No ratings yet
Determinants of Energy Consumption in The Dairy in
23 pages
Singh Et Al. 2017
No ratings yet
Singh Et Al. 2017
12 pages
Oneway Anova F
No ratings yet
Oneway Anova F
17 pages
Choosing The Correct Statistical Test in SAS, Stata, SPSS and R
No ratings yet
Choosing The Correct Statistical Test in SAS, Stata, SPSS and R
8 pages
W5 - Homework assignment
No ratings yet
W5 - Homework assignment
3 pages
Guide To Experimental Design
No ratings yet
Guide To Experimental Design
4 pages
Research Paper Using Multiple Regression Analysis
No ratings yet
Research Paper Using Multiple Regression Analysis
8 pages
Paper: "Definition, Types, and Examples of Research Variables"
No ratings yet
Paper: "Definition, Types, and Examples of Research Variables"
12 pages
2013 JBR Effects of The Perceived Diagnosticity of Presented Attributes
No ratings yet
2013 JBR Effects of The Perceived Diagnosticity of Presented Attributes
8 pages
Saghar 8612-2
No ratings yet
Saghar 8612-2
22 pages
Cost Estimation: Solutions To Review Questions
No ratings yet
Cost Estimation: Solutions To Review Questions
60 pages
Getahun Last Edited Thesis Report PDF
No ratings yet
Getahun Last Edited Thesis Report PDF
149 pages
Intro To Experimental Psychology
No ratings yet
Intro To Experimental Psychology
8 pages
Johnson 2001 Toward A New Classification of Nonexperimental Quantitative Research
No ratings yet
Johnson 2001 Toward A New Classification of Nonexperimental Quantitative Research
12 pages
Unit 1&2
No ratings yet
Unit 1&2
270 pages
Foundations of Research Methodology
No ratings yet
Foundations of Research Methodology
49 pages
Black Friday Sales Analysis & Prediction: A.Priyanka P.Anish K.Pruthvi Raj
No ratings yet
Black Friday Sales Analysis & Prediction: A.Priyanka P.Anish K.Pruthvi Raj
16 pages
CH 04 Sensitivity Vs Scenario Analysis
No ratings yet
CH 04 Sensitivity Vs Scenario Analysis
2 pages
Machine Learning VIVEK
80% (5)
Machine Learning VIVEK
118 pages
C775 - Lecture 4
No ratings yet
C775 - Lecture 4
7 pages
Econometrics
100% (1)
Econometrics
115 pages
The Effects of Alcohol Addiction Among Grade 11
No ratings yet
The Effects of Alcohol Addiction Among Grade 11
12 pages
Design of Experiments (DOE) May Prove Useful.: 2/25/2017 Ronald Morgan Shewchuk 1
No ratings yet
Design of Experiments (DOE) May Prove Useful.: 2/25/2017 Ronald Morgan Shewchuk 1
116 pages
Factors Affecting The Territory Distribution of Skylarks Alauda Arvensis Breeding On Lowland Farmland
No ratings yet
Factors Affecting The Territory Distribution of Skylarks Alauda Arvensis Breeding On Lowland Farmland
9 pages

Module 3 - MultipleLinearRegression - Afterclass1b

Uploaded by

Module 3 - MultipleLinearRegression - Afterclass1b

Uploaded by

IIMT 2641 Introduction to Business Analytics

Module 3: Linear Regression

§ y is the dependent variable (DV)

D!*! = .# + .$ %$,! + .& %&,! x2i

Estimated Standard Errors tstat = (Estimated Coefficient – 0)/(Standard Error)

Coefficient of Determination: R-Squared

SSE = 2.97 < SSE1 = 5.73 F-Test

• Two-tail p-value = 2*P(T<-|tstat|) = 2*t.dist(-5.415, 22,1) = 1.94e-5

'()!*+),- /012,3# 3#.##89:3#

• Two-tail p-value = 2*P(T<-|tstat|) = 2*t.dist(-4.525, 22,1) = 0.000167

• Adding more variables can improve the model

• A major problem with 6; is adding another independent variable to an

• Adding a variable will not change SST.

A model can be “overly complicated” if we excessively add independent

• Example: If a univariate regression model has only 2 data points, you

• Fail to fit additional data/accurately predict future observations

Not all available variables should always be used.

How to choose what variables to use?

VIF WinterRain AGST HarvestRain Age

What has changed?

Predict unknown value of the dependent variable

Variables In – sample R# Out - of – sample R#

• Better in-sample R& does not imply better out-of-sample R& .

§ What is the estimated model?

§ What is the standard error of the coefficient of AGST?

§ What is the prediction of LogPrice if Age is 35, HarvestRain is 40, and

You might also like

• Two-tail p-value = 2P(T<-|tstat|) = 2t.dist(-5.415, 22,1) = 1.94e-5

• Two-tail p-value = 2P(T<-|tstat|) = 2t.dist(-4.525, 22,1) = 0.000167