AS lecture 07 ( Multiple Linear Regression)
AS lecture 07 ( Multiple Linear Regression)
REGRESSION
Lecture # 08
3
Two Main Objectives
• Establish if there is a relationship between two variables.
• More specifically, establish if there is a statistically significant
relationship between two variables.
• Examples: Income and spending, wage and gender, student height and
exam scores.
• Forecast new observations.
• Can we use what we know about the relationship to forecast unobserved
values?
• Examples: What will are sales be over the next quarter? What will be the
effect of advertising over sales?
4
How to Perform a Multiple Linear Regression
Multiple Regression Formula
𝒚 = 𝜷𝟎 + 𝜷𝟏 𝒙𝟏 + ⋯ + 𝜷𝒏 𝒙𝒏 + 𝒆
• 𝒚 is the predicted value of the dependent variable
• 𝜷𝟎 is the y-intercept (value of y when other parameters are set to 0.
• 𝜷𝟏 𝒙𝟏 the regression coefficient or slope 𝜷𝟏 of the first independent
variable 𝒙𝟏
• … do the same for however many independent variables you are
testing.
• 𝜷𝒏 𝒙𝒏 the regression coefficient of the last independent variable
• 𝒆 is the error of the estimate (model error), or how much variation
there is in our estimate of the regression coefficient.
5
Regression Analysis: Example
• Suppose that a manager wants to determine the relationship
between the firm’s advertising expenditures and quality control over
its sales revenue. The manager wants to test the hypothesis that
higher advertising expenditures and quality control lead to higher
sales for the firm, and, furthermore, she wants to estimate the
strength of the relationship (i.e., how much sales increase for each
dollar increase in advertising expenditures).
• The manager collects data on advertising expenditures, quality
control and on sales revenue for the firm over the past 10 years.
6
Advertising and Quality Expenditures and Sales Revenues of
the Firm in Each of 10 Years
Scatter Diagram
Advertising Quality
Year Sales (𝐲) Expense Control
(𝐱 𝟏 ) (𝐱 𝟐 )
1 44 10 3
2 40 9 4
3 42 11 3
4 46 12 3
5 48 11 4
6 52 12 5
7 54 13 6
8 58 13 7
9 56 14 7
10 60 15 8
7
Regression Analysis
Regression Line: Line of
Best Fit:
Draw the line, by
visual inspection, the
positively sloped straight
line that “best” fits
between the data points
(so that the data points
are about equally distant
on either side of the
line).
8
Multiple Regression Formula
ෝ = 𝜷𝟎 + 𝜷𝟏 𝑿𝟏 + 𝜷𝟐 𝑿𝟐
𝒚
• 𝛽0 = 𝑦ത + 𝛽1 𝑋ത1 − 𝛽2 𝑋ത2 Regression Sum Calculation
(σ 𝑋1 )2
σ 𝑥22 σ 𝑥1 𝑦 −(σ 𝑥1 𝑥2 )(σ 𝑥2 𝑦) • σ 𝑥12 = σ 𝑋12−
• 𝛽1 = σ 𝑥12 σ 𝑥22 −(σ 𝑥1 𝑥2 )2
𝑛
(σ 𝑋2 )2
• σ 𝑥22 = 2
σ 𝑋2 −
𝑛
σ 𝑋1 σ 𝑦
σ 𝑥12 σ 𝑥2 𝑦 −(σ 𝑥1 𝑥2 )(σ 𝑥1 𝑦) • σ 𝑥1 𝑦 = σ 𝑋𝑦 −
• 𝛽2 = σ 𝑥12 σ 𝑥22 −(σ 𝑥1 𝑥2 )2
𝑛
σ𝑋 σ𝑦
• σ 𝑥2 𝑦 = σ 𝑋𝑦 − 2
𝑛
σ 𝑋1 σ 𝑋2
• 𝑦ത =
σ𝑦 • σ 𝑥1 𝑥2 = σ 𝑋1 𝑋2 −
𝑛 𝑛
σ 𝑋1
• 𝑋1 =
𝑛
σ 𝑋2
• 𝑋2 =
𝑛 9
Advertising and Quality Expenditures and Sales Revenues of
the Firm in Each of 10 Years
Advertising Quality
Year Sales (𝐲) Expense Control 𝑿𝟐𝟏 𝑿𝟐𝟐 𝑿𝟏 𝒚 𝑿𝟐 𝒚 𝑿𝟏 𝑿𝟐 𝒚𝟐
(𝐗 𝟏 ) (𝐗 𝟐 )
1 44 10 3 100 9 440 132 30 1936
2 40 9 4 81 16 360 160 36 1600
3 42 11 3 121 9 462 126 33 1764
4 46 12 3 144 9 552 138 36 2116
5 48 11 4 121 16 528 192 44 2304
6 52 12 5 144 25 624 260 60 2704
7 54 13 6 169 36 756 324 78 2916
8 58 13 7 169 49 754 406 91 3364
9 56 14 7 196 49 784 392 98 3136
10 60 15 8 225 64 900 480 120 3600
σ𝑦 500
• 𝑦ത = = = 50
𝑛 10
σ 𝑋1 120
• 𝑋1 = = = 12
𝑛 10
σ𝑋 50
• 𝑋2 = 2 = = 5
𝑛 10
11
Advertising and Quality Expenditures and Sales Revenues of
the Firm in Each of 10 Years
Advertising Quality
Year Sales (𝐲) Expense Control 𝑿𝟐𝟏 𝑿𝟐𝟐 𝑿𝟏 𝒚 𝑿𝟐 𝒚 𝑿𝟏 𝑿𝟐 𝒚𝟐
(𝐗 𝟏 ) (𝐗 𝟐 )
(σ 𝑋1 )2 120 2
• σ 𝑥12 = σ 𝑋12
− = 1470 − = 30
𝑛 10
2 2 (σ 𝑋2 )2 50 2
• σ 𝑥2 = σ 𝑋2 − = 282 − = 32
𝑛 10
σ 𝑋1 σ 𝑦 120 500
• σ 𝑥1 𝑦 = σ 𝑋1 𝑦 − = 6106 − = 106
𝑛 10
σ 𝑋2 σ 𝑦 50 500
• σ 𝑥2 𝑦 = σ 𝑋2 𝑦 − = 2610 − = 110
𝑛 10
σ𝑋 σ𝑋 120 50
• σ 𝑥1 𝑥2 = σ 𝑋1 𝑋2 − 1 2 = 626 − = 26
𝑛 10 12
Advertising and Quality Expenditures and Sales Revenues of
the Firm in Each of 10 Years
Advertising Quality
Year Sales (𝐲) Expense Control 𝑿𝟐𝟏 𝑿𝟐𝟐 𝑿𝟏 𝒚 𝑿𝟐 𝒚 𝑿𝟏 𝑿𝟐 𝒚𝟐
(𝐗 𝟏 ) (𝐗 𝟐 )
Regression Sum
Calculation σ 𝑥22 σ 𝑥1 𝑦 −(σ 𝑥1 𝑥2 )(σ 𝑥2 𝑦) 32 106 − 26 (110)
• 𝛽1 = σ 𝑥12 σ 𝑥22 −(σ 𝑥1 𝑥2 )2
= =
30 32 − 26 2
• σ 𝑥12 = 30 1.873
• σ 𝑥22 = 32
• σ 𝑥1 𝑦 = 106 σ 𝑥12 σ 𝑥2 𝑦 −(σ 𝑥1 𝑥2 )(σ 𝑥1 𝑦) 30 110 −(26)(106)
• σ 𝑥2 𝑦 = 110 • 𝛽2 = σ 𝑥12 σ 𝑥22 −(σ 𝑥1 𝑥2 )2
= =
30 32 − 26 2
• σ 𝑥1 𝑥2 = 26
1.915
13
Advertising and Quality Expenditures and Sales Revenues of
the Firm in Each of 10 Years
Advertising Quality
Year Sales (𝐲) Expense Control 𝑿𝟐𝟏 𝑿𝟐𝟐 𝑿𝟏 𝒚 𝑿𝟐 𝒚 𝑿𝟏 𝑿𝟐 𝒚𝟐
(𝐗 𝟏 ) (𝐗 𝟐 )
𝒃
𝒕=
𝑺𝒃
• The higher this calculated 𝑡 ratio is, the more confident we have
significant relationship between 𝑋 (advertising) and 𝑌 (sales).
17
Tests of Significance
• To test the hypothesis that 𝑏 is statistically significant (i.e. that
advertising positively affects sales), we need first of all to calculate
the standard error (deviation) of 𝑏
σ(𝑌𝑡 − 𝑌𝑡 )2
𝑆𝑏 =
(𝑛 − 𝑘) σ(𝑋𝑡 − 𝑋) ത 2
σ 𝑒𝑡2
𝑆𝑏 =
ത 2
(𝑛 − 𝑘) σ(𝑋𝑡 − 𝑋)
18
Advertising and Sales Revenues of the Firm in Each of 10
Years
ෝ
𝒚
Advertising ഥ𝟏 = 𝟏𝟕. 𝟗𝟒𝟗
Year Sales (𝐘) 𝑿𝟏 − 𝑿 𝑿𝟏 − 𝑿𝟏 𝟐
𝒆𝒕 = 𝒀 − 𝒀 𝒆𝟐𝒕
Expense (𝐗 𝟏 ) + 𝟏. 𝟖𝟕𝟑 𝑿𝟏
+ 𝟏. 𝟗𝟏𝟓 𝑿𝟐
1 44 10 -2 4 42.424 1.576 2.483776
2 40 9 -3 9 42.466 -2.466 6.081156
3 42 11 -1 1 44.297 -2.297 5.276209
4 46 12 0 0 46.17 -0.17 0.0289
5 48 11 -1 1 46.212 1.788 3.196944
6 52 12 0 0 50 2 4
7 54 13 1 1 53.788 0.212 0.044944
8 58 13 1 1 55.703 2.297 5.276209
9 56 14 2 4 57.576 -1.576 2.483776
10 60 15 2 9 61.364 -1.364 1.860496
= 500 120 30 30.73241
19
Quality Control and Sales Revenues of the Firm in Each of 10
Years
ෝ
𝒚
Quality (𝑿𝟐 = 𝟏𝟕. 𝟗𝟒𝟗
Year Sales (𝐘) 𝑿𝟐 − 𝑿𝟐 𝒆𝒕 = 𝒀 − 𝒀 𝒆𝟐𝒕
Control (𝐗 𝟐 ) − 𝑿𝟐 )𝟐 + 𝟏. 𝟖𝟕𝟑 𝑿𝟏
+ 𝟏. 𝟗𝟏𝟓 𝑿𝟐
1 44 3 -2 4 42.424 1.576 2.483776
2 40 4 -1 1 42.466 -2.466 6.081156
3 42 3 -2 4 44.297 -2.297 5.276209
4 46 3 -2 4 46.17 -0.17 0.0289
5 48 4 -1 1 46.212 1.788 3.196944
6 52 5 0 0 50 2 4
7 54 6 1 1 53.788 0.212 0.044944
8 58 7 2 4 55.703 2.297 5.276209
9 56 7 2 4 57.576 -1.576 2.483776
10 60 8 3 9 61.364 -1.364 1.860496
= 500 ഥ =5
𝑿 32 30.73241
20
Tests of Significance
• To test the hypothesis that 𝑏 is statistically significant (i.e. that
advertising positively affects sales), we need first of all to calculate
the standard error (deviation) of 𝑏
σ(𝑌𝑡 − 𝑌𝑡 )2
𝑆𝑏 =
(𝑛 − 𝑘) σ(𝑋𝑡 − 𝑋) ത 2
30.732
𝑆𝑏 = = 0.357
(10 − 2)(30)
21
𝒕-statistics or 𝒕 ratio
𝟏. 𝟖𝟕𝟑
𝒕= = 𝟓. 𝟐𝟒𝟔
𝟎. 𝟑𝟓𝟕
22
23
𝒕-statistics or 𝒕 ratio
• The critical value is 𝒕 = 𝟐. 𝟑𝟎𝟔 for two tailed 𝑡 test.
• Since our calculated value of 𝒕 = 𝟓. 𝟑𝟒𝟔 exceeds the tabular value of
𝑡 = 2.306 for the 𝟓% level of significance with 𝟖 𝒅𝒇.
𝑡𝑐 > 𝑡
5.246 > 2.306
• We reject the null hypothesis that there is no relationship between
𝑋 (advertising) and 𝑌 (sales) and
• We accept the alternate hypothesis there is a significant relationship
between 𝑋 and 𝑌.
• It means that we are 95% confident that such a relationship exists.
24
Tests of Significance
• To test the hypothesis that 𝑏 is statistically significant (i.e. that
advertising positively affects sales), we need first of all to calculate
the standard error (deviation) of 𝑏
σ(𝑌𝑡 − 𝑌𝑡 )2
𝑆𝑏 =
(𝑛 − 𝑘) σ(𝑋𝑡 − 𝑋) ത 2
30.732
𝑆𝑏 = = 0.3464
(10 − 2)(32)
25
𝒕-statistics or 𝒕 ratio
𝟏. 𝟖𝟕𝟑
𝒕= = 𝟓. 𝟓𝟐𝟖
𝟎. 𝟑𝟓𝟕
26
27
𝒕-statistics or 𝒕 ratio
• The critical value is 𝒕 = 𝟐. 𝟑𝟎𝟔 for two tailed 𝑡 test.
• Since our calculated value of 𝒕 = 𝟓. 𝟓𝟐𝟖 exceeds the tabular value of
𝑡 = 2.306 for the 𝟓% level of significance with 𝟖 𝒅𝒇.
𝑡𝑐 > 𝑡
5.528 > 2.306
• We reject the null hypothesis that there is no relationship between
𝑋 (advertising) and 𝑌 (sales) and
• We accept the alternate hypothesis there is a significant relationship
between 𝑋 and 𝑌.
• It means that we are 95% confident that such a relationship exists.
28
𝟐
Coefficient of determination (𝑹 )
𝑅2 measures how much of the variation in the firm’s sales is explained by the variation
in its advertising expenditures and quality control.
σ𝑦 2
𝛽0 σ 𝑦 + 𝛽1 σ 𝑋1 𝑦 + 𝛽2 σ 𝑋2 𝑦 − 𝑛
2
𝑅 =
2 σ𝑦 2
σ𝑦 −
𝑛
500 2
17.949 500 + 1.873 6106 + 1.915 2616 − 10
𝑅2 =
500 2
25440 −
10
𝑹𝟐 = 𝟎. 𝟗𝟓𝟔
A value of 𝑹𝟐 = 𝟎. 𝟗𝟓𝟔 indicates that 𝟗𝟓. 𝟔% of the variability in 𝒚 is explained by its linear
relationship with the independent variables 𝑋1 , 𝑋2 and only 𝟓% of the variation is due to other
factors, which is not part of this model. 29
Coefficient of correlation (𝒓)
𝒓= 𝑹𝟐
This means that variables X & Y vary together 97.7% of the time.
30
Problem: 𝑹 𝟐
𝛽1 = 0.38 𝛽2 = 1.62
𝑋1 𝑦 = 619 𝑋2 𝑦 = 1007
σ𝑦 2
𝛽0 σ 𝑦 + 𝛽1 σ 𝑋1 𝑦 + 𝛽2 σ 𝑋2 𝑦 − 𝑛
𝑅2 =
2 σ𝑦 2
σ𝑦 −
𝑛
31
Standard Deviation of Regression or
Standard Error of Estimate
• All the observed values of (𝑦, 𝑋1 , 𝑋2 ) do not fall on the regression
line but they scatter away form it.
• The standard error of estimate is the standard deviation of
multiple regression.
• It measure the dispersion of y values about the population
multiple regression equation.
• For a multiple regression with two independent variables
𝑋1 𝑎𝑛𝑑 𝑋2 it is denoted by 𝜎𝑦12 . Here 1 and 2 indicates the
𝑋1 𝑎𝑛𝑑 𝑋2 .
• The sample standard error of estimate denoted by 𝑆𝑦12
32
Standard Deviation of Regression or
Standard Error of Estimate
σ 𝑦−𝑦ො 2
𝜎𝑦12 =
𝑛−3
33
Practice Problem:
Compute the standard error of estimate.
𝑛=5 𝛽0 = −1.33
𝑦 = 89 𝑦 2 = 1885
𝛽1 = 0.38 𝛽2 = 1.62
𝑋1 𝑦 = 619 𝑋2 𝑦 = 1007
σ 𝑦 2 − 𝛽0 σ 𝑦 − 𝛽1 σ 𝑋1 𝑦 − 𝛽2 σ 𝑋2 𝑦
𝜎𝑦12 =
𝑛−3
34
Practice Problem Advertising Quality
Year Sales (𝐲) Expense Control
(𝐱 𝟏 ) (𝐱 𝟐 )
1 30 10 15
2 22 5 8
3 16 10 12
4 7 3 7
5 14 2 10
35
Practice Problem:
Solve the multiple regression problem using the following data:
𝑛=5
𝑦 = 89 𝑋1 𝑋2 = 351
𝑋1 = 30 𝑋2 = 52 𝑋1 𝑦 = 619
36
Acknowledgment
• [Peter Andrew Bruce] Practical Statistics for Data Scientists
• [David Forsyth] Probability and Statistics for Computer Science
• [Michael Baron] Probability and Statistics for Computer Scientists
• .
37