Lecture 4
Lecture 4
1
The Multiple Linear Regression Model
Using Linear Algebra:
y = Xβ + ϵ
n×1 n × (p + 1) (p + 1) × 1 n×1
2
OLS
We want Minimize the Residual Sum of Squares (RRS):
1×n n×1
RSS(β) = (y − Xβ)T (y − Xβ)
Setting the sum of the residuals
⟹
n×1
δRSS
= − 2XT y + 2XT Xβ = 0 This is first order condition
δβ
⟹
n×1
β = (X X) X y
−1 T
Estimates: ̂ T Solving the first order condition
⟹
Predicted Outcomes: ŷ = Xβ ̂ You can predict whatever you want
3
Variation of Estimation
y = Xβ + ϵ
Covariance Matrix of Error:
σ2 … 0
Var(ϵ) = E[ϵϵ ⊤] = σ 2I = ⋮ σ2 ⋮
0 … σ2
We assume uncorrelated & the same variance among observations!
4
Variation of Estimation
y = Xβ + ϵ
Covariance Matrix of Error:
σ2 … 0
Var(ϵ) = E[ϵϵ ⊤] = σ 2I = ⋮ σ2 ⋮
0 … σ2
We assume uncorrelated & the same variance among observations!
5
Variation of Estimation
y = Xβ + ϵ
Covariance Matrix of Error:
σ2 … 0
Var(ϵ) = E[ϵϵ ⊤] = σ 2I = ⋮ σ2 ⋮
0 … σ2
Covariance Matrix of Estimate: The variation of coefficients from one sample to another sample
6
What makes Var( β)̂ smaller?
Var( β)̂ = σ 2(X⊤X)−1
7
ti
tt
fi
tt
The Statistics of Linear Regression
2 2
RSS ∑ e ̂
n ∑ (yi − y ̂
i )
σ ̂2 = = =
n−p−1 n−p−1 n−p−1 Again estimate the error variance from the residuals
SE( β)̂ = ̂
For the standard error of an estimate, we take the square root of its
diag(Var( β)) corresponding diagonal element in the matrix
βĵ − 0
t=
This will gives us the p-values and confidence interval for each
Test statistic:
SE( βĵ )
coefficient.
8
Goodness-of-Fit
n n
ŷ = β1̂ xi1 + ⋯ + βp̂ xip ⟹ (yi − ȳ)2 (yi − y)̂ 2
∑ ∑
TSS = RSS =
i=1 i=1
R2
Only TV 0.6119
Sa l es = βoT v + β1T vT VA d + ϵ
2TSS − RSS RSS
Only Radio 0.332 R = =1−
Sa l es = βoRa + β1Ra R a d i o A d + ϵ TSS TSS
Only Newspaper 0.05212
Sa l es = βoNe + β1Ne Ne ws p a p erA d + ϵ
• Adding another variable will always allow us to fit the training data better
but including it leads to overfitting, so will it won’t help us predict (and might hurt!)
9
Important Questions
1) Is there any systematic relationship between Y and X1, ⋯Xp ?
Null: H0 : β1 = β2 = … = βp = 0
Alternative: Ha : at least one βj ≠ 0
10
F-test
Is at least one predictor useful? (Question 1)
Null: H0 : β1 = β2 = … = βp = 0
(TSS − RSS)/p
Compute the F-statistic: F= ∼ F(p, n − p − 1)
RSS/(n − p − 1)
11
Sales and Advertising
Sales = βo + β1TVAd + β2 RadioAd + β3NewspaperAd + ϵ
Question:
If we run a bunch of t-tests (say, at a 5% level), do we need to run the F-test?
13
Multiple Comparisons
If the individual p-value for one of the predictors is small, does that
mean that at least one of the predictors is related to the response?
15
Deciding on Important Variables
Suppose the F-test is rejected. Which variables matter?
But some of those will be false discoveries… There is chance to make 5% mistake
16
Deciding on Important Variables
Suppose the F-test is rejected. Which variables matter?
Variable Selection
1) Run all possible models, and then determine which model fits best
(Akaike information criterion, Bayesian information criterion, etc….)
17
ff
Categorical Variable as a Predictor
18
Categorical Variables
What if X be a qualitative variable?
Y = β0 + β1X + ϵ
19
Categorical Variables
What if X be a qualitative variable?
Y = β0 + β1X + ϵ
Example:
20
Categorical Variables
1- Coding them as the level: This called Dummy variable
{0
1 if Female
Igender = Gender
if Male
{1 {1 {1
0 if Spring 0 if Spring W 0 if Spring
S
Iseason = F
Iseason = Iseason =
if Summer if Fall if Winter
21
How do we interpret categorical variables?
{0
Igender =
1 if Female
if Male
Income =: Y = β0 + β1Igender + ϵ
⟹
Use OLS as before
{1 {1 {1
0 if Spring 0 if Spring W 0 if Spring
S
Iseason = F
Iseason = Iseason =
if Summer if Fall if Winter
S F W
Spending =: Y = β0 + β1Iseason + β2 Iseason + β3Iseason +ϵ
⟹
Use OLS as before
β0̂ = ȳspending in Spring An estimate of the sample mean of spending in spring
βĵ = ȳspending in j − β0̂ = ȳspending in j − ȳspending in Spring
Main Effect
Interaction Effect
̂
Sales = 6.7502+0.0191 × TVAd + 0.0289 × RadioAd+0.0011TVAd × RadioAd
Let’s assume we spent $1000 on TVAd and $1000 on RadioAd.
What will happen if we advertise $1000 more on either TV or Radio?
27
Interaction Effect
Sales = βo+β1T VAd + β2 RadioAd+β3T VAd × RadioAd+ϵ
Main Effect
Interaction Effect
̂
Sales = 6.7502+0.0191 × T VAd + 0.0289 × RadioAd+0.0011T VAd × RadioAd
̂
Sales = βô + ( β1̂ + β3̂ × RadioAd )T VAd + β2̂ RadioAd++ϵ
29
Non-linearities
MPG and Horsepower for a sample of cars
30
Polynomial Regression
add non-linear terms to the regression
Y = β0 + β1X1 + β2 X2 + ϵ X1 = horsepower
X2 = horsepower2
31
Non-linearities
MPG and Horsepower for a sample of cars
50
Linear
Degree 2
Degree 5
40
Miles per gallon
30
20
10
Horsepower
32
Non-linearities
50
Linear
Degree 2
40
Miles per gallon
Then we should to check the Data pattern
30
20
1)Plot the Data
10
2)Plot the residuals 50 100 150 200
Horsepower
20
323 334
15
terms
323
15
330
334
10
10
2) More advanced
5
Residuals
Residuals
5
methods in later
0
0
−5
−5
lectures
−15 −10
−15 −10
155
5 10 15 20 25 30 15 20 25 30 35
33
Violations of Assumptions of Linear Regression
&
Handling These Issues
34
Well-known Violations
35

Outliers
Definition: An observa on yi that lies far from the predicted model
i.e., eî = (yi − yî ) be signi cantly larger than others
Effects: Typically should not affect model estimates, but will affect measures of fit
R 2 will decrease & SE ( β)̂ will increase
Detection: By plotting data and residuals
Studentized mean divide by standard error
20 20 20
6
6
Studentized Residuals
3
4
4
Residuals
2
2
Y
2
1
0
0
−2
−1
−4
−2 −1 0 1 2 −2 0 2 4 6 −2 0 2 4 6
6
6
Studentized Residuals
3
4
4
Residuals
2
2
Y
2
1
0
0
−2
−1
−4
−2 −1 0 1 2 −2 0 2 4 6 −2 0 2 4 6
tt
ti
n n i=1 n
5 − So, if hi is far from the average leverage, you should pay a en on to it!
5
2
4
Studentized Residuals
10
Only Outlier
1
41
3
20
X2
2
0
Y
1
−1
0
0
−1
−2
X X1 Leverage
It is a statistical value to find Data points with large residuals (outliers) and/or high
leverage (The Link)
Thumb rule: Observations with D > 4/n should be inspected, where n: Number of
observations
ti
41
ti
ti
ti
Influential observations
Solution:
2. Use the Robust regression: Without dropping points, use alternate assumptions
to estimate model
In other words, you will minimize the MSE in a different way to adjust the
influential observations
42
Influential observations
n
β0,β1 [ ∑ ]
Proc reg data= Name; OLS min RSS = min (yi − β0 − β1xi)2
β0,β1
model y = x; i=1
Run;
β0,β1 [ ∑ ]
Proc robustreg data= Name method=mm ; Robust min RSS = min ρ(yi − β0 − β1xi)
model y = x; β0,β1
i=1
Run; n
β0,β1 [ ∑ ]
mm min RSS = min χ(yi − β0 − β1xi)
β0,β1
i=1
43
Computer Sales and Economic Status
What is the relationship between computer sale in a country and
44
Example
45
Example
46
Regression Omitting Outlier
Parameter Estimates
Parameter Standard
Variable Label DF Estimate Error t Value Pr > |t|
Intercept Intercept 1 -54.11423 45.35109 -1.19 0.2513
gnp_per_capita GNP per head 1 0.00166 0.00049088 3.37 0.0042
Unemployment_rate Unemployment rate 1 -0.56378 2.54459 -0.22 0.8276
percentage_education_spend %age spend on education 1 20.15309 7.73693 2.60 0.0199
47
Robust Regression (MM)
48
Collinearity
Predictors that are highly correlated
80
800
70
600
60
Rating
Age
50
400
40
200
30
2000 4000 6000 8000 12000 2000 4000 6000 8000 12000
Limit Limit
5
21.8
0
21
.8
21.5
4
−1
21.25
21
3
.5
βRating
βAge
−2
2
−3
1
−4
0
−5
50
Collinearity
Balance
3.3 ≈ f(age, limit,
Other Considerations in rating)
the Regression Model 101
∑
1 − Xi = cikTl
k=1
2 − {T1, ⋯, Tl} inherit the maximum possible variance from {X1, ⋯, Xp}
Omitting a variable that is correlated with both the response and other predictors
Collinearity:
55
Heteroscedasticity
56
Heteroscedasticity
57
Heteroscedasticity
σ2 … 0
Var(ϵ) = E[ϵϵ ⊤] = σ 2I =
each error term has
We assumed that: ⋮ σ2 ⋮ the same variance
0 … σ2 (“homoskedasticity”)
X
58
Heteroscedasticity
σ12 … 0
If: Var(ϵ) = E[ϵϵ ⊤] = ⋮ σi2 ⋮ Different observations can have
“more/less” error than others
0 … σn2
e.g. higher values of X might have more error than lower values of X (surveys, income measures, etc…)
59
Heteroscedasticity
15
0.4
998
975
845
10
0.2
0.0
5
Residuals
Residuals
671
437
60
Heteroscedasticity
Solution: 1) Transform the dependent variable
15
0.4
998
975
845
10
0.2
0.0
5
Residuals
Residuals
671
437
61
Heteroscedasticity
Solution: 2) Use “robust” methods for calculating standard errors
Pros: This is a method that “fixes” the incorrect standard errors, and as a result the p values are correct.
Cons: Often creates large S.E and therefore large confident intervals
62