0% found this document useful (0 votes)
13 views

Lecture 4

Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views

Lecture 4

Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 62

SHERVIN SHAHROKHI TEHRANI

Predictive Analytics With SAS


Lecture Four: Linear Regression II

Prof. Ryan Webb

1
The Multiple Linear Regression Model
Using Linear Algebra:
y = Xβ + ϵ

Vector of Data Matrix of predictors Vector of Coefficients Vector of Errors

y0 Observation 1 1 x11 x12 … x1,p β0 ϵ0


y1 β1 ϵ1
1 x21 x22 … x2,p
y= y2 X= β= β2 ϵ= ϵ2
⋮ ⋮ ⋮ ⋮
⋮ ⋮ ⋮
yn 1 xn1 xn2 … xn,p ϵn
βp
Observation n

n×1 n × (p + 1) (p + 1) × 1 n×1
2
OLS
We want Minimize the Residual Sum of Squares (RRS):
1×n n×1
RSS(β) = (y − Xβ)T (y − Xβ)
Setting the sum of the residuals


n×1
δRSS
= − 2XT y + 2XT Xβ = 0 This is first order condition
δβ


n×1

β = (X X) X y
−1 T
Estimates: ̂ T Solving the first order condition

Predicted Outcomes: ŷ = Xβ ̂ You can predict whatever you want

3
Variation of Estimation
y = Xβ + ϵ
Covariance Matrix of Error:
σ2 … 0
Var(ϵ) = E[ϵϵ ⊤] = σ 2I = ⋮ σ2 ⋮
0 … σ2
We assume uncorrelated & the same variance among observations!

4
Variation of Estimation
y = Xβ + ϵ
Covariance Matrix of Error:
σ2 … 0
Var(ϵ) = E[ϵϵ ⊤] = σ 2I = ⋮ σ2 ⋮
0 … σ2
We assume uncorrelated & the same variance among observations!

Homoscedasticity: The same variance at each observation

5
Variation of Estimation
y = Xβ + ϵ
Covariance Matrix of Error:
σ2 … 0
Var(ϵ) = E[ϵϵ ⊤] = σ 2I = ⋮ σ2 ⋮
0 … σ2

Covariance Matrix of Estimate: The variation of coefficients from one sample to another sample

σβ21 … cov(β1, βn)


Var( β)̂ = σ 2(X⊤X)−1 = ⋮ σβ22 ⋮ This is a (p + 1) × (p + 1) matrix
cov(β1, βn) … σβ2n
(p + 1) × (p + 1)
The variation of between coefficients from one sample to another sample

6
What makes Var( β)̂ smaller?
Var( β)̂ = σ 2(X⊤X)−1

1) σ is smaller The variance of the error term decreases

2) n increases More observa ons provide a be er ed linear model

3) the "orthogonality" of the matrix X

The more independent variation each variable adds, the more


precise each estimate. Having more distinct information

7
ti
tt
fi
tt
The Statistics of Linear Regression
2 2
RSS ∑ e ̂
n ∑ (yi − y ̂
i )
σ ̂2 = = =
n−p−1 n−p−1 n−p−1 Again estimate the error variance from the residuals

Var( β)̂ = σ ̂2(X⊤X)−1 This is a p × p matrix

SE( β)̂ = ̂
For the standard error of an estimate, we take the square root of its
diag(Var( β)) corresponding diagonal element in the matrix

βĵ − 0
t=
This will gives us the p-values and confidence interval for each
Test statistic:
SE( βĵ )
coefficient.

8
Goodness-of-Fit
n n
ŷ = β1̂ xi1 + ⋯ + βp̂ xip ⟹ (yi − ȳ)2 (yi − y)̂ 2
∑ ∑
TSS = RSS =
i=1 i=1
R2
Only TV 0.6119
Sa l es = βoT v + β1T vT VA d + ϵ
2TSS − RSS RSS
Only Radio 0.332 R = =1−
Sa l es = βoRa + β1Ra R a d i o A d + ϵ TSS TSS
Only Newspaper 0.05212
Sa l es = βoNe + β1Ne Ne ws p a p erA d + ϵ

TV+Radio 0.89719 when you add variables, R 2 must go up!


Sa l es = βoT v + β1T vT VA d + β1Ra R a d i o A d + ϵ
TV+Radio+Newspaper 0.8972
Sa l es = βoT v + β1T vT VA d + β1Ra R a d i o A d + β1Ne Ne ws p a p erA d + ϵ

• Adding another variable will always allow us to fit the training data better
but including it leads to overfitting, so will it won’t help us predict (and might hurt!)
9
Important Questions
1) Is there any systematic relationship between Y and X1, ⋯Xp ?

Null: H0 : β1 = β2 = … = βp = 0
Alternative: Ha : at least one βj ≠ 0

2) Do all the predictors help explain responses, or just some?


βĵ − 0
Test statistic: t=
SE( βĵ )

10
F-test
Is at least one predictor useful? (Question 1)

Null: H0 : β1 = β2 = … = βp = 0

Alternative: Ha : at least one βj ≠ 0

(TSS − RSS)/p
Compute the F-statistic: F= ∼ F(p, n − p − 1)
RSS/(n − p − 1)

11
Sales and Advertising
Sales = βo + β1TVAd + β2 RadioAd + β3NewspaperAd + ϵ

F = 570.3 ⟺ p < 0.000000001


74 3. Linear
ThereRegression
is at least a useful explanatory variable
Coefficient Std. error t-statistic p-value
Intercept 2.939 0.3119 9.42 < 0.0001
TV 0.046 0.0014 32.81 < 0.0001
radio 0.189 0.0086 21.89 < 0.0001
newspaper −0.001 0.0059 −0.18 0.8599
TABLE 3.4. For the Advertising data, least squares coefficient estimates of the12
Multiple Comparisons
If the individual p-value for one of the predictors is small, does that
mean that at least one of the predictors is related to the response?

Question:
If we run a bunch of t-tests (say, at a 5% level), do we need to run the F-test?

13
Multiple Comparisons
If the individual p-value for one of the predictors is small, does that
mean that at least one of the predictors is related to the response?

YES! We must look at F-test.


The t-test tells us whether each variable, if added to the
regression, adds information. But this test will be wrong 5% of
the time!

So if we do many tests, say for many individual predictors, we


should expect 5% to reject simply by chance….
14
Deciding on Important Variables
Suppose the F-test is rejected. Which variables matter?

we could look at individual t-tests? Whichever is significant we consider it!

15
Deciding on Important Variables
Suppose the F-test is rejected. Which variables matter?

we could look at individual t-tests? Whichever is significant we consider it!

But some of those will be false discoveries… There is chance to make 5% mistake

16
Deciding on Important Variables
Suppose the F-test is rejected. Which variables matter?

Variable Selection

1) Run all possible models, and then determine which model fits best
(Akaike information criterion, Bayesian information criterion, etc….)

This would require running 2p di erent models

2) Add (or subtract) variables to the model with biggest (smallest)


change in model fit.

We will explore both paths in later lectures.

17
ff
Categorical Variable as a Predictor

18
Categorical Variables
What if X be a qualitative variable?

Y = β0 + β1X + ϵ

19
Categorical Variables
What if X be a qualitative variable?

Y = β0 + β1X + ϵ
Example:

1 − Sales = β0 + β1Season + ϵ Season ∈ {Spring, Summer, Fall, Winter}


2 − Sales = β0 + β1LocationofStore + ϵ Location ∈ {Europe, NorthAmerica}
3 − Spending = β0 + β1Ethnicity + ϵ Ethnicity ∈ {White, A fricanAmerican, Asian, Hispanic}
4 − Spending = β0 + β1Gender + ϵ Gender ∈ {Male, Female}

20
Categorical Variables
1- Coding them as the level: This called Dummy variable

{0
1 if Female
Igender = Gender
if Male

Note that if X has k-levels, we need k − 1-dummy variables

{1 {1 {1
0 if Spring 0 if Spring W 0 if Spring
S
Iseason = F
Iseason = Iseason =
if Summer if Fall if Winter

Spring is the base

21
How do we interpret categorical variables?

{0
Igender =
1 if Female
if Male
Income =: Y = β0 + β1Igender + ϵ


Use OLS as before

β0̂ = ȳmale An estimate of the sample mean of males’ income

β1̂ = ȳFemale − β0̂ = ȳFemale − ȳmale


An estimate of the sample mean difference of
females’ income vs. males’ income
22
How do we interpret categorical variables?

{1 {1 {1
0 if Spring 0 if Spring W 0 if Spring
S
Iseason = F
Iseason = Iseason =
if Summer if Fall if Winter

S F W
Spending =: Y = β0 + β1Iseason + β2 Iseason + β3Iseason +ϵ


Use OLS as before
β0̂ = ȳspending in Spring An estimate of the sample mean of spending in spring
βĵ = ȳspending in j − β0̂ = ȳspending in j − ȳspending in Spring

An estimate of the sample mean difference of


Spending in season j vs. Spring
23
Interaction Effect
74 3. Linear Regression
Sales = βo + β1T VAd + β2 RadioAd + β3NewspaperAd + ϵ
Coefficient Std. error t-statistic p-value
Intercept 2.939 0.3119 9.42 < 0.0001
TV
radio
0.046
0.189
0.0014
0.0086
32.81
21.89
< 0.0001
< 0.0001
It is clear that TV and Radio have
newspaper −0.001 0.0059 −0.18 0.8599 Key roles.
TABLE 3.4. For the Advertising data, least squares coefficient estimates of the
multiple linear regression of number of units sold on radio, TV, and newspaper
advertising budgets.

1. Is there any synergy between TVAd and RadioAd?


that the simple and multiple regression coefficients can be quite different.
This difference stems from the fact that in the simple regression case, the
slope term represents the average effect of a $1,000 increase in newspaper
2. Should we simultaneously advertise on both medium?
advertising, ignoring other predictors such as TV and radio. In contrast, in
the multiple regression setting, the coefficient for newspaper represents the
average effect of increasing newspaper spending by $1,000 while holding TV
and radio fixed.
Does it make sense for the multiple regression to suggest no relationship
between sales and newspaper while the simple linear regression implies the
opposite? In fact it does. Consider the correlation matrix for the three
predictor variables and response variable, displayed in Table 3.5. Notice
that the correlation between radio and newspaper is 0.35. This reveals a 24
Interaction Effect
74 3. Linear Regression
Sales = βo + β1T VAd + β2 RadioAd + β3NewspaperAd + ϵ
Coefficient Std. error t-statistic p-value
Intercept 2.939 0.3119 9.42 < 0.0001
TV
radio
0.046
0.189
0.0014
0.0086
32.81
21.89
< 0.0001
< 0.0001
It is clear that TV and Radio have
newspaper −0.001 0.0059 −0.18 0.8599 Key roles.
TABLE 3.4. For the Advertising data, least squares coefficient estimates of the
multiple linear regression of number of units sold on radio, TV, and newspaper
advertising budgets.

Sales = β +β TVAd + β RadioAd+β TVAd × RadioAd+ϵ


that the simple and multiple regression coefficients can be quite different.
o 1 2 3
This difference stems from the fact that in the simple regression case, the
slope term represents the average effect of a $1,000 increase in newspaper
advertising, ignoring other predictors such as TV and radio. In contrast, in
Main Effect
the multiple regression setting, the coefficient for newspaper represents the
average effect of increasing newspaper spending by $1,000 while holding TV
Interaction Effect
and radio fixed.
Does it make sense for the multiple regression to suggest no relationship
between sales and newspaper while the simple linear regression implies the
opposite? In fact it does. Consider the correlation matrix for the three
predictor variables and response variable, displayed in Table 3.5. Notice
that the correlation between radio and newspaper is 0.35. This reveals a 25
Interaction Effect
R 2 = 0.897
74 3. Linear Regression
Sales = βo + β1T VAd + β2 RadioAd + β3NewspaperAd + ϵ
Coefficient Std. error t-statistic p-value
Intercept 2.939 0.3119 9.42 < 0.0001
TV
radio
0.046
0.189
0.0014
0.0086
32.81
21.89
< 0.0001
< 0.0001
It is clear that TV and Radio have
newspaper −0.001 0.0059 −0.18 0.8599 Key roles.
TABLE 3.4. For the Advertising data, least squares coefficient estimates of the
multiple linear regression of number of units sold on radio, TV, and newspaper
Sales =
advertising β +β T VAd + β RadioAd+β T VAd × RadioAd+ϵ
budgets.
o 1 2 3
R 2 = 0.968
that the simple and multiple regression coefficients can be quite different.
This difference stems from the fact that in the simple regression case, the
Main Effect
slope term represents the average effect of a $1,000 increase in newspaper
advertising, ignoring other predictors such as TV and radio. In contrast, in
the multiple regression setting, the coefficient for newspaper represents the
average effect of increasing newspaper spending by $1,000 while holding TV
Interaction Effect
and radio fixed.
Does it make sense for the multiple regression to suggest no relationship
between sales and newspaper while the simple linear regression implies the
opposite? In fact it does. Consider the correlation matrix for the three
predictor variables and response variable, displayed in Table 3.5. Notice
that the correlation between radio and newspaper is 0.35. This reveals a 26
Interaction Effect
Sales = βo+β1T VAd + β2 RadioAd+β3T VAd × RadioAd+ϵ

Main Effect

Interaction Effect

̂
Sales = 6.7502+0.0191 × TVAd + 0.0289 × RadioAd+0.0011TVAd × RadioAd
Let’s assume we spent $1000 on TVAd and $1000 on RadioAd.
What will happen if we advertise $1000 more on either TV or Radio?

27
Interaction Effect
Sales = βo+β1T VAd + β2 RadioAd+β3T VAd × RadioAd+ϵ

Main Effect

Interaction Effect
̂
Sales = 6.7502+0.0191 × T VAd + 0.0289 × RadioAd+0.0011T VAd × RadioAd
̂
Sales = βô + ( β1̂ + β3̂ × RadioAd )T VAd + β2̂ RadioAd++ϵ

Let’s assume we spent $1000 on TVAd and $1000 on RadioAd.


What will happen if we advertise $1000 more on either TV or Radio?

$1000 on TV: There is 0.0191 × 1000 = 19.1 more sales at TV


but there is 0.0011 × 1000 × 1000 = 1.1 × 1000 = 1100
more sales through the synergistic effect of TV and Radio
28
Polynomial regression

29
Non-linearities
MPG and Horsepower for a sample of cars

30
Polynomial Regression
add non-linear terms to the regression

mpg = β0 + β1horsepower + β2horsepower2 + ϵ

The model is still a linear combination of predictors. So it is still a linear model.

Y = β0 + β1X1 + β2 X2 + ϵ X1 = horsepower
X2 = horsepower2

You can add higher order polynomials.


The the model will become very flexible, however it might start to overfit…

31
Non-linearities
MPG and Horsepower for a sample of cars

50
Linear
Degree 2
Degree 5

40
Miles per gallon

30
20
10

50 100 150 200

Horsepower
32
Non-linearities

50
Linear
Degree 2

We assumed the model is linear. Degree 5

40
Miles per gallon
Then we should to check the Data pattern

30
20
1)Plot the Data

10
2)Plot the residuals 50 100 150 200

Horsepower

Fixes: “Pattern” “No Pattern”


1) Introduce non-linear Residual Plot for Linear Fit Residual Plot for Quadratic Fit

20
323 334

15
terms
323

15
330
334

10
10
2) More advanced

5
Residuals

Residuals
5
methods in later

0
0

−5
−5
lectures

−15 −10
−15 −10
155

5 10 15 20 25 30 15 20 25 30 35

Fitted values Fitted values

33
Violations of Assumptions of Linear Regression
&
Handling These Issues

34
Well-known Violations

1. Outliers, High Leverage, and Influential observations


Existence of observations which have values far away from the average pattern

2. Multicollinearity amongst independent variables


There are some predictors which are highly corrolated , i.e., ∣ corr(Xj, Xj′) ∣ ≈ 1

3. Heteroscedasticity in error term Yi = βXi + ϵi


The observations have errors with different variances, i.e., Var(ϵi) = σi2 for observation i

4. Normality of error term


We cannot see the error term is distibuted normally, i.e., ϵj ≁ N(0,σj2)

35


Outliers
Definition: An observa on yi that lies far from the predicted model
i.e., eî = (yi − yî ) be signi cantly larger than others
Effects: Typically should not affect model estimates, but will affect measures of fit
R 2 will decrease & SE ( β)̂ will increase
Detection: By plotting data and residuals
Studentized mean divide by standard error
20 20 20

6
6

Studentized Residuals
3

4
4

Residuals

2
2
Y

2
1
0

0
−2

−1
−4

−2 −1 0 1 2 −2 0 2 4 6 −2 0 2 4 6

X Fitted Values Fitted Values


fi
36
ti
Outliers
Solution: Drop observations (but do so with caution!)
If you believe there is an error in data!
Studentized mean divide by standard error
20 20 20

6
6

Studentized Residuals
3

4
4

Residuals

2
2
Y

2
1
0

0
−2

−1
−4

−2 −1 0 1 2 −2 0 2 4 6 −2 0 2 4 6

X Fitted Values Fitted Values

Red line: regression with observation 20 & R 2 = 0.805


Blue line: regression without observation 20 & R 2 = 0.892
37
High Leverage Points
Definition:
1 − An observa on xi = (xi1, ⋯, xip) that lies far from other predictors
1 (xi − x̄)2
2 − If p = 1 then its leverage is hi = + n
n ∑i=1 (xi − x̄)2
3 − The formula is more complicated if p > 1
1 1 n p+1

4 − you only need to know ≤ hi ≤ 1 hi =

tt
ti
n n i=1 n
5 − So, if hi is far from the average leverage, you should pay a en on to it!

Effects: Typically will affect model estimates substantially

Detection: By plotting data and residuals


38
ti
High Leverage Points
Effects: Typically will affect model estimates substantially
Detection: By plotting data and residuals
Outlier & High Leverage
41 20

5
2

4
Studentized Residuals
10

Only Outlier

1
41

3
20

X2

2
0
Y

1
−1

0
0

−1
−2

−2 −1 0 1 2 3 4 −2 −1 0 1 2 0.00 0.05 0.10 0.15 0.20 0.25

X X1 Leverage

Red line: regression with all observations


Blue line: regression without observation 41
39
Influential observations
Definition:
1 − It is an observa on whose dele on would no ceably change the result.
2 − Intui vely, it is an outlier with a high leverage

Effects: Typically will affect model estimates very substantially

Detection: By plotting data and residuals


ti
40
ti
ti
ti
Influential observations
Definition:
1 − It is an observa on whose dele on would no ceably change the result.
2 − Intui vely, it is an outlier with a high leverage

Effects: Typically will affect model estimates very substantially

Detection: Using the Cook’s D measure.

It is a statistical value to find Data points with large residuals (outliers) and/or high
leverage (The Link)

Thumb rule: Observations with D > 4/n should be inspected, where n: Number of
observations
ti
41
ti
ti
ti
Influential observations
Solution:

1. Drop observations (but do so with caution!)

2. Use the Robust regression: Without dropping points, use alternate assumptions
to estimate model

In robust regression observations receive different weights (e.g. importance


degree) based on their distance to the average distributions of dependent and
independent variables

In other words, you will minimize the MSE in a different way to adjust the
influential observations

42
Influential observations
n

β0,β1 [ ∑ ]
Proc reg data= Name; OLS min RSS = min (yi − β0 − β1xi)2
β0,β1
model y = x; i=1
Run;

β0,β1 [ ∑ ]
Proc robustreg data= Name method=mm ; Robust min RSS = min ρ(yi − β0 − β1xi)
model y = x; β0,β1
i=1
Run; n

β0,β1 [ ∑ ]
mm min RSS = min χ(yi − β0 − β1xi)
β0,β1
i=1

ρ − function depends on our assumption about the error term distribution ϵ

43
Computer Sales and Economic Status
What is the relationship between computer sale in a country and

1. Gross national income, e.g., GNP


2. Unemployment rate
3. Percentage of age spending on eduction

We observe the 21 countries data.

44
Example

45
Example

46
Regression Omitting Outlier

Parameter Estimates
Parameter Standard
Variable Label DF Estimate Error t Value Pr > |t|
Intercept Intercept 1 -54.11423 45.35109 -1.19 0.2513
gnp_per_capita GNP per head 1 0.00166 0.00049088 3.37 0.0042
Unemployment_rate Unemployment rate 1 -0.56378 2.54459 -0.22 0.8276
percentage_education_spend %age spend on education 1 20.15309 7.73693 2.60 0.0199

47
Robust Regression (MM)

Parameter Estimates for Final Weighted Least Squares Fit


95%
Standard Confidence
Parameter DF Estimate Error Limits Chi-Square Pr > ChiSq
Intercept 1 -32.2088 41.8908 -114.313 49.8957 0.59 0.4420
gnp_per_capita 1 0.0017 0.0005 0.0007 0.0027 11.43 0.0007
Unemployment_rate 1 -0.5279 2.5756 -5.5759 4.5202 0.04 0.8376
percentage_education 1 15.2276 6.5962 2.2993 28.1560 5.33 0.0210
Scale 1 29.9836

48
Collinearity
Predictors that are highly correlated

80

800
70

600
60

Rating
Age

50

400
40

200
30

2000 4000 6000 8000 12000 2000 4000 6000 8000 12000

Limit Limit

They tend to increase and decrease together


It is difficult to determine which variable matters (they both contain similar information)
49
Collinearity
If predictors are correlated, we cannot pin down a best fitted line
Not correlated: well-defined minimum Correlated: long narrow “valley”

5
21.8

0
21
.8
21.5

4
−1
21.25
21

3
.5

βRating
βAge
−2

2
−3

1
−4

0
−5

0.16 0.17 0.18 0.19 −0.1 0.0 0.1 0.2


Implies: β Limit βLimit

1. Standard errors will be large


2. Analysis (tests) might ignore important predictors
3. you will fail to reject some false null hypotheses

50
Collinearity

Balance
3.3 ≈ f(age, limit,
Other Considerations in rating)
the Regression Model 101

Coefficient Std. error t-statistic p-value


Intercept −173.411 43.828 −3.957 < 0.0001
Model 1 age −2.292 0.672 −3.407 0.0007
limit 0.173 0.005 34.496 < 0.0001
Intercept −377.537 45.254 −8.343 < 0.0001
Model 2 rating 2.202 0.952 2.312 0.0213
limit 0.025 0.064 0.384 0.7012
TABLE 3.11. The results for two multiple regression models involving the
Credit data set are shown. Model 1 is a regression of balance on age and limit,
and Model 2 a regression of balance on rating and limit. The standard error 51
ˆ
Collinearity
Detection:

1. Finding the covariance matrix of predictors to determine highly


correlated variables

2. Calculating the variance inflation factor called VIF


̂ 1
VIF( βj) =
1 − RX2j∣X−j
where RX2j∣X−j is the R 2 from regression of Xj on the other predictors

The smallest value of VIF is 1


Rule of Thumb: If VIF exceeds 10, there will be collinearity problem
52
Collinearity
Solution:

1. Drop one of the collinear variables

2. Combine the collinear variables into an aggregate measure

Use principal component analysis to create non-collinear variables, through


linear combinations of original variables

Advantage: Not drop any data Disadvantage: Interpretation can be difficult

3. Collect more data.

To have a new independent variation amongst the independent variables


53
Principal Component Analysis
Let assume we have p − variables X1, ⋯, Xp .
Principal Component Analysis means nding a new set of variables T1, ⋯
l


1 − Xi = cikTl
k=1
2 − {T1, ⋯, Tl} inherit the maximum possible variance from {X1, ⋯, Xp}

Intuition: We want to have some linear combination of the original variables


such that:
1. Summarize the information of our original variables
2. But those should be uncorrelated with each other as much as possible
fi
54
Collinearity vs. Omitted Variable Bias
Omitted Variable Bias: It creates more problem!

Omitting a variable that is correlated with both the response and other predictors

Leads to biased estimates


• you might incorrectly conclude that important predictors are unimportant!
• you might incorrectly conclude that unimportant predictors are important!

Collinearity:

Leads to noisy estimates


• you might incorrectly conclude that important predictors are unimportant!

55
Heteroscedasticity

56
Heteroscedasticity

57
Heteroscedasticity
σ2 … 0
Var(ϵ) = E[ϵϵ ⊤] = σ 2I =
each error term has
We assumed that: ⋮ σ2 ⋮ the same variance
0 … σ2 (“homoskedasticity”)

Y It seems this not the case

X
58
Heteroscedasticity
σ12 … 0
If: Var(ϵ) = E[ϵϵ ⊤] = ⋮ σi2 ⋮ Different observations can have
“more/less” error than others
0 … σn2

e.g. higher values of X might have more error than lower values of X (surveys, income measures, etc…)

Implies: The calculated standard errors are too small!


hypothesis tests on coefficients will over-reject (false positives)

59
Heteroscedasticity

different observations can have “more/less” error than others

Response Y Response log(Y)

15

0.4
998
975
845

10

0.2
0.0
5
Residuals

Residuals

−0.8 −0.6 −0.4 −0.2


0
−5
605
−10

671
437

10 15 20 25 30 2.4 2.6 2.8 3.0 3.2 3.4

Fitted values Fitted values

60
Heteroscedasticity
Solution: 1) Transform the dependent variable

Use log(Y ), or Y OR Use log(Xj), or Xj


Response Y Response log(Y)

15

0.4
998
975
845

10

0.2
0.0
5
Residuals

Residuals

−0.8 −0.6 −0.4 −0.2


0
−5
605
−10

671
437

10 15 20 25 30 2.4 2.6 2.8 3.0 3.2 3.4

Fitted values Fitted values

Disadvantage: Need to find the transformation by trial and error

61
Heteroscedasticity
Solution: 2) Use “robust” methods for calculating standard errors

proc reg data=mydata;


model y = x / acov; No longer we assume Homoscedasticity
run; OR uncorrelated error term
Find for you the matrix σ12 ρ12 … ρ1n information about ϵ
i
ρ σ 2
⋯ ρ2n provides information about ϵi+1
Var(ϵ) = E[ϵϵ ⊤] = 12 2
⋮ ⋮ ⋮ ⋮
ρ1n ρ2n ⋯ σp2

Pros: This is a method that “fixes” the incorrect standard errors, and as a result the p values are correct.
Cons: Often creates large S.E and therefore large confident intervals

62

You might also like