Theory 3. Linear Regression With One Regressor (Textbook Chapter 4)
Theory 3. Linear Regression With One Regressor (Textbook Chapter 4)
(SW Chapter 4)
4-1
The problems of statistical inference for linear regression
are, at a general level, the same as for estimation of the mean
or of the differences between two means. Statistical, or
econometric, inference about the slope entails:
• Estimation:
o How should we draw a line through the data to estimate
the (population) slope (answer: ordinary least squares).
o What are advantages and disadvantages of OLS?
• Hypothesis testing:
o How to test if the slope is zero?
• Confidence intervals:
o How to construct a confidence interval for the slope?
4-2
Linear Regression: Some Notation and Terminology
(SW Section 4.1)
4-5
The Ordinary Least Squares Estimator
(SW Section 4.2)
4-6
Mechanics of OLS
The population regression line: Test Score = β0 + β1STR
ΔTest score
β1 = = ??
ΔSTR
4-7
n
The OLS estimator solves: min b0 ,b1 ∑ [Yi − (b0 + b1 X i )]2
i =1
4-8
4-9
Application to the California Test Score – Class Size data
n = 698.9 – 2.28×STR
Estimated regression line: TestScore
4-10
Interpretation of the estimated slope and intercept
n = 698.9 – 2.28×STR
TestScore
• Districts with one more student per teacher on average
have test scores that are 2.28 points lower.
ΔTest score
• That is, = –2.28
ΔSTR
• The intercept (taken literally) means that, according to this
estimated line, districts with zero students per teacher
would have a (predicted) test score of 698.9.
• This interpretation of the intercept makes no sense – it
extrapolates the line outside the range of the data – here,
the intercept is not economically meaningful.
4-11
Predicted values & residuals:
One of the districts in the data set is Antelope, CA, for which
STR = 19.33 and Test Score = 657.8
predicted value: Yˆ
Antelope= 698.9 – 2.28×19.33 = 654.8
residual: uˆ Antelope = 657.8 – 654.8 = 3.0
4-12
OLS regression: STATA output
-------------------------------------------------------------------------
| Robust
testscr | Coef. Std. Err. t P>|t| [95% Conf. Interval]
--------+----------------------------------------------------------------
str | -2.279808 .5194892 -4.39 0.000 -3.300945 -1.258671
_cons | 698.933 10.36436 67.44 0.000 678.5602 719.3057
-------------------------------------------------------------------------
n = 698.9 – 2.28×STR
TestScore
2 2 ESS
∑ i
(Yˆ
i =1
− Yˆ ) 2
Definition of R : R = = n
∑ i
TSS
(Y − Y ) 2
i =1
• R2 = 0 means ESS = 0
• R2 = 1 means ESS = TSS
• 0 ≤ R2 ≤ 1
• For regression with a single X, R2 = the square of the
correlation coefficient between X and Y
4-15
The Standard Error of the Regression (SER)
1 n 2
= ∑
n − 2 i =1
uˆi
1 n
(the second equality holds because û = ∑ uˆi = 0).
n i =1
4-16
1 n 2
SER = ∑
n − 2 i =1
uˆi
The SER:
• has the units of u, which are the units of Y
• measures the average “size” of the OLS residual (the
average “mistake” made by the OLS regression line)
• The root mean squared error (RMSE) is closely related to
the SER:
1 n 2
RMSE = ∑
n i =1
uˆi
4-17
Technical note: why divide by n–2 instead of n–1?
1 n 2
SER = ∑
n − 2 i =1
uˆi
4-18
Example of the R2 and the SER
4-21
Least squares assumption #1: E(u|X = x) = 0.
For any given value of X, the mean of u is zero:
4-24
Least squares assumption #3: Large outliers are rare
Technical statement: E(X4) < ∞ and E(Y4) < ∞
4-25
OLS can be sensitive to an outlier:
4-29
The mean and variance of the sampling distribution of βˆ1
Some preliminary algebra:
Yi = β0 + β1Xi + ui
Y = β0 + β1 X + u
so Yi – Y = β1(Xi – X ) + (ui – u )
Thus,
n
∑( X i − X )(Yi − Y )
βˆ1 = i =1
n
∑ i
( X
i =1
− X ) 2
∑( X
i =1
i − X )[ β1 ( X i − X ) + (ui − u )]
= n
∑ i
( X
i =1
− X ) 2
4-30
n n
∑( X i − X )( X i − X ) ∑( X i − X )(ui − u )
βˆ1 = β1 i =1
n
+ i =1
n
∑( Xi − X )
i =1
2
∑ i
( X
i =1
− X ) 2
∑( X i − X )(ui − u )
so βˆ1 – β1 = i =1
n
.
∑ i
( X
i =1
− X ) 2
n
⎡ n n
⎤
Now ∑ ( X i − X )(u i − u ) = ∑ ( X i − X )u i – ⎢ ∑ ( X i − X ) ⎥ u
i =1 i =1 ⎣ i =1 ⎦
n
⎡⎛ n ⎞ ⎤
= ∑ ( X i − X )u i – ⎢ ⎜ ∑ X i ⎟ − nX ⎥ u
i =1 ⎣ ⎝ i =1 ⎠ ⎦
n
= ∑( Xi =1
i − X )u i
4-31
n n
Substitute ∑( X
i =1
i − X )(u i − u ) = ∑( X
i =1
i − X )u i into the
∑( X i − X )(ui − u )
βˆ1 – β1 = i =1
n
∑ i
( X
i =1
− X ) 2
so
n
∑( X i − X )u i
βˆ1 – β1 = i =1
n
∑ i
( X
i =1
− X ) 2
4-32
Now we can calculate E( βˆ1 ) and var( βˆ1 ):
⎡ n ⎤
⎢ ∑ ( X i − X )u i ⎥
E( βˆ1 ) – β1 = E ⎢ i =n1 ⎥
⎢ ( X − X )2 ⎥
⎢⎣ ∑i =1
i
⎥⎦
⎧ ⎡ n ⎤ ⎫
⎪ ⎢ ∑ ( X i − X )u i ⎥ ⎪
⎪ ⎪
= ⎨ ⎢ n
E E i =1
⎥ 1X ,..., X n⎬
⎪ ⎢ ∑ ( X i − X )2 ⎥ ⎪
⎩⎪ ⎣⎢ i =1 ⎦⎥ ⎭⎪
= 0 because E(ui|Xi=x) = 0 by LSA #1
• Thus LSA #1 implies that E( βˆ ) = β1
1
n −1
where vi = (Xi – X )ui. If n is large, s ≈ σ and
2
X
2
X≈ 1, so
n
1 n
∑
n i =1
vi
ˆ
β 1 – β1 ≈ ,
2
σX
Summary so far
• βˆ is unbiased: E( βˆ ) = β1 – just like Y !
1 1
4-35
What is the sampling distribution of βˆ1 ?
The exact sampling distribution is complicated – it
depends on the population distribution of (Y, X) – but when n
is large we get some simple (and good) approximations:
p
(1) Because var( βˆ1 ) ∝ 1/n and E( βˆ1 ) = β1, βˆ1 → β1
4-36
Large-n approximation to the distribution of βˆ1 :
1 n 1 n
∑
n i =1
vi ∑
n i =1
vi
ˆ
β 1 – β1 = ≈ , where vi = (Xi – X )ui
⎛ n −1⎞ 2 σX 2
⎜ ⎟ X
s
⎝ n ⎠
• When n is large, vi = (Xi – X )ui ≈ (Xi – μX)ui, which is
i.i.d. (why?) and var(vi) < ∞ (why?). So, by the CLT,
1 n
∑
n i =1
vi is approximately distributed N(0, σ 2
v / n ).
⎛ σ 2
⎞
βˆ1 ~ N ⎜ β1 , v4 ⎟ , where vi = (Xi – μX)ui
⎝ nσ X ⎠
4-37
The larger the variance of X, the smaller the variance of βˆ1
The math
ˆ 1 var[( X i − μ x )ui ]
var( β1 – β1) = ×
n σ X4
where σ X2 = var(Xi). The variance of X appears in its square
in the denominator – so increasing the spread of X decreases
the variance of β1.
The intuition
If there is more variation in X, then there is more
information in the data that you can use to fit the regression
line. This is most easily seen in a figure…
4-38
The larger the variance of X, the smaller the variance of βˆ1
There are the same number of black and blue dots – using
which would you get a more accurate regression line?
4-39
Summary of the sampling distribution of βˆ1 :
If the three Least Squares Assumptions hold, then
• The exact (finite sample) sampling distribution of βˆ1 has:
o E( βˆ ) = β1 (that is, βˆ is unbiased)
1 1
ˆ 1 var[( X i − μ x )ui ] 1
o var( β1 ) = × ∝ .
n σX 4
n
• Other than its mean and variance, the exact distribution of
βˆ is complicated and depends on the distribution of (X,u)
1
p
• βˆ1 → β1 (that is, βˆ1 is consistent)
βˆ1 − E ( βˆ1 )
• When n is large, ~ N(0,1) (CLT)
var( βˆ1 )
• This parallels the sampling distribution of Y .
4-40
We are now ready to turn to hypothesis tests & confidence
intervals…
4-41