Chapter5.pdf.pdf

MULTIPLE REGRESSION ANALYSIS: STATISTICAL
PROPERTIES
Introductory Econometrics: A Modern Approach, 5e
Haoming Liu
National University of Singapore
August 21, 2022
The Expected Value of the OLS Estimators
The Variance of the OLS Estimators
Efficiency of OLS: The Gauss-Markov Theorem
Liu, H (NUS) MULTIPLE REGRESSION ANALYSIS: STATISTICAL PROPERTIES
August 21, 2022 1 / 82

Recap
We motived OLS estimation using the population regression function
E(y|x1, x2, ..., xk) = β0 + β1x1 + β2x2 + ... + βkxk
Given n observations, we obtained the sample regression function,
ŷ = β̂0 + β̂1x1 + β̂2x2 + ... + β̂kxk,
by choosing the β̂j to minimize the sum of squared residuals.
The β̂j have a ceteris paribus interpretation. For example,
ŷ = β̂1∆x1, if ∆x2 = ... = ∆xk = 0
August 21, 2022 2 / 82

Recap
We discussed algebraic properties of fitted values and residuals. These
hold for any sample. Also, features of R2, the goodness-of-fit
measure.
The only assumption we needed to discuss the algebraic properties for
a given sample is that we can actually compute the estimates.
Now we turn to statistical properties and features of the study the
sampling distributions of the estimators. We go further than in the
simple regression case, eventually covering statistical inference.
August 21, 2022 3 / 82

MLR
As with simple regression, there is a set of assumptions under which
OLS is unbiased. We also explicitly consider the bias caused by
omitting a variable that appears in the population model.
Now we label the assumptions with “MLR” (multiple linear
regression).
August 21, 2022 4 / 82

Assumption MLR.1 (Linear in Parameters)
The model in the population can be written as
y = β0 + β1x1 + β2x2 + ... + βkxk + u
where the βj are the population parameters and u is the unobserved error.
We have seen examples already where y and the xj can be nonlinear
functions of underlying variables, and so the model is flexible.
August 21, 2022 5 / 82

Assumption MLR.2 (Random Sampling)
We have a random sample of size n from the population,
{(xi1, xi2, ..., xik, yi ) : i = 1, ..., n}
As with SLR, this assumption introduces the data and implies the
data are a representative sample from the population.
Sometimes we will plug a random draw into the population equation:
yi = β0 + β1xi1 + β2xi2 + ... + βkxik + ui ,
which emphasizes that, along with the observed variables, we
effectively draw an unobserved error, ui , for each unit i.
As an example,
log(wagei ) = β0 + β1educi + β2IQi + β3experi + β4exper2
i + ui
lwagei = β0 + β1educi + β2IQi + β3experi + β4expersqi + ui
August 21, 2022 6 / 82

Assumption MLR.3 (No Perfect Collinearity)
In the sample (and, therefore, in the population), none of the explanatory
variables is constant, and there are no exact linear relationships among
them.
The need to rule out cases where {xij : i = 1, ..., n} has no variation
for each j is clear from simple regression.
There is a new part to the assumption because we have more than
one explanatory variable. We must rule out the (extreme) case that
one (or more) of the explanatory variables is an exact linear function
of the others.
If, say, xi1 is an exact linear function of xi2, ..., xik in the sample, we
say the model suffers from perfect collinearity.
Under perfect collinearity, there are no unique OLS estimators. Stata
and other regression packages will indicate a problem.
August 21, 2022 7 / 82

Examples of Perfect Collinearity
x1 = 2 ∗ x2
x1 = x3 + 3 ∗ x4
yi = β0 + β1x1i + β2x2i + ui = β0 + 0.5 ∗ β1x1i + 2 ∗ β2x2i + ui
August 21, 2022 8 / 82

Usually perfect collinearity arises from a bad specification of the
population model. A small sample size or bad luck in drawing the
sample can also be the culprit.
Assumption MLR.3 can only hold if n ≥ k + 1, that is, we must have
at least as many observations as we have parameters to estimate.
Suppose that k = 2 and x1 = educ, x2 = exper. If we draw our
sample so that
educi = 2experi
for every i, then MLR.3 is violated. This is very unlikely unless the
sample is small. (In any realistic population there are plenty of people
whose education level is not twice their years of workforce experience.)
August 21, 2022 9 / 82

With the samples we have looked at (n = 680, n = 759, even
n = 173), the presence of perfect collinearity is usually a result of
poor model specification, or defining variables inappropriately.
Such problems can almost always be detected by remembering the
ceteris paribus nature of multiple regression.
EXAMPLE: Do not include the same variable in an equation that is
measured in different units. For example, in a CEO salary equation, it
would make no sense to include firm sales measured in dollars along with
sales measured in millions of dollars. There is no new information once we
include one of these.
The return on equity should be included as a percent or proportion, but
not both.
August 21, 2022 10 / 82

EXAMPLE: Be careful with functional forms! Suppose we start with a
constant elasticity model of family consumption:
log(cons) = β0 + β1 log(inc) + u
How might we allow the elasticity to be nonconstant, but include the
above as a special case? The following does not work:
log(cons) = β0 + β1 log(inc) + β2 log(inc2
) + u
because log(inc2) = 2 log(inc), that is, x2 = 2x1, where x1 = log(inc).
August 21, 2022 11 / 82

Instead, we probably mean something like
log(cons) = β0 + β1 log(inc) + β2[log(inc)]2
+ u
which means x2 = x2
1 . With this choice, x2 is an exact nonlinear
function of x1, but this (fortunately) is allowed in MLR.3.
Tracking down perfect collinearity can be harder when it involves
more than two variables.
August 21, 2022 12 / 82

EXAMPLE: In VOTE1.DTA:
Go to poll everywhere.
August 21, 2022 13 / 82

One of the three variables has to be dropped. (Stata does this
automatically, but we should rely on ourselves to properly construct a
model and interpret it.)
The model makes no sense from a ceteris paribus perspective. For
example, β1 is suppose to measure the effect of changing expendA on
voteA, holding fixed expendB and totexpend. But if expendB and
totexpend are held fixed, expendA cannot change!
We would probably drop totexpend and just use the two separate
spending variables.
August 21, 2022 14 / 82

. gen totexpend = expendA + expendB
. reg voteA expendA expendB totexpend
note: expendA omitted because of collinearity
Source | SS df MS Number of obs = 173
-------------+------------------------------ F( 2, 170) = 95.83
Model | 25679.8879 2 12839.944 Prob > F = 0.0000
Residual | 22777.3606 170 133.984474 R-squared = 0.5299
-------------+------------------------------ Adj R-squared = 0.5244
Total | 48457.2486 172 281.728189 Root MSE = 11.575
------------------------------------------------------------------------------
voteA | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
expendA | (omitted)
expendB | -.0744583 .0053848 -13.83 0.000 -.0850879 -.0638287
totexpend | .0383308 .0033868 11.32 0.000 .0316452 .0450165
_cons | 49.619 1.426147 34.79 0.000 46.80376 52.43423
------------------------------------------------------------------------------
August 21, 2022 15 / 82

. reg voteA expendA expendB
-------------+------------------------------ F( 2, 170) = 95.83
Model | 25679.8879 2 12839.9439 Prob > F = 0.0000
-------------+------------------------------ Adj R-squared = 0.5244
Total | 48457.2486 172 281.728189 Root MSE = 11.575
------------------------------------------------------------------------------
-------------+----------------------------------------------------------------
expendA | .0383308 .0033868 11.32 0.000 .0316452 .0450165
expendB | -.0361275 .0031071 -11.63 0.000 -.042261 -.0299939
_cons | 49.619 1.426147 34.79 0.000 46.80376 52.43423
------------------------------------------------------------------------------
August 21, 2022 16 / 82

The results of the previous regression seem sensible: spending by
candidate A has a positive effect on the share of the vote received by
A, and spending by B has essentially the opposite effect. (If expendA
increases by 10, so $10,000, and expendB is held fixed, the voteA is
estimated to increase by about .38 percentage points.)
Note that shareA, which is a nonlinear function of expendA and
expendA,
shareA = 100 ·
expendA
(expendA + expendB)
can be included along with the two expenditure variables. It allows for
the relative size of spending to matter.
August 21, 2022 17 / 82

. reg voteA expendA expendB shareA
-------------+------------------------------ F( 3, 169) = 346.87
Model | 41687.0627 3 13895.6876 Prob > F = 0.0000
-------------+------------------------------ Adj R-squared = 0.8578
Total | 48457.2486 172 281.728189 Root MSE = 6.3293
------------------------------------------------------------------------------
-------------+----------------------------------------------------------------
expendA | -.0064488 .0029065 -2.22 0.028 -.0121866 -.000711
expendB | .0049463 .0026662 1.86 0.065 -.000317 .0102097
shareA | .5096844 .0254977 19.99 0.000 .4593494 .5600194
_cons | 24.96397 1.459247 17.11 0.000 22.08327 27.84467
------------------------------------------------------------------------------
August 21, 2022 18 / 82

A Key Point
Assumption MLR.3 does not say the explanatory variables have to be
uncorrelated – in the population or sample. Nor does it say they cannot be
“highly” correlated. MLR.3 rules out perfect correlation in the sample,
that is, correlations of ±1.
Again, in practice violations of MLR.3 are rare unless a mistake has
been made in specifying the model.
Multiple regression would be useless if we had to insist x1, ..., xk were
uncorrelated in the sample (or population)!
If the xj were all pairwise uncorrelated, we could just use a bunch of
simple regressions.
August 21, 2022 19 / 82

In an equation like
lwage = β0 + β1educ + β2IQ + β3exper + u,
we fully expect correlation among educ, IQ, and exper. (Already saw
educ and IQ are positively correlated; educ and exper tend to be
negatively correlated (why?).)
If educ were uncorrelated with all other variables that affect lwage, we
could stick with simple regression of lwage on educ to estimate β1.
Multiple regression allows us to estimate ceteris paribus effects
precisely when there is correlation among the xj
August 21, 2022 20 / 82

MLR.1 to MLR.3
1 y = β0 + β1x1 + β2x2 + ... + βkxk + u
2 random sampling from the population
3 no perfect collinearity in the sample
The last assumption ensures that the OLS estimators are unique and can be
obtained from the first order conditions (minizing the sum of squared
residuals).
We need a final assumption for unbiasedness.
August 21, 2022 21 / 82

Assumption MLR.4 (Zero Conditional Mean)
E(u|x1, x2, ..., xk) = 0 for all (x1, ..., xk)
Remember, the real assumption is E(u|x1, x2, ..., xk) = E(u): the
average value of the error does not change across different slices of
the population defined by x1, ..., xk. Setting E(u) = 0 essentially
defines β0.
If u is correlated with any of the xj , MLR.4 is violated. This is usually
a good way to think about the problem.
August 21, 2022 22 / 82

When Assumption MLR.4 holds, we say x1, ..., xk are exogenous
explanatory variables. If xj is correlated with u, we often say xj is an
endogenous explanatory variable.
August 21, 2022 23 / 82

EXAMPLE: Effects of Class Size on Student Performance
Suppose, for a standardized test score,
score = β0 + β1classize + β2income + u
Even at the same income level, families differ in their interest and
concern about their children’s education. Family support and student
motivation are in u. Are these correlated with class size even though
we have included income? Probably.
August 21, 2022 24 / 82

Unbiasedness of OLS
Theorem
Under Assumptions MLR.1 through MLR.4, and conditional on
{(xi1, ..., xik) : i = 1, ..., n}, the OLS estimators are unbiased:
E(β̂j ) = βj , j = 0, 1, 2, ..., k
for any values of the βj .
The easiest proof requires matrix algebra. See Appendix 3A for a
proof based on summations.
Often the hope is that if our focus is on, say, x1, we can include
enough other variables in x2, ..., xk to make MLR.4 true, or close to
true.
August 21, 2022 25 / 82

Inclusion of Irrelevant Variables
It is important to see that the unbiasedness result allow for the βj to
be any value, including zero.
Suppose, then, that we specify the model
lwage = β0 + β1educ + β2exper + β3motheduc + u,
where MLR.1 through MLR.4 hold. Suppose that β3 = 0, but we do
not know that. We estimate the full model by OLS:

lwage = β̂0 + β̂1educ + β̂2exper + β̂3motheduc
August 21, 2022 26 / 82

Inclusion of Irrelevant Variables
We automatically know from the unbiasedness result that
E(β̂j ) = βj , j = 0, 1, 2
E(β̂3) = 0
The result that including an irrelevant variable, or overspecifying
the model, does not cause bias in any coefficients is often presented
with an extra argument. It follows from the general unbiasedness
result, but it does increase the standard error of the estimates.
August 21, 2022 27 / 82

Omitted Variable Bias
Leaving a variable out when it should be including in multiple
regression is a serious problem. This is called excluding a relevant
variable or underspecifying the model.
We can perform a misspecification analysis in this case. The
general case is more complicated.
Consider the case where the correct model has two explanatory
variables:
y = β0 + β1x1 + β2x2 + u
satisfies MLR.1 through MLR.4.
August 21, 2022 28 / 82

If we regress y on x1 and x2, we know the resulting estimators will be
unbiased. But suppose we leave out x2 and use simple regression of y
on x1:
ỹ = β̃0 + β̃1x1
In most cases, we omit x2 because we cannot collect data on it.
We can easily derive the bias in β̃1 (conditional on the sample
outcomes {(xi1, xi2) : i = 1, ..., n}).
August 21, 2022 29 / 82

We already have a relationship between β̃1 and the multiple regression
estimator, β̂1:
β̃1 = β̂1 + β̂2δ̃1
where β̂2 is the multiple regression estimator of β2 and δ̃1 is the slope
coefficient in the auxiliary regression
xi2 on xi1, i = 1, ..., n
August 21, 2022 30 / 82

Now just use the fact that β̂1 and β̂2 are unbiased (or would be if we could
compute them): Conditional on the sample values of x1 and x2,
E(β̃1) = E(β̂1) + E(β̂2)δ̃1
= β1 + β2δ̃1
Therefore,
Bias(β̃1) = β2δ̃1
Recall that δ̃1 has the same sign as the sample correlation Corr(xi1, xi2).
August 21, 2022 31 / 82

The simple regression estimator is unbiased (for the given outcomes
{(xi1, xi2)}) in two cases.
1. β2 = 0. But this means that x2 does not appear in the population
model, so simple regression is the right thing to do.
2. Corr(xi1, xi2) = 0 (in the sample). Then the simple and multiple
regression estimators are identical because δ̃1 = 0.
If β2 ̸= 0 and Corr(xi1, xi2) ̸= 0 then β̃1 is generally biased. We do know
know β2 and might only have a vague idea about the size of δ̃1. But we
often can guess at the signs.
August 21, 2022 32 / 82

Technically, the bias computed holds for a particular “sample” on (x1, x2).
But acting as if what matters is correlation between x1 and x2 in the
population gives us the correct answer when we turn to asymptotic
analysis.
In what follows, we do not make the distinction between the sample and
population correlation between x1 and x2.
Bias in the Simple Regression Estimator of β1
Corr(x1, x2) > 0 Corr(x1, x2) < 0
β2 > 0 Positive Bias Negative Bias
β2 < 0 Negative Bias Positive Bias
August 21, 2022 33 / 82

EXAMPLE: Omitted Ability Bias
lwage = β0 + β1educ + β2abil + u
where abil is “ability.” Essentially by definition, β2 > 0. We also think
Corr(educ, abil) > 0
so that higher ability people get more education, on average.
August 21, 2022 34 / 82

EXAMPLE: Omitted Ability Bias
In this scenario,
E(β̃1) > β1
so there is an upward bias in simple regression. Our failure to control for
ability leads to (on average) overestimating the return to education. We
attribute some of the effect of ability to education because ability and
education are positively correlated.
Remember, for a particular sample, we can never know whether β̃1 > β1.
But we should be very hesitant to trust a procedure that produces to large
an estimate on average.
August 21, 2022 35 / 82

EXAMPLE: Effects of a Tutoring Program on Student
Performance
GPA = β0 + β1tutor + β2abil + u
where tutor is hours spent in tutoring. Again, β2 > 0. Suppose that
students with lower ability tend to use more tutoring:
Corr(tutor, abil) < 0
Then
E(β̃1) = β1 + β2δ̃1 = β1 + (+)(−) < β1
so that our failure to account for ability leads us to underestimate the
effect of tutoring. In fact, it could happen that β1 > 0 but E(β̃1) ≤ 0, so
we tend to find no effect or even a negative effect.
August 21, 2022 36 / 82

If, as an approximation, we assume educ and exper are uncorrelated,
and exper and abil are uncorrelated, then the bias analysis of β̃1 for
the simpler case carries through, but it is now β3 that matters. So, as
a rough guide, β̃1 will have an upward bias because β3 > 0 and
Corr(educ, abil) > 0.
In the general case, it should be remembered that correlation of any
xj with an omitted variable generally causes bias in all of the OLS
estimators, not just in β̃j
See Appendix 3A in Wooldridge for a more detailed treatment.
August 21, 2022 37 / 82

The Variance of the OLS Estimators
So far, we have assumed
1 y = β0 + β1x1 + β2x2 + . . . + βkxk + u
2 random sampling from the population
3 no perfect collinearity in the sample
4 E(u|x1, x2, ..., xk) = 0
Under MLR.3 we can compute the OLS estimates in our sample.
The other assumptions then ensure that OLS is unbiased (conditional
on the outcomes of the explanatory variables).
August 21, 2022 38 / 82

Assumption MLR.5 (Homoskedasticity)
When we have omitted an important variable, we have derived that
OLS is biased, and we have shown how to obtain the sign of the bias
in simple cases.
As in the simple regression case, to obtain Var(β̂j ) we add a
simplifying assumption: homoskedasticity (constant variance).
The variance of the error, u, does not change with any of x1, x2, ..., xk:
Var(u|x1, x2, ..., xk) = Var(u) = σ2
This assumption can never be guaranteed. We make it for now to get
simple formulas, and to be able to discuss efficiency of OLS.
August 21, 2022 39 / 82

Assumptions MLR.1 and MLR.4 imply that
E(β̂) = β
MLR.5,
Var(y|x1, x2, ..., xk) = Var(u|x1, x2, ..., xk) = σ2
Assumptions MLR.1 through MLR.5 are called the Gauss Markov
assumptions.
August 21, 2022 40 / 82

If we have a savings equation,
sav = β0 + β1inc + β2famsize + β3pareduc + u
where famsize is size of the family and pareduc is total parents’
education, MLR.5 means that the variance in sav cannot depend in
income, family size, or parents’s education.
Later we will show how to relax MLR.5, and how to test whether it is
true.
August 21, 2022 41 / 82

To set up the following theorem, we focus only on the slope
coefficients. (A different formula is needed for Var(β̂0).
As before, we are computing the variance conditional on the values of
the explanatory variables in the sample.
We need to define two quantities associated with each xj . The first is
the total variation in xj in the sample:
SSTj =
n
X
i=1
(xij − x̄j )2
(SSTj /n is the sample variance of xj .)
August 21, 2022 42 / 82

The second is a measure of correlation between xj and the other
explanatory variables, in the sample. This is the R-squared from the
regression
xij on xi1, xi2, ..., xi,j−1, xi,j+1, ..., xik
That is, we regress xj on all of the other explanatory variables. (y plays no
role here). Call this R-squared R2
j , j = 1, ..., k.
August 21, 2022 43 / 82

Important: R2
j = 1 is ruled out by Assumption MLR.3 because
R2
j = 1 means that, in the sample, xj is an exact linear function of
the other explanatory variables.
Any value 0 ≤ R2
j < 1 is permitted. As R2
j gets closer to one, xj is
more linearly related to the other independent variables.
August 21, 2022 44 / 82

Theorem (Sampling Variances of OLS Slope Estimators)
Under Assumptions MLR.1 to MLR.5, and condition on the values of the
explanatory variables in the sample,
Var(β̂j ) =
σ2
SSTj (1 − R2
j )
, j = 1, 2, ..., k.
All five Gauss-Markov assumptions are needed to ensure this formula is
correct.
August 21, 2022 45 / 82

Suppose k = 3,
y = β0 + β1x1 + β2x2 + β3x3 + u
E(u|x1, x2, x3) = 0
Var(u|x1, x2, x3) = γ0 + γ1x1
where x1 ≥ 0 (as are γ0 and γ1). This violates MLR.5, and the standard
variance formula is generally incorrect for all OLS estimators, not just
Var(β̂1).
August 21, 2022 46 / 82

The variance
Var(β̂j ) =
σ2
SSTj (1 − R2
j )
has three components. σ2 and SSTj are familiar from simple regression.
The third, 1 − R2
j , is new to multiple regression.
August 21, 2022 47 / 82

Factors Affect Var(β̂j)
1 As the error variance (in the population), σ2, decreases, Var(β̂j )
decreases. One way to reduce the error variance is to take more stuff
out of the error. That is, add more explanatory variables.
2 As the total sample variation in xj , SSTj , increases, Var(β̂j )
decreases. As in the simple regression case, it is easier to estimate
how xj affects y if we see more variation in xj
August 21, 2022 48 / 82

As we mentioned earlier, SSTj /n [or SSTj (n − 1) – the difference is
unimportant here] is the sample variance of {xij : i = 1, 2, ..., n}. So
we can assume
SSTj ≈ nσ2
j
where σ2
j > 0 is the population variance of xj .
We can increase SSTj by increasing the sample size. SSTj is roughly
a linear function of n. [Of the three components in Var(β̂j ), this is
the only one that depends systematically on n.]
August 21, 2022 49 / 82

Var(β̂j ) =
σ2
SSTj (1 − R2
j )
As R2
j → 1, Var(β̂j ) → ∞. R2
j measures how linearly related xj is to the
other explanatory variables.
We get the smallest variance for β̂j when R2
j = 0:
Var(β̂j ) =
σ2
SSTj
,
which looks just like the simple regression formula.
August 21, 2022 50 / 82

If xj is unrelated to all other independent variables, it is easier to
estimate its ceteris paribus effect on y.
R2
j = 0 is very rare. Even small values are not especially common.
In fact, R2
j ≈ 1 is somewhat common, and this can cause problems
for getting a sufficiently precise estimate of βj .
August 21, 2022 51 / 82

Below is a graph of Var(β̂1) as a function of R2
1 :
August 21, 2022 52 / 82

Loosely, R2
j “close” to one is called the “problem” of
multicollinearity. Unfortunately, we cannot define what we mean by
“close” that is relevant for all situations. We have ruled out the case
of perfect collinearity, R2
j = 1.
Here is an important point: One often hears discussions of
multicollinearity as if high correlation among two or more of the xj is
a violation of an assumption we have made. But it does not violate
any of the Gauss-Markov assumptions, including MLR.3.
August 21, 2022 53 / 82

We know that if the zero conditional mean assumption is violated,
OLS is not unbiased. If MLR.1 through MLR.4 hold, but
homoscedasticity (constant variance) does not, then
Var(β̂j ) =
σ2
SSTj (1 − R2
j )
is not the correct formula.
But multicollinearity does not cause the OLS estimators to be biased.
We still have E(β̂j ) = βj .
August 21, 2022 54 / 82

Further, any claim that the OLS variance formula is “biased” in the
presence of multicollinearity is also wrong. The formula is correct
under MLR.1 through MLR.5.
In fact, the formula is doing its job: It shows that if R2
j is “close” to
one, Var(β̂j ) might be very large. If R2
j is “close” to one, xj does not
have much sample variation separate from the other explanatory
variables. We are trying to estimate the effect of xj on y, holding
x1, ..., xj−1, xj+1, ..., , xk fixed, but the data might not be allowing us
to do that very precisely.
August 21, 2022 55 / 82

Because multicollinearity violates none of our assumptions, it is
essentially impossible to state hard rules about when it is a
“problem.” This has not stopped some from trying.
Other than just looking at the R2
j , a common “measure” of
multicollinearity is called the variance inflation factor (VIF):
VIFj =
1
1 − R2
j
.
August 21, 2022 56 / 82

Because
Var(β̂j ) =
σ2
SSTj
· VIFj ,
the VIFj tells us how many times larger the variance is than if we had the
“ideal” case of no correlation of xij with xi1, ..., xi,j−1, xi,j+1, ..., , xik.
This sometimes leads to silly rules-of-thumb. For example, one should
be “concerned” if VIFj > 10 (equivalently, R2
j > .9).
Is R2
j > .9 “large”? Yes, in the sense that it would be better to have
R2
j smaller.
But, if we want to control for, say, x2, ..., xk to get a good ceteris
paribus effect of x1 on y, we often have no choice.
August 21, 2022 57 / 82

A large VIFj can be offset by a large SSTj :
Var(β̂j ) =
σ2
SSTj
· VIFj
Remember, SSTj grows roughly linearly with the sample size, n. A
large VIFj can be offset by a large sample size. The value of VIFj per
se is irrelevant. Ultimately, it is Var(β̂j ) that is important.
Even so, at this point, we have no way of knowing whether Var(β̂j ) is
“too large” for the estimate β̂j to be useful. Only when we discuss
confidence intervals and hypothesis testing will this be apparent.
August 21, 2022 58 / 82

Be wary of work that reports a set of multicollinearity “diagnostics”
and concludes nothing useful can be learned because multicollinearity
is “too severe.” Sometimes a VIF of about 10 is used to make such a
claim.
Other “diagnostics” are even more difficult to interpret. Using them
indiscriminately is often a mistake.
August 21, 2022 59 / 82

Consider an example:
y = β0 + β1x1 + β2x2 + β3x3 + u,
where β1 is the coefficient of interest. In fact, assumex2 and x3 act as
controls so that we hope to get a good ceteris paribus estimate of x1.
Such controls are often highly correlated. (For example, x2 and x3
could be different standardized test scores.)
The key is that the correlation between x2 and x3 has nothing to do
with Var(β̂1). It is only correlation of x1 with (x2, x3) that matters.
August 21, 2022 60 / 82

In an example to determine whether communities with larger minority
populations are discriminated against in lending, we might have
percapproved = β0 + β1percminority
+β2avginc + β3avghouseval + u,
where β1 is the key coefficient. We might expect avginc and avghouseval
to be highly correlated across communities. But we do not care really care
whether we can precisely estimate β2 or β3.
August 21, 2022 61 / 82

Variances in Misspecified Models
As with bias calculations, we can study the variances of the OLS
estimators in misspecified models.
Consider the same case with (at most) two explanatory variables:
y = β0 + β1x1 + β2x2 + u
which we assume satisfies the Gauss-Markov assumptions.
We run the “short” regression, y on x1, and also the “long”
regression, y on x1, x2:
ỹ = β̃0 + β̃1x1
ŷ = β̂0 + β̂1x1 + β̂2x2
August 21, 2022 62 / 82

We know from the previous analysis that
Var(β̂1) =
σ2
SST1(1 − R2
1 )
conditional on the values xi1 and xi2 in the sample.
What about the simple regression estimator? Can show
Var(β̃1) =
σ2
SST1
which is again conditional on {(xi1, xi2) : i = 1, ..., n}.
August 21, 2022 63 / 82

Whenever xi1 and xi2 are correlated, R2
1 > 0, and
Var(β̃1) =
σ2
SST1
<
σ2
SST1(1 − R2
1 )
< Var(β̂1)
So, by omitting x2, we can in fact get an estimator with a smaller
variance, even though it is biased. When we look at bias and
variance, we have a tradeoff between simple and multiple regression.
In the case R2
1 > 0, we can draw two conclusions.
August 21, 2022 64 / 82

y = β0 + β1x1 + β2x2 + u
1 If β2 ̸= 0, β̃1 is biased, β̂1 is unbiased, but Var(β̃1) < Var(β̂1).
2 If β2 = 0, β̃1 and β̂1 are both unbiased and Var(β̃1) = Var(β̂1).
Case 2 is clear cut. If β2 = 0, x2 has no (partial) effect on y. When x2 is
correlated with x1, including it along with x1 in the regression makes it
more difficult to estimate the partial effect of x1.
Simple regression is clearly preferred.
August 21, 2022 65 / 82

Case 1 is more difficult, but there are reasons to prefer the unbiased
estimator, β̂1.
First, the bias in β̃1 does not systematically change with the sample
size. We should assume the bias is as large when n = 1, 000 as when
n = 10.
By contrast, the variances Var(β̃1) and Var(β̂1) both shrink at the
rate 1/n. With a large sample size, the difference between Var(β̃1)
and Var(β̂1) is less important, especially considering the bias in β̃1 is
not shrinking.
August 21, 2022 66 / 82

Second reason for preferring β̂1 is more subtle. The formulas
Var(β̃1) =
σ2
SST1
Var(β̂1) =
σ2
SST1(1 − R2
1 )
because they condition on the same explanatory variables, act as if the
error variance does not change when we add x2. But if β2 ̸= 0, the
variance (σ̂2) does shrink.
August 21, 2022 67 / 82

In a more advanced course, we would be making a comparison between
Var(β̃1) =
η2
SST1
Var(β̂1) =
σ2
SST1(1 − R2
1 )
where η2 > σ2 reflects the larger error variance in the simple regression
analysis.
August 21, 2022 68 / 82

Estimating the Error Variance
We still need to estimate σ2. With n observations and k + 1 parameters,
we only have
df = n − (k + 1)
degrees of freedom. Recall we lose the k + 1 df due to k + 1 restrictions
on the OLS residuals:
n
X
i=1
ûi = 0
n
X
i=1
xij ûi = 0, j = 1, 2, ..., k
August 21, 2022 69 / 82

Unbiased Estimation of σ2
Under the Gauss-Markov assumptions (MLR.1 through MLR.5)
σ̂2
= (n − k − 1)−1
n
X
i=1
û2
i = SSR/df
is an unbiased estimator of σ2.
This means that, if we divide by n rather than n − k − 1, the bias is
−σ2

k + 1
n

which means the estimated variance would be too small, on average.
August 21, 2022 70 / 82

Unbiased Estimation of σ2
The bias disappears as n increases.
The square root of σ̂2, σ̂, is reported by all regression packages.
(standard error of the regression, or root mean squared error).
Note that SSR falls when a new explanatory variable is added, but df
falls, too. So σ̂ can increase or decrease when a new variable is added
in multiple regression.
August 21, 2022 71 / 82

The standard error of each β̂j is computed (for the slopes) as
se(β̂j ) =
σ̂
q
SSTj (1 − R2
j )
and it will be critical to report these along with the coefficient
estimates.
We have discussed the three factors that affect se(β̂j ) already.
August 21, 2022 72 / 82

Using WAGE2.DTA:
. reg lwage educ IQ exper
-------------+------------------------------ F( 3, 755) = 69.78
Model | 57.0352742 3 19.0117581 Prob F = 0.0000
Residual | 205.71337 755 .27246804 R-squared = 0.2171
-------------+------------------------------ Adj R-squared = 0.2140
Total | 262.748644 758 .346634095 Root MSE = .52198
------------------------------------------------------------------------------
lwage | Coef. Std. Err. t P|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
educ | .1069849 .0116513 9.18 0.000 .084112 .1298578
IQ | .0080269 .0015893 5.05 0.000 .0049068 .0111469
exper | .0435405 .0084242 5.17 0.000 .0270028 .0600783
_cons | -.228922 .2299876 -1.00 0.320 -.6804132 .2225692
------------------------------------------------------------------------------

lwage = −.229
(.230)
+ .107
(.012)
educ + .0080
(.0016)
IQ + .0435
(.0084)
exper
n = 759, R2
= .217
August 21, 2022 73 / 82

Efficiency of OLS: The Gauss-Markov Theorem
How come we use OLS, rather than some other estimation method?
Consider simple regression:
y = β0 + β1x + u
and write, for each i,
yi = β0 + β1xi + ui .
If we average across the n obervations we get
ȳ = β0 + β1x̄ + ū
August 21, 2022 74 / 82

For any i with xi ̸= x̄, subtract and rearrange:
β1 =
(yi − ȳ)
(xi − x̄)
+
(ui − ū)
(xi − x̄)
The last term has a zero expected value under random sampling and
E(u|x) = 0. If xi ̸= x̄ for all i, we could use an estimator
β̆1 = n−1
n
X
i=1
(yi − ȳ)
(xi − x̄)
β̆1 is not the same as the OLS estimator,
β̂1 =
Pn
i=1(xi − x̄)yi
Pn
i=1(xi − x̄)2
How do we know OLS is better than this new estimator, β̆1?
Generally, we do not. Under MLR.1 to MLR.4, both estimators are
unbiased.
August 21, 2022 75 / 82

This means β̂1 has a sampling distribution that is less spread out
around β1 than β̆1. When comparing unbiased estimators, we prefer
an estimator with smaller variance.
We can make very general statements for the multiple regression case,
provided the 5 Gauss-Markov assumptions hold.
However, we must also limit the class of estimators that we can
compare with OLS.
August 21, 2022 76 / 82

THEOREM (Gauss-Markov)
Under Assumptions MLR.1 through MLR.5, the OLS estimators β̂0, β̂1, ...,
β̂k are the best linear unbiased estimators (BLUEs)
Start from the end of “BLUE” and work backwards:
E (estimator) We must be able to compute an estimate from the
observable data, using a fixed rule.
L (linear) The estimator is a linear function of {yi : i = 1, 2, ..., n}. It
can be a nonlinear function of the explanatory variables.
These estimators have the general form
β̃j =
n
X
i=1
wij yi
qwhere the {wij : i = 1, ..., n} are any functions of
{(xi1, ..., xik) : i = 1, ..., n}.
The OLS estimators can be written in this way.
August 21, 2022 77 / 82

U (unbiased)
We must impose enough restrictions on the wij – we omit those here – so
that
E(β̃j ) = βj , j = 0, 1..., k
(conditional on {(xi1, ..., xik) : i = 1, ..., n}).
We know the OLS estimators are unbiased under MLR.1 through MLR.4.
So are a lot of other linear estimators.
August 21, 2022 78 / 82

B (best)
This means smallest variance (which makes sense once we impose
unbiasedness). In other words, what can be shown is that, under MLR.1
through MLR.5, and conditional on the explanatory variables in the
sample,
Var(β̂j ) ≤ Var(β̃j ) all j
(and usually the inequality is strict).
If we do not impose unbiasedness, then we can use silly rules – such as
β̃j = 1 always – to get estimators with zero variance.
August 21, 2022 79 / 82

How do we use the GM Theorem? If the Gauss-Markov assumptions
hold, and we insist on unbiased estimators that are also linear
functions of {yi : i = 1, 2, ..., n}, then we need look no further than
OLS: it delivers the smallest possible variances.
It might be possible (but even so, not practical) to find unbiased
estimators that are nonlinear functions of {yi : i = 1, 2, ..., n} that
have smaller variances than OLS. The GM Theorem only allows linear
estimators in the comparison group.
August 21, 2022 80 / 82

Appendix 3A contains a proof of the GM Theorem.
If MLR.5 fails, that is, Var(u|x1, ..., xk) depends on one or more xj ,
the GM conclusions do not hold. There may be linear, unbiased
estimators of the βj with smaller variance.
August 21, 2022 81 / 82

Remember: Failure of MLR.5 does not cause bias in the β̂j , but it
does have two consequences:
1. The usual formuals for Var(β̂j ), and therefore for se(β̂j ), are
wrong.
2. The β̂j are no longer BLUE.
The first of these is more serious, as it will directly affect statistical
inference (next). The second consequence means we may want to
search for estimators other than OLS. This is not so easy. And with a
large sample, it may not be very important.
August 21, 2022 82 / 82

Chapter5.pdf.pdf

More Related Content

What's hot (20)

Similar to Chapter5.pdf.pdf (20)

More from ROBERTOENRIQUEGARCAA1 (20)

Recently uploaded (20)

Chapter5.pdf.pdf