Chapter 2 The Classical Linear Regression Model (CLRM)
Chapter 2 The Classical Linear Regression Model (CLRM)
THE CLASSICAL
LINEAR REGRESSION MODEL (CLRM)
e
twin branches of statistical inference. Based on the OLS, we obtained the
ut
sample regression, such as the one shown in Equation (1.40). This is of
course a sample regression function (SRF) because it is based on a specific
b
sample drawn randomly from the purported population. What can we say
tri
about the true population regression function (PRF) from the SRF? In prac-
tice, we do not observe the PRF and have to “guess” it from the SRF. To
is
obtain the best possible guess, we need a framework, which is provided by
the classical linear regression model (CLRM). The CLRM is based on
rd
several assumptions, which are discussed below.
,o
2.1 Assumptions of the CLRM
these assumptions more critically. However, keep in mind that in any sci-
po
sense that their values are fixed in repeated sampling. However, if the
regressors are stochastic, we assume that each regressor is independent
of the error term or at least uncorrelated with it. We will discuss this
t
E (ui | X ) = 0 (2.1)
D
E(u | X ) = 0 (2.1a)
23
Copyright ©2019 by SAGE Publications, Inc.
is work may not be reproduced or distributed in any form or by any means without express written permission of the publish
24
More explicitly,
u1 E (u1 ) 0
u E (u )
2 2 0
E u3 = E (u3 ) = 0
e
ut
u E (u ) 0
n n
b
Because of this critical assumption, and given the values of the regressors,
we can write Equation (1.5) as
tri
E ( y | X ) = BX + E( u | X )
is
= BX (2.2)
rd
This is the PRF. In regression analysis, our primary objective is to estimate
this function. The PRF thus gives the mean value of the regressand corre-
,o
sponding to the given values of the regressors, noting that conditional on
these values the mean value of the error term is 0.
st
var(ui | X ) = σ 2 (2.3)
y,
In matrix notation,
p
co
cov(ui , u j | X ) = 0, i ≠ j (2.4)
where cov stands for covariance, and i and j are two different error terms.
Of course, if i = j, we get the variance of ui given in Equation (2.3).
Figure 2.2 shows a likely pattern of autocorrelation.
Homoscedasticity Heteroscedasticity
100
100
80 80
e
60 60
ut
40 40
b
20 20
tri
0 0
0 20 40 60 80 100 0 20 40 60 80 100
is
rd
,o
Figure 2.2 Autocorrelation
st
Autocorrelation function
po
0.2
0.15
0.1
y,
0.05
p
0
co
−0.05
−0.1
−0.15
t
no
E (uu′ ) = σ 2 I if i = j
=0 if i ≠ j
where 0 is the null matrix and I is the identity matrix. We discuss this
assumption further in Chapter 5. More compactly, we can express Assump-
tions 4 and 5 as
σ 2 0
E(uu′) = σ 2 I =
e
0 σ2
ut
Assumption 6: There is no perfect linear relationship among the X vari-
b
ables. This is the assumption of no multicollinearity. Strictly speaking,
tri
multicollinearity refers to the existence of more than one exact linear
relationship, and collinearity refers to the existence of a single exact
is
linear relationship. But this distinction is rarely maintained in practice,
rd
and multicollinearity refers to both cases. Imagine what would happen in
the wage regression given in Equation (1.5), if we were to include work
experience both in years and in months!
,o
In matrix notation, this assumption means that the X matrix is of full
column rank. In other words, the columns of the X matrix are linearly inde-
st
ui ~ N (0, σ 2 ) (2.5)
o
Or in matrix notation,
D
u ~ N (0, σ 2 I ) (2.5a)
The assumption of the normality of the error term is crucial if the sample size
is rather small; it is not essential if we have a very large sample. However,
we will revisit this assumption in Chapter 7. With this assumption, CLRM is
known as the classical normal linear regression model (CNLRM).
y ~ N ( BX , σ 2 I ) (2.6)
e
That is, the regressand is distributed normally with mean BX and the (con-
ut
stant) variance σ2.
Under Assumption 8, we can use the method of maximum likelihood
b
(ML) as an alternative to OLS. We will discuss ML more thoroughly in
tri
Chapter 3 because of its general applicability in many areas of statistics.
With one or more of the preceding assumptions, in this chapter, we dis-
is
cuss the following topics:
rd
1. The sampling distribution of the OLS estimators, b
2. An estimator of the unknown variance, σ 2
,o
3. The relationship between the residual e and the error u
st
constants. However, this is not true of the estimated b coefficients, for their
values depend on the sample data at hand. In other words, the b coefficients
are random. As such, we would like to find their sampling or probability
t
Recall that
Therefore,
e
= ( X′ X )−1 X′ E (uu′ ) X (X′ X ) −1
ut
= ( X′ X )−1 X′σ 2 IX (X′ X ) −1
= σ 2 (X′ X ) −1
b
(2.8)
tri
In deriving this expression, we have used properties of the transpose of an
is
inverse matrix, and the assumption that X is fixed and that the variance of
ui is constant and the us are uncorrelated. Notice that we can move the
rd
expectations operator through X because it is assumed fixed. The variances
of the individual elements of b are on the main diagonal (running from
upper left to lower right), and the off-diagonal elements give the covari-
,o
ances between pairs of coefficients in b.
Since
st
b = ( X′ X ) −1 X′ y
po
(1.16)
b ~ N ( B, σ 2 ( X′ X ) −1) (2.9)
That is, b is normally distributed with B as its mean (see Equation 2.20) and
the variance established in Equation (2.8). In other words, under the normality
t
no
over all possible samples, here the variation in b over all possible samples.1
1
Suppose we draw several independent samples and for each sample we compute a
(test) statistic, such as the mean, and draw a frequency distribution of all these sta-
tistics. Roughly speaking, this frequency distribution is the sampling distribution of
that statistic. In our case, under the assumed conditions, the probability or sampling
distribution of any component of b is normal as shown in Equation (2.9).
bk ~ N ( Bk , σ 2 x kk ) (2.9a)
e
However, before we can engage in hypothesis testing, we need to esti-
mate the unknown σ 2. Remember that σ 2 refers to the variance of the error
ut
term u. Since we do not observe u directly, we have to rely on the estimated
b
residuals, e, to learn about the true variance. Toward that end, we need to
establish the relationship between u and e. Recall that
tri
e = y −ŷ (2.10)
is
Substituting for ŷ from (1.23), we obtain
e = y − Xb
rd
,o
−1
= y − X ( X′ X ) X′ y, substituting for b from Eq. (1.16 )
= My
st
= M (XB + u )
po
where
y,
M = [ I − X (X′ X ) −1 X′ ]
2
p
bk
b
2
MXB= [XB – X(XX′)−1X′XB] = XB – IXB = 0, where I is the identity matrix.
e
E (e ) = E ( Mu)
ut
= ME ( u )
b
=0 (2.12)
tri
because E(u) = 0, by Assumption 1. We have thus shown that the expecta-
is
tion of each element of e is 0.
Now,
rd
cov(e ) = E ( ee′ ) = E ( Muu′ M′ )
,o
= ME (uu′ ) M′
2 2
= σ IM = σ M (2.13)
st
e ~ N (0, σ 2 M ) (2.14)
p
Therefore, like the mean of u, the mean of e is 0, but unlike u, the residuals
co
have zero expectations, have constant variance, and are not autocorrelated
(by assumption). In other words, the properties that hold for u generally do
not hold for e, except for zero expectations.
o
D
3
Actually, the distribution of e is degenerate as its variance–covariance matrix is
singular. On this, see Vogelvang, B. (2005). Econometrics: Theory and applications
with Eviews (chapter 4). Harlow, England: Pearson-Addison Wesley.
e
ut
e = Mu (2.11)
b
Therefore,
tri
E (e′ e ) = E (u′ M′ Mu)
= E (u′ Mu)
is
(2.15)
rd
because of the properties of M. Now,
the trace
= tr[M (Euu′ )] , since the trace and expectations operators are
po
both li near
= tr[ M (σ 2 I )]
y,
= σ 2 tr(M )
p
−1
= σ tr(I n ) − tr(( X′ X ) X′ X )
2
= σ 2 (n − k ), since tr ( I ) = n and tr ( I k ) = k
no
(2.16)
The notation tr(M) means the trace of the matrix M, which is simply the
o
e ′e
=σ
2
E (2.17)
n−k
If we now define
e
e ′e Σe 2
ut
S2 = = i (2.18)
n−k n−k
b
then
tri
E (S 2 ) = σ 2 (2.19)
is
In words, S 2 is an unbiased estimator of the true error variance σ 2. S, the square
rd
root of S 2, is called the standard error (se) of the estimate or the standard
error of the regression. In practice, therefore, we use S 2 in place of σ 2.
,o
2.3 Properties of OLS Estimators:
The Gauss–Markov Theorem4
st
b = ( X′ X ) −1 X′ y (1.16)
no
= ( X′ X ) −1
X′ [ XB + u] , substituting for y
= ( X′ X ) X′ XB + ( X′ X ) −1 X′ u
−1
o
= B + ( X′ X −)1 X′ u
D
4
In Appendix C, we discuss both small-sample and large-sample properties of OLS
and ML estimators.
5
Although known as the Gauss–Markov theorem, the least-squares approach of
Gauss antedates (1821) the minimum-variance approach of Markov (1900).
6
See the discussion in Darnell, A. C. (1994). A dictionary of econometrics (p. 155).
Cheltenham, England: Edward Elgar.
Now
e
(Recall the definition of unbiased estimator.) Note that E(u|X) = 0 by assumption.
To prove that in the class of unbiased linear estimators the least-squares esti-
ut
mators have the least variance (i.e., they are efficient), we proceed as follows:
b
Let b* be another linear estimator of B such that
tri
b* = [ A + ( X′ X ) −1 X′ ] y (2.21)
is
where A is some nonstochastic k × n matrix, similar to X. Simplifying, we obtain
rd
b* = Ay + ( X′ X ) −1 X′ y
= Ay + b (2.22)
,o
where b is the least-squares estimator given in Equation (1.16).
Now
st
E ( b* ) = A + ( X′ X ) −1 X′ E ( y )
po
= A + ( X′ X ) −1 X′ ( XB )
= ( AX + I ) B (2.23)
y,
Now E(b*) = B if and only if AX = 0. In other words, for the linear estimator
b* to be unbiased, AX must be 0.
p
Thus,
co
Given that u has zero mean and constant variance (= σ 2I), we can now
find the variance of b* as follows:
o
= σ 2 AA′ + (X′ X ) −1
= σ 2 (X′ X )−1 + AA′ σ 2
= var ( b ) + AA′ σ 2 (2.24)
Since AA′ is a positive semidefinite matrix, Equation (2.24) shows that the
covariance matrix of b* is equal to the covariance matrix of b plus a positive
semidefinite matrix. That is, cov(b*) > cov(b), unless A = 0. This shows that
in the class of unbiased linear estimators, the least-square estimator b has
the least variance, that is, it is efficient compared with any other linear
unbiased estimator of B.
e
It is important to note that in establishing the Gauss–Markov theorem we
ut
do not have to assume that the error term u follows a particular probability
distribution, such as the normal. To establish the theorem, we only need
b
Assumptions 1 to 5.
tri
It is also important to note that if one or more assumptions underlying
the Gauss–Markov theorem are not satisfied, the OLS estimators will not
is
be BLUE. Also, bear in mind that the Gauss–Markov theorem holds only
for linear estimators, that is, linear functions of the observation vector y.
rd
There are situations where nonlinear (in-the-parameter) estimators are
more efficient than the linear estimators. In this book, we do not deal with
nonlinear estimators, for that requires a separate book.7
,o
To sum up, we have shown that under the Gauss–Markov assumptions,
b, the least-square estimator of B, is BLUE, that is, in the class of unbiased
st
linear estimators, b has the least variance. We also showed how to estimate
B and the variance of the estimated B.
po
We have shown how to estimate B, that is, each of its elements. Suppose
we want to estimate some linear function of the elements of B, that is, that
p
of B1, B2, B3, . . . , Bk. More specifically, suppose we want to estimate t′B,
co
What this means is that whether we estimate all the elements of B, or one
no
of its elements, or estimate a linear combination (t′B), we can use the OLS
regression.
Let λ = t′B. By choosing t appropriately, we can make λ equal to any element
o
7
For examples of nonlinear estimators and their applications, see Gujarati, D.
(2015). Econometrics by example (2nd ed.). London, England: Palgrave Macmillan.
But keep in mind that in general the variance of λ̂ depends on every element
of the covariance matrix of b, the estimator of B. However, if some ele-
ments of the vector t are equal to zero, var(λ̂) does not depend on the cor-
responding rows and columns of the covariance matrix σ2(X′X)−1.
e
As an example, consider λ = 4B1 − B5. In this case,
ut
var (λλ̂ ) = t12 var(b1 ) + t52 var(b5 ) + 2t1t5 cov(b1 , b5 )
(2.26)
b
=16 var(b1 ) + var(b5 ) − 8 cov(b1 , b5 )
tri
Notice in this example only the variances of b1 and b5 and their covariances
is
are involved, as the values of the other parameters in the k-variable regres-
sion (1.2) are assumed to be zero. But if there are more nonzero coeffi-
rd
cients, the variances and their pairwise covariances will also be involved in
computing the variance of the linear combination t′B.
,o
2.5 Large-Sample Properties of OLS Estimators
st
which is a small, or finite sample, property. We can also show that the OLS
estimators are consistent, that is, they converge to their true values as the
sample size increases indefinitely. Convergence is a large-sample property.
y,
for its variance to converge to zero as the sample size n increases indefi-
co
nitely. For the OLS estimator b, we have already shown that its variance is
cov(b) = σ 2 ( X′ X ) −1 (2.8)
t
σ 2 −1
cov(b) = (n X ′X ) −1 (2.27)
n
o
σ 2
plim cov(b) = plim (n −1 X ′X ) −1
n →∞ n →∞
n (2.28)
σ 2
= plim lim(n −1 X ′X ) −1
n →∞ n n →∞
We have assumed that the elements of the matrix X are bounded, which
means the second term in the preceding equation is bounded for all n.
Therefore, the second term above can be replaced by a matrix of finite
constants. Now, the limit of the first term in Equation (2.28) tends to zero
as n increases indefinitely. As a result,
e
plimcov(b) = 0 (2.29)
ut
n →∞
b
result, we have used some of the properties of the plim operator.
tri
2.5.2 Consistency of the OLS
is
Estimator of the Error Variance
rd
We have proved that S 2 is an unbiased estimator of σ 2. Assuming values
of ui are independent and identically distributed (iid), we can prove that S 2
is also a consistent estimator of σ 2. The proof is as follows:
,o
( y − Xb)′( y − Xb)
S2 =
st
n−k
u′( I − X ( X ′X ) −1 X ′)u
po
= (2.30)
n−k
n u′u u′X . X ′X . X ′u
−1
= −
n − k n n
y,
n n
p
Note: e = My = Mu, where M = [I − X(X′X)X′]. Also note how the entries are
co
n X ′u
−1
u′u u′X X ′X
plim S 2 = plim plim − plim ⋅ plim ⋅ p lim
n − k n n n n
−1
X ′X
o
=σ 2
e
What this says is that each element of b is uncorrelated with each ele-
ment of the least-squares residual vector e. The proof is as follows:
ut
Recall that
b
tri
b = B + ( X′ X ) −1 X′ u (2.7)
e = Mu (2.11)
is
are both linear functions of the error term u.
rd
Now the covariance of b and e is
cov(b, e ) = ( X′ X ) −1 X′ var(u) M′
,o
= σ 2 ( X′ X ) −1 X′ M′
= 0, since MX = 0 ↔ X′ M′
st
(2.32)
po
the error term u is normally distributed and will see the consequence of the
p
normality assumption.
co
e
2.5.5 Asymptotic Normality of S2
ut
If in addition to the classical assumptions, it is assumed that values of ui
are iid and have bounded fourth-order moments about the origin, S 2 is
b
asymptotically normally distributed. These results also hold true even if the
tri
ui values are not iid.10
is
2.6 Summary
rd
The CLRM, y = XB + u, is the foundation of regression analysis. It is based
on several assumptions. The basic assumptions are that (1) the data matrix
,o
X is nonstochastic, (2) it is of full column rank, (3) the expected value of
the error term is zero, and (4) the covariance matrix of the error term
E(uu′) = σ 2 I . This means the error variance is constant and equal to σ 2 and
st
ties such as (1) they are unbiased and (2) among all linear unbiased estima-
tors of B, they have minimum variances. This is called the Gauss–Markov
co
are consistent estimators and (2) the OLS estimators asymptotically follow
the normal distribution.
o
Exercises
D
σ2
a. cov(b1 , b2 ) = − X
Σ( X i − X ) 2
b. cov(Y , b2 ) = 0
e
ut
Σxi2 Σyi2 − (Σxi yi ) 2
RSS=
Σxi2
b
tri
where RSS is the residual sum of squares and
is
xi = ( X i − X ); yi = (Yi − Y ); xi yi = ( X i − X )(Yi − Y )
rd
2.3 Verify the following properties of OLS estimators:
a. The OLS regression line (plane) passes through the sample means of
,o
the regressand and the regressors.
b. The mean values of the actual Y and the estimated Y( = Ŷ ) are the same.
st
yi = b2 x2i + b3 x3i + + bk xk i + ei
p
Yi * = B1* + B2* X i* + ui
no
where
o
Yi − Y X −X
Yi* = ; X i* = i
D
sY sX
a. Show that a standardized variable has a zero mean and unit variance.
c. What is the relationship between B1* and B1 and between B2* and B2?
2.5 The sample correlation coefficient between variables Y and X, rXY, is
e
defined as
ut
Σxi yi
rXY =
b
Σxi2 Σyi2
tri
where
is
xi = ( X i − X ); yi = (Yi − Y )
rd
If we standardize variables as in Exercise 2.4, does it affect the correlation
coefficient between X and Y? Show the necessary calculations.
,o
2.6 Consider variables X1, X2, and X3. Now consider the following correla-
st
tion coefficients:
po
1 − r132 1 − r232
the influence of the variable X3. The concept of partial correlation is akin to
the concept of a partial regression coefficient.
a. What other partial correlation coefficients can you compute?
o
e
c. B2 + B3 = 2 B4
ut
2.8 Remember that the hat matrix, H, is expressed as
b
H = X ( X′ X ) −1 X
tri
Show that the residual vector e can also be expressed as
is
e =(I − H ) y
rd
2.9 Prove that the matrices H and (I − H) are idempotent.
,o
2.10* For the following matrix, compute its eigenvalues:
1 0 0
st
0 1 0
po
0 0 1
(*Optional)
y,
Yi = B1 + B2 X i + B3 X i2 + ui
Models like this are called polynomial regression models, here a second-
t
degree polynomial.
no
a. Is this an LRM?
b. Can OLS be used to estimate the parameters of this model?
o
Yi = B1 + B2 X 2i + B3 X 3i + B4 X 4i + ui
(Yi − X 2i ) = B1 + B3 X 3i + B4 X 4i + ui
e
for further details).
ut
b. How would you estimate the restricted regression, taking into account
the restriction that B2 = 1?
b
tri
is
rd
,o
st
po
p y,
t co
no
o
D