0% found this document useful (0 votes)
86 views

Chapter 2 The Classical Linear Regression Model (CLRM)

2.3. What is the role of the stochastic error term ui in regression analysis? What is the difference between the stochastic error term and the residual, uˆi ?

Uploaded by

castroodwa2
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
86 views

Chapter 2 The Classical Linear Regression Model (CLRM)

2.3. What is the role of the stochastic error term ui in regression analysis? What is the difference between the stochastic error term and the residual, uˆi ?

Uploaded by

castroodwa2
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 20

CHAPTER 2.

THE CLASSICAL
LINEAR REGRESSION MODEL (CLRM)

In Chapter 1, we showed how we estimate an LRM by the method of least


squares. As noted in Chapter 1, estimation and hypothesis testing are the

e
twin branches of statistical inference. Based on the OLS, we obtained the

ut
sample regression, such as the one shown in Equation (1.40). This is of
course a sample regression function (SRF) because it is based on a specific

b
sample drawn randomly from the purported population. What can we say

tri
about the true population regression function (PRF) from the SRF? In prac-
tice, we do not observe the PRF and have to “guess” it from the SRF. To

is
obtain the best possible guess, we need a framework, which is provided by
the classical linear regression model (CLRM). The CLRM is based on

rd
several assumptions, which are discussed below.
,o
2.1 Assumptions of the CLRM

We now discuss these assumptions. In Chapters 5 and 6, we will examine


st

these assumptions more critically. However, keep in mind that in any sci-
po

entific inquiry we start with a set of simplified assumptions and gradually


proceed to more complex situations.
y,

Assumption 1: The regression model is linear in the parameters as in


Equation (1.1); it may or may not be linear in the variables, the Ys and Xs.
p

Assumption 2: The regressors are assumed fixed, or nonstochastic, in the


co

sense that their values are fixed in repeated sampling. However, if the
regressors are stochastic, we assume that each regressor is independent
of the error term or at least uncorrelated with it. We will discuss this
t

assumption in more detail in Chapter 6.


no

Assumption 3: Given the values of the X variables, the expected, or mean,


value of the error term ui is 0.
o

E (ui | X ) = 0 (2.1)
D

In matrix notation, we have

E(u | X ) = 0 (2.1a)

where 0 is the null vector.

23
Copyright ©2019 by SAGE Publications, Inc.
is work may not be reproduced or distributed in any form or by any means without express written permission of the publish
24

More explicitly,

u1   E (u1 )  0 
u   E (u )   
 2  2  0

E u3 = E (u3 )  = 0 
  
     

e
     

ut
u   E (u )  0 
 n  n   

b
Because of this critical assumption, and given the values of the regressors,
we can write Equation (1.5) as

tri
E ( y | X ) = BX + E( u | X )

is
= BX (2.2)

rd
This is the PRF. In regression analysis, our primary objective is to estimate
this function. The PRF thus gives the mean value of the regressand corre-
,o
sponding to the given values of the regressors, noting that conditional on
these values the mean value of the error term is 0.
st

Assumption 4: The variance of each ui, given the values of X, is constant


po

or homoscedastic (i.e., of equal variance). That is,

var(ui | X ) = σ 2 (2.3)
y,

In matrix notation,
p
co

var (u | X ) = E (uu′ ) = σ 2 I (2.3a)

where I is an n × n identity matrix (see also Assumption 5).


t

If var(ui X ) = σ i2, the error variance is said to be heteroscedastic, or of


no

unequal variance. We will discuss this case in Chapter 5.


Figure 2.1 is a picture of both homoscedasticity and heteroscedasticity.
o

Assumption 5: There is no correlation between error terms belonging to


D

two different observations. That is,

cov(ui , u j | X ) = 0, i ≠ j (2.4)

where cov stands for covariance, and i and j are two different error terms.
Of course, if i = j, we get the variance of ui given in Equation (2.3).
Figure 2.2 shows a likely pattern of autocorrelation.

Copyright ©2019 by SAGE Publications, Inc.


is work may not be reproduced or distributed in any form or by any means without express written permission of the publish
25

Figure 2.1 Homoscedasticity and Heteroscedasticity

Homoscedasticity Heteroscedasticity
100
100

80 80

e
60 60

ut
40 40

b
20 20

tri
0 0
0 20 40 60 80 100 0 20 40 60 80 100

is
rd
,o
Figure 2.2 Autocorrelation
st

Autocorrelation function
po

0.2
0.15
0.1
y,

0.05
p

0
co

−0.05
−0.1
−0.15
t
no

0 0.5 1 1.5 2 2.5 3 3.5


× 104
Time
o
D

Assumptions 4 and 5 can be expressed as

E (uu′ ) = σ 2 I if i = j
=0 if i ≠ j

Copyright ©2019 by SAGE Publications, Inc.


is work may not be reproduced or distributed in any form or by any means without express written permission of the publish
26

where 0 is the null matrix and I is the identity matrix. We discuss this
assumption further in Chapter 5. More compactly, we can express Assump-
tions 4 and 5 as

σ 2  0 
 
E(uu′) = σ 2 I =     

e
 0  σ2

ut
 
Assumption 6: There is no perfect linear relationship among the X vari-

b
ables. This is the assumption of no multicollinearity. Strictly speaking,

tri
multicollinearity refers to the existence of more than one exact linear
relationship, and collinearity refers to the existence of a single exact

is
linear relationship. But this distinction is rarely maintained in practice,

rd
and multicollinearity refers to both cases. Imagine what would happen in
the wage regression given in Equation (1.5), if we were to include work
experience both in years and in months!
,o
In matrix notation, this assumption means that the X matrix is of full
column rank. In other words, the columns of the X matrix are linearly inde-
st

pendent. This requires that the number of observations, n, is greater than


po

the number of parameters estimated (i.e., the k regression coefficients). We


discuss this assumption further in Chapter 7.

Assumption 7: The regression model used in the analysis is correctly


y,

specified, that is, there is no (model) specific error or bias. In practice,


p

this is a tall assumption, but in Chapter 7, we discuss fully the import of


this assumption.
co

Assumption 8: Although not a part of the original CLRM, for statistical


inference (hypothesis testing), we assume that the error term ui follows the
t

normal distribution with 0 mean and (constant) variance σ 2. Symbolically,


no

ui ~ N (0, σ 2 ) (2.5)
o

Or in matrix notation,
D

u ~ N (0, σ 2 I ) (2.5a)

The assumption of the normality of the error term is crucial if the sample size
is rather small; it is not essential if we have a very large sample. However,
we will revisit this assumption in Chapter 7. With this assumption, CLRM is
known as the classical normal linear regression model (CNLRM).

Copyright ©2019 by SAGE Publications, Inc.


is work may not be reproduced or distributed in any form or by any means without express written permission of the publish
27

Since we are assuming that the X matrix is nonstochastic but u is sto-


chastic, the regressand Y is also stochastic. In addition, since u is normally
distributed with 0 mean and constant variance, Y inherits the properties of
u. More specifically,

y ~ N ( BX , σ 2 I ) (2.6)

e
That is, the regressand is distributed normally with mean BX and the (con-

ut
stant) variance σ2.
Under Assumption 8, we can use the method of maximum likelihood

b
(ML) as an alternative to OLS. We will discuss ML more thoroughly in

tri
Chapter 3 because of its general applicability in many areas of statistics.
With one or more of the preceding assumptions, in this chapter, we dis-

is
cuss the following topics:

rd
1. The sampling distribution of the OLS estimators, b
2. An estimator of the unknown variance, σ 2
,o
3. The relationship between the residual e and the error u
st

4. Small-sample properties of the OLS estimators


po

5. Large-sample properties of the OLS estimators

2.2 The Sampling or Probability


y,

Distributions of the OLS Estimators


p

Remember that the population parameters in B, although unknown, are


co

constants. However, this is not true of the estimated b coefficients, for their
values depend on the sample data at hand. In other words, the b coefficients
are random. As such, we would like to find their sampling or probability
t

distributions to establish properties of the (OLS) estimators.


no

Recall that

b = ( X′ X ) −1 X′ y, using Equation (1.16)


o
D

Therefore,

b = ( X′ X ) −1 X′ [(XB + u )] , using Equation (1.5)


= ( X′ X )−1 X′ XB + (X′ X )−1 X′ u
= B + (X′ X ) −1 X′ u (2.7)

Copyright ©2019 by SAGE Publications, Inc.


is work may not be reproduced or distributed in any form or by any means without express written permission of the publish
28

By the definition of covariance, we obtain

cov(b) = E (b − B )(b − B )′ = E [( X′ X )−1X′ u ] [( X′ X ) −1 X′ u ]′ ,


using Equation ( 2.7 )
= E [( X′ X ) X′ uu′ X (X′ X )−1 ]
−1

e
= ( X′ X )−1 X′ E (uu′ ) X (X′ X ) −1

ut
= ( X′ X )−1 X′σ 2 IX (X′ X ) −1
= σ 2 (X′ X ) −1

b
(2.8)

tri
In deriving this expression, we have used properties of the transpose of an

is
inverse matrix, and the assumption that X is fixed and that the variance of
ui is constant and the us are uncorrelated. Notice that we can move the

rd
expectations operator through X because it is assumed fixed. The variances
of the individual elements of b are on the main diagonal (running from
upper left to lower right), and the off-diagonal elements give the covari-
,o
ances between pairs of coefficients in b.
Since
st

b = ( X′ X ) −1 X′ y
po

(1.16)

and the X matrix is fixed, b is a linear function of y. Using Assumption 8, we


know that y is normally distributed. It is a property of the normal distribution
y,

that any linear function of a normally distributed variable is also normally


distributed. Therefore, b is ipso facto normally distributed as follows:
p
co

b ~ N ( B, σ 2 ( X′ X ) −1) (2.9)

That is, b is normally distributed with B as its mean (see Equation 2.20) and
the variance established in Equation (2.8). In other words, under the normality
t
no

assumption, the sampling distribution of the OLS estimator is normal, as


shown in Equation (2.9). This finding will aid us in testing hypotheses about
any element of B or any linear combination thereof. It may be noted that a
o

sampling distribution is a probability distribution of an estimator or of any test


statistic. In other words, it describes the variation in the values of a statistic
D

over all possible samples, here the variation in b over all possible samples.1

1
Suppose we draw several independent samples and for each sample we compute a
(test) statistic, such as the mean, and draw a frequency distribution of all these sta-
tistics. Roughly speaking, this frequency distribution is the sampling distribution of
that statistic. In our case, under the assumed conditions, the probability or sampling
distribution of any component of b is normal as shown in Equation (2.9).

Copyright ©2019 by SAGE Publications, Inc.


is work may not be reproduced or distributed in any form or by any means without express written permission of the publish
29

For any single element of b, bk, we can express Equation (2.9) as

bk ~ N ( Bk , σ 2 x kk ) (2.9a)

where x kk is the kth diagonal element of (X′X)−1. The square root of σ 2x kk


will give the standard error of bk (see Figure 2.3).

e
However, before we can engage in hypothesis testing, we need to esti-
mate the unknown σ 2. Remember that σ 2 refers to the variance of the error

ut
term u. Since we do not observe u directly, we have to rely on the estimated

b
residuals, e, to learn about the true variance. Toward that end, we need to
establish the relationship between u and e. Recall that

tri
e = y −ŷ (2.10)

is
Substituting for ŷ from (1.23), we obtain

e = y − Xb
rd
,o
−1
= y − X ( X′ X ) X′ y, substituting for b from Eq. (1.16 )
= My
st

= M (XB + u )
po

where
y,

M = [ I − X (X′ X ) −1 X′ ]
2
p

= Mu, because MXB = 0. (2.11)


co

Figure 2.3 The Distribution of bk, a Component of the Vector b


t
no
o
D

bk
b

2
MXB= [XB – X(XX′)−1X′XB] = XB – IXB = 0, where I is the identity matrix.

Copyright ©2019 by SAGE Publications, Inc.


is work may not be reproduced or distributed in any form or by any means without express written permission of the publish
30

As noted in Chapter 1, M is a very important matrix in the analysis of


LRMs. It is an idempotent matrix, a square matrix with the property that
M = M   2. For further properties of the idempotent matrices, see Appendix A
on linear algebra.
Since M is constant because it is a function of (fixed) X, we can write

e
E (e ) = E ( Mu)

ut
= ME ( u )

b
=0 (2.12)

tri
because E(u) = 0, by Assumption 1. We have thus shown that the expecta-

is
tion of each element of e is 0.
Now,

rd
cov(e ) = E ( ee′ ) = E ( Muu′ M′ )
,o
= ME (uu′ ) M′
2 2
= σ IM = σ M (2.13)
st

recalling the properties of M. This equation gives the covariance matrix of e.


po

Since e is a linear function of u and since u is normally distributed by


Assumption 8, we have
y,

e ~ N (0, σ 2 M ) (2.14)
p

Therefore, like the mean of u, the mean of e is 0, but unlike u, the residuals
co

are heteroscedastic as well as autocorrelated.3


What Equations (2.13) and (2.14) show is that the residuals e1, e2, . . . , en
have zero mean values, generally have different variances, and have nonzero
covariances. Remember that in the (population) CLRM errors, u1, u2, . . . , un
t
no

have zero expectations, have constant variance, and are not autocorrelated
(by assumption). In other words, the properties that hold for u generally do
not hold for e, except for zero expectations.
o
D

3
Actually, the distribution of e is degenerate as its variance–covariance matrix is
singular. On this, see Vogelvang, B. (2005). Econometrics: Theory and applications
with Eviews (chapter 4). Harlow, England: Pearson-Addison Wesley.

Copyright ©2019 by SAGE Publications, Inc.


is work may not be reproduced or distributed in any form or by any means without express written permission of the publish
31

Although we have assumed that the variance of u (not of e) is constant,


equal to σ 2, we are yet to estimate it from the sample data. Toward that end,
we proceed as follows.
Even though we do not observe u, we observe e (after the regression is
estimated). Naturally, we will have to estimate the unknown variance from
the estimated e. From Equation (2.11), we know that

e
ut
e = Mu   (2.11)

b
Therefore,

tri
E (e′ e ) = E (u′ M′ Mu)
= E (u′ Mu)

is
(2.15)

rd
because of the properties of M. Now,

E (e′ e ) = E (u′ Mu)


,o
= E [tr(u′Mu)], since u′Mu is a scalar
= E [tr (Muu′ )] changing the order of multiplication inside
st

the trace
= tr[M (Euu′ )] , since the trace and expectations operators are
po

both li near
= tr[ M (σ 2 I )]
y,

= σ 2 tr(M )
p

= σ 2  tr( I ) − tr( X ( X′ X ) X′ )  , using the definition of M


−1
 
co

−1
= σ  tr(I n ) − tr(( X′ X ) X′ X )
2

= σ 2 [tr(I n ) − tr(I k ) ] , since( X′ X ) −1 X′ X = I k


t

= σ 2 (n − k ), since tr ( I ) = n and tr ( I k ) = k
no

(2.16)

The notation tr(M) means the trace of the matrix M, which is simply the
o

sum of the entries of the main diagonal of M. In deriving the steps in


D

Equation (2.16), we have made use of several properties of the trace of a


matrix, such as the fact that trace is a linear operator and if AB and BA are
both square matrices, then tr(AB) = tr(BA).

Copyright ©2019 by SAGE Publications, Inc.


is work may not be reproduced or distributed in any form or by any means without express written permission of the publish
32

As a result, we can now write

 e ′e 
 =σ
2
E (2.17)
n−k 
If we now define

e
e ′e Σe 2

ut
S2 = = i (2.18)
n−k n−k

b
then

tri
E (S 2 ) = σ 2 (2.19)

is
In words, S 2 is an unbiased estimator of the true error variance σ 2. S, the square

rd
root of S 2, is called the standard error (se) of the estimate or the standard
error of the regression. In practice, therefore, we use S 2 in place of σ 2.
,o
2.3 Properties of OLS Estimators:
The Gauss–Markov Theorem4
st

The OLS estimators possess some ideal or optimum properties, which


po

are contained in the well-known Gauss–Markov theorem:5 Given the


assumptions of the classical regression model, in the class of unbiased lin-
ear estimators, the least-squares estimators have minimum variance; that is,
they are best linear unbiased estimators, BLUE for short. In other words,
y,

no other linear, unbiased estimator of B can have a smaller variance than


p

the OLS estimator given in Equation (2.8).


To establish this theorem, first note that b, the OLS estimator of B, is a
co

linear function of the regressand y, as we have established in Chapter 1 (see


Equation 1.16).6 To prove that b is unbiased, we proceed as follows:
t

b = ( X′ X ) −1 X′ y (1.16)
no

= ( X′ X ) −1
X′ [ XB + u] , substituting for y
= ( X′ X ) X′ XB + ( X′ X ) −1 X′ u
−1
o

= B + ( X′ X −)1 X′ u
D

4
In Appendix C, we discuss both small-sample and large-sample properties of OLS
and ML estimators.
5
Although known as the Gauss–Markov theorem, the least-squares approach of
Gauss antedates (1821) the minimum-variance approach of Markov (1900).
6
See the discussion in Darnell, A. C. (1994). A dictionary of econometrics (p. 155).
Cheltenham, England: Edward Elgar.

Copyright ©2019 by SAGE Publications, Inc.


is work may not be reproduced or distributed in any form or by any means without express written permission of the publish
33

Now

E (b) = B + (X′ X ) −1 X′ E (u)


=B (2.20)
In words, the expected value of b is equal to B, thus proving that b is unbiased.

e
(Recall the definition of unbiased estimator.) Note that E(u|X) = 0 by assumption.
To prove that in the class of unbiased linear estimators the least-squares esti-

ut
mators have the least variance (i.e., they are efficient), we proceed as follows:

b
Let b* be another linear estimator of B such that

tri
b* = [ A + ( X′ X ) −1 X′ ] y (2.21)

is
where A is some nonstochastic k × n matrix, similar to X. Simplifying, we obtain

rd
b* = Ay + ( X′ X ) −1 X′ y
= Ay + b (2.22)
,o
where b is the least-squares estimator given in Equation (1.16).
Now
st

E ( b* ) =  A + ( X′ X ) −1 X′  E ( y )
po

=  A + ( X′ X ) −1 X′  ( XB )
= ( AX + I ) B (2.23)
y,

Now E(b*) = B if and only if AX = 0. In other words, for the linear estimator
b* to be unbiased, AX must be 0.
p

Thus,
co

b* =  A + (X′ X ) −1 X′  [ XB + u] , substituting for (y )


= B +  A + (X′ X ) −1 X′  u, because AX = 0
t
no

Given that u has zero mean and constant variance (= σ 2I), we can now
find the variance of b* as follows:
o

cov(b*) = E  A + (X′ X ) −1 X′  uu′  A + (X′ X ) −1 X′ ′


D

=  A + (X′ X ) X′  E(uu′ )  A + (X′ X ) X′ ′


−1 −1

= σ 2 AA′ + (X′ X ) −1 
= σ 2 (X′ X )−1 + AA′ σ 2
= var ( b ) + AA′ σ 2 (2.24)

Copyright ©2019 by SAGE Publications, Inc.


is work may not be reproduced or distributed in any form or by any means without express written permission of the publish
34

Since AA′ is a positive semidefinite matrix, Equation (2.24) shows that the
covariance matrix of b* is equal to the covariance matrix of b plus a positive
semidefinite matrix. That is, cov(b*) > cov(b), unless A = 0. This shows that
in the class of unbiased linear estimators, the least-square estimator b has
the least variance, that is, it is efficient compared with any other linear
unbiased estimator of B.

e
It is important to note that in establishing the Gauss–Markov theorem we

ut
do not have to assume that the error term u follows a particular probability
distribution, such as the normal. To establish the theorem, we only need

b
Assumptions 1 to 5.

tri
It is also important to note that if one or more assumptions underlying
the Gauss–Markov theorem are not satisfied, the OLS estimators will not

is
be BLUE. Also, bear in mind that the Gauss–Markov theorem holds only
for linear estimators, that is, linear functions of the observation vector y.

rd
There are situations where nonlinear (in-the-parameter) estimators are
more efficient than the linear estimators. In this book, we do not deal with
nonlinear estimators, for that requires a separate book.7
,o
To sum up, we have shown that under the Gauss–Markov assumptions,
b, the least-square estimator of B, is BLUE, that is, in the class of unbiased
st

linear estimators, b has the least variance. We also showed how to estimate
B and the variance of the estimated B.
po

2.4 Estimating Linear Functions of the OLS Parameters


y,

We have shown how to estimate B, that is, each of its elements. Suppose
we want to estimate some linear function of the elements of B, that is, that
p

of B1, B2, B3, . . . , Bk. More specifically, suppose we want to estimate t′B,
co

where t′ is a 1 × k vector of real numbers and B is a k × 1 vector of the


parameters in B. It can be shown that the BLUE of t′B is t′b, where b is the
least-square estimator of B (see also Appendix C).
t

What this means is that whether we estimate all the elements of B, or one
no

of its elements, or estimate a linear combination (t′B), we can use the OLS
regression.
Let λ = t′B. By choosing t appropriately, we can make λ equal to any element
o

of B, or to the sum of the elements of B that might be of interest to researchers.


D

Suppose in Equation (1.2), we want the coefficient of B1 equal to 4, and the


coefficient of B5 equal to −1, and the rest of the coefficients to be all zeros. In
other words, we want λ = 4B1 − B5. Here, t ′ = (4, 0, 0, 0, −1, 0, 0, 0, . . .)′.
Using the definition of variance, we can now find the variance of the
estimated λ (= λ̂), which is

7
For examples of nonlinear estimators and their applications, see Gujarati, D.
(2015). Econometrics by example (2nd ed.). London, England: Palgrave Macmillan.

Copyright ©2019 by SAGE Publications, Inc.


is work may not be reproduced or distributed in any form or by any means without express written permission of the publish
35

var (λλ̂ ) = t ′(var(b))t = σ 2 t ′( X ′X ) −1 t (2.25)

But keep in mind that in general the variance of λ̂ depends on every element
of the covariance matrix of b, the estimator of B. However, if some ele-
ments of the vector t are equal to zero, var(λ̂) does not depend on the cor-
responding rows and columns of the covariance matrix σ2(X′X)−1.

e
As an example, consider λ = 4B1 − B5. In this case,

ut
var (λλ̂ ) = t12 var(b1 ) + t52 var(b5 ) + 2t1t5 cov(b1 , b5 )
(2.26)

b
=16 var(b1 ) + var(b5 ) − 8 cov(b1 , b5 )

tri
Notice in this example only the variances of b1 and b5 and their covariances

is
are involved, as the values of the other parameters in the k-variable regres-
sion (1.2) are assumed to be zero. But if there are more nonzero coeffi-

rd
cients, the variances and their pairwise covariances will also be involved in
computing the variance of the linear combination t′B.
,o
2.5 Large-Sample Properties of OLS Estimators
st

2.5.1 Consistency of OLS Estimators


We have shown that the OLS estimators of the CLRM are unbiased,
po

which is a small, or finite sample, property. We can also show that the OLS
estimators are consistent, that is, they converge to their true values as the
sample size increases indefinitely. Convergence is a large-sample property.
y,

Proof: A sufficient condition for an unbiased estimator to be consistent is


p

for its variance to converge to zero as the sample size n increases indefi-
co

nitely. For the OLS estimator b, we have already shown that its variance is

cov(b) = σ 2 ( X′ X ) −1 (2.8)
t

which we can write as


no

σ 2 −1
cov(b) = (n X ′X ) −1 (2.27)
n
o

To see the behavior of this expression as n → ∞, we have


D

σ 2 
plim cov(b) = plim  (n −1 X ′X ) −1 
n →∞ n →∞
 n  (2.28)
σ 2
= plim lim(n −1 X ′X ) −1
n →∞ n n →∞

where plim is probability limit (see Appendix C for details).

Copyright ©2019 by SAGE Publications, Inc.


is work may not be reproduced or distributed in any form or by any means without express written permission of the publish
36

We have assumed that the elements of the matrix X are bounded, which
means the second term in the preceding equation is bounded for all n.
Therefore, the second term above can be replaced by a matrix of finite
constants. Now, the limit of the first term in Equation (2.28) tends to zero
as n increases indefinitely. As a result,

e
plimcov(b) = 0 (2.29)

ut
n →∞

which establishes that b is a consistent estimator of B. In establishing this

b
result, we have used some of the properties of the plim operator.

tri
2.5.2 Consistency of the OLS

is
Estimator of the Error Variance

rd
We have proved that S 2 is an unbiased estimator of σ 2. Assuming values
of ui are independent and identically distributed (iid), we can prove that S 2
is also a consistent estimator of σ 2. The proof is as follows:
,o
( y − Xb)′( y − Xb)
S2 =
st

n−k
u′( I − X ( X ′X ) −1 X ′)u
po

= (2.30)
n−k

 n  u′u u′X .  X ′X  . X ′u 
−1

=  −   
 n − k   n n 
y,

n  n 
p

Note: e = My = Mu, where M = [I − X(X′X)X′]. Also note how the entries are
co

manipulated by multiplying or dividing them by the sample size or the


adjusted sample size without affecting the basic relationships.
Taking the plim of both sides of Equation (2.30), we obtain
t
no

 n  X ′u 
−1
u′u u′X  X ′X 
plim S 2 = plim    plim − plim ⋅ plim   ⋅ p lim 
 n − k  n n  n  n 
−1
 X ′X 
o

= 1(σ 2 − 0 ⋅ Q −1 ⋅ 0), where plim   =Q


−1
(2.31)
 n 
D

=σ 2

which establishes the result. Note that for large n, (n − k) ≈ n.

Copyright ©2019 by SAGE Publications, Inc.


is work may not be reproduced or distributed in any form or by any means without express written permission of the publish
37

In deriving the preceding result, we have used Khinchine’s theorem


(see Appendix B) as well as the properties of the plim.

2.5.3 Independence of the OLS


Estimators and the Residual Term, e

e
What this says is that each element of b is uncorrelated with each ele-
ment of the least-squares residual vector e. The proof is as follows:

ut
Recall that

b
tri
b = B + ( X′ X ) −1 X′ u (2.7)
e = Mu (2.11)

is
are both linear functions of the error term u.

rd
Now the covariance of b and e is

cov(b, e ) = ( X′ X ) −1 X′ var(u) M′
,o
= σ 2 ( X′ X ) −1 X′ M′
= 0, since MX = 0 ↔ X′ M′
st

(2.32)
po

This shows that b and e are uncorrelated.


It may be noted that if we assume that u is normally distributed, b and e
are not only uncorrelated but also independent. In Chapter 3, we will con-
sider the normal linear regression model, which explicitly assumes that
y,

the error term u is normally distributed and will see the consequence of the
p

normality assumption.
co

2.5.4 Large-Sample Distribution of b:


Asymptotic Normality of the OLS Estimators
t

It can be shown that8


no

b asy ~ N ( B, σ 2 ( X′ X ) −1) (2.33)


o

where asy means asymptotically (i.e., n → ∞).


D

In other words, as the sample size n increases indefinitely, b is approxi-


mately normally distributed with mean equal to B and variance equal to
8
The proof is rather complicated and can be found in Theil, H. (1971). Principles of
econometrics (pp. 380–381). New York, NY: Wiley; see also Mittlehammer, R. C.
(1996). Mathematical statistics for economics and business (pp. 443–447). New
York, NY: Springer.

Copyright ©2019 by SAGE Publications, Inc.


is work may not be reproduced or distributed in any form or by any means without express written permission of the publish
38

σ 2(X′X)−1. Each element of b is individually normally distributed with


variance equal to the appropriate element of the variance matrix σ 2(X′X)−1.
This result holds whether u is normally distributed or not.
It may be noted that if the errors ui are not iid, even then b is asymptoti-
cally normally distributed as in (2.33) under certain conditions.9

e
2.5.5 Asymptotic Normality of S2

ut
If in addition to the classical assumptions, it is assumed that values of ui
are iid and have bounded fourth-order moments about the origin, S 2 is

b
asymptotically normally distributed. These results also hold true even if the

tri
ui values are not iid.10

is
2.6 Summary

rd
The CLRM, y = XB + u, is the foundation of regression analysis. It is based
on several assumptions. The basic assumptions are that (1) the data matrix
,o
X is nonstochastic, (2) it is of full column rank, (3) the expected value of
the error term is zero, and (4) the covariance matrix of the error term
E(uu′) = σ 2 I . This means the error variance is constant and equal to σ 2 and
st

that the error terms are mutually uncorrelated.


po

We used the method of OLS to estimate the parameters of an LRM. One


reason for using OLS is that it does not require us to make assumptions
about the probability distribution of the error term, and it is comparatively
easy to estimate. Parameters of the CLRM estimated by OLS are called
y,

OLS estimators. OLS estimators have several desirable statistical proper-


p

ties such as (1) they are unbiased and (2) among all linear unbiased estima-
tors of B, they have minimum variances. This is called the Gauss–Markov
co

theorem. These are small-sample properties.


OLS estimators have these asymptotic, or large-sample, properties:
(1) The OLS estimators of B as well as the estimator of the error variance
t
no

are consistent estimators and (2) the OLS estimators asymptotically follow
the normal distribution.
o

Exercises
D

2.1 Consider the bivariate regression: Yi = B1 + B2 X i + ui . Under the classi-


cal linear regression assumptions, show that
9
See Mittlehammer, R. C. (1996). Mathematical statistics for economics and busi-
ness (p. 445). New York, NY: Springer.
10
See Mittlehammer, R. C. (1996). Mathematical statistics for economics and busi-
ness (pp. 448–449). New York, NY: Springer.

Copyright ©2019 by SAGE Publications, Inc.


is work may not be reproduced or distributed in any form or by any means without express written permission of the publish
39

σ2
a. cov(b1 , b2 ) = − X
Σ( X i − X ) 2

b. cov(Y , b2 ) = 0

2.2 Show that for the model in Exercise 2.1,

e
ut
Σxi2 Σyi2 − (Σxi yi ) 2
RSS=
Σxi2

b
tri
where RSS is the residual sum of squares and

is
xi = ( X i − X ); yi = (Yi − Y ); xi yi = ( X i − X )(Yi − Y )

rd
2.3 Verify the following properties of OLS estimators:
a. The OLS regression line (plane) passes through the sample means of
,o
the regressand and the regressors.
b. The mean values of the actual Y and the estimated Y( = Ŷ ) are the same.
st

c. In the CLRM with intercept, the mean value of the residuals (e ) is


zero.
po

d. As a result of the preceding property, the k-variable sample CLRM


can be expressed as
y,

yi = b2 x2i + b3 x3i +  + bk xk i + ei
p

where yi = (Yi − Y ); xki = ( X ki − X k )


co

2.4 Consider the following bivariate regression model:


t

Yi * = B1* + B2* X i* + ui
no

where
o

Yi − Y X −X
Yi* = ; X i* = i
D

sY sX

where sY and sX are the sample standard deviations of Y and X. Yi * and X i*


are known as standardized variables, often known as Z scores. Since the
units of measurement of the Z scores in the numerator and the denominator
are the same, they are called “pure” or “unitless” numbers.

Copyright ©2019 by SAGE Publications, Inc.


is work may not be reproduced or distributed in any form or by any means without express written permission of the publish
40

a. Show that a standardized variable has a zero mean and unit variance.

b. What are the formulas to estimate B1* and B2*?

c. What is the relationship between B1* and B1 and between B2* and B2?
2.5 The sample correlation coefficient between variables Y and X, rXY, is

e
defined as

ut
Σxi yi
rXY =

b
Σxi2 Σyi2

tri
where

is
xi = ( X i − X ); yi = (Yi − Y )

rd
If we standardize variables as in Exercise 2.4, does it affect the correlation
coefficient between X and Y? Show the necessary calculations.
,o
2.6 Consider variables X1, X2, and X3. Now consider the following correla-
st

tion coefficients:
po

r12 = correlation coefficient between X 1 and X 2


r13 = correlation coefficient between X 1 and X 3
r23 = correlation coefficient between X 2 and X 3
p y,

r12 − r13 r23


r12.3 =
co

1 − r132 1 − r232

r12.3 is called the partial correlation coefficient between X1 and X2 holding


t
no

the influence of the variable X3. The concept of partial correlation is akin to
the concept of a partial regression coefficient.
a. What other partial correlation coefficients can you compute?
o

b. If we standardize the three variables as in Exercise 2.4, would the


D

correlation coefficients among the standardized variables be differ-


ent from the unstandardized variables?
c. Would partial correlation coefficients be affected by standardizing
the variables? Explain.
2.7 Consider the following LRM:
Yi = B1 + B2 X 2i + B3 X 3i + B4 X 4i + B5 X 5i + ui

Copyright ©2019 by SAGE Publications, Inc.


is work may not be reproduced or distributed in any form or by any means without express written permission of the publish
41

How would you test the following hypotheses?


a. B=2 B=3 B=
4 B=
5 B, that is, all partial regression coefficients
are the same.
=
b. B2 B=
3 and B4 B5

e
c. B2 + B3 = 2 B4

ut
2.8 Remember that the hat matrix, H, is expressed as

b
H = X ( X′ X ) −1 X

tri
Show that the residual vector e can also be expressed as

is
e =(I − H ) y

rd
2.9 Prove that the matrices H and (I − H) are idempotent.
,o
2.10* For the following matrix, compute its eigenvalues:

1 0 0 
st

0 1 0
 
po

0 0 1

(*Optional)
y,

2.11 Consider the following regression model (see Chapter 7, Equation


(7.30)):
p
co

Yi = B1 + B2 X i + B3 X i2 + ui

Models like this are called polynomial regression models, here a second-
t

degree polynomial.
no

a. Is this an LRM?
b. Can OLS be used to estimate the parameters of this model?
o

c. Since X i2 is the square of X i , does this model suffer from perfect


collinearity?
D

2.12 Consider the following model:

Yi = B1 + B2 X 2i + B3 X 3i + B4 X 4i + ui

You are told that B2 = 1.

Copyright ©2019 by SAGE Publications, Inc.


is work may not be reproduced or distributed in any form or by any means without express written permission of the publish
42

a. In this case, is it legitimate to estimate the following regression?

(Yi − X 2i ) = B1 + B3 X 3i + B4 X 4i + ui

This model is called a restricted linear regression, whereas the preceding


model is called an unrestricted linear regression (see Chapter 4, Appendix 4A

e
for further details).

ut
b. How would you estimate the restricted regression, taking into account
the restriction that B2 = 1?

b
tri
is
rd
,o
st
po
p y,
t co
no
o
D

Copyright ©2019 by SAGE Publications, Inc.


is work may not be reproduced or distributed in any form or by any means without express written permission of the publish

You might also like