0% found this document useful (0 votes)
10 views

Econometría

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views

Econometría

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 43

Curso de Atualización para Almunnos de Economı́a

Course: Econometrics1
Prof.: Dr. Ricardo Quineche Uribe2
T.A.: Rita Huarancca Delgado3

Contents
1 Introduction 3

2 Large sample theory 3


2.1 Existence of moments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2.2 Convergence in Probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2.3 Markov’s inequality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.4 Weak Law of Large Numbers (WLLN) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.5 Consistency of estimators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.6 Convergences in marginal probabilities imply convergence in joint probability . . . . . . . . . 4
2.7 Continuous Mapping Theorem (CMT) for convergence in probability . . . . . . . . . . . . . . 5
2.8 Convergence in moments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.8.1 Properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.9 Cauchy-Schwartz Inequality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.10 Jensen’s Inequality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.11 Convergence in Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.11.1 Properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.12 Continuous Mapping Theorem (CMT) for convergence in distribution . . . . . . . . . . . . . 6
2.13 Slutsky’s Lemma . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.14 Cramér-Wold Device . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.15 Central Limit Theorem(CLT) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.15.1 Univariate CLT: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.15.2 Multivariate CLT: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.16 Delta Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.16.1 For one variable: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.16.2 Multivariate version: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.17 Example 1: Constructing an estimator of σ 2 (P ) . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.18 Example 2: Applying Delta Method: univariate . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.19 Example 3: Applying Delta Method: multivariate . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.20 Constructing a confidence set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.20.1 Not relying on Asymptotics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.20.2 Relying on Asymptotics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.20.3 Coverage Property . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.21 Example 4: Constructing confidence sets with Markov’s inequality . . . . . . . . . . . . . . . 12
2.22 Example 5: Constructing confidence sets using (M)CLT . . . . . . . . . . . . . . . . . . . . . 13
1 This content cannot be shared, uploaded, or distributed. This content has not been written for publication.
2 The contact email address is [email protected].
3 The contact email address is [email protected].
2.23 Hypothesis testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.23.1 Consistency in level . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.23.2 p-value of a test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.23.3 Steps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.24 Example 6: Test statistic for the sample mean . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.25 Testing multidimensional hypothesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.26 Example 7: Testing multidimensional hypothesis . . . . . . . . . . . . . . . . . . . . . . . . . 17

3 Conditional expectations 18
3.1 Expected Value Rule . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.2 Expected Value Rule for conditional expectation . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.2.1 Given an event A . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.2.2 Given that a random variable Y takes on an specific value . . . . . . . . . . . . . . . . 19
3.2.3 As a random variable . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.3 The mean of E[X | Y ]: Law of Iterated Expectations (LIE) . . . . . . . . . . . . . . . . . . . 19
3.4 Properties of conditional expectations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.5 About Independence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.5.1 Mean independence ⇏ independence . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.5.2 Mean independence ⇒ zero covariance . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.5.3 Uncorrelatedness ⇏ mean independence . . . . . . . . . . . . . . . . . . . . . . . . . . 20

4 Linear Regressions 21
4.1 Three interpretations of linear regressions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
4.1.1 Linear Conditional Expectation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
4.1.2 ”Best” Linear Approximation to Conditional Expectation / ”Best” Linear Predictor
of Y | X . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
4.1.3 Causal model interpretation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
4.2 Linear Regression when E[Xu] = 0 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
4.2.1 Solving for β . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
4.2.2 Solving for subvectors of β . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
4.2.3 Solving for β 2 when X′1 = 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
4.2.4 Estimating β . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
4.2.5 Unbiasedness of the natural estimator of β . . . . . . . . . . . . . . . . . . . . . . . . 27
4.2.6 Consistency of the natural estimator of β . . . . . . . . . . . . . . . . . . . . . . . . . 28
4.2.7 Limiting distribution of the natural estimator of β . . . . . . . . . . . . . . . . . . . . 28
4.2.8 Limiting distribution of the natural estimator of β if E[u | X] = 0 . . . . . . . . . . . . 29
4.2.9 Estimating Ω . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
4.2.10 Gauss Markov Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
4.2.11 Inference: Testing a single linear restriction . . . . . . . . . . . . . . . . . . . . . . . . 32
4.2.12 Inference: Testing multiple linear restrictions (Wald Test) . . . . . . . . . . . . . . . . 33
4.2.13 Tests of non-linear restrictions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
4.3 Linear Regression when E[Xu] ̸= 0 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
4.3.1 Motivating examples for when E[Xu] ̸= 0 . . . . . . . . . . . . . . . . . . . . . . . . . 35
4.3.2 What happens to the OLS estimator if E[Xu] ̸= 0 ? . . . . . . . . . . . . . . . . . . . 36
4.3.3 Solving for β: Instrumental Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
4.3.4 Estimating β . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
4.3.5 Consistency of the TSLS estimator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
4.3.6 Limiting distribution of the TSLS estimator . . . . . . . . . . . . . . . . . . . . . . . . 41
4.3.7 Efficiency of the TSTS estimator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
4.3.8 Estimating Ω . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

2
1 Introduction
In econometrics, there are three typical problems we wish to address.
Suppose X1 , X2 , . . . , Xn are independently and identically distributed according to the cumulative distribu-
tion function (CDF) P . Then, our goal is to “learn” some “features” of P from the data.
i Estimate θ(P ): Construct a function called an estimator, θ̂n = θ̂n (X1 , X2 , . . . , Xn ), that provides a
“best guess” for θ(P ). For example, µ(P ) = E [Xi ] , σ 2 (P ) = VarP [Xi ], coefficients from the regressions
(β(P )).
ii Test a hypothesis about θ(P ): For example, a test could be of the form θ(P ) = θ0 , and we want to
construct a function, ϕn = ϕn (X1 , X2 , . . . , Xn ) ∈ [0, 1], that determines the probability with which you
reject the hypothesis.
iii Construct a confidence region for θ(P ): Construct a random set, Cn = Cn (X1 , X2 , . . . , Xn ), that contains
θ(P ) with some pre-specified probability.
Studying finite sample properties of the above is typically difficult (unless we make some strong assumptions
about P ). This is why we wish to study large-sample properties.

2 Large sample theory


2.1 Existence of moments
Definition 1.1. We say that the mean of a random vector X, denoted E[X], exists if E[|X|] < ∞. If X is a
random variable on R, then we can split E[X] into positive and negative parts:

E[X] = E X + − E X − ,
   

where
X + := max{X, 0}, X − := max{−X, 0}.
Then, notice that
E[|X|] = E X + + E X − .
   

Thus, the requirement that E [|Xi |] < ∞ is equivalent to requiring that E Xi+ , E Xi− < ∞ (i.e. they both
   
 +  −
exist) since E Xi , E Xi ≥ 0 by construction.

2.2 Convergence in Probability


Let {Xn : n ≥ 1} and X be random vectors on Rk .Xn is said to converge in probability to X, denoted
p
Xn → X, if, as n → ∞,
P (|Xn − X| > ε) → 0, ∀ε > 0,
where | · | is the usual Euclidean norm4 . Equivalently,

lim P (|Xn − X| > ε) = 0 for any ε > 0


n→∞
4 On the n-dimensional Euclidean space Rn , the intuitive notion of length of the vector x = (x , x , . . . , x ) is captured by
1 2 n
the positive root of q
|x| := x21 + · · · + x2n .

3
2.3 Markov’s inequality
For any random variable X,
E [|X|q ]
P(|X| > ε) ≤ , ∀q, ε > 0,
εq
where | · | is the usual Euclidean norm.

2.4 Weak Law of Large Numbers (WLLN)


Pn
Xi
Recall X̄n ≡ i=1
n . If
1. {Xn : n ≥ 1} is a sequence of iid random variables on R with CDF P
2. E [Xi ] ≡ µ(P ) < ∞ (i.e., exists)
Then, Pn
i=1 Xi p
→ E [Xi ]
n
That is,
p
X̄n → µ(P )

2.5 Consistency of estimators


• Consistency of an estimator means that as the sample size gets large the estimate gets closer and closer
to the true value of the parameter.
iid
• Let X1 , X2 , . . . Xn ∼ P on R and suppose that µ(P ) exists.
• We wish to construct an estimator of µ(P ).
• As we are able to apply WLLN, a natural estimator that converges to µ(P ) is X̄n :
p
X̄n → µ(P )

• That is, X̄n is consistent for µ(P )


• This is a good strategy to construct consistent estimators, where we construct the estimator for the
unknown parameter by trying to take advantage of the WLLN.
• Recall that consistency requires X̄n to converge to the true/population value of the distribution, µ(P ).
Thus, although X̄n + 1 would converge in probability to a constant, it is not consistent for µ(P ).
• consistency is related, but not equivalent to unbiasedness. Unbiasedness requires that
 
E X̄n = µ(P )
i.e. it is not an asymptotic property. Note that unbiasedness does not imply consistency, nor does
consistency imply unbiasedness.

2.6 Convergences in marginal probabilities imply convergence in joint proba-


bility
p p
Suppose {Xn : n ≥ 1} , {Yn : n ≥ 1} , X and Y are random variables and that Xn → X and Yn → Y . Then,
p
(Xn , Yn ) → (X, Y ).
This is a very useful property of convergence in probability especially when combined with the Continuous
Mapping Theorem.

4
2.7 Continuous Mapping Theorem (CMT) for convergence in probability
Let {Xn : n ≥ 1} and X be random vectors on Rk . Suppose that g : Rk → Rd is continuous at each point in
the set C ⊆ Rk such that P(X ∈ C) = 1. Then,
p p
Xn → X ⇒ g (Xn ) → g(X)
p
• Recall that if {Xn : n ≥ 1} , {Yn : n ≥ 1} , X and Y are random variables and that Xn → X and
p
Yn → Y , then
p
(Xn , Yn ) → (X, Y ).

• The Continuous Mapping Theorem then allows us to define a function g (Xn , Yn ) so that
p
g (Xn , Yn ) → g(X, Y ).

• For example, g could be Xn + Yn , Xn Yn , Xn /Yn (if Y ̸= 0 ) etc.


• We will use this property many times in proving consistency.

2.8 Convergence in moments


Let {Xn : n ≥ 1} and X be random vectors on Rk .Xn is said to converge in qth moment to X such that
Lq
q ≥ 1, denoted Xn → X, if, as n → ∞,
q
E [|Xn − X| ] → 0.

2.8.1 Properties
• Convergence in moments implies convergence in probability: Let {Xn : n ≥ 1} and X be random
vectors on Rk . Then,
Lq p
Xn → X ⇒ Xn → X

• Convergence in probability does not imply in convergence in moments.

2.9 Cauchy-Schwartz Inequality


   
For any random variables u and v with E u2 , E v 2 < ∞

E[uv]2 ≤ E u2 E v 2 .
   

Furthermore, inequality is strict unless there exists α such that P(u = αv) = 1.

2.10 Jensen’s Inequality


Let I ⊆ R be a convex set (i.e. an interval on R ) and f : I → R a convex function. Then, for any random
variable X such that P(x ∈ I) = 1, if:
1. E[|X|] < ∞
2. E[|f (X)|] < ∞
Then,
f (E[X]) ≤ E[f (X)]
If f is a concave function (i.e. −f is convex), then the inequality above is reversed.

5
2.11 Convergence in Distribution
Let {Xn : n ≥ 1} and X be random vectors on Rk .Xn is said to converge in distribution to X, denoted
d
Xn → X if, for all x for which P(X ≤ x) is continuous, as n → ∞

P (Xn ≤ x) → P(X ≤ x)

2.11.1 Properties
• Convergence in probability implies convergence in distribution.
• Convergence in distribution does not imply convergence in probability(unless the convergence in dis-
tribution is into a constant).
• If
d
1. Xn → b
2. b is a constant
Then,
p
Xn → b

• If
d
1. Xn → X
d
2. Yn → Xn
Then,
d
Yn → X

• Marginal convergence in distribution does not imply joint convergence in distribution (unless one of
the marginal convergence in distribution is into a constant).
• If
d
1. Xn → X
d
2. Yn → b
3. b is a constant.
Then,
d
(Xn , Yn ) → (X, b)

2.12 Continuous Mapping Theorem (CMT) for convergence in distribution


Let {Xn : n ≥ 1} and X be random vectors on Rk . Suppose that g : Rk → Rd is continuous at each point in
the set C such that P(X ∈ C) = 1. Then,
d d
Xn → X ⇒ g (Xn ) → g(X).

6
2.13 Slutsky’s Lemma
Let {Xn : n ≥ 1} , {Yn : n ≥ 1} and X be random vectors, and c ∈ Rk a constant. If
d
1. Xn → X
d
2. Yn → c
Then,
d
Xn′ Yn → X ′ c
and
d
Xn + Yn → X + c

2.14 Cramér-Wold Device


Let {Xn : n ≥ 1} and X be random vectors. Then,
d d
Xn → X ⇔ t′ Xn → t′ X, ∀t′ ∈ Rk .

2.15 Central Limit Theorem(CLT)


2.15.1 Univariate CLT:
If
iid
1. X1 , X2 , . . . , Xn ∼ P on R
2. σ 2 (P ) < ∞
Then,
√  d
n X̄n − µ(P ) → N 0, σ 2 (P )


2.15.2 Multivariate CLT:


It comes straightforward by using Cramér-Wold Device to generalise the univariate central limit theorem. If
iid
1. X1 , X2 , . . . , Xn ∼ P on Rk
2. Σ(P ) < ∞ (i.e. every entry in the variance-covariance matrix is finite)
Then,
√  d
n X̄n − µ(P ) → N (0, Σ(P )).

2.16 Delta Method


2.16.1 For one variable:
Let {Xn : n ≥ 1} and X be random variables, c a constant, and τn a sequence of constants such that τn → ∞
d
and τn (Xn − c) → X. Let g(x) be a function that is continuous and differentiable at c. Denote gx (c) as
partial derivative of g(x) evaluated at c. Then,
d
τn (g (Xn ) − g(c)) → gx (c)X.

In particular, if
d
X ∼ N (0, σ 2 (P )) ⇒ τn (g (Xn ) − g(c)) → N 0, gx2 (c)σ 2 (P ) .


7
2.16.2 Multivariate version:
Let {Xn : n ≥ 1} and X be random vectors, c ∈ Rk a constant, and τn a sequence of constants such that
d
τn → ∞ and τn (Xn − c) → X. Let g : Rk → Rd be a function that is continuous and differentiable at c.
Denote Dg (c) as the d × k matrix of partials of g evaluated at c. Then,
d
τn (g (Xn ) − g(c)) → Dg(c)X.

In particular, if
d
X ∼ N (0, Σ) ⇒ τn (g (Xn ) − g(c)) → N (0, Dg(c)ΣDg(c)′ ) .

2.17 Example 1: Constructing an estimator of σ 2 (P ))


iid  
Let X1 , X2 , . . . , Xn ∼ P on R. Suppose that E Xi2 < ∞. We wish to construct an estimator of σ 2 (P ) :=
Var (Xi ).
• A nice way to come up with an estimator is to try to apply WLLN. Thus, as starting point, it is good
to locate E [Xi ] so we can use X̄n .
• Recall
Var (Xi ) = E Xi2 − (E [Xi ])2
 

• Then, a natural estimator is:


Pn  Pn 2
Xi2 Xi
s2n := i=1
− i=1
.
n n

• Is it consistent?
• Since E Xi2 < ∞, then E [Xi ] < ∞. Then, since Xi is iid for i = 1, ..., n and E [Xi ] < ∞, as n → ∞,
 

by WLLN:
p
X̄n → E[X].
 
Also, since Xi2 is iid for i = 1, ..., n and E Xi2 < ∞, as n → ∞, by WLLN:
n
1 X 2 p  2
X → E Xi ,
n i=1 i

• We may write s2n as a function g, parameterised as


n
! !
1 X
s2n =g Xi2 , X̄n
n i=1

where,
g(a, b) = a − b2


• Since g(.) is continuous, by the CMT,


p
s2n → E Xi2 − E[X]2 = σ 2 (P ).
 

• This shows that s2n is a consistent estimator of σ 2 (P ).

8
2.18 Example 2: Applying Delta Method: univariate
iid
Suppose X1 , X2 , . . . , Xn ∼ P = Bernoulli(q), with q ∈ (0, 1). Then, by the Central Limit Theorem,
√  d
n X̄n − q → N (0, g(q)),

where g(q) = q(1 q). Λ natural estimator for g(q) is g X̄n . Since, Dg(q) = 1 2q, the Delta Method
implies that
√  d
n g X̄n − g(q) → Dg(q)N (0, g(q)) = N 0, (1 − 2q)2 q(1 − q) .
 

When q = 1/2, we have that


√   d
n g X̄n − g(q) → N (0, 0) = 0.
√   p
That is, n g X̄n − g(q) → 0. We get a degenerate limit distribution because the Delta Method considers
Taylor expansion only up to the first order.
To get a non-degenerate limiting distribution, we use second-order Taylor expansion, which gives us:

  D2 g(q) 2 
g X̄n − g(q) = Dg(q) X̄n − q + X̄n − q + R X̄n − q
2

where R(0) = 0 and R(h) = o h2 ; i.e. as h → 0,
R(h)
→ 0.
h2
Since D2 g(q) = −2, when q = 1/2, above simplifies to
 2 
g X̄n − g(q) = − X̄n − q + R X̄n − q .
Multiplying both sides by n, we obtain
  2 
n g X̄n − g(q) = −n X̄n − q + nR X̄n − q
Consider the first term on the right-hand side. Rearranging and using the Continuous Mapping Theorem
gives that:
2 √ 2
−n X̄n − q = − n X̄n − q
  2  2
d 1 d 1
→ − N 0, = − N (0, 1)
4 2
 2
d 1 d 1
= − N (0, 1) = − χ21 .
2 4
We can also write the second as
!
 2 R X̄n − q
nR X̄n − q = n X̄n − q b 2 ,
X̄n − q
R(h)
where b is defined as before to ensure that it is continuous at Xn = q. Recall that as h → 0, h2 → 0, and
P
the fact that X̄n → q. Thus, we know that
! !
R X̄n − q R X̄n − q d
b 2 → 0 ⇒ b 2 → 0.
X̄n − q X̄n − q
And, from before, we have already shown that
2 d 1 2
n X̄n − q → χ
4

9
2.19 Example 3: Applying Delta Method: multivariate
Let X1 , . . . , Xn be an iid sequence of random variables with distribution (λ) with λ > 0. (Here, (λ) denotes
the exponential distribution with parameter λ, whose pdf is given by

λ exp(−λx) if x ≥ 0
f (x) = .
0 if x < 0

In particular, the mean and the variance of such a random variable are 1/λ and 1/λ2 , respectively.) Inde-
pendently of X1 , . . . , Xn , let Y1 , . . . , Yn be an iid sequence of random variables with distribution (µ) with
µ > 0.

Deriving the limiting distribution for log Ȳn /X̄n :
As always, write out the expression:
 
Ȳn  
log = log Ȳn − log X̄n .
X̄n

So it looks like we’ll be splicing things together, we should work in a multivariate setting. Since Xi ’s and
Yi ’s are iid with finite variances, and Xi Yi , by the multivariate CLT,
√ 1/λ2
       
X̄n 1/λ d 0 0
n − →N , .
Ȳn 1/µ 0 0 1/µ2

Let g(x, y) = log(y/x) which is continuous and differentiable if x ̸= 0 and y/x > 0. Note that
 
Dg(x, y) = xy ∂(y/x) x ∂(y/x)
= − x1 y1 .

∂x y ∂y

Then, by the Delta method

√ 1/λ2
        
Ȳn 1/λ d  0 −λ
n log − log → N 0, −λ µ
X̄n 1/µ 0 1/µ2 µ
d
= N (0, 2).

2.20 Constructing a confidence set


A confidence set/region of level 1 − α for µ(P ), denoted Cn := Cn (X1 , X2 , . . . , Xn ), is such that the
probability that the true mean is contained in the set is greater than 1 − α, α ∈ (0, 1); i.e.

P (µ(P ) ∈ Cn ) ≥ 1 − α

Cn would then be the confidence set of level 1 − α for µ(P ). For example, if α = 5%, then Cn gives us the
interval in which the the probability that Cn contains the true mean is 95%.
We can construct a confidence set relying or not relying on asymptotics.

2.20.1 Not relying on Asymptotics


We can use Markov’s inequality to construct confidence intervals since:
• Marvok’s inequality is:
E [|X|q ]
P(|X| > ε) ≤ , ∀q, ε > 0,
εq

10
• using X̄n − E [X] as argument with q = 2:
h 2 i
 E X̄n − µ(P )
P X̄n − µ(P ) > ε ≤ 2
 ε
Var X̄n
=
ε2

• Then, by construction,  
 Var X̄n
P X̄n − µ(P ) ≤ ε ≥ 1 − .
ε2
• If we could somehow make  
Var X̄n
1− =1−α
ε2
• Then, 
P X̄n − µ(P ) ≤ ε̄ ≥ 1 − α.

• That is, 
P X̄n − ε̄ ≤ µ(P ) ≤ X̄n + ε̄ ≥ 1 − α.

• ,which is equivalent to:


P (µ(P ) ∈ Cn ) ≥ 1 − α
with  
Cn := X̄n − ε̄, X̄n + ε̄ .

A recipe that almost always works is the following one:


1. Write down the Markov’s inequality for X̄n − E [X] with q = 2.
2. Find out an upper bound for the RHS of Markov’s inequality such that the unknown parameters
disappear.
3. Equate that upper bound to α.
4. Using the obtained above expression, rewrite such that ε = g(α)
5. Then, we can then define the confidence set as
 
Cn := X̄n − g(α), X̄n + g(α) .

Equivalently, we can write the confidence set as



Cn := x ∈ R : X̄n − x ≤ g(α)

2.20.2 Relying on Asymptotics


A confidence set/region of level 1 − α for g(P ), denoted Cn := Cn (X1 , X2 , . . . , Xn ), is such that the proba-
bility that the true parameter g(P ) is contained in the set is greater than 1 − α, α ∈ (0, 1); i.e.

P (g(P ) ∈ Cn ) ≥ 1 − α.

Below, we list the steps to construct a confidence set/region using Central Limit Theorem, which relies on
asymptotic properties.

11
1. Construct a consistent estimator for g(P ). (Hint: use WLLN and CMT.)
2. Find out the limiting distribution of ĝ(P ). (Hint: use CLT, MCLT, Delta Method).
3. Derive an standard distribution. (Hint: Use Slutsky’s lemma)
4. Then, if the standard distribution is the normal standard one, we can write the confidence set as

   
ĝ(P ) − x
Cn := x ∈ R : n ≤ z1− α2
sg(P )
Equivalently, we can write the confidence set as

sg(P )
cn := z 1− α √ ,
2
n
and
Cn := [ĝ(P ) − cn , ĝ(P )n + cn ] .

2.20.3 Coverage Property


To show that P (µ(P ) ∈ Cn ) → 1 − α.

2.21 Example 4: Constructing confidence sets with Markov’s inequality


iid
Suppose that X1 , X2 , . . . , Xn ∼ X, where X has the CDF P = Bernoulli(r) with r ∈ (0, 1).
Thus, the mean and the variance of a Bernoulli distribution are given by µ(P ) = r and σ 2 (P ) = r(1 − r)
respectively.
Let α ∈ (0, 1). We would like to construct a nontrivial (random) set Cn := Cn (X1 , X2 , . . . , Xn ) such that
the probability that the Cn contains the true mean is greater than 1 − α ; i.e.

P (g(P ) ∈ Cn ) ≥ 1 − α.
• We will use the recipe we have learned in section 2.18.1.
• 1. Write down the Markov’s inequality for X̄n − E[X] with q = 2:
h 2 i
 E X̄n − µ(P )
P X̄n − µ(P ) > ε ≤
ε2 P 
1 n P
1
Pn Var[X] + 2 i̸=j Cov [Xi , Xj ]
 
Var X̄n 2 Var [ i=1 X] n2 i=1
= = n =
ε2 ε2 ε2
1 n Var(X) r(1 − r)
= 2 =
n ε2 nε2

• 2. Find out an upper bound for the RHS of Markov’s inequality such that the unknown parameters
disappear:
Note that r(1 − r) is maximised at 1/4 when r = 1/2. Thus,
 r(1 − r) 1
P X̄n − µ(P ) > ε ≤ ≤ , ∀r
nε2 4nε2

• 3. Equate that upper bound to α:


1
α=
4nε2

12
• 4. Using the obtained above expression, rewrite such that ε = g(α):
1
ε= √
4αn

• 5. Then, we can then define the confidence set as


 
1 1
Cn := X̄n − √ , X̄n + √
4αn 4αn
Equivalently, we can write the confidence set as
 
1
Cn := x ∈ R : X̄n − x ≤ √
4αn

2.22 Example 5: Constructing confidence sets using (M)CLT


We have previously constructed a confidence set for Bernoulli distributed iid random variables using Markov’s
Inequality (see Example 4), which did not rely on asymptotics. Below, we construct a confidence set/region
using Central Limit Theorem, which relies on asymptotic properties.
iid
X1 , X2 , . . . , Xn ∼ P = Bernoulli (q), where q ∈ (0, 1). Let α be given. We wish to construct a confidence
region for µ(P ) = q at level 1 − α.
• Step 1: In this case, the g(P ) = µ(P ) = E(X). Thus, by:
p
X̄n → µ(P ) = q.

• Step 2: By CLT:
√  d
n X̄n − µ(P ) → N 0, σ 2 (P )


• Step 3: Since σ 2 (P ) = q(1 − q), a natural estimator for σ 2 (P ) is

s2n = X̄n 1 − X̄n





Thus, we can write s2n as a function f (), parameterised as s2n = f X̄n , where f (a) = a(1 − a). Since
f () is continuous, by the Continuous Mapping Theorem and WLLN:
p
s2n → σ 2 (P ).

Since σ 2 (P ) > 0, we can write a function g(), parameterised as g(a) = √1 . Since g() is continuous, by
a
the Continuous Mapping Theorem and WLLN:
1 p 1
g(s2n ) = → g(σ 2 (P )) = .
sn σ(P )

Then, by Slutsky’s Lemma, √ 


n X̄n − µ(P ) d
p → N (0, 1).
s2n
That is, √ 
n X̄n − µ(P ) d
q  → N (0, 1).
X̄n 1 − X̄n

13
• Step 4: Define q 
X̄n 1 − X̄n
cn := z1− α2 √ ,
n
and  
Cn := X̄n − cn , X̄n + cn .
We can write confident regions in the following equivalent way:

 
 n X̄n − x 
Cn := x ∈ R : q  ≤ z1− α
2
 X̄n 1 − X̄n 

Stating and verifying the coverage property that your confidence region satisfies:
• The coverage property is that P (µ(P ) ∈ Cn ) → 1 − α.
• To verify it,
P (µ(P ) ∈ Cn ) = P (Xn − cn ≤ µ(P ) ≤ Xn + cn )

= P X̄n − µ(P ) ≤ cn
 q 
X̄n 1 − X̄n
= P  X̄n − µ(P ) ≤ z1− α2 √ 
n

 
n X̄n − µ(P )
= P q  ≤ z1− α2

X̄n 1 − X̄n

→ P |z| ≤ z1− α2

= P z α2 ≤ z ≤ z1− α2
 α α
=Φ 1− −Φ
2 2
 α α
= 1− − = 1 − α.
2 2

2.23 Hypothesis testing


We wish to apply the concept we developed above to test hypotheses about µ(P ). For example, we wish to
test the null hypothesis,
H0 : µ(P ) ≤ 0,
against the alternative hypothesis,
H1 : µ(P ) > 0.
There are two types of errors:
1. Type I error, which is the probability of rejecting H0 when it is true (i.e. false rejection/positive)
2. Type II error, which is the probability of not rejecting H0 when it is false (i.e. false negative).
Generally, it is not possible to minimise both type I and II errors (since type II error can be under any
distribution).

14
2.23.1 Consistency in level
• We restrict attention to tests of the form:

ϕn = 1{Tn >cn } ,

where Tn is the test statistic which is a function of data such that “large” values provide evidence
against H0 ; and cn is the critical value which gives the definition of “large”.
• E [ϕn ] is the probability of rejecting H0 , called the power function of the test.
• For µ(P ) satisfying H0 , E [ϕn ] is the probability of type I error
• For a µ(P ) satisfying H1 , 1 − E [ϕn ] is the probability of type II error.
We say that a test ϕn = ϕn (X1 , X2 , . . . , Xn ) ∈ [0, 1] is consistent in level if

lim sup E [ϕn ] ≤ α,


n→∞

for µ(P ) satisfying H0 and where α ∈ (0, 1) is the significance level of the test.
• Therefore, the requirement of consistency in level is given by
 
lim sup E [ϕn ] = lim sup E 1{Tn >cn } = lim sup P (Tn > cn ) ≤ α
n→∞ n→∞ n→∞

for µ(P ) satisfying H0 .

2.23.2 p-value of a test


The p-value of a test is the smallest value of α for which we reject the null hypothesis

p̂n := inf {α ∈ (0, 1) : Tn > cn } ,

where cn depends on α.

2.23.3 Steps
A recipe that almost always works is the following one:
1. Express the parameter in H0 as a function of Expectations.
2. Find a consistent estimator. (Hint: CMT, WLLN)
3. Find the limiting distribution of the estimator. (Hint: CLT, MCLT, Delta Method)
4. Standardize the distribution under the null. (Hint: Slutsky’s Lemma)
5. Construct the test ϕn according to H1 . That is taking into account if it is greater than, less than or
not equal to.

2.24 Example 6: Test statistic for the sample mean


iid
Suppose that X1 , X2 , . . . , Xn ∼ P on R and that σ 2 (P ) < ∞. Construct an estimator for the mean. Consider
the following null and alternative hypotheses,

H0 : µ(P ) = 0
H1 : µ(P ) ̸= 0.

Construct a test statistic to test the above.

15
• First, we need to come up with an estimator for what is being tested under the null.
• As in Example 1, we need to try to use WLLN to construct a consistent estimator. In this case is
straightforward since it is testing the Mean.
• Then, by WLLN the consistent estimator is X̄n
• Now, we need to find a limiting distribution for the estimator and try to come up with a standard
distribution. For this, we try to use CLT. In this case, it is straightforward since it is only the estimator
of the mean.
• by CLT,
√  d
n X̄n − µ(P ) → N 0, σ 2 (P )


• We are close to get a normal standard distribution, but we need to take out σ 2 (P ). For this, we will
use Slutsky’s Lemma.
d
• We know (from Example 1) that s2n converges√in probability to σ 2 (P ), which implies that s2n → σ 2 (P ).
In particular, we use the fact that f (x) = 1/ x is a continuous function when x ̸= 0. Then, by CMT
p
1/sn → 1/σ(P ), which implies convergence in distribution. Thus, we have have that
√  d 
1. n X̄n − µ(P ) → N 0, σ 2 (P )
d
2. 1/sn → 1/σ(P )
Therefore, by Slutsky’s Lemma:
√  d
(1/sn ) n X̄n − µ(P ) → (1/σ(P ))N 0, σ 2 (P )


That is, √ 
n X̄n − µ(P ) d
→ N (0, 1)
sn

• Now, under the null (µ(P ) = 0): √


nX̄n d
→ N (0, 1).
sn

• To conduct the test at level α, the critical value is based on the standard normal distribution. But
since the test is two sided, we would reject whenever we see that Tn is both too low or too high. So
with significance level α,  α
cn := Φ 1 − := z1− α2 ,
2
where Φ is the CDF of N (0, 1). Note that Φ(x) = P(z ≤ x) and we want to choose cn such that the
probability of z being greater than cn is 1 − α/2 (and the probability of z being less than cn is α/2);
i.e. cn is the 1 − α/2 th quantile of the standard normal distribution.
• So a natural candidate for a test statistic is

nX̄n
Tn :=
sn

• Showing it is consistent in level: That is, we wish to show that

lim sup E [ϕn ] ≤ α


n→∞

16
under the null. Notice that √ !
n X̄n
E [ϕn ] = P > z1− α2 .
sn

We also have that √


nX̄n d
→ N (0, 1),
sn
Noting that | · | is a continuous operation, by CMT

n X̄n d
→ |N (0, 1)|,
sn
Then, by definition of convergence in distribution, as n → ∞:
√ !
n X̄n 
P ≥ z1− α2 → P |Z| ≥ z1− α2
sn

Thus, by taking limsup of both sides, we obtain



lim sup Ep [ϕn ] ≤ P |Z| ≥ z1− α2
n→∞
 
= P Z < z α2 + P Z > z1− α2
α α
= + = α.
2 2
That is, the test is consistent in level α.

2.25 Testing multidimensional hypothesis


iid
Let X1 , X2 , . . . , Xn ∼ P on Rk with Σ(P ) < ∞ being the k × k variance-covariance matrix. Assume that
Σ(P ) < ∞. Note that g(P ) is a k × 1 vector in this context. The test is

H0 : g(P ) = 0,
H1 : g(P ) ̸= 0.

The steps described in Section 2.22.3 are still but we have to take into consideration the fact that now we
are dealing with a variance-covariance matrix instead of a single parameter, which will have consequences
when standardizing the limiting distribution of ĝ(P ).

2.26 Example 7: Testing multidimensional hypothesis


iid
Let X1 , X2 , . . . , Xn ∼ P on Rk with Σ(P ) < ∞ being the k × k variance-covariance matrix. Assume that
Σ(P ) < ∞. Note that µ(P ) is a k × 1 vector in this context. The test is

H0 : µ(P ) = 0,
H1 : µ(P ) ̸= 0.

From the CLT, since Xi ’s are iid


√  d
n X̄n − µ(P ) → z ∼ N (0, Σ(P )).

A useful fact to remember is that if Σ(P ) is invertible, and z is a k × 1 vector, then

z ∼ N (0, Σ(P )) ⇒ z ′ Σ−1 (P )z ∼ χ2k .

17
In this case, we therefore have that
′  d
n X̄n − µ(P ) Σ−1 (P ) X̄n − µ(P ) → χ2k .

Although appears that can may use this to construct a test statistic, notice that we do not in fact know
Σ(P ). However, we can estimate Σ(P ). Recall that:

Σ(P ) = E [XX ′ ] − E[X]E[X ′ ]

Thus, a natural estimator is:


Pn  Pn   Pn
(Xi Xi′ ) Xi′

i=1 i=1 Xi i=1
Σ̂n = −
n n n
By WLLN,
n
1X p
Xi Xi′ → E [XX ′ ] ,
n i=1
p
X̄n → E[X],
Pn
Xi′ p
i=1
→ E[X ′ ],
n
We can express this as a continuous function g(a, b, c) = a − bc. Since we are given that Σ(P ) < ∞, which
also implies that E[|X|] < ∞, by CMT:
p
Σ̂n → Σ(P ).
Then, since we have assumed that Σ−1 is invertible, by the Continuous Mapping Theorem, we have that
p
Σ̂−1
n →Σ
−1
(P )

Thus, by Slutsky’s Lemma:


′  d
n X̄n − µ(P ) Σ̂−1
n X̄n − µ(P ) → χ2k .
Under the null:
d
Tn := nX̄n′ Σ̂−1 2
n X̄n → χk .

Then, the test is given by:


ϕn = 1{Tn >cn }
with
Tn := nX̄n′ Σ̂−1
n X̄n ,
cn := ck,1−α ,
where ck,1−α is the 1 − α th quantile of χ2k . Notice that we are not using |.| because we are dealing with a
χ2k distribution and not with a normal standard one.
To show that the test is consistent in level,
 
lim sup EP [ϕn ] = lim sup P nX̄n′ Σ̂−1
n X̄n > ck,1−α
n→∞ n→∞
≤ P (T ≥ ck,1−α ) = 1 − (1 − α) = α,

where T ∼ χ2k .

3 Conditional expectations
Let (Y, X) be random variables such that Y ∈ R, X ∈ R.

18
3.1 Expected Value Rule
X
E[X] = xpX (x)
x

3.2 Expected Value Rule for conditional expectation


3.2.1 Given an event A
X
E[X | A] = xpX|A (x)
x

3.2.2 Given that a random variable Y takes on an specific value


X
E[X | Y = y] = xpX|Y (x | y)
x

3.2.3 As a random variable


Notice that from above we have:
E[X | Y = y] = g(y)
It is a fixed value. However, If we allow g(y) to take any possible value that Y can take. Then, the conditional
expectation of random variable X given random variable Y is not a fixed number but a random variable:
E[X | Y ] = g(Y )
Notice that as g(Y ) is a random variable, it has on its own a distribution, mean, variance, etc.

3.3 The mean of E[X | Y ]: Law of Iterated Expectations (LIE)


E[E[X | Y ]] = E[g(Y )]
X
E[E[X | Y ]] = g(y)PY (y)
y
X
E[E[X | Y ]] = E[X | Y = y]PY (y)
y
by the Total Expectation Theorem:
X
E[X] = E[X | Y = y]PY (y)
y

Then, LIE states that:


E[E[X | Y ]] = E[X]

3.4 Properties of conditional expectations


i If Y = f (X), then E[Y | X] = f (X).
ii E[Y + X | Z] = E[Y | Z] + E[X | Z].
iii E[f (X)Y | X] = f (X)E[Y | X].
iv If P(Y ≥ 0) = 1, then P(E[Y | X] ≥ 0) = 1.
v Independence: If X ⊥ Y , then E[Y | X] = E[Y ]
vi Mean independence: If E[Y | X] = c (i.e. y is mean independent of X ), then, by Law of Iterated
Expectations,
E[E[Y | X]] = E[Y ] = E[c] = c

19
3.5 About Independence
Note that independence implies mean independence, which in turn, implies uncorrelatedness. The converse
statements do not hold.

3.5.1 Mean independence ⇏ independence


Suppose
Y | X ∼ N 0, σ 2 X


Then E[Y | X] = 0 so that Y is mean independent of X. However, Y and X and clearly not independent.

3.5.2 Mean independence ⇒ zero covariance


Suppose E[Y | X] = c, then

Cov[X, Y ] = E[XY ] − E[X]E[Y ]


( LIE ) = E[E[XY | X]] − E[X]E[E[Y | X]]
= E[XE[Y | X]] − E[X]E[E[Y | X]]
= E[XE[c]] − E[X]E[c]
= cE[X] − cE[X] = 0.

3.5.3 Uncorrelatedness ⇏ mean independence


d
Suppose X ∼ N (0, 1) and Y = X 2 . Then,

Cov[X, Y ] = E X 3 − E[X]E X 2 = 0 − 0 = 0.
   

 
However, E[Y | X] = E X 2 | X = X 2 .

20
4 Linear Regressions
Let (Y, X, u) be random variables/vectors where Y, u ∈ R and X ∈ Rk+1 , such that
 
1
X1 
X= . 
 
 .. 
Xk
 
β0
 β1 
β= . 
 
 .. 
βk
and
Y = X′ β + u

4.1 Three interpretations of linear regressions


4.1.1 Linear Conditional Expectation
• E[Y | X] = E[X′ β | X] + E[u | X].
• Then,
E[Y | X] = X′ β + E[u | X]

• Assume E[u | X] = 0
• Then,
E[Y | X] = X′ β

• The parameter β has no causal interpretation-it is simply a convenient way of summarising a feature
of distribution of Y and X

4.1.2 ”Best” Linear Approximation to Conditional Expectation / ”Best” Linear Predictor of


Y |X
Define u := Y − X′ β. For the ”Best” Linear Approximation to Conditional Expectation interpretation,
consider h i
2
min E (E[Y | X] − X′ b)
b∈Rk+1
For the “Best” Linear Predictor of Y | X interpretation, consider
h i
2
min E (Y − X′ b)
b∈Rk+1

Notice that any solution to first minimization problem is a solution to second one (and vice versa):
 
h i
2
E (E[Y | X] − X′ b) =E (E[Y | X] − Y +Y − X′ b)2 
| {z }
:=v
=E v + 2E [v (Y − X′ b)]
 2
h i
2
+ E (Y − X′ b)
h i
2
=E v 2 + 2E[vY ] − 2E [vX′ b] + E (Y − X′ b)
 

21
Notice that
E [vX′ b] = E [(E[Y | X] − Y )X′ b]
= E [E[Y | X]X′ b] − E [Y X′ b]
(LIE) = E [Y X′ b] − E [Y X′ b]
= 0.
Then, h i h i
2 2
E (E[Y | X] − X′ b) = E v 2 + 2E[vY ] + E (Y − X′ b)
 

Notice that the first two terms of the above equation are constants with respect to b. Hence, minimising the
first minimization problem with respect to b must give the same solution as minimising the second one with
respect to b.
Let β be a solution to either of the minimization problems. Define u := Y − X′ β, then we can interpret
Y = X′ β + u as either the best “Best” Linear Approximation to Conditional Expectation or “Best” Linear
Predictor of Y | X, which we denote as BLP(Y | X).
As before, no causal interpretation can be drawn from β. To derive properties of u, consider the first-order
condition of the minimization problem:

0 = 2E [X (Y − X′ β)]
= E[Xu]

4.1.3 Causal model interpretation


Assume that Y = g(X, u), where X represents observed determinants of Y , and u represents the unobserved
determinants of Y .
In other words, the assumption says that if we are given X and u, then we can compute Y .
Then, the effect of a unit change in Xi on Y , holding X−i and u constant is ∂g/∂Xi . Thus, if we assume
that g(X, u) = X′ β + u, then
∂g(X, u)
= βi
∂Xi
However, we do not know statements about the relationships between observed and unobserved determinants
of Y : e.g. E [Xi u] or E[u | X].

4.2 Linear Regression when E[Xu] = 0


Define (Y, X, u) and β as above and assume that:
• E Y2 <∞
 

• E [XX′ ] exists
−1
• There is no perfect collinearity in X: E [XX′ ] < ∞. X is said to be perfectly collinear if there is
c ̸= 0, c ∈ Rk+1 such that P (c′ X = 0) = 1.
• E[Xu] = 0

22
4.2.1 Solving for β
Using the fact that E[Xu] = 0:
0 = E[Xu]
= E [X (Y − X′ β)]
= E[XY ] − E XX′ β
 

−1
⇒ β = E [XX′ ] E[XY ].

What happens if E [XX ] is not invertible? This means that there exists multiple solutions for β.
Although having multiple solutions is not a problem under the first and second interpretations, it would
matter for the last interpretation; i.e. multiple solutions are a problem in cases where we are interested in
causal relationships.

4.2.2 Solving for subvectors of β


Suppose we partition X and β into subvectors (X1 )k1 ×1 and (X2 )k2 ×1 , and accordingly, to (β 1 )k1 ×1 and
(β 2 )k2 ×1 . Then,
Y = X′ β + u
 
′ ′
 β1
= X 1 X 2 +u
β2
= X′1 β 1 + X′2 β 2 + u
We also have that, by assumption,
   
X1 u 0k1 ×1
E[Xu] = 0 ⇒ E = .
X2 u 0k2 ×1

One way to solve this is:


−1
β = E XX′

E[XY ]
 −1
       
β1 X1 ′ ′ X1
⇔ =E X1 X2 E Y
β2 X2 X2
−1
E [X1 X′1 ] E [X1 X′2 ]
   
X1 Y
= E
E [X2 X′1 ] E [X1 X′2 ] X2 Y

Is there a way to find β 1 without finding β 2 ?


Frisch Waugh Theorem for Population:
First, define
Ỹ := Y − BLP (Y | X2 ) = Y − X′2 γ,
X̃1 := X1 − BLP (X1 | X2 ) = X1 − X′2 δ,
Then, β 1 = β̃ can be found from:
Ỹ = X̃′1 β̃ + ũ
where since E[X̃1 ũ] = 0:
h i−1 h i
β̃ = E X̃1 X̃′1 E X̃1 Ỹ
h i−1 h i
= Var X̃1 Cov X̃1 , Ỹ

Proof:

23
h i
First note that since E [XX′ ] is invertible by assumption, and X̃1 is linear combination of X, E X̃1 X̃′1 is
invertible. h i−1 h i
β̃ = E X̃1 X̃′1 E X̃1 Ỹ
h i−1 h i
= E X̃1 X̃′1 E X̃1 (Y − BLP (Y | X2 ))
h i−1  h i h i
= E X̃1 X̃′1 E X̃1 Y − E X̃1 BLP (Y | X2 )

Since X̃1 is the residual from regressing X1 on X2 . Then, X̃1 is orthogonal (linearly independent) to X2 :
h i h i h i
E X̃1 BLP (Y | X2 ) = E X̃1 (X′2 γ) = E X̃1 X′2 γ = 0k1 ×1

Hence,
h i−1 h i
β̃ = E X̃1 X̃′1 E X̃1 Y
h i−1 h i
= E X̃1 X̃′1 E X̃1 (X′1 β 1 + X′2 β 2 + u)
h i−1  h i h i h i
= E X̃1 X̃′1 E X̃1 X′1 β 1 + E X̃1 X′2 β 2 + E X̃1 u
h i h i
Notice that E X̃1 X′2 β 2 = 0 since E X̃1 X′2 = 0.
h i
Now, let’s analyze E X̃1 u :
h i
E X̃1 u = E [(X1 − BLP (X1 | X2 )) u] = E [X1 u] − E [(X′2 δ) u]

= E [X1 u] − (δE [X2 u]) = 0

Thus,
h i−1 h i
β̃ = E X̃1 X̃′1 E X̃1 X′1 β 1
h i−1 h  i

= E X̃1 X̃′1 E X̃1 X̃′1 + BLP (X1 | X2 ) β1
h i−1 h i

= β 1 + E X̃1 X̃′1 E X̃1 BLP (X1 | X2 ) β 1 .

Notice that
h i h i h i
′ ′
E X̃1 BLP (X1 | X2 ) =E X̃1 (X′2 δ) = E X̃1 δ ′ X2
  ′  
X̃11 δ1
 X̃12   δ2   
= E  .   .   X2 
   
 ..   ..   
X̃1k1 δk2
  
δ1 X̃11 δ2 X̃11 ··· δk2 X̃11
 δ1 X̃12 δ2 X̃12 ··· δk2 X̃12
  

=E  (X2 )k2 ×1 

 .. .. .. .. 
. . . .

  
δ1 X̃1k1 δ2 X̃1k1 ··· δk2 X̃1k1 k1 ×k2
= 0k1 ×1 ,

24
h i
where the last equality holds since E X̃1i X2j = 0 for all i = 1, 2, . . . , k1 and j = 1, 2, . . . k2 . Therefore

β̃ = β 1 .

Notice that we showed the following in the proof above:


h i
• E X̃1 u = 0k1 ×1 : This comes from the assumption that E[Xu] = 0, which holds by construction of
BLP.
h i
• E X̃1 X′2 = 0k1 ×k2 : This means that the ”residual” from regressing X1 on X2 is orthogonal to X2 .
h i
• E X̃1 BLP (Y | X2 ) = 0k1 ×1 : This means that the ”residual” from ”regressing” X1 on X2 is orthog-
onal to BLP (Y | X2 ) from ”regressing” Y on X2 .
h i

• E X̃1 BLP (X1 | X2 ) = 0k1 ×1 : This means that the ”residual” from ”regressing” X1 on X2 is
orthogonal to BLP (X1 | X2 ) from ”regressing” X1 on X2 .

4.2.3 Solving for β 2 when X′1 = 1


From:
Y = X′1 β 1 + X′2 β 2 + u
If X′1 = 1, then we have the following linear regression:

Y = β1 + X′2 β 2 + u

where β1 is the intercept. From FWT we know that in general β 2 = β̃ can be found from:

Ỹ = X̃′2 β̃ + ũ

where
Ỹ := Y − BLP (Y | X1 ) = Y − X′1 γ,
X̃2 := X2 − BLP (X2 | X1 ) = X2 − X′1 δ,
h i
2
For this particular case BLP [Y | X2 ] solves minβ0 E (Y − β0 ) , which gives the first-order condition as
E [(Y − β0 )] = 0 ⇒ E[Y ] = β0 . Similarly for BLP [X1 | X2 ]. Substituting the expressions above, we have:

Ỹ = Y − E[Y ]
X̃2 = X2 − E [X2 ]

Thus,
 ′ −1
β̃ = E (X2 − E [X2 ]) (X2 − E [X2 ]) E [(X2 − E [X2 ]) (Y − E[Y ])]
−1
= Var [X2 ] Cov [X2 , Y ] .
As β 2 = β̃, we have that when there is an intercept, the vector β 2 can be found by FWT by:

−1
β 2 = Var [X2 ] Cov [X2 , Y ]

25
4.2.4 Estimating β
Let (Y, X, u) be random variables/vectors where Y, u ∈ R and X ∈ Rk+1 , where

X = (X0 , X1 , . . . , Xk ) , X0 = 1,

β = (β0 , β1 , . . . , βk ) ,

be a parameter such that


Y = X′ β + u
Assume that:
1. E[Xu] = 0
2. E XX′ < ∞
 

3. there is no perfect collinearity in X


  iid
4. (Y, X) ∼ P and we have samples Y 1 , X1 , Y 2 , X2 , . . . , (Y n , Xn ) ∼ P , where we denote by super-
scripts the index for sample.
Recall that
−1
β = E [XX′ ] E[XY ]
Then, a natural estimator for β is
n
!−1 n
!
1 X i i′ 1X i i
β̂ n = XX XY .
n i=1 n i=1

This (analog) estimator is called the ordinary least square (OLS) estimator. The name comes from the
fact that β̂ n solves
n
1X i 2
β̂ n = arg min Y − Xi′ b .
b∈R k+1 n i=1

To see this, notice that the first-order condition is given by


n
1 X i i 
X Y − Xi′ β̂ n = 0(k+1)×1
n i=1
n n
1 X i i′  1X i i
⇒ X X β̂ n = XY
n i=1 n i=1

We define the fitted values to be


Ŷ i = Xi β̂ n
and the fitted residuals to be
ûi := Y i − Ŷ i = Y i − Xi′ β̂ n .
With these definitions, we realise that (3.11) can be written as
Pn
 Pn i=1 Xi ûi
=0 (k+1)×1
i i

Pi=1 X0 û 0
n i i
i=1 X1 û
   0 
⇔  =  ..  .
   
..

Pn .   . 
i i
i=1 Xk+1 û 0

26
Note that since X0i = 1 for all i, the first-order condition implies that
n
X
ûi = 0
i=1

when we have a constant in the regression.

4.2.5 Unbiasedness of the natural estimator of β


E[Xu] = 0 is not enough for β to be unbiased. We need E[u | X] = 0 (i.e. u is mean independent of X).
Recall that !−1 !
n n
1 X i i′ 1X i i
β̂ n = XX XY .
n i=1 n i=1
Then,
n
!−1 n
!
1 X i i′ 1X i i′ i

β̂ n = XX X X β+u .
n i=1 n i=1
n
!−1 n
! n
!−1 n
!
1 X i i′ 1 X i i′ 1 X i i′ 1X i i
β̂ n = XX XX β+ XX Xu .
n i=1 n i=1 n i=1 n i=1
n
!−1 n
!
1 X i i′ 1X i i
β̂ n = β + XX Xu .
n i=1 n i=1

Taking the conditional expectation given X:


 !−1 ! 
n n
h i 1 X i i′ 1X i i
E β̂ n | X1 , X2 , . . . , Xn = E β + XX Xu | X1 , X2 , . . . , Xn 
n i=1 n i=1

n
!−1 " n
! #
h
1 2 n
i 1 X i i′ 1X i i 1 2 n
E β̂ n | X , X , . . . , X =β+ XX E Xu | X ,X ,...,X
n i=1 n i=1
n
!−1 n
!
h
1 2 n
i 1 X i i′ 1X  i i 1 2 n

E β̂ n | X , X , . . . , X =β+ XX E X u | X ,X ,...,X
n i=1 n i=1
n
!−1 n
!
h i 1 X i i′ 1X i  i
E β̂ n | X1 , X2 , . . . , Xn = β + X E u | X1 , X2 , . . . , Xn

XX
n i=1 n i=1
Since u is mean independent of X: h i
E β̂ n | X1 , X2 , . . . , Xn = β

By LIE: h h ii
E[β̂ n ] = E E β̂ n | X1 , X2 , . . . , Xn
h i
⇔ E β̂ n = β.

27
4.2.6 Consistency of the natural estimator of β
Consistency Without requiring further assumptions, we can show that the OLS estimator is consistent.
Recall that !−1 !
n n
1 X i i′ 1X i i
β̂ n = XX XY .
n n=1 n n=1

First, since E Xi Xi′ = E [XX′ ] exists and Xi ,s are iid,


 

n
1 X i i′ p
X X → E [XX′ ]
n n=1

Since we are given that there is no perfect collinearity in X, then E [XX′ ] is invertible. Then, by the
Continuous Mapping Theorem,
n
!−1
1 X i i′ p −1
XX → E [XX′ ]
n n=1
′
Similarly, since E Xi Y i = E[XY ] = E [X (X′ β + u)] = E [XX′ ] β and E [XX′ ] exists, and Xi , Y i s are
 

iid,
n
1 X i i p  i i
XY →E XY
n n=1
Since convergence in marginal probabilities implies converge in joint probabilities:
 !−1 
n n
1 X 1 X p

−1 
Xi Xi′ Xi Y i  → E [XX′ ] , E Xi Y i

 ,
n n=1 n n=1

Noting that multiplication is a continuous operation, and given above, by the Continuous Mapping Theorem,
n
!−1 n
!
1 X i i′ 1X i i p −1
β̂ n = XX XY → E [XX′ ] E[XY ] = β
n n=1 n n=1

4.2.7 Limiting distribution of the natural estimator of β


Suppose further that Var[Xu] exists, then
√  
d
n β̂ n − β → N (0, Ω),

where
−1 −1
Ω = E [XX′ ] Var[Xu]E [XX′ ]
Proof: Substituting Y i = Xi′ β + ui into β̂ n yields

n
!−1
n
!
1 X i i′ 1X i
X Xi′ β + ui

β̂ n = XX
n n=1 n n=1
n
! −1 n
!
1 X i i′ 1X i i
=β+ XX Xu
n n=1 n n=1
n
!−1 n
!
√   1 X i i′ √ 1X i i
⇒ n β̂ n − β = XX n Xu .
n n=1 n n=1

28
By assumption, Var[Xu] exists and Xi ui ’s are iid, then by the Central Limit Theorem:
n
!
√ 1X i i d
n X u → N (0, Var[Xu]),
n n=1

where we used the fact that that E[Xu] = 0 by assumption. We already showed previously that

n
!−1
1 X i i′ p −1
XX → E [XX′ ] .
n n=1

Hence, by Slutsky’s Lemma,


n
!−1 n
!
√   1 X i i′ √ 1X i i
n β̂ n − β = XX n Xu
n n=1 n n=1
d −1
→ E XX′

N (0, Var[Xu])

Thus,
√  
d

−1 −1

n β̂ n − β → N 0, E [XX′ ] Var[Xu]E [XX′ ] .

4.2.8 Limiting distribution of the natural estimator of β if E[u | X] = 0


Suppose, in addition, that E[u | X] = 0 and Var[u | X] = σ 2 (i.e. u is homoscedastic), then
√  
d

−1

n β̂ n − β → N 0, σ 2 E [XX′ ]

Proof :
Notice that
Var[Xu] = E [(Xu − E[Xu])(Xu − E[Xu])′ ]
[Since: E[Xu] = 0] = E XX′ u2
 

[ LIE ] = E XX′ E u2 | X .
  

Since,
Var[u | X] = E (u − E[u | X])2 | X
 

= E u2 | X − E[u | X]2
 

= E u2 | X = σ 2
 

we can write
Var[Xu] = E XX′ σ 2 .
 

Thus,
−1 −1
Ω = E [XX′ ] Var[Xu]E [XX′ ]
−1 −1
= E [XX′ ] E [XX′ ] σ 2 E [XX′ ]
−1
= σ 2 E [XX′ ]

4.2.9 Estimating Ω
Recall:

√  
d
n β̂ n − β → N (0, Ω),

29
where
−1 −1
Ω = E [XX′ ] Var[Xu]E [XX′ ]
Since we do not observe u, we do not know Ω. Here, we focus on deriving a consistent estimator of Ω.
Under homoscedasticity:
Suppose u is homoscedastic so that

E[u | X] = 0, Var[u | X] = σ 2 .

Recall that, in this case, Ω simplifies to


−1
Ω = σ 2 E [XX′ ]
17
So, a natural estimator for Ω is:
n
!−1
1 X i i′
Ω̃n := XX σ̂n2 ,
n i=1
n
1 X i 2
σ̂n2 := û
n i=1

We know from before that (see consistency of OLS)

n
!−1
1 X i i′ p −1
→ E Xi Xi′

XX
n i=1

Thus, it remains to show that σ̂n2 is consistent.


Note that if we could observe ui and define σ̂n2 using ui , then it would be easy to show that σ̂n2 is consistent
(use WLLN). But we cannot observe ui .
However, we can still introduce ui to show that σ̂n2 is consistent. To do so write out ûi while adding and
subtracting X′i β :
 
ûi = Y i − Xi′ β̂ n = Y i − Xi′ β + Xi′ β − Xi′ β̂ n = ui − Xi′ β̂ n − β .

This implies that


2 2   h  i2
ûi = ui − 2ui Xi′ β̂ n − β + Xi′ β̂ n − β ,

which, in turn, gives that


n n i2 n
1 X i 2 1 X i i′   1 Xh 
σ̂n2 = u −2 u X β̂ n − β + Xi′ β̂ n − β
n i=1 n i=1 n i=1

Let us consider each term in turn:


Pn 2
• n1 i=1 ui : Since ui ’s are iid, and E ui = 0 < ∞, WLLN immediately gives us that
 

n
1 X i 2 p
u → Var[u]
n i=1

Pn  
• 1
n i=1 ui i′
X β̂ n − β . Notice that the betas can be taken outside of the summation since they do
p Pn p
not depend on i. Then, we use the fact that β̂ n → β and n1 i=1 ui Xi′ → E[Xu] = 0, and the

30
Continuous Mapping Theorem (as well as the fact that convergence in marginal probabilities implies
convergence in joint probabilities) to conclude that
 1X n
p
β̂ n − β ui Xi′ → 0
n i=1

Pn h  i2
1 i′
n i=1 X β̂ n − β : Since the term is positive, it would be sufficient to show that its upper
bound is zero. With u and v being (k + 1) × 1 vectors, Cauchy-Schwartz inequality tells us that
 2   
k
X Xk Xk
 uj v j  ≤  u2j   vj2 
j=1 j=1 j=1

So, for each i,


 2   
  2 k   k k  2 2
X X 2 X 2
Xi′ β̂ n − β Xji′ β̂n,j − βj  ≤  Xji′   β̂n,j − βj  = Xi

= β̂ n − β
j=1 j=1 j=1

where | · | is the Euclidean norm. Summing across i and dividing by n, we obtain that
n i2 n 
2

1 X h i′  1X i 2
X β̂ n − β ≤ X β̂ n − β
n i=1 n i=1
n
!
1X i 2 2
= X β̂ n − β
n i=1
h i
2
Notice that E [XX′ ] < ∞ implies that E (Xj ) < ∞ for all j = 0, 1 . . . , k :

X02
    
X0 X0 X1 ··· X0 Xk
 X1    X0 X1 X12 ··· X1 Xk 
E [XX′ ] = E  X0 X1 ··· Xk  = E <∞
    
..  .. .. .. ..
 .    . . . . 
Xk X0 Xk X1 Xk ··· Xk2
Pn 2 p h i
2
Since Xji ’s are iid, by WLLN, we know that n1 i=1 Xji → E (Xj ) . Then, by the fact that
convergence in marginal probabilities implies convergence in joint probabilities, using the Continuous
Mapping Theorem,
k n k
X 1X 2 p X h
2
i
Xji → E (Xj ) .
j=0
n i=1 j=0
p
Therefore, by CMT and the fact that β̂ n → β
n
!
1X i 2 2 p
X β̂ n − β →0
n i=1

Thus, we obtain the result that


p
σ̂n2 → Var[u].

31
Therefore, we have that (since convergence in marginal probabilities implies convergence in joint probabili-
ties)  
n
!−1
1 X p
  −1 
 Xi Xi′ , σ̂n2  → E Xi Xi′ , Var[u]
n i=1

and applying the Continuous Mapping Theorem (with g(a, b) = ab, which is continuous) yields that

n
!−1
1 X i i′ p
Ω̃n = XX σ̂n2 → Var[Xu]
n i=1

Under heteroscedasticity
Under heteroscedasticity A natural estimator here is
n
!−1 n
! n
!−1
1 X i i′ 1 X i i′ i 2 1 X i i′
Ω̂n = XX X X û XX
n i=1 n i=1 n i=1

1
Pn −1 p  −1
We already showed that n i=1 Xi Xi′ → E Xi Xi′ so our focus here is the term in the middle.
Since Xi ’s and ui ’s are both iid and, by assumption, Var [Xu] exists, applying WLLN gives us that
n
1 X i i′ i 2 p h i i′ i 2 i
X X û →E XX u
n i=1

, where ûi := Y i − Ŷ i = Y i − Xi′ β̂ n

4.2.10 Gauss Markov Theorem


Suppose further that E[u | X] = 0, Var[u | X] = σ 2 (i.e. u is homoscedastic), then β̂ n is the BLUE: ”best”
linear estimator of β in the sense of having the ”smallest” variance among all unbiased linear estimators.

4.2.11 Inference: Testing a single linear restriction


Suppose we wish to test, at level α,
H0 : r′ β = c,
H1 : r′ β ̸= c,
where r = (k + 1) × 1 and c is a scalar. For example, if r = (1, 0, . . .)′ , then the test is β0 = c. Alternatively,
if r = (1, −1, 0, 0, . . .)′ , then the test is β0 − β1 = c. Recall that
√  
d
n β̂ n − β → N (0, Ω).

Since linear operations are continuous, by Delta Method with g(a) = r′ a and Dg (a) = r,
√  ′ 
d
n r β̂ n − r′ β → N (0, r′ Ωr) .

Note that, if r ̸= 0, then r′ Ωr > 0 (i.e. Ω is positive definite) so that, by the Continuous Mapping Theorem,
 −1 p
−1
r′ Ω̂n r → (r′ Ωr)

32
Then, by Slutsky’s Lemma,
√  ′ 
n r β̂ n − r′ β d
p → N (0, 1).
r′ Ω̂n r
Thus, we can consider the following test statistic (under the null)
√  ′ 
n r β̂ n − c
Tn = p
r′ Ω̂n r
and the test is
ϕn = 1n|T o .
n |>z1− α .
2

By construction, this test is consistent in level α.

4.2.12 Inference: Testing multiple linear restrictions (Wald Test)


Suppose we wish to test, at level α,
H0 : Rβ = c
H1 : Rβ ̸= c
where R = p × (k + 1) and c = p × 1; i.e. there are p linear restrictions. For example, this allows us to test
   
β0 1
 β2 + β3  =  0  ,
β1 − β2 2

by setting  
1 0 0 0 0 ···
R= 0 0 1 1 0 ··· 
0 1 −1 0 0 · · · 3×(k+1)

Just as we required r ̸= 0, here, to rule out redundant restrictions (e.g. 2β1 + 2β2 = 2c and β1 + β2 = c ),
we require the rows of R to be linearly independent.
This means that, given that Ω is invertible by assumption, RΩR′ is also invertible.20 To see this suppose
 
a b
Ω=
c d

and that this is invertible.      


1 0 a b 1 0 a b
=
0 1 c d 0 1 c d
However, if
      
2 2 a b 2 1 2a + 2c 2b + 2d 2 1
=
1 1 c d 2 1 a+c b+d 2 1
 
4a + 4c + 4b + 4d 2a + 2c + 2b + 2d
=
2a + 2c + 2b + 2d a+c+b+d

By the Continuous Mapping Theorem (linear operations are continuous),


√  
d
n Rβ̂ n − Rβ → N (0, RΩR′ ) .

33
By the Continuous Mapping Theorem, we also have that
 −1 p −1
RΩ̂n R′ → (RΩR′ )

Recall that, if
√  
d
n β̂ n − β → z ∼ N (0, Ω),

then z ′ Ω−1 z ∼ χ2k . Using this, we can obtain that


 ′  −1  
d
n Rβ̂ n − Rβ RΩ̂n R′ Rβ̂ n − Rβ → χ2p .

The test statistic is then given by


 ′  −1  
Tn = n Rβ̂ n − c RΩ̂n R′ Rβ̂ n − c

and
ϕn = 1{Tn >cp,1−α } ,
where cp,1−α is the 1 − α th quantile of χ2p distribution. By construction, this is consistent in level α

4.2.13 Tests of non-linear restrictions


Suppose we wish to test, at level α,
H0 : f (β) = c,
H1 : f (β) ̸= c,

where f : Rk+1 → Rp is continuously differentiable at β. Denote as Dβ f (β) the p × (k + 1) matrix of partials


of f , evaluated at β with linearly independent rows. Then, by the Delta method,
√    
d
n f β̂ n − f (β) → N (0, Dβ f (β)ΩDβ f (β)′ ) ,

where we note that Dβ f (β)ΩDβ f (β)′ is non-singular (for the same reason as before). By the Continuous
Mapping Theorem, then    ′ p
Dβ f β̂ n Ω̂n Dβ f β̂ n → Dβ f (β)ΩDβ f (β)′ .

Then we have that


   ′     ′ −1    
d
n f β̂ n − f (β) Dβ f β̂ n Ω̂n Dβ f β̂ n f β̂ n − f (β) → χ2p .

The test statistic is then given by


   ′     ′ −1    
Tn = n f β̂ n − c Dβ f β̂ n Ω̂n Dβ f β̂ n f β̂ n − c

and
ϕn = 1{Tn >cp,1−α } ,
where cp,1−α is the 1 − α th quantile of χ2p distribution. By construction, this is consistent in level α

34
4.3 Linear Regression when E[Xu] ̸= 0
Suppose that we have (Y, X, u) where Y, u ∈ R and X ∈ Rk+1 , with X0 = 1, and β such that

Y = X′ β + u.

• Any Xj with E [Xj u] = 0 is said to be exogenous.


• Any Xj with E [Xj u] ̸= 0 is said to be endogenous.

4.3.1 Motivating examples for when E[Xu] ̸= 0


The examples below show that even if the ”true” model is such that E[Xu] = 0, we can still have endogeneity
problems in the estimated model if:
i the estimated model omits relevant variables
ii there are measurement errors
iii there are simultaneity problems
Omitted Variables:
Suppose k = 2 and that the true model is given by

Y = β0 + β1 X1 + β2 X2 + u

Assume that the model is causal but we still have that E[Xu] = 0. However, suppose we cannot observe X2
so that we estimate the model as
Y = β0∗ + β1∗ X1 + u∗ .
Using the true model, we can rewrite the estimated model as:

Y = (β0 + E [X2 ] β2 ) + β1 X1 + (u + (X2 − E [X2 ]) β2 ).


| {z } |{z} | {z }
=β0∗ =β1∗ =u∗

Then, note that


E [u∗ ] = E[u] + E [X2 − E [X2 ]] β2 = 0
however,
Cov [X1 , u∗ ] = Cov [X1 , u + (X2 − E [X2 ]) β2 ]
= Cov [X1 , X2 ] β2 .
Thus, Cov [X1 , u ] ̸= 0 if Cov [X1 , X2 ] ̸= 0 and/or β2 ̸= 0. Hence, we realise that E [X1 u∗ ] ̸= 0 in general so

that X1 is an endogenous variable in the estimated model.


Measurement error
Suppose we partition X into X0 = 1 and X1 ∈ Rk , and partition β correspondingly. The true model is given
by
Y = β0 + X′1 β 1 + u

We suppose that this is the causal model and that E[Xu] = 0. However, assume that X1 is unobserved and
that we instead observe
X̂1 = X1 + v,
where E[v] = 0, Cov[u, v] = 0 and Cov [X1 , v] = 0. The model we therefore estimate is

Y = β0∗ + X̂′1 β ∗1 + u∗ .

35
Using the true model, we can rewrite the estimated model as

Y = β0 + X̂′1 β 1 + (u − v′ β 1 ) .
| {z }
=u∗

Then, note that


E [u∗ ] = E[u] − E [v′ ] β 1 = 0,
however, h i
Cov X̂1 , u∗ = Cov [X1 + v, u − v′ β 1 ]
= − Var[v]β 1
Thus,
h unless
i Var[v] = 0 (i.e. measurement error is a constant) or β 1 = 0 (i.e. X1 are unrelated to Y ), then
E X̂1 u∗ ̸= 0; i.e. X̂1 is endogenous.

Simultaneity
Let superscript d denote demand-side variables and superscript s denote supplyside variables. Consider the
following demand and supply equations:

Qd = β0d + β1d P̃ + ud ,
Qs = β0s + β1s P̃ + us
   
with E ud = E [us ] = 0 and that E ud us = 0. What we observe in the data is the equilibrium outcome
determined by the market clearing condition, Qd = Qs ; i.e.

β0d + β1d P̃ + ud = β0s + β1s P̃ + us


 
β0s − β0d + us − ud
⇒ P̃ = .
β1d − β1s
h i h i
Thus, since P̃ contains us and ud , E P̃ ud , E P̃ us ̸= 0; i.e. P̃ is endogenous in both equations.

4.3.2 What happens to the OLS estimator if E[Xu] ̸= 0 ?


In general, when E[Xu] ̸= 0, the OLS estimator is both biased and inconsistent. To see the unbiasedness
note that note that when E[Xu] ̸= 0, then E[u | X] ̸= 0.
Similarly, when E[Xu] ̸= 0, notice that
−1 −1
E [XX′ ] E[XY ] = E XX′ E [X (X′ β + u)]


−1
= β + E [XX′ ] E[Xu] .
| {z }
̸=0

p −1 p
Hence, we although we would still have that β̂ n → E [XX′ ] E[XY ], we no longer have that β̂ n → β. That
is, the OLS estimator is inconsistent.

4.3.3 Solving for β: Instrumental Variables


We maintain the same assumptions as before, however, suppose further that there exists instruments, Z ∈
Rl+1 , such that
• (instrument exogeneity). E[Zu] = 0.
• (instrument relevance/rank condition). E ZX′ has rank k + 1.
 

36
• (order condition). l + 1 ≥ k + 1 (a necessary condition for rank condition).
• Z includes all exogenous Xj (in particular, Z0 = X0 = 1 ).
• E [ZZ′ ] and E [ZX′ ] exist.
• No perfect collinearity in Z.
Now we solve for β. First, note that

E[Zu] = 0 ⇒ E [Z (Y − X′ β)] = 0
⇒ E[ZY] = E [ZX′ ] β

When β is exactly identified (l + 1 = k + 1):


Suppose first that l + 1 = k + 1, then E [ZX′ ] is an invertible square matrix so that, from the above equation,
we can immediately obtain that
−1
β = E [ZX′ ] E[ZY]

When β is over-identified (l + 1 > k + 1):


Now suppose that l + 1 > k + 1, then E ZX′ is not invertible. However the following result will allow us
 

to proceed:
Suppose there is no perfect collinearity in Z, then

rank E ZX′ = rank(Π),


 

where Π is such that BLP(X | Z) = Π′ Z. In particular, if rank [E [ZX′ ]] = k + 1, then


h i

Π′ E ZX′ = Π′ E Z (Π′ Z) = Π′ E [ZZ′ ] Π
 

is invertible.
Multiplying both sides of E[ZY] = E ZX′ β by Π′
 

Π′ E[ZY] = Π′ E [ZX′ ] β

Since BLP(X | Z) = Π′ Z:

Π′ E[ZY] = Π′ E ZZ′ Πβ
 

As we showed that Π′ E ZZ′ Π is invertible, then we can rewrite the solution for β as:
 

−1 ′
β = Π′ E ZX′

Π E[ZY]

where,
−1
Π = E [ZZ′ ] E[ZX]

(Rank Inequality). For any comformable matrices A and B,

rank(AB) ≤ min{rank(A), rank(B)}

Proof. From the fact that BLP(X | Z) = Π′ Z,

X = Π′ Z + v,

37
with E Zv′ = 0 (we use the fact that there is no collinearity for the existence of Π so that E ZZ′ is
   

invertible). Hence,
h i

E [ZX′ ] = E Z (Π′ Z + v) = E [ZZ′ Π] + E Zv′ = E ZZ′ Π
   

Using Rank Inequality,


rank E ZZ′ Π ≤ min rank E ZZ′ , rank(Π)
     

≤ rank(Π)

4.3.4 Estimating β
Suppose we have the following:
• (Y, X, Z, u) such that X ∈ Rk+1 with X0 = 1 and Z ∈ Rl+1 with Z0 = 1 and that Z1 includes any
exogenous Xj ’s.
• E ZZ′ and E ZX′ exist
   

• there is no perfect collinearity in Z


• E[Zu] = 0
• rank (E [ZX′ ]) = k + 1
• sample is iid:
iid
Y 1 , X1 , Z1 , Y 2 , X2 , Z2 , . . . , (Y n , Xn , Zn ) ∼ (Y, X, Z)
 

We consider two cases:


1. Instrument variables (IV) estimator: in the exactly identified case l + 1 = k + 1
2. Two-stage least squares (TSLS) estimator: in the over-identified case l + 1 > k + 1
IV estimator
Recall that, when l + 1 = k + 1:
−1
β = E [ZX′ ] E[ZY ]

So the natural estimator is !−1 !


n n
1 X i i′ 1X i i
β̂ n = ZX ZY .
n i=1 n i=1
Note that this implies:
n
1 X i i 
Z Y − Xi′ β̂ n = 0
n i=1
n
1X i i
⇒ Z û = 0,
n i=1

where ûi = Yi − Xi′ β̂ n .


The IV estimator can also be written as
−1
β̂ n = (Z′ X) Z′ Y,
′ ′
where Z = Z 1 , Z 2 , . . . , Z n . In this case, defining Û = û1 , û2 , . . . , ûn = Y − Xβ̂ n , we realise that

ZÛ = 0.

38
TSLS estimator
Recall that, when l + 1 ≥ k + 1
−1 ′
β = Π′ E [ZX′ ] Π E[ZY ]
′ ′
−1 ′
= Π E [ZZ ] Π Π E[ZY ]
where BLP(X | Z) = Π′ Z so that
−1
Π = E ZZ′ E [ZX′ ]


So the natural estimator is


n
!!−1 n
!
1 ′ 1 X i i′ ′ 1X i i
β̂ n = Π̂n ZX Π̂n ZY
n i=1 n i=1
n
!−1 n
!
1 X ′ i i′ 1X ′ i i
= Π̂ Z X Π̂ Z Y
n i=1 n n i=1 n

or ! !−1 !
n n
2 ′ 1 X i i′ ′ 1X i i
β̂ n = Π̂n ZZ Π̂n Π̂n ZY
n i=1 n i=1
where !−1 !
n n
1 X i i′ 1 X i i′
Π̂n = ZZ ZX
n i=1 n i=1
1 2 ′
To see that β̂ n = β̂ n , recall that Xi = Π̂n Zi + v̂i so that

n
!!−1 n
!
1 ′ 1 X i ′ i ′ ′ 1X i i
i
β̂ n = Π̂n Z Π̂n Z + v̂ Π̂n ZY
n i=1 n i=1
n
! !−1 n
!
′ 1 X i i′ ′ 1X i i 2
= Π̂n ZZ Π̂n Π̂n ZY = β̂ n ,
n i=1 n i=1

where we use the fact that


n
1 X i i′
Z v̂ = 0
n i=1
by construction.
Notice also that:
1 ′
• β̂ n has the interpretation as the IV estimator, where we replace Zi with Π̂n Zi .
2 ′
• β̂ n has the TSLS interpretation: (i) We first regress Xi on Zi to obtain Π̂n Zi ; (ii) We then regress Y i
on Π̂′n Zi
• if Π̂′n is invertible (i.e. l + 1 = k + 1 ), then the IV and TSLS estimators correspond exactly.

4.3.5 Consistency of the TSLS estimator


Recall:
−1
β = E [Π′ ZX′ ] E [Π′ ZY ]
−1
= E [Π′ ZZ′ Π] E [Π′ ZY ] ,

39
and the TSLS estimator: !−1 !
n n
1 X ′ i i′ 1X ′ i i
β̂ n = Π̂ Z X Π̂ Z Y
n i=1 n n i=1 n
n
!−1 n
!
1 X ′ i i′ 1X ′ i i
= Π̂ Z Z Π̂n Π̂ Z Y
n i=1 n n i=1 n
where !−1 !
n n
1 X i i′ 1 X i i′
Π̂n = ZZ ZX
n i=1 n i=1

Since Zi and Xi are iid and E [ZZ′ ] and E[ZX]′ exist, by WLLN,
n
1X i i p
Z X → E [ZX′ ]
n i=1
n
1 X i i′ p
Z Z → E [ZZ′ ]
n i=1

By the Continuous Mapping Theorem, since there is no perfect collinearity in Z (i.e. E ZZ′ is invertible),
 

n
!−1
1 X i i′ p −1
→ E ZZ′

ZZ .
n i=1

By the Continuous Mapping Theorem again,


p
Π̂n → Π.

Since Zi , Y i ’s are iid,
n
1 X i i p  i i
ZY →E ZY .
n n=1
by the Continuous Mapping Theorem,
n
!
′ 1X i i p
→ Π′ E Zi Y i = E Π′ Zi Y i .
   
Π̂n ZY
n n=1

by the Continuous Mapping Theorem,


n
!
′ 1 X i i′ p
Π̂n → Π′ E Zi Zi′ Π.
 
Π̂n ZZ
n i=1

Since rank E ZX′ = rank(Π) = k + 1 then E Π′ ZZZ′ Π is invertible, so, by the Continuous Mapping
   

Theorem,
n
! !−1
′ 1 X i i′ p −1
Π̂n ZZ Π̂n → (E [Π′ ZZ′ Π])
n i=1
Applying the Continuous Mapping Theorem, we obtain that
p −1
β̂ n → β = E [Π′ ZZ′ Π] E [Π′ ZY ] .

40
4.3.6 Limiting distribution of the TSLS estimator
Assume further that Var[Zu] exist. Then,
√  
d
n β̂ n − β → N (0, Ω),

where
−1 −1
Ω = E [Π′ ZZ′ Π] Π′ Var[Zu]ΠE [Π′ ZZ′ Π]

To see this, substituting the expression for Y i gives


n
!−1 n
!
1 X ′ i i′ 1X ′ i i′ i

β̂ n = Π̂ Z Z Π̂n Π̂ Z X β + u
n i=1 n n i=1 n
n
!−1 n
! n
!−1 n
!
1 X ′ i i′ 1 X ′ i i′ 1 X ′ i i′ 1X ′ i i
= Π̂ Z X Π̂ Z X β+ Π̂ Z X Π̂ Z u
n i=1 n n i=1 n n i=1 n n i=1 n
n
!−1 n
!
1 X ′ i i′ 1X ′ i i
=β+ Π̂ Z X Π̂ Z u
n i=1 n n i=1 n

Rearranging and multiplying by n yields
n
!−1 n
!
√   1 X ′ i i′ ′ √ 1X
n β̂ n − β = Π̂ Z X Π̂n n Zi ui .
n i=1 n n i=1

Recall that we already showed that


n
!−1 n
! !−1
1 X ′ i i′ ′ ′ 1 X i i′ ′ p  −1 ′
Π̂n → Π′ E Zi Zi′ Π

Π̂n Z X Π̂n = Π̂n ZZ Π̂n Π
n i=1 n i=1

Since Var[Zu] < ∞ and Zi , Xi , Y i are iid, by the Central Limit Theorem,
n
!
√ 1X d
n Zi ui → N (0, Var[Zu])
n i=1

By Slutsky’s Lemma,
n
! !−1 n
!
′ 1 X i i′ ′ √ 1X d  −1 ′
Zi ui → ΠE′ Zi Zi′ Π

Π̂n ZZ Π̂n Π̂n n Π N (0, Var[Zu]).
n i=1 n i=1

That is,
√  
d
n β̂ n − β → N (0, Ω),
−1 −1
where Ω = E [Π′ ZZ′ Π] Π′ Var[Zu]ΠE Π′ ZZZ′ Π .


4.3.7 Efficiency of the TSTS estimator


Efficiency Recall that we solve for β using Π′ E[Zu]
 = 0. Alternatively, we could have used any other matrix
Γ with some dimension such that rank ΓE ZX′ = k + 1, and use it to solve for β :
n
!−1 n
!
1 X ′ i i′ 1X ′ i i
β̃ n = ΓZX ΓZY .
n i=1 n i=1

41
As before, we will have that
√  
d
n β̃ n − β → N (0, Ω̃),
−1 ′   −1 ′
where Ω̃ = E Γ′ ZX′ Γ Var[Zu]Γ E Γ′ ZX′

(the transpose in the last term is there as the term
might not be symmetric).
However, if E[u | Z] = 0 and Var[u | Z] = σ 2 , then Γ = Π is the ”best” in the sense that Ω ≤ Ω̃ for any Γ
that satisfies rank (ΓE [ZX′ ]) = k + 1

4.3.8 Estimating Ω
The natural estimator here is !
n
1 X i i′ i 2
Ω̂n = A Z Z û A′ ,
n i=1
 ′ Pn −1 ′
where A := Π̂n n1 i=1 Zi Zi′ Π̂n

Π̂n . Note that we already showed that A converges in probability.
 P  p  P 2  p
2
To show that n1 → Var[Zu] is isomorphic when we showed that n1

Zi Zi′ ûi Xi Xi′ ûi →
Var[Xu] in the case of OLS estimator.
Crucially, ûi is the predicted residual from the regression of Y i on Xi —and does not involve Zi . That is,
ûi = Y i − X̂i′ β.

42

You might also like