0% found this document useful (0 votes)

10 views

Econometría

Uploaded by

MIRELLI THAIS JIMENEZ PULACHE

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

10 views

Econometría

Uploaded by

MIRELLI THAIS JIMENEZ PULACHE

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 43

Curso de Atualización para Almunnos de Economı́a

Course: Econometrics1
Prof.: Dr. Ricardo Quineche Uribe2
T.A.: Rita Huarancca Delgado3

Contents
1 Introduction 3

2 Large sample theory 3

2.1 Existence of moments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2.2 Convergence in Probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2.3 Markov’s inequality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.4 Weak Law of Large Numbers (WLLN) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.5 Consistency of estimators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.6 Convergences in marginal probabilities imply convergence in joint probability . . . . . . . . . 4
2.7 Continuous Mapping Theorem (CMT) for convergence in probability . . . . . . . . . . . . . . 5
2.8 Convergence in moments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.8.1 Properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.9 Cauchy-Schwartz Inequality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.10 Jensen’s Inequality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.11 Convergence in Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.11.1 Properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.12 Continuous Mapping Theorem (CMT) for convergence in distribution . . . . . . . . . . . . . 6
2.13 Slutsky’s Lemma . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.14 Cramér-Wold Device . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.15 Central Limit Theorem(CLT) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.15.1 Univariate CLT: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.15.2 Multivariate CLT: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.16 Delta Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.16.1 For one variable: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.16.2 Multivariate version: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.17 Example 1: Constructing an estimator of σ 2 (P ) . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.18 Example 2: Applying Delta Method: univariate . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.19 Example 3: Applying Delta Method: multivariate . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.20 Constructing a confidence set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.20.1 Not relying on Asymptotics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.20.2 Relying on Asymptotics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.20.3 Coverage Property . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.21 Example 4: Constructing confidence sets with Markov’s inequality . . . . . . . . . . . . . . . 12
2.22 Example 5: Constructing confidence sets using (M)CLT . . . . . . . . . . . . . . . . . . . . . 13
1 This content cannot be shared, uploaded, or distributed. This content has not been written for publication.
2 The contact email address is [email protected].
3 The contact email address is [email protected].
2.23 Hypothesis testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.23.1 Consistency in level . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.23.2 p-value of a test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.23.3 Steps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.24 Example 6: Test statistic for the sample mean . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.25 Testing multidimensional hypothesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.26 Example 7: Testing multidimensional hypothesis . . . . . . . . . . . . . . . . . . . . . . . . . 17

3 Conditional expectations 18
3.1 Expected Value Rule . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.2 Expected Value Rule for conditional expectation . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.2.1 Given an event A . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.2.2 Given that a random variable Y takes on an specific value . . . . . . . . . . . . . . . . 19
3.2.3 As a random variable . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.3 The mean of E[X | Y ]: Law of Iterated Expectations (LIE) . . . . . . . . . . . . . . . . . . . 19
3.4 Properties of conditional expectations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.5 About Independence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.5.1 Mean independence ⇏ independence . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.5.2 Mean independence ⇒ zero covariance . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.5.3 Uncorrelatedness ⇏ mean independence . . . . . . . . . . . . . . . . . . . . . . . . . . 20

4 Linear Regressions 21
4.1 Three interpretations of linear regressions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
4.1.1 Linear Conditional Expectation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
4.1.2 ”Best” Linear Approximation to Conditional Expectation / ”Best” Linear Predictor
of Y | X . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
4.1.3 Causal model interpretation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
4.2 Linear Regression when E[Xu] = 0 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
4.2.1 Solving for β . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
4.2.2 Solving for subvectors of β . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
4.2.3 Solving for β 2 when X′1 = 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
4.2.4 Estimating β . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
4.2.5 Unbiasedness of the natural estimator of β . . . . . . . . . . . . . . . . . . . . . . . . 27
4.2.6 Consistency of the natural estimator of β . . . . . . . . . . . . . . . . . . . . . . . . . 28
4.2.7 Limiting distribution of the natural estimator of β . . . . . . . . . . . . . . . . . . . . 28
4.2.8 Limiting distribution of the natural estimator of β if E[u | X] = 0 . . . . . . . . . . . . 29
4.2.9 Estimating Ω . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
4.2.10 Gauss Markov Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
4.2.11 Inference: Testing a single linear restriction . . . . . . . . . . . . . . . . . . . . . . . . 32
4.2.12 Inference: Testing multiple linear restrictions (Wald Test) . . . . . . . . . . . . . . . . 33
4.2.13 Tests of non-linear restrictions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
4.3 Linear Regression when E[Xu] ̸= 0 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
4.3.1 Motivating examples for when E[Xu] ̸= 0 . . . . . . . . . . . . . . . . . . . . . . . . . 35
4.3.2 What happens to the OLS estimator if E[Xu] ̸= 0 ? . . . . . . . . . . . . . . . . . . . 36
4.3.3 Solving for β: Instrumental Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
4.3.4 Estimating β . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
4.3.5 Consistency of the TSLS estimator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
4.3.6 Limiting distribution of the TSLS estimator . . . . . . . . . . . . . . . . . . . . . . . . 41
4.3.7 Efficiency of the TSTS estimator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
4.3.8 Estimating Ω . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

2
1 Introduction
In econometrics, there are three typical problems we wish to address.
Suppose X1 , X2 , . . . , Xn are independently and identically distributed according to the cumulative distribu-
tion function (CDF) P . Then, our goal is to “learn” some “features” of P from the data.
i Estimate θ(P ): Construct a function called an estimator, θ̂n = θ̂n (X1 , X2 , . . . , Xn ), that provides a
“best guess” for θ(P ). For example, µ(P ) = E [Xi ] , σ 2 (P ) = VarP [Xi ], coefficients from the regressions
(β(P )).
ii Test a hypothesis about θ(P ): For example, a test could be of the form θ(P ) = θ0 , and we want to
construct a function, ϕn = ϕn (X1 , X2 , . . . , Xn ) ∈ [0, 1], that determines the probability with which you
reject the hypothesis.
iii Construct a confidence region for θ(P ): Construct a random set, Cn = Cn (X1 , X2 , . . . , Xn ), that contains
θ(P ) with some pre-specified probability.
Studying finite sample properties of the above is typically difficult (unless we make some strong assumptions
about P ). This is why we wish to study large-sample properties.

2 Large sample theory

2.1 Existence of moments
Definition 1.1. We say that the mean of a random vector X, denoted E[X], exists if E[|X|] < ∞. If X is a
random variable on R, then we can split E[X] into positive and negative parts:

E[X] = E X + − E X − ,

where
X + := max{X, 0}, X − := max{−X, 0}.
Then, notice that
E[|X|] = E X + + E X − .

Thus, the requirement that E [|Xi |] < ∞ is equivalent to requiring that E Xi+ , E Xi− < ∞ (i.e. they both

+ −
exist) since E Xi , E Xi ≥ 0 by construction.

2.2 Convergence in Probability

Let {Xn : n ≥ 1} and X be random vectors on Rk .Xn is said to converge in probability to X, denoted
p
Xn → X, if, as n → ∞,
P (|Xn − X| > ε) → 0, ∀ε > 0,
where | · | is the usual Euclidean norm4 . Equivalently,

lim P (|Xn − X| > ε) = 0 for any ε > 0

n→∞
4 On the n-dimensional Euclidean space Rn , the intuitive notion of length of the vector x = (x , x , . . . , x ) is captured by
1 2 n
the positive root of q
|x| := x21 + · · · + x2n .

3
2.3 Markov’s inequality
For any random variable X,
E [|X|q ]
P(|X| > ε) ≤ , ∀q, ε > 0,
εq
where | · | is the usual Euclidean norm.

2.4 Weak Law of Large Numbers (WLLN)

Pn
Xi
Recall X̄n ≡ i=1
n . If
1. {Xn : n ≥ 1} is a sequence of iid random variables on R with CDF P
2. E [Xi ] ≡ µ(P ) < ∞ (i.e., exists)
Then, Pn
i=1 Xi p
→ E [Xi ]
n
That is,
p
X̄n → µ(P )

2.5 Consistency of estimators

• Consistency of an estimator means that as the sample size gets large the estimate gets closer and closer
to the true value of the parameter.
iid
• Let X1 , X2 , . . . Xn ∼ P on R and suppose that µ(P ) exists.
• We wish to construct an estimator of µ(P ).
• As we are able to apply WLLN, a natural estimator that converges to µ(P ) is X̄n :
p
X̄n → µ(P )

• That is, X̄n is consistent for µ(P )

• This is a good strategy to construct consistent estimators, where we construct the estimator for the
unknown parameter by trying to take advantage of the WLLN.
• Recall that consistency requires X̄n to converge to the true/population value of the distribution, µ(P ).
Thus, although X̄n + 1 would converge in probability to a constant, it is not consistent for µ(P ).
• consistency is related, but not equivalent to unbiasedness. Unbiasedness requires that

E X̄n = µ(P )
i.e. it is not an asymptotic property. Note that unbiasedness does not imply consistency, nor does
consistency imply unbiasedness.

2.6 Convergences in marginal probabilities imply convergence in joint proba-

bility
p p
Suppose {Xn : n ≥ 1} , {Yn : n ≥ 1} , X and Y are random variables and that Xn → X and Yn → Y . Then,
p
(Xn , Yn ) → (X, Y ).
This is a very useful property of convergence in probability especially when combined with the Continuous
Mapping Theorem.

4
2.7 Continuous Mapping Theorem (CMT) for convergence in probability
Let {Xn : n ≥ 1} and X be random vectors on Rk . Suppose that g : Rk → Rd is continuous at each point in
the set C ⊆ Rk such that P(X ∈ C) = 1. Then,
p p
Xn → X ⇒ g (Xn ) → g(X)
p
• Recall that if {Xn : n ≥ 1} , {Yn : n ≥ 1} , X and Y are random variables and that Xn → X and
p
Yn → Y , then
p
(Xn , Yn ) → (X, Y ).

• The Continuous Mapping Theorem then allows us to define a function g (Xn , Yn ) so that
p
g (Xn , Yn ) → g(X, Y ).

• For example, g could be Xn + Yn , Xn Yn , Xn /Yn (if Y ̸= 0 ) etc.

• We will use this property many times in proving consistency.

2.8 Convergence in moments

Let {Xn : n ≥ 1} and X be random vectors on Rk .Xn is said to converge in qth moment to X such that
Lq
q ≥ 1, denoted Xn → X, if, as n → ∞,
q
E [|Xn − X| ] → 0.

2.8.1 Properties
• Convergence in moments implies convergence in probability: Let {Xn : n ≥ 1} and X be random
vectors on Rk . Then,
Lq p
Xn → X ⇒ Xn → X

• Convergence in probability does not imply in convergence in moments.

2.9 Cauchy-Schwartz Inequality

For any random variables u and v with E u2 , E v 2 < ∞

E[uv]2 ≤ E u2 E v 2 .

Furthermore, inequality is strict unless there exists α such that P(u = αv) = 1.

2.10 Jensen’s Inequality

Let I ⊆ R be a convex set (i.e. an interval on R ) and f : I → R a convex function. Then, for any random
variable X such that P(x ∈ I) = 1, if:
1. E[|X|] < ∞
2. E[|f (X)|] < ∞
Then,
f (E[X]) ≤ E[f (X)]
If f is a concave function (i.e. −f is convex), then the inequality above is reversed.

5
2.11 Convergence in Distribution
Let {Xn : n ≥ 1} and X be random vectors on Rk .Xn is said to converge in distribution to X, denoted
d
Xn → X if, for all x for which P(X ≤ x) is continuous, as n → ∞

P (Xn ≤ x) → P(X ≤ x)

2.11.1 Properties
• Convergence in probability implies convergence in distribution.
• Convergence in distribution does not imply convergence in probability(unless the convergence in dis-
tribution is into a constant).
• If
d
1. Xn → b
2. b is a constant
Then,
p
Xn → b

• If
d
1. Xn → X
d
2. Yn → Xn
Then,
d
Yn → X

• Marginal convergence in distribution does not imply joint convergence in distribution (unless one of
the marginal convergence in distribution is into a constant).
• If
d
1. Xn → X
d
2. Yn → b
3. b is a constant.
Then,
d
(Xn , Yn ) → (X, b)

2.12 Continuous Mapping Theorem (CMT) for convergence in distribution

Let {Xn : n ≥ 1} and X be random vectors on Rk . Suppose that g : Rk → Rd is continuous at each point in
the set C such that P(X ∈ C) = 1. Then,
d d
Xn → X ⇒ g (Xn ) → g(X).

6
2.13 Slutsky’s Lemma
Let {Xn : n ≥ 1} , {Yn : n ≥ 1} and X be random vectors, and c ∈ Rk a constant. If
d
1. Xn → X
d
2. Yn → c
Then,
d
Xn′ Yn → X ′ c
and
d
Xn + Yn → X + c

2.14 Cramér-Wold Device

Let {Xn : n ≥ 1} and X be random vectors. Then,
d d
Xn → X ⇔ t′ Xn → t′ X, ∀t′ ∈ Rk .

2.15 Central Limit Theorem(CLT)

2.15.1 Univariate CLT:
If
iid
1. X1 , X2 , . . . , Xn ∼ P on R
2. σ 2 (P ) < ∞
Then,
√ d
n X̄n − µ(P ) → N 0, σ 2 (P )

2.15.2 Multivariate CLT:

It comes straightforward by using Cramér-Wold Device to generalise the univariate central limit theorem. If
iid
1. X1 , X2 , . . . , Xn ∼ P on Rk
2. Σ(P ) < ∞ (i.e. every entry in the variance-covariance matrix is finite)
Then,
√ d
n X̄n − µ(P ) → N (0, Σ(P )).

2.16 Delta Method

2.16.1 For one variable:
Let {Xn : n ≥ 1} and X be random variables, c a constant, and τn a sequence of constants such that τn → ∞
d
and τn (Xn − c) → X. Let g(x) be a function that is continuous and differentiable at c. Denote gx (c) as
partial derivative of g(x) evaluated at c. Then,
d
τn (g (Xn ) − g(c)) → gx (c)X.

In particular, if
d
X ∼ N (0, σ 2 (P )) ⇒ τn (g (Xn ) − g(c)) → N 0, gx2 (c)σ 2 (P ) .

7
2.16.2 Multivariate version:
Let {Xn : n ≥ 1} and X be random vectors, c ∈ Rk a constant, and τn a sequence of constants such that
d
τn → ∞ and τn (Xn − c) → X. Let g : Rk → Rd be a function that is continuous and differentiable at c.
Denote Dg (c) as the d × k matrix of partials of g evaluated at c. Then,
d
τn (g (Xn ) − g(c)) → Dg(c)X.

In particular, if
d
X ∼ N (0, Σ) ⇒ τn (g (Xn ) − g(c)) → N (0, Dg(c)ΣDg(c)′ ) .

2.17 Example 1: Constructing an estimator of σ 2 (P ))

iid
Let X1 , X2 , . . . , Xn ∼ P on R. Suppose that E Xi2 < ∞. We wish to construct an estimator of σ 2 (P ) :=
Var (Xi ).
• A nice way to come up with an estimator is to try to apply WLLN. Thus, as starting point, it is good
to locate E [Xi ] so we can use X̄n .
• Recall
Var (Xi ) = E Xi2 − (E [Xi ])2

• Then, a natural estimator is:

Pn Pn 2
Xi2 Xi
s2n := i=1
− i=1
.
n n

• Is it consistent?
• Since E Xi2 < ∞, then E [Xi ] < ∞. Then, since Xi is iid for i = 1, ..., n and E [Xi ] < ∞, as n → ∞,

by WLLN:
p
X̄n → E[X].

Also, since Xi2 is iid for i = 1, ..., n and E Xi2 < ∞, as n → ∞, by WLLN:
n
1 X 2 p 2
X → E Xi ,
n i=1 i

• We may write s2n as a function g, parameterised as

n
! !
1 X
s2n =g Xi2 , X̄n
n i=1

where,
g(a, b) = a − b2

• Since g(.) is continuous, by the CMT,

p
s2n → E Xi2 − E[X]2 = σ 2 (P ).

• This shows that s2n is a consistent estimator of σ 2 (P ).

8
2.18 Example 2: Applying Delta Method: univariate
iid
Suppose X1 , X2 , . . . , Xn ∼ P = Bernoulli(q), with q ∈ (0, 1). Then, by the Central Limit Theorem,
√ d
n X̄n − q → N (0, g(q)),

where g(q) = q(1 q). Λ natural estimator for g(q) is g X̄n . Since, Dg(q) = 1 2q, the Delta Method
implies that
√ d
n g X̄n − g(q) → Dg(q)N (0, g(q)) = N 0, (1 − 2q)2 q(1 − q) .

When q = 1/2, we have that

√ d
n g X̄n − g(q) → N (0, 0) = 0.
√ p
That is, n g X̄n − g(q) → 0. We get a degenerate limit distribution because the Delta Method considers
Taylor expansion only up to the first order.
To get a non-degenerate limiting distribution, we use second-order Taylor expansion, which gives us:

D2 g(q) 2
g X̄n − g(q) = Dg(q) X̄n − q + X̄n − q + R X̄n − q
2

where R(0) = 0 and R(h) = o h2 ; i.e. as h → 0,
R(h)
→ 0.
h2
Since D2 g(q) = −2, when q = 1/2, above simplifies to
2
g X̄n − g(q) = − X̄n − q + R X̄n − q .
Multiplying both sides by n, we obtain
2
n g X̄n − g(q) = −n X̄n − q + nR X̄n − q
Consider the first term on the right-hand side. Rearranging and using the Continuous Mapping Theorem
gives that:
2 √ 2
−n X̄n − q = − n X̄n − q
2 2
d 1 d 1
→ − N 0, = − N (0, 1)
4 2
2
d 1 d 1
= − N (0, 1) = − χ21 .
2 4
We can also write the second as
!
2 R X̄n − q
nR X̄n − q = n X̄n − q b 2 ,
X̄n − q
R(h)
where b is defined as before to ensure that it is continuous at Xn = q. Recall that as h → 0, h2 → 0, and
P
the fact that X̄n → q. Thus, we know that
! !
R X̄n − q R X̄n − q d
b 2 → 0 ⇒ b 2 → 0.
X̄n − q X̄n − q
And, from before, we have already shown that
2 d 1 2
n X̄n − q → χ
4

9
2.19 Example 3: Applying Delta Method: multivariate
Let X1 , . . . , Xn be an iid sequence of random variables with distribution (λ) with λ > 0. (Here, (λ) denotes
the exponential distribution with parameter λ, whose pdf is given by

λ exp(−λx) if x ≥ 0
f (x) = .
0 if x < 0

In particular, the mean and the variance of such a random variable are 1/λ and 1/λ2 , respectively.) Inde-
pendently of X1 , . . . , Xn , let Y1 , . . . , Yn be an iid sequence of random variables with distribution (µ) with
µ > 0.

Deriving the limiting distribution for log Ȳn /X̄n :
As always, write out the expression:

Ȳn
log = log Ȳn − log X̄n .
X̄n

So it looks like we’ll be splicing things together, we should work in a multivariate setting. Since Xi ’s and
Yi ’s are iid with finite variances, and Xi Yi , by the multivariate CLT,
√ 1/λ2

X̄n 1/λ d 0 0
n − →N , .
Ȳn 1/µ 0 0 1/µ2

Let g(x, y) = log(y/x) which is continuous and differentiable if x ̸= 0 and y/x > 0. Note that

Dg(x, y) = xy ∂(y/x) x ∂(y/x)
= − x1 y1 .

∂x y ∂y

Then, by the Delta method

√ 1/λ2

Ȳn 1/λ d 0 −λ
n log − log → N 0, −λ µ
X̄n 1/µ 0 1/µ2 µ
d
= N (0, 2).

2.20 Constructing a confidence set

A confidence set/region of level 1 − α for µ(P ), denoted Cn := Cn (X1 , X2 , . . . , Xn ), is such that the
probability that the true mean is contained in the set is greater than 1 − α, α ∈ (0, 1); i.e.

P (µ(P ) ∈ Cn ) ≥ 1 − α

Cn would then be the confidence set of level 1 − α for µ(P ). For example, if α = 5%, then Cn gives us the
interval in which the the probability that Cn contains the true mean is 95%.
We can construct a confidence set relying or not relying on asymptotics.

2.20.1 Not relying on Asymptotics

We can use Markov’s inequality to construct confidence intervals since:
• Marvok’s inequality is:
E [|X|q ]
P(|X| > ε) ≤ , ∀q, ε > 0,
εq

10
• using X̄n − E [X] as argument with q = 2:
h 2 i
E X̄n − µ(P )
P X̄n − µ(P ) > ε ≤ 2
ε
Var X̄n
=
ε2

• Then, by construction,
Var X̄n
P X̄n − µ(P ) ≤ ε ≥ 1 − .
ε2
• If we could somehow make
Var X̄n
1− =1−α
ε2
• Then,
P X̄n − µ(P ) ≤ ε̄ ≥ 1 − α.

• That is,
P X̄n − ε̄ ≤ µ(P ) ≤ X̄n + ε̄ ≥ 1 − α.

• ,which is equivalent to:

P (µ(P ) ∈ Cn ) ≥ 1 − α
with
Cn := X̄n − ε̄, X̄n + ε̄ .

A recipe that almost always works is the following one:

1. Write down the Markov’s inequality for X̄n − E [X] with q = 2.
2. Find out an upper bound for the RHS of Markov’s inequality such that the unknown parameters
disappear.
3. Equate that upper bound to α.
4. Using the obtained above expression, rewrite such that ε = g(α)
5. Then, we can then define the confidence set as

Cn := X̄n − g(α), X̄n + g(α) .

Equivalently, we can write the confidence set as

Cn := x ∈ R : X̄n − x ≤ g(α)

2.20.2 Relying on Asymptotics

A confidence set/region of level 1 − α for g(P ), denoted Cn := Cn (X1 , X2 , . . . , Xn ), is such that the proba-
bility that the true parameter g(P ) is contained in the set is greater than 1 − α, α ∈ (0, 1); i.e.

P (g(P ) ∈ Cn ) ≥ 1 − α.

Below, we list the steps to construct a confidence set/region using Central Limit Theorem, which relies on
asymptotic properties.

11
1. Construct a consistent estimator for g(P ). (Hint: use WLLN and CMT.)
2. Find out the limiting distribution of ĝ(P ). (Hint: use CLT, MCLT, Delta Method).
3. Derive an standard distribution. (Hint: Use Slutsky’s lemma)
4. Then, if the standard distribution is the normal standard one, we can write the confidence set as
√

ĝ(P ) − x
Cn := x ∈ R : n ≤ z1− α2
sg(P )
Equivalently, we can write the confidence set as
√
sg(P )
cn := z 1− α √ ,
2
n
and
Cn := [ĝ(P ) − cn , ĝ(P )n + cn ] .

2.20.3 Coverage Property

To show that P (µ(P ) ∈ Cn ) → 1 − α.

2.21 Example 4: Constructing confidence sets with Markov’s inequality

iid
Suppose that X1 , X2 , . . . , Xn ∼ X, where X has the CDF P = Bernoulli(r) with r ∈ (0, 1).
Thus, the mean and the variance of a Bernoulli distribution are given by µ(P ) = r and σ 2 (P ) = r(1 − r)
respectively.
Let α ∈ (0, 1). We would like to construct a nontrivial (random) set Cn := Cn (X1 , X2 , . . . , Xn ) such that
the probability that the Cn contains the true mean is greater than 1 − α ; i.e.

P (g(P ) ∈ Cn ) ≥ 1 − α.
• We will use the recipe we have learned in section 2.18.1.
• 1. Write down the Markov’s inequality for X̄n − E[X] with q = 2:
h 2 i
E X̄n − µ(P )
P X̄n − µ(P ) > ε ≤
ε2 P
1 n P
1
Pn Var[X] + 2 i̸=j Cov [Xi , Xj ]

Var X̄n 2 Var [ i=1 X] n2 i=1
= = n =
ε2 ε2 ε2
1 n Var(X) r(1 − r)
= 2 =
n ε2 nε2

• 2. Find out an upper bound for the RHS of Markov’s inequality such that the unknown parameters
disappear:
Note that r(1 − r) is maximised at 1/4 when r = 1/2. Thus,
r(1 − r) 1
P X̄n − µ(P ) > ε ≤ ≤ , ∀r
nε2 4nε2

• 3. Equate that upper bound to α:

1
α=
4nε2

12
• 4. Using the obtained above expression, rewrite such that ε = g(α):
1
ε= √
4αn

• 5. Then, we can then define the confidence set as

1 1
Cn := X̄n − √ , X̄n + √
4αn 4αn
Equivalently, we can write the confidence set as

1
Cn := x ∈ R : X̄n − x ≤ √
4αn

2.22 Example 5: Constructing confidence sets using (M)CLT

We have previously constructed a confidence set for Bernoulli distributed iid random variables using Markov’s
Inequality (see Example 4), which did not rely on asymptotics. Below, we construct a confidence set/region
using Central Limit Theorem, which relies on asymptotic properties.
iid
X1 , X2 , . . . , Xn ∼ P = Bernoulli (q), where q ∈ (0, 1). Let α be given. We wish to construct a confidence
region for µ(P ) = q at level 1 − α.
• Step 1: In this case, the g(P ) = µ(P ) = E(X). Thus, by:
p
X̄n → µ(P ) = q.

• Step 2: By CLT:
√ d
n X̄n − µ(P ) → N 0, σ 2 (P )

• Step 3: Since σ 2 (P ) = q(1 − q), a natural estimator for σ 2 (P ) is

s2n = X̄n 1 − X̄n

Thus, we can write s2n as a function f (), parameterised as s2n = f X̄n , where f (a) = a(1 − a). Since
f () is continuous, by the Continuous Mapping Theorem and WLLN:
p
s2n → σ 2 (P ).

Since σ 2 (P ) > 0, we can write a function g(), parameterised as g(a) = √1 . Since g() is continuous, by
a
the Continuous Mapping Theorem and WLLN:
1 p 1
g(s2n ) = → g(σ 2 (P )) = .
sn σ(P )

Then, by Slutsky’s Lemma, √

n X̄n − µ(P ) d
p → N (0, 1).
s2n
That is, √
n X̄n − µ(P ) d
q → N (0, 1).
X̄n 1 − X̄n

13
• Step 4: Define q
X̄n 1 − X̄n
cn := z1− α2 √ ,
n
and
Cn := X̄n − cn , X̄n + cn .
We can write confident regions in the following equivalent way:
√
 
 n X̄n − x 
Cn := x ∈ R : q ≤ z1− α
2
 X̄n 1 − X̄n 

Stating and verifying the coverage property that your confidence region satisfies:
• The coverage property is that P (µ(P ) ∈ Cn ) → 1 − α.
• To verify it,
P (µ(P ) ∈ Cn ) = P (Xn − cn ≤ µ(P ) ≤ Xn + cn )

= P X̄n − µ(P ) ≤ cn
 q 
X̄n 1 − X̄n
= P  X̄n − µ(P ) ≤ z1− α2 √ 
n
√
 
n X̄n − µ(P )
= P q ≤ z1− α2

X̄n 1 − X̄n

→ P |z| ≤ z1− α2

= P z α2 ≤ z ≤ z1− α2
α α
=Φ 1− −Φ
2 2
α α
= 1− − = 1 − α.
2 2

2.23 Hypothesis testing

We wish to apply the concept we developed above to test hypotheses about µ(P ). For example, we wish to
test the null hypothesis,
H0 : µ(P ) ≤ 0,
against the alternative hypothesis,
H1 : µ(P ) > 0.
There are two types of errors:
1. Type I error, which is the probability of rejecting H0 when it is true (i.e. false rejection/positive)
2. Type II error, which is the probability of not rejecting H0 when it is false (i.e. false negative).
Generally, it is not possible to minimise both type I and II errors (since type II error can be under any
distribution).

14
2.23.1 Consistency in level
• We restrict attention to tests of the form:

ϕn = 1{Tn >cn } ,

where Tn is the test statistic which is a function of data such that “large” values provide evidence
against H0 ; and cn is the critical value which gives the definition of “large”.
• E [ϕn ] is the probability of rejecting H0 , called the power function of the test.
• For µ(P ) satisfying H0 , E [ϕn ] is the probability of type I error
• For a µ(P ) satisfying H1 , 1 − E [ϕn ] is the probability of type II error.
We say that a test ϕn = ϕn (X1 , X2 , . . . , Xn ) ∈ [0, 1] is consistent in level if

lim sup E [ϕn ] ≤ α,

n→∞

for µ(P ) satisfying H0 and where α ∈ (0, 1) is the significance level of the test.
• Therefore, the requirement of consistency in level is given by

lim sup E [ϕn ] = lim sup E 1{Tn >cn } = lim sup P (Tn > cn ) ≤ α
n→∞ n→∞ n→∞

for µ(P ) satisfying H0 .

2.23.2 p-value of a test

The p-value of a test is the smallest value of α for which we reject the null hypothesis

p̂n := inf {α ∈ (0, 1) : Tn > cn } ,

where cn depends on α.

2.23.3 Steps
A recipe that almost always works is the following one:
1. Express the parameter in H0 as a function of Expectations.
2. Find a consistent estimator. (Hint: CMT, WLLN)
3. Find the limiting distribution of the estimator. (Hint: CLT, MCLT, Delta Method)
4. Standardize the distribution under the null. (Hint: Slutsky’s Lemma)
5. Construct the test ϕn according to H1 . That is taking into account if it is greater than, less than or
not equal to.

2.24 Example 6: Test statistic for the sample mean

iid
Suppose that X1 , X2 , . . . , Xn ∼ P on R and that σ 2 (P ) < ∞. Construct an estimator for the mean. Consider
the following null and alternative hypotheses,

H0 : µ(P ) = 0
H1 : µ(P ) ̸= 0.

Construct a test statistic to test the above.

15
• First, we need to come up with an estimator for what is being tested under the null.
• As in Example 1, we need to try to use WLLN to construct a consistent estimator. In this case is
straightforward since it is testing the Mean.
• Then, by WLLN the consistent estimator is X̄n
• Now, we need to find a limiting distribution for the estimator and try to come up with a standard
distribution. For this, we try to use CLT. In this case, it is straightforward since it is only the estimator
of the mean.
• by CLT,
√ d
n X̄n − µ(P ) → N 0, σ 2 (P )

• We are close to get a normal standard distribution, but we need to take out σ 2 (P ). For this, we will
use Slutsky’s Lemma.
d
• We know (from Example 1) that s2n converges√in probability to σ 2 (P ), which implies that s2n → σ 2 (P ).
In particular, we use the fact that f (x) = 1/ x is a continuous function when x ̸= 0. Then, by CMT
p
1/sn → 1/σ(P ), which implies convergence in distribution. Thus, we have have that
√ d
1. n X̄n − µ(P ) → N 0, σ 2 (P )
d
2. 1/sn → 1/σ(P )
Therefore, by Slutsky’s Lemma:
√ d
(1/sn ) n X̄n − µ(P ) → (1/σ(P ))N 0, σ 2 (P )

That is, √
n X̄n − µ(P ) d
→ N (0, 1)
sn

• Now, under the null (µ(P ) = 0): √

nX̄n d
→ N (0, 1).
sn

• To conduct the test at level α, the critical value is based on the standard normal distribution. But
since the test is two sided, we would reject whenever we see that Tn is both too low or too high. So
with significance level α, α
cn := Φ 1 − := z1− α2 ,
2
where Φ is the CDF of N (0, 1). Note that Φ(x) = P(z ≤ x) and we want to choose cn such that the
probability of z being greater than cn is 1 − α/2 (and the probability of z being less than cn is α/2);
i.e. cn is the 1 − α/2 th quantile of the standard normal distribution.
• So a natural candidate for a test statistic is
√
nX̄n
Tn :=
sn

• Showing it is consistent in level: That is, we wish to show that

lim sup E [ϕn ] ≤ α

n→∞

16
under the null. Notice that √ !
n X̄n
E [ϕn ] = P > z1− α2 .
sn

We also have that √

nX̄n d
→ N (0, 1),
sn
Noting that | · | is a continuous operation, by CMT
√
n X̄n d
→ |N (0, 1)|,
sn
Then, by definition of convergence in distribution, as n → ∞:
√ !
n X̄n
P ≥ z1− α2 → P |Z| ≥ z1− α2
sn

Thus, by taking limsup of both sides, we obtain

lim sup Ep [ϕn ] ≤ P |Z| ≥ z1− α2
n→∞

= P Z < z α2 + P Z > z1− α2
α α
= + = α.
2 2
That is, the test is consistent in level α.

2.25 Testing multidimensional hypothesis

iid
Let X1 , X2 , . . . , Xn ∼ P on Rk with Σ(P ) < ∞ being the k × k variance-covariance matrix. Assume that
Σ(P ) < ∞. Note that g(P ) is a k × 1 vector in this context. The test is

H0 : g(P ) = 0,
H1 : g(P ) ̸= 0.

The steps described in Section 2.22.3 are still but we have to take into consideration the fact that now we
are dealing with a variance-covariance matrix instead of a single parameter, which will have consequences
when standardizing the limiting distribution of ĝ(P ).

2.26 Example 7: Testing multidimensional hypothesis

iid
Let X1 , X2 , . . . , Xn ∼ P on Rk with Σ(P ) < ∞ being the k × k variance-covariance matrix. Assume that
Σ(P ) < ∞. Note that µ(P ) is a k × 1 vector in this context. The test is

H0 : µ(P ) = 0,
H1 : µ(P ) ̸= 0.

From the CLT, since Xi ’s are iid

√ d
n X̄n − µ(P ) → z ∼ N (0, Σ(P )).

A useful fact to remember is that if Σ(P ) is invertible, and z is a k × 1 vector, then

z ∼ N (0, Σ(P )) ⇒ z ′ Σ−1 (P )z ∼ χ2k .

17
In this case, we therefore have that
′ d
n X̄n − µ(P ) Σ−1 (P ) X̄n − µ(P ) → χ2k .

Although appears that can may use this to construct a test statistic, notice that we do not in fact know
Σ(P ). However, we can estimate Σ(P ). Recall that:

Σ(P ) = E [XX ′ ] − E[X]E[X ′ ]

Thus, a natural estimator is:

Pn Pn Pn
(Xi Xi′ ) Xi′

i=1 i=1 Xi i=1
Σ̂n = −
n n n
By WLLN,
n
1X p
Xi Xi′ → E [XX ′ ] ,
n i=1
p
X̄n → E[X],
Pn
Xi′ p
i=1
→ E[X ′ ],
n
We can express this as a continuous function g(a, b, c) = a − bc. Since we are given that Σ(P ) < ∞, which
also implies that E[|X|] < ∞, by CMT:
p
Σ̂n → Σ(P ).
Then, since we have assumed that Σ−1 is invertible, by the Continuous Mapping Theorem, we have that
p
Σ̂−1
n →Σ
−1
(P )

Thus, by Slutsky’s Lemma:

′ d
n X̄n − µ(P ) Σ̂−1
n X̄n − µ(P ) → χ2k .
Under the null:
d
Tn := nX̄n′ Σ̂−1 2
n X̄n → χk .

Then, the test is given by:

ϕn = 1{Tn >cn }
with
Tn := nX̄n′ Σ̂−1
n X̄n ,
cn := ck,1−α ,
where ck,1−α is the 1 − α th quantile of χ2k . Notice that we are not using |.| because we are dealing with a
χ2k distribution and not with a normal standard one.
To show that the test is consistent in level,

lim sup EP [ϕn ] = lim sup P nX̄n′ Σ̂−1
n X̄n > ck,1−α
n→∞ n→∞
≤ P (T ≥ ck,1−α ) = 1 − (1 − α) = α,

where T ∼ χ2k .

3 Conditional expectations
Let (Y, X) be random variables such that Y ∈ R, X ∈ R.

18
3.1 Expected Value Rule
X
E[X] = xpX (x)
x

3.2 Expected Value Rule for conditional expectation

3.2.1 Given an event A
X
E[X | A] = xpX|A (x)
x

3.2.2 Given that a random variable Y takes on an specific value

X
E[X | Y = y] = xpX|Y (x | y)
x

3.2.3 As a random variable

Notice that from above we have:
E[X | Y = y] = g(y)
It is a fixed value. However, If we allow g(y) to take any possible value that Y can take. Then, the conditional
expectation of random variable X given random variable Y is not a fixed number but a random variable:
E[X | Y ] = g(Y )
Notice that as g(Y ) is a random variable, it has on its own a distribution, mean, variance, etc.

3.3 The mean of E[X | Y ]: Law of Iterated Expectations (LIE)

Then, LIE states that:

E[E[X | Y ]] = E[X]

3.4 Properties of conditional expectations

19
3.5 About Independence
Note that independence implies mean independence, which in turn, implies uncorrelatedness. The converse
statements do not hold.

3.5.1 Mean independence ⇏ independence

Suppose
Y | X ∼ N 0, σ 2 X

Then E[Y | X] = 0 so that Y is mean independent of X. However, Y and X and clearly not independent.

3.5.2 Mean independence ⇒ zero covariance

Suppose E[Y | X] = c, then

Cov[X, Y ] = E[XY ] − E[X]E[Y ]

( LIE ) = E[E[XY | X]] − E[X]E[E[Y | X]]
= E[XE[Y | X]] − E[X]E[E[Y | X]]
= E[XE[c]] − E[X]E[c]
= cE[X] − cE[X] = 0.

3.5.3 Uncorrelatedness ⇏ mean independence

d
Suppose X ∼ N (0, 1) and Y = X 2 . Then,

Cov[X, Y ] = E X 3 − E[X]E X 2 = 0 − 0 = 0.

However, E[Y | X] = E X 2 | X = X 2 .

20
4 Linear Regressions
Let (Y, X, u) be random variables/vectors where Y, u ∈ R and X ∈ Rk+1 , such that
 
1
X1 
X= . 
 
 .. 
Xk
 
β0
 β1 
β= . 
 
 .. 
βk
and
Y = X′ β + u

4.1 Three interpretations of linear regressions

• Assume E[u | X] = 0
• Then,
E[Y | X] = X′ β

• The parameter β has no causal interpretation-it is simply a convenient way of summarising a feature
of distribution of Y and X

4.1.2 ”Best” Linear Approximation to Conditional Expectation / ”Best” Linear Predictor of

Y |X
Define u := Y − X′ β. For the ”Best” Linear Approximation to Conditional Expectation interpretation,
consider h i
2
min E (E[Y | X] − X′ b)
b∈Rk+1
For the “Best” Linear Predictor of Y | X interpretation, consider
h i
2
min E (Y − X′ b)
b∈Rk+1

Notice that any solution to first minimization problem is a solution to second one (and vice versa):
 
h i
2
E (E[Y | X] − X′ b) =E (E[Y | X] − Y +Y − X′ b)2 
| {z }
:=v
=E v + 2E [v (Y − X′ b)]
2
h i
2
+ E (Y − X′ b)
h i
2
=E v 2 + 2E[vY ] − 2E [vX′ b] + E (Y − X′ b)

21
Notice that
E [vX′ b] = E [(E[Y | X] − Y )X′ b]
= E [E[Y | X]X′ b] − E [Y X′ b]
(LIE) = E [Y X′ b] − E [Y X′ b]
= 0.
Then, h i h i
2 2
E (E[Y | X] − X′ b) = E v 2 + 2E[vY ] + E (Y − X′ b)

Notice that the first two terms of the above equation are constants with respect to b. Hence, minimising the
first minimization problem with respect to b must give the same solution as minimising the second one with
respect to b.
Let β be a solution to either of the minimization problems. Define u := Y − X′ β, then we can interpret
Y = X′ β + u as either the best “Best” Linear Approximation to Conditional Expectation or “Best” Linear
Predictor of Y | X, which we denote as BLP(Y | X).
As before, no causal interpretation can be drawn from β. To derive properties of u, consider the first-order
condition of the minimization problem:

0 = 2E [X (Y − X′ β)]
= E[Xu]

4.1.3 Causal model interpretation

Assume that Y = g(X, u), where X represents observed determinants of Y , and u represents the unobserved
determinants of Y .
In other words, the assumption says that if we are given X and u, then we can compute Y .
Then, the effect of a unit change in Xi on Y , holding X−i and u constant is ∂g/∂Xi . Thus, if we assume
that g(X, u) = X′ β + u, then
∂g(X, u)
= βi
∂Xi
However, we do not know statements about the relationships between observed and unobserved determinants
of Y : e.g. E [Xi u] or E[u | X].

4.2 Linear Regression when E[Xu] = 0

Define (Y, X, u) and β as above and assume that:
• E Y2 <∞

• E [XX′ ] exists
−1
• There is no perfect collinearity in X: E [XX′ ] < ∞. X is said to be perfectly collinear if there is
c ̸= 0, c ∈ Rk+1 such that P (c′ X = 0) = 1.
• E[Xu] = 0

22
4.2.1 Solving for β
Using the fact that E[Xu] = 0:
0 = E[Xu]
= E [X (Y − X′ β)]
= E[XY ] − E XX′ β

−1
⇒ β = E [XX′ ] E[XY ].
′
What happens if E [XX ] is not invertible? This means that there exists multiple solutions for β.
Although having multiple solutions is not a problem under the first and second interpretations, it would
matter for the last interpretation; i.e. multiple solutions are a problem in cases where we are interested in
causal relationships.

4.2.2 Solving for subvectors of β

Suppose we partition X and β into subvectors (X1 )k1 ×1 and (X2 )k2 ×1 , and accordingly, to (β 1 )k1 ×1 and
(β 2 )k2 ×1 . Then,
Y = X′ β + u

′ ′
β1
= X 1 X 2 +u
β2
= X′1 β 1 + X′2 β 2 + u
We also have that, by assumption,

X1 u 0k1 ×1
E[Xu] = 0 ⇒ E = .
X2 u 0k2 ×1

One way to solve this is:

−1
β = E XX′

E[XY ]
−1

β1 X1 ′ ′ X1
⇔ =E X1 X2 E Y
β2 X2 X2
−1
E [X1 X′1 ] E [X1 X′2 ]

X1 Y
= E
E [X2 X′1 ] E [X1 X′2 ] X2 Y

Is there a way to find β 1 without finding β 2 ?

Frisch Waugh Theorem for Population:
First, define
Ỹ := Y − BLP (Y | X2 ) = Y − X′2 γ,
X̃1 := X1 − BLP (X1 | X2 ) = X1 − X′2 δ,
Then, β 1 = β̃ can be found from:
Ỹ = X̃′1 β̃ + ũ
where since E[X̃1 ũ] = 0:
h i−1 h i
β̃ = E X̃1 X̃′1 E X̃1 Ỹ
h i−1 h i
= Var X̃1 Cov X̃1 , Ỹ

Proof:

23
h i
First note that since E [XX′ ] is invertible by assumption, and X̃1 is linear combination of X, E X̃1 X̃′1 is
invertible. h i−1 h i
β̃ = E X̃1 X̃′1 E X̃1 Ỹ
h i−1 h i
= E X̃1 X̃′1 E X̃1 (Y − BLP (Y | X2 ))
h i−1 h i h i
= E X̃1 X̃′1 E X̃1 Y − E X̃1 BLP (Y | X2 )

Since X̃1 is the residual from regressing X1 on X2 . Then, X̃1 is orthogonal (linearly independent) to X2 :
h i h i h i
E X̃1 BLP (Y | X2 ) = E X̃1 (X′2 γ) = E X̃1 X′2 γ = 0k1 ×1

Hence,
h i−1 h i
β̃ = E X̃1 X̃′1 E X̃1 Y
h i−1 h i
= E X̃1 X̃′1 E X̃1 (X′1 β 1 + X′2 β 2 + u)
h i−1 h i h i h i
= E X̃1 X̃′1 E X̃1 X′1 β 1 + E X̃1 X′2 β 2 + E X̃1 u
h i h i
Notice that E X̃1 X′2 β 2 = 0 since E X̃1 X′2 = 0.
h i
Now, let’s analyze E X̃1 u :
h i
E X̃1 u = E [(X1 − BLP (X1 | X2 )) u] = E [X1 u] − E [(X′2 δ) u]
′
= E [X1 u] − (δE [X2 u]) = 0

Thus,
h i−1 h i
β̃ = E X̃1 X̃′1 E X̃1 X′1 β 1
h i−1 h i
′
= E X̃1 X̃′1 E X̃1 X̃′1 + BLP (X1 | X2 ) β1
h i−1 h i
′
= β 1 + E X̃1 X̃′1 E X̃1 BLP (X1 | X2 ) β 1 .

Notice that
h i h i h i
′ ′
E X̃1 BLP (X1 | X2 ) =E X̃1 (X′2 δ) = E X̃1 δ ′ X2
  ′  
X̃11 δ1
 X̃12   δ2   
= E  .   .   X2 
   
 ..   ..   
X̃1k1 δk2
  
δ1 X̃11 δ2 X̃11 ··· δk2 X̃11
 δ1 X̃12 δ2 X̃12 ··· δk2 X̃12
  

=E  (X2 )k2 ×1 

 .. .. .. .. 
. . . .

  
δ1 X̃1k1 δ2 X̃1k1 ··· δk2 X̃1k1 k1 ×k2
= 0k1 ×1 ,

24
h i
where the last equality holds since E X̃1i X2j = 0 for all i = 1, 2, . . . , k1 and j = 1, 2, . . . k2 . Therefore

β̃ = β 1 .

Notice that we showed the following in the proof above:

h i
• E X̃1 u = 0k1 ×1 : This comes from the assumption that E[Xu] = 0, which holds by construction of
BLP.
h i
• E X̃1 X′2 = 0k1 ×k2 : This means that the ”residual” from regressing X1 on X2 is orthogonal to X2 .
h i
• E X̃1 BLP (Y | X2 ) = 0k1 ×1 : This means that the ”residual” from ”regressing” X1 on X2 is orthog-
onal to BLP (Y | X2 ) from ”regressing” Y on X2 .
h i
′
• E X̃1 BLP (X1 | X2 ) = 0k1 ×1 : This means that the ”residual” from ”regressing” X1 on X2 is
orthogonal to BLP (X1 | X2 ) from ”regressing” X1 on X2 .

4.2.3 Solving for β 2 when X′1 = 1

From:
Y = X′1 β 1 + X′2 β 2 + u
If X′1 = 1, then we have the following linear regression:

Y = β1 + X′2 β 2 + u

where β1 is the intercept. From FWT we know that in general β 2 = β̃ can be found from:

Ỹ = X̃′2 β̃ + ũ

where
Ỹ := Y − BLP (Y | X1 ) = Y − X′1 γ,
X̃2 := X2 − BLP (X2 | X1 ) = X2 − X′1 δ,
h i
2
For this particular case BLP [Y | X2 ] solves minβ0 E (Y − β0 ) , which gives the first-order condition as
E [(Y − β0 )] = 0 ⇒ E[Y ] = β0 . Similarly for BLP [X1 | X2 ]. Substituting the expressions above, we have:

Ỹ = Y − E[Y ]
X̃2 = X2 − E [X2 ]

Thus,
′ −1
β̃ = E (X2 − E [X2 ]) (X2 − E [X2 ]) E [(X2 − E [X2 ]) (Y − E[Y ])]
−1
= Var [X2 ] Cov [X2 , Y ] .
As β 2 = β̃, we have that when there is an intercept, the vector β 2 can be found by FWT by:

−1
β 2 = Var [X2 ] Cov [X2 , Y ]

25
4.2.4 Estimating β
Let (Y, X, u) be random variables/vectors where Y, u ∈ R and X ∈ Rk+1 , where
′
X = (X0 , X1 , . . . , Xk ) , X0 = 1,
′
β = (β0 , β1 , . . . , βk ) ,

be a parameter such that

Y = X′ β + u
Assume that:
1. E[Xu] = 0
2. E XX′ < ∞

3. there is no perfect collinearity in X

iid
4. (Y, X) ∼ P and we have samples Y 1 , X1 , Y 2 , X2 , . . . , (Y n , Xn ) ∼ P , where we denote by super-
scripts the index for sample.
Recall that
−1
β = E [XX′ ] E[XY ]
Then, a natural estimator for β is
n
!−1 n
!
1 X i i′ 1X i i
β̂ n = XX XY .
n i=1 n i=1

This (analog) estimator is called the ordinary least square (OLS) estimator. The name comes from the
fact that β̂ n solves
n
1X i 2
β̂ n = arg min Y − Xi′ b .
b∈R k+1 n i=1

To see this, notice that the first-order condition is given by

n
1 X i i
X Y − Xi′ β̂ n = 0(k+1)×1
n i=1
n n
1 X i i′ 1X i i
⇒ X X β̂ n = XY
n i=1 n i=1

We define the fitted values to be

Ŷ i = Xi β̂ n
and the fitted residuals to be
ûi := Y i − Ŷ i = Y i − Xi′ β̂ n .
With these definitions, we realise that (3.11) can be written as
Pn
 Pn i=1 Xi ûi
=0 (k+1)×1
i i

Pi=1 X0 û 0
n i i
i=1 X1 û
   0 
⇔  =  ..  .
   
..

Pn .   . 
i i
i=1 Xk+1 û 0

26
Note that since X0i = 1 for all i, the first-order condition implies that
n
X
ûi = 0
i=1

when we have a constant in the regression.

4.2.5 Unbiasedness of the natural estimator of β

E[Xu] = 0 is not enough for β to be unbiased. We need E[u | X] = 0 (i.e. u is mean independent of X).
Recall that !−1 !
n n
1 X i i′ 1X i i
β̂ n = XX XY .
n i=1 n i=1
Then,
n
!−1 n
!
1 X i i′ 1X i i′ i

β̂ n = XX X X β+u .
n i=1 n i=1
n
!−1 n
! n
!−1 n
!
1 X i i′ 1 X i i′ 1 X i i′ 1X i i
β̂ n = XX XX β+ XX Xu .
n i=1 n i=1 n i=1 n i=1
n
!−1 n
!
1 X i i′ 1X i i
β̂ n = β + XX Xu .
n i=1 n i=1

Taking the conditional expectation given X:

 !−1 ! 
n n
h i 1 X i i′ 1X i i
E β̂ n | X1 , X2 , . . . , Xn = E β + XX Xu | X1 , X2 , . . . , Xn 
n i=1 n i=1

n
!−1 " n
! #
h
1 2 n
i 1 X i i′ 1X i i 1 2 n
E β̂ n | X , X , . . . , X =β+ XX E Xu | X ,X ,...,X
n i=1 n i=1
n
!−1 n
!
h
1 2 n
i 1 X i i′ 1X i i 1 2 n

E β̂ n | X , X , . . . , X =β+ XX E X u | X ,X ,...,X
n i=1 n i=1
n
!−1 n
!
h i 1 X i i′ 1X i i
E β̂ n | X1 , X2 , . . . , Xn = β + X E u | X1 , X2 , . . . , Xn

XX
n i=1 n i=1
Since u is mean independent of X: h i
E β̂ n | X1 , X2 , . . . , Xn = β

By LIE: h h ii
E[β̂ n ] = E E β̂ n | X1 , X2 , . . . , Xn
h i
⇔ E β̂ n = β.

27
4.2.6 Consistency of the natural estimator of β
Consistency Without requiring further assumptions, we can show that the OLS estimator is consistent.
Recall that !−1 !
n n
1 X i i′ 1X i i
β̂ n = XX XY .
n n=1 n n=1

First, since E Xi Xi′ = E [XX′ ] exists and Xi ,s are iid,

n
1 X i i′ p
X X → E [XX′ ]
n n=1

Since we are given that there is no perfect collinearity in X, then E [XX′ ] is invertible. Then, by the
Continuous Mapping Theorem,
n
!−1
1 X i i′ p −1
XX → E [XX′ ]
n n=1
′
Similarly, since E Xi Y i = E[XY ] = E [X (X′ β + u)] = E [XX′ ] β and E [XX′ ] exists, and Xi , Y i s are

iid,
n
1 X i i p i i
XY →E XY
n n=1
Since convergence in marginal probabilities implies converge in joint probabilities:
 !−1 
n n
1 X 1 X p

−1
Xi Xi′ Xi Y i  → E [XX′ ] , E Xi Y i

 ,
n n=1 n n=1

Noting that multiplication is a continuous operation, and given above, by the Continuous Mapping Theorem,
n
!−1 n
!
1 X i i′ 1X i i p −1
β̂ n = XX XY → E [XX′ ] E[XY ] = β
n n=1 n n=1

4.2.7 Limiting distribution of the natural estimator of β

Suppose further that Var[Xu] exists, then
√
d
n β̂ n − β → N (0, Ω),

where
−1 −1
Ω = E [XX′ ] Var[Xu]E [XX′ ]
Proof: Substituting Y i = Xi′ β + ui into β̂ n yields

n
!−1
n
!
1 X i i′ 1X i
X Xi′ β + ui

β̂ n = XX
n n=1 n n=1
n
! −1 n
!
1 X i i′ 1X i i
=β+ XX Xu
n n=1 n n=1
n
!−1 n
!
√ 1 X i i′ √ 1X i i
⇒ n β̂ n − β = XX n Xu .
n n=1 n n=1

28
By assumption, Var[Xu] exists and Xi ui ’s are iid, then by the Central Limit Theorem:
n
!
√ 1X i i d
n X u → N (0, Var[Xu]),
n n=1

where we used the fact that that E[Xu] = 0 by assumption. We already showed previously that

n
!−1
1 X i i′ p −1
XX → E [XX′ ] .
n n=1

Hence, by Slutsky’s Lemma,

n
!−1 n
!
√ 1 X i i′ √ 1X i i
n β̂ n − β = XX n Xu
n n=1 n n=1
d −1
→ E XX′

N (0, Var[Xu])

Thus,
√
d

−1 −1

n β̂ n − β → N 0, E [XX′ ] Var[Xu]E [XX′ ] .

4.2.8 Limiting distribution of the natural estimator of β if E[u | X] = 0

Suppose, in addition, that E[u | X] = 0 and Var[u | X] = σ 2 (i.e. u is homoscedastic), then
√
d

−1

n β̂ n − β → N 0, σ 2 E [XX′ ]

Proof :
Notice that
Var[Xu] = E [(Xu − E[Xu])(Xu − E[Xu])′ ]
[Since: E[Xu] = 0] = E XX′ u2

[ LIE ] = E XX′ E u2 | X .

Since,
Var[u | X] = E (u − E[u | X])2 | X

= E u2 | X − E[u | X]2

= E u2 | X = σ 2

we can write
Var[Xu] = E XX′ σ 2 .

Thus,
−1 −1
Ω = E [XX′ ] Var[Xu]E [XX′ ]
−1 −1
= E [XX′ ] E [XX′ ] σ 2 E [XX′ ]
−1
= σ 2 E [XX′ ]

4.2.9 Estimating Ω
Recall:

√
d
n β̂ n − β → N (0, Ω),

29
where
−1 −1
Ω = E [XX′ ] Var[Xu]E [XX′ ]
Since we do not observe u, we do not know Ω. Here, we focus on deriving a consistent estimator of Ω.
Under homoscedasticity:
Suppose u is homoscedastic so that

E[u | X] = 0, Var[u | X] = σ 2 .

Recall that, in this case, Ω simplifies to

−1
Ω = σ 2 E [XX′ ]
17
So, a natural estimator for Ω is:
n
!−1
1 X i i′
Ω̃n := XX σ̂n2 ,
n i=1
n
1 X i 2
σ̂n2 := û
n i=1

We know from before that (see consistency of OLS)

n
!−1
1 X i i′ p −1
→ E Xi Xi′

XX
n i=1

Thus, it remains to show that σ̂n2 is consistent.

Note that if we could observe ui and define σ̂n2 using ui , then it would be easy to show that σ̂n2 is consistent
(use WLLN). But we cannot observe ui .
However, we can still introduce ui to show that σ̂n2 is consistent. To do so write out ûi while adding and
subtracting X′i β :

ûi = Y i − Xi′ β̂ n = Y i − Xi′ β + Xi′ β − Xi′ β̂ n = ui − Xi′ β̂ n − β .

This implies that

2 2 h i2
ûi = ui − 2ui Xi′ β̂ n − β + Xi′ β̂ n − β ,

which, in turn, gives that

n n i2 n
1 X i 2 1 X i i′ 1 Xh
σ̂n2 = u −2 u X β̂ n − β + Xi′ β̂ n − β
n i=1 n i=1 n i=1

Let us consider each term in turn:

Pn 2
• n1 i=1 ui : Since ui ’s are iid, and E ui = 0 < ∞, WLLN immediately gives us that

n
1 X i 2 p
u → Var[u]
n i=1

Pn
• 1
n i=1 ui i′
X β̂ n − β . Notice that the betas can be taken outside of the summation since they do
p Pn p
not depend on i. Then, we use the fact that β̂ n → β and n1 i=1 ui Xi′ → E[Xu] = 0, and the

30
Continuous Mapping Theorem (as well as the fact that convergence in marginal probabilities implies
convergence in joint probabilities) to conclude that
1X n
p
β̂ n − β ui Xi′ → 0
n i=1

Pn h i2
1 i′
n i=1 X β̂ n − β : Since the term is positive, it would be sufficient to show that its upper
bound is zero. With u and v being (k + 1) × 1 vectors, Cauchy-Schwartz inequality tells us that
 2   
k
X Xk Xk
 uj v j  ≤  u2j   vj2 
j=1 j=1 j=1

So, for each i,

 2   
2 k k k 2 2
X X 2 X 2
Xi′ β̂ n − β Xji′ β̂n,j − βj  ≤  Xji′   β̂n,j − βj  = Xi

= β̂ n − β
j=1 j=1 j=1

where | · | is the Euclidean norm. Summing across i and dividing by n, we obtain that
n i2 n
2

1 X h i′ 1X i 2
X β̂ n − β ≤ X β̂ n − β
n i=1 n i=1
n
!
1X i 2 2
= X β̂ n − β
n i=1
h i
2
Notice that E [XX′ ] < ∞ implies that E (Xj ) < ∞ for all j = 0, 1 . . . , k :

X02
    
X0 X0 X1 ··· X0 Xk
 X1    X0 X1 X12 ··· X1 Xk 
E [XX′ ] = E  X0 X1 ··· Xk  = E <∞
    
..  .. .. .. ..
 .    . . . . 
Xk X0 Xk X1 Xk ··· Xk2
Pn 2 p h i
2
Since Xji ’s are iid, by WLLN, we know that n1 i=1 Xji → E (Xj ) . Then, by the fact that
convergence in marginal probabilities implies convergence in joint probabilities, using the Continuous
Mapping Theorem,
k n k
X 1X 2 p X h
2
i
Xji → E (Xj ) .
j=0
n i=1 j=0
p
Therefore, by CMT and the fact that β̂ n → β
n
!
1X i 2 2 p
X β̂ n − β →0
n i=1

Thus, we obtain the result that

p
σ̂n2 → Var[u].

31
Therefore, we have that (since convergence in marginal probabilities implies convergence in joint probabili-
ties)  
n
!−1
1 X p
−1
 Xi Xi′ , σ̂n2  → E Xi Xi′ , Var[u]
n i=1

and applying the Continuous Mapping Theorem (with g(a, b) = ab, which is continuous) yields that

n
!−1
1 X i i′ p
Ω̃n = XX σ̂n2 → Var[Xu]
n i=1

Under heteroscedasticity
Under heteroscedasticity A natural estimator here is
n
!−1 n
! n
!−1
1 X i i′ 1 X i i′ i 2 1 X i i′
Ω̂n = XX X X û XX
n i=1 n i=1 n i=1

1
Pn −1 p −1
We already showed that n i=1 Xi Xi′ → E Xi Xi′ so our focus here is the term in the middle.
Since Xi ’s and ui ’s are both iid and, by assumption, Var [Xu] exists, applying WLLN gives us that
n
1 X i i′ i 2 p h i i′ i 2 i
X X û →E XX u
n i=1

, where ûi := Y i − Ŷ i = Y i − Xi′ β̂ n

4.2.10 Gauss Markov Theorem

Suppose further that E[u | X] = 0, Var[u | X] = σ 2 (i.e. u is homoscedastic), then β̂ n is the BLUE: ”best”
linear estimator of β in the sense of having the ”smallest” variance among all unbiased linear estimators.

4.2.11 Inference: Testing a single linear restriction

Suppose we wish to test, at level α,
H0 : r′ β = c,
H1 : r′ β ̸= c,
where r = (k + 1) × 1 and c is a scalar. For example, if r = (1, 0, . . .)′ , then the test is β0 = c. Alternatively,
if r = (1, −1, 0, 0, . . .)′ , then the test is β0 − β1 = c. Recall that
√
d
n β̂ n − β → N (0, Ω).

Since linear operations are continuous, by Delta Method with g(a) = r′ a and Dg (a) = r,
√ ′
d
n r β̂ n − r′ β → N (0, r′ Ωr) .

Note that, if r ̸= 0, then r′ Ωr > 0 (i.e. Ω is positive definite) so that, by the Continuous Mapping Theorem,
−1 p
−1
r′ Ω̂n r → (r′ Ωr)

32
Then, by Slutsky’s Lemma,
√ ′
n r β̂ n − r′ β d
p → N (0, 1).
r′ Ω̂n r
Thus, we can consider the following test statistic (under the null)
√ ′
n r β̂ n − c
Tn = p
r′ Ω̂n r
and the test is
ϕn = 1n|T o .
n |>z1− α .
2

By construction, this test is consistent in level α.

4.2.12 Inference: Testing multiple linear restrictions (Wald Test)

Suppose we wish to test, at level α,
H0 : Rβ = c
H1 : Rβ ̸= c
where R = p × (k + 1) and c = p × 1; i.e. there are p linear restrictions. For example, this allows us to test
   
β0 1
 β2 + β3  =  0  ,
β1 − β2 2

by setting  
1 0 0 0 0 ···
R= 0 0 1 1 0 ··· 
0 1 −1 0 0 · · · 3×(k+1)

Just as we required r ̸= 0, here, to rule out redundant restrictions (e.g. 2β1 + 2β2 = 2c and β1 + β2 = c ),
we require the rows of R to be linearly independent.
This means that, given that Ω is invertible by assumption, RΩR′ is also invertible.20 To see this suppose

a b
Ω=
c d

and that this is invertible.

1 0 a b 1 0 a b
=
0 1 c d 0 1 c d
However, if

2 2 a b 2 1 2a + 2c 2b + 2d 2 1
=
1 1 c d 2 1 a+c b+d 2 1

4a + 4c + 4b + 4d 2a + 2c + 2b + 2d
=
2a + 2c + 2b + 2d a+c+b+d

By the Continuous Mapping Theorem (linear operations are continuous),

√
d
n Rβ̂ n − Rβ → N (0, RΩR′ ) .

33
By the Continuous Mapping Theorem, we also have that
−1 p −1
RΩ̂n R′ → (RΩR′ )

Recall that, if
√
d
n β̂ n − β → z ∼ N (0, Ω),

then z ′ Ω−1 z ∼ χ2k . Using this, we can obtain that

′ −1
d
n Rβ̂ n − Rβ RΩ̂n R′ Rβ̂ n − Rβ → χ2p .

The test statistic is then given by

′ −1
Tn = n Rβ̂ n − c RΩ̂n R′ Rβ̂ n − c

and
ϕn = 1{Tn >cp,1−α } ,
where cp,1−α is the 1 − α th quantile of χ2p distribution. By construction, this is consistent in level α

4.2.13 Tests of non-linear restrictions

Suppose we wish to test, at level α,
H0 : f (β) = c,
H1 : f (β) ̸= c,

where f : Rk+1 → Rp is continuously differentiable at β. Denote as Dβ f (β) the p × (k + 1) matrix of partials

of f , evaluated at β with linearly independent rows. Then, by the Delta method,
√
d
n f β̂ n − f (β) → N (0, Dβ f (β)ΩDβ f (β)′ ) ,

where we note that Dβ f (β)ΩDβ f (β)′ is non-singular (for the same reason as before). By the Continuous
Mapping Theorem, then ′ p
Dβ f β̂ n Ω̂n Dβ f β̂ n → Dβ f (β)ΩDβ f (β)′ .

Then we have that

′ ′ −1
d
n f β̂ n − f (β) Dβ f β̂ n Ω̂n Dβ f β̂ n f β̂ n − f (β) → χ2p .

The test statistic is then given by

′ ′ −1
Tn = n f β̂ n − c Dβ f β̂ n Ω̂n Dβ f β̂ n f β̂ n − c

and
ϕn = 1{Tn >cp,1−α } ,
where cp,1−α is the 1 − α th quantile of χ2p distribution. By construction, this is consistent in level α

34
4.3 Linear Regression when E[Xu] ̸= 0
Suppose that we have (Y, X, u) where Y, u ∈ R and X ∈ Rk+1 , with X0 = 1, and β such that

Y = X′ β + u.

• Any Xj with E [Xj u] = 0 is said to be exogenous.

• Any Xj with E [Xj u] ̸= 0 is said to be endogenous.

4.3.1 Motivating examples for when E[Xu] ̸= 0

The examples below show that even if the ”true” model is such that E[Xu] = 0, we can still have endogeneity
problems in the estimated model if:
i the estimated model omits relevant variables
ii there are measurement errors
iii there are simultaneity problems
Omitted Variables:
Suppose k = 2 and that the true model is given by

Y = β0 + β1 X1 + β2 X2 + u

Assume that the model is causal but we still have that E[Xu] = 0. However, suppose we cannot observe X2
so that we estimate the model as
Y = β0∗ + β1∗ X1 + u∗ .
Using the true model, we can rewrite the estimated model as:

Y = (β0 + E [X2 ] β2 ) + β1 X1 + (u + (X2 − E [X2 ]) β2 ).

| {z } |{z} | {z }
=β0∗ =β1∗ =u∗

Then, note that

E [u∗ ] = E[u] + E [X2 − E [X2 ]] β2 = 0
however,
Cov [X1 , u∗ ] = Cov [X1 , u + (X2 − E [X2 ]) β2 ]
= Cov [X1 , X2 ] β2 .
Thus, Cov [X1 , u ] ̸= 0 if Cov [X1 , X2 ] ̸= 0 and/or β2 ̸= 0. Hence, we realise that E [X1 u∗ ] ̸= 0 in general so
∗

that X1 is an endogenous variable in the estimated model.

Measurement error
Suppose we partition X into X0 = 1 and X1 ∈ Rk , and partition β correspondingly. The true model is given
by
Y = β0 + X′1 β 1 + u

We suppose that this is the causal model and that E[Xu] = 0. However, assume that X1 is unobserved and
that we instead observe
X̂1 = X1 + v,
where E[v] = 0, Cov[u, v] = 0 and Cov [X1 , v] = 0. The model we therefore estimate is

Y = β0∗ + X̂′1 β ∗1 + u∗ .

35
Using the true model, we can rewrite the estimated model as

Y = β0 + X̂′1 β 1 + (u − v′ β 1 ) .
| {z }
=u∗

Then, note that

E [u∗ ] = E[u] − E [v′ ] β 1 = 0,
however, h i
Cov X̂1 , u∗ = Cov [X1 + v, u − v′ β 1 ]
= − Var[v]β 1
Thus,
h unless
i Var[v] = 0 (i.e. measurement error is a constant) or β 1 = 0 (i.e. X1 are unrelated to Y ), then
E X̂1 u∗ ̸= 0; i.e. X̂1 is endogenous.

Simultaneity
Let superscript d denote demand-side variables and superscript s denote supplyside variables. Consider the
following demand and supply equations:

Qd = β0d + β1d P̃ + ud ,
Qs = β0s + β1s P̃ + us

with E ud = E [us ] = 0 and that E ud us = 0. What we observe in the data is the equilibrium outcome
determined by the market clearing condition, Qd = Qs ; i.e.

β0d + β1d P̃ + ud = β0s + β1s P̃ + us

β0s − β0d + us − ud
⇒ P̃ = .
β1d − β1s
h i h i
Thus, since P̃ contains us and ud , E P̃ ud , E P̃ us ̸= 0; i.e. P̃ is endogenous in both equations.

4.3.2 What happens to the OLS estimator if E[Xu] ̸= 0 ?

In general, when E[Xu] ̸= 0, the OLS estimator is both biased and inconsistent. To see the unbiasedness
note that note that when E[Xu] ̸= 0, then E[u | X] ̸= 0.
Similarly, when E[Xu] ̸= 0, notice that
−1 −1
E [XX′ ] E[XY ] = E XX′ E [X (X′ β + u)]

−1
= β + E [XX′ ] E[Xu] .
| {z }
̸=0

p −1 p
Hence, we although we would still have that β̂ n → E [XX′ ] E[XY ], we no longer have that β̂ n → β. That
is, the OLS estimator is inconsistent.

4.3.3 Solving for β: Instrumental Variables

We maintain the same assumptions as before, however, suppose further that there exists instruments, Z ∈
Rl+1 , such that
• (instrument exogeneity). E[Zu] = 0.
• (instrument relevance/rank condition). E ZX′ has rank k + 1.

36
• (order condition). l + 1 ≥ k + 1 (a necessary condition for rank condition).
• Z includes all exogenous Xj (in particular, Z0 = X0 = 1 ).
• E [ZZ′ ] and E [ZX′ ] exist.
• No perfect collinearity in Z.
Now we solve for β. First, note that

E[Zu] = 0 ⇒ E [Z (Y − X′ β)] = 0
⇒ E[ZY] = E [ZX′ ] β

When β is exactly identified (l + 1 = k + 1):

Suppose first that l + 1 = k + 1, then E [ZX′ ] is an invertible square matrix so that, from the above equation,
we can immediately obtain that
−1
β = E [ZX′ ] E[ZY]

When β is over-identified (l + 1 > k + 1):

Now suppose that l + 1 > k + 1, then E ZX′ is not invertible. However the following result will allow us

to proceed:
Suppose there is no perfect collinearity in Z, then

rank E ZX′ = rank(Π),

where Π is such that BLP(X | Z) = Π′ Z. In particular, if rank [E [ZX′ ]] = k + 1, then

h i
′
Π′ E ZX′ = Π′ E Z (Π′ Z) = Π′ E [ZZ′ ] Π

is invertible.
Multiplying both sides of E[ZY] = E ZX′ β by Π′

Π′ E[ZY] = Π′ E [ZX′ ] β

Since BLP(X | Z) = Π′ Z:

Π′ E[ZY] = Π′ E ZZ′ Πβ

As we showed that Π′ E ZZ′ Π is invertible, then we can rewrite the solution for β as:

−1 ′
β = Π′ E ZX′

Π E[ZY]

where,
−1
Π = E [ZZ′ ] E[ZX]

(Rank Inequality). For any comformable matrices A and B,

rank(AB) ≤ min{rank(A), rank(B)}

Proof. From the fact that BLP(X | Z) = Π′ Z,

X = Π′ Z + v,

37
with E Zv′ = 0 (we use the fact that there is no collinearity for the existence of Π so that E ZZ′ is

invertible). Hence,
h i
′
E [ZX′ ] = E Z (Π′ Z + v) = E [ZZ′ Π] + E Zv′ = E ZZ′ Π

Using Rank Inequality,

rank E ZZ′ Π ≤ min rank E ZZ′ , rank(Π)

≤ rank(Π)

4.3.4 Estimating β
Suppose we have the following:
• (Y, X, Z, u) such that X ∈ Rk+1 with X0 = 1 and Z ∈ Rl+1 with Z0 = 1 and that Z1 includes any
exogenous Xj ’s.
• E ZZ′ and E ZX′ exist

• there is no perfect collinearity in Z

• E[Zu] = 0
• rank (E [ZX′ ]) = k + 1
• sample is iid:
iid
Y 1 , X1 , Z1 , Y 2 , X2 , Z2 , . . . , (Y n , Xn , Zn ) ∼ (Y, X, Z)

We consider two cases:

1. Instrument variables (IV) estimator: in the exactly identified case l + 1 = k + 1
2. Two-stage least squares (TSLS) estimator: in the over-identified case l + 1 > k + 1
IV estimator
Recall that, when l + 1 = k + 1:
−1
β = E [ZX′ ] E[ZY ]

So the natural estimator is !−1 !

n n
1 X i i′ 1X i i
β̂ n = ZX ZY .
n i=1 n i=1
Note that this implies:
n
1 X i i
Z Y − Xi′ β̂ n = 0
n i=1
n
1X i i
⇒ Z û = 0,
n i=1

where ûi = Yi − Xi′ β̂ n .

The IV estimator can also be written as
−1
β̂ n = (Z′ X) Z′ Y,
′ ′
where Z = Z 1 , Z 2 , . . . , Z n . In this case, defining Û = û1 , û2 , . . . , ûn = Y − Xβ̂ n , we realise that

ZÛ = 0.

38
TSLS estimator
Recall that, when l + 1 ≥ k + 1
−1 ′
β = Π′ E [ZX′ ] Π E[ZY ]
′ ′
−1 ′
= Π E [ZZ ] Π Π E[ZY ]
where BLP(X | Z) = Π′ Z so that
−1
Π = E ZZ′ E [ZX′ ]

So the natural estimator is

n
!!−1 n
!
1 ′ 1 X i i′ ′ 1X i i
β̂ n = Π̂n ZX Π̂n ZY
n i=1 n i=1
n
!−1 n
!
1 X ′ i i′ 1X ′ i i
= Π̂ Z X Π̂ Z Y
n i=1 n n i=1 n

or ! !−1 !
n n
2 ′ 1 X i i′ ′ 1X i i
β̂ n = Π̂n ZZ Π̂n Π̂n ZY
n i=1 n i=1
where !−1 !
n n
1 X i i′ 1 X i i′
Π̂n = ZZ ZX
n i=1 n i=1
1 2 ′
To see that β̂ n = β̂ n , recall that Xi = Π̂n Zi + v̂i so that

n
!!−1 n
!
1 ′ 1 X i ′ i ′ ′ 1X i i
i
β̂ n = Π̂n Z Π̂n Z + v̂ Π̂n ZY
n i=1 n i=1
n
! !−1 n
!
′ 1 X i i′ ′ 1X i i 2
= Π̂n ZZ Π̂n Π̂n ZY = β̂ n ,
n i=1 n i=1

where we use the fact that

n
1 X i i′
Z v̂ = 0
n i=1
by construction.
Notice also that:
1 ′
• β̂ n has the interpretation as the IV estimator, where we replace Zi with Π̂n Zi .
2 ′
• β̂ n has the TSLS interpretation: (i) We first regress Xi on Zi to obtain Π̂n Zi ; (ii) We then regress Y i
on Π̂′n Zi
• if Π̂′n is invertible (i.e. l + 1 = k + 1 ), then the IV and TSLS estimators correspond exactly.

4.3.5 Consistency of the TSLS estimator

Recall:
−1
β = E [Π′ ZX′ ] E [Π′ ZY ]
−1
= E [Π′ ZZ′ Π] E [Π′ ZY ] ,

39
and the TSLS estimator: !−1 !
n n
1 X ′ i i′ 1X ′ i i
β̂ n = Π̂ Z X Π̂ Z Y
n i=1 n n i=1 n
n
!−1 n
!
1 X ′ i i′ 1X ′ i i
= Π̂ Z Z Π̂n Π̂ Z Y
n i=1 n n i=1 n
where !−1 !
n n
1 X i i′ 1 X i i′
Π̂n = ZZ ZX
n i=1 n i=1

Since Zi and Xi are iid and E [ZZ′ ] and E[ZX]′ exist, by WLLN,
n
1X i i p
Z X → E [ZX′ ]
n i=1
n
1 X i i′ p
Z Z → E [ZZ′ ]
n i=1

By the Continuous Mapping Theorem, since there is no perfect collinearity in Z (i.e. E ZZ′ is invertible),

n
!−1
1 X i i′ p −1
→ E ZZ′

ZZ .
n i=1

By the Continuous Mapping Theorem again,

p
Π̂n → Π.

Since Zi , Y i ’s are iid,
n
1 X i i p i i
ZY →E ZY .
n n=1
by the Continuous Mapping Theorem,
n
!
′ 1X i i p
→ Π′ E Zi Y i = E Π′ Zi Y i .

Π̂n ZY
n n=1

by the Continuous Mapping Theorem,

n
!
′ 1 X i i′ p
Π̂n → Π′ E Zi Zi′ Π.

Π̂n ZZ
n i=1

Since rank E ZX′ = rank(Π) = k + 1 then E Π′ ZZZ′ Π is invertible, so, by the Continuous Mapping

Theorem,
n
! !−1
′ 1 X i i′ p −1
Π̂n ZZ Π̂n → (E [Π′ ZZ′ Π])
n i=1
Applying the Continuous Mapping Theorem, we obtain that
p −1
β̂ n → β = E [Π′ ZZ′ Π] E [Π′ ZY ] .

40
4.3.6 Limiting distribution of the TSLS estimator
Assume further that Var[Zu] exist. Then,
√
d
n β̂ n − β → N (0, Ω),

where
−1 −1
Ω = E [Π′ ZZ′ Π] Π′ Var[Zu]ΠE [Π′ ZZ′ Π]

To see this, substituting the expression for Y i gives

n
!−1 n
!
1 X ′ i i′ 1X ′ i i′ i

β̂ n = Π̂ Z Z Π̂n Π̂ Z X β + u
n i=1 n n i=1 n
n
!−1 n
! n
!−1 n
!
1 X ′ i i′ 1 X ′ i i′ 1 X ′ i i′ 1X ′ i i
= Π̂ Z X Π̂ Z X β+ Π̂ Z X Π̂ Z u
n i=1 n n i=1 n n i=1 n n i=1 n
n
!−1 n
!
1 X ′ i i′ 1X ′ i i
=β+ Π̂ Z X Π̂ Z u
n i=1 n n i=1 n
√
Rearranging and multiplying by n yields
n
!−1 n
!
√ 1 X ′ i i′ ′ √ 1X
n β̂ n − β = Π̂ Z X Π̂n n Zi ui .
n i=1 n n i=1

Recall that we already showed that

n
!−1 n
! !−1
1 X ′ i i′ ′ ′ 1 X i i′ ′ p −1 ′
Π̂n → Π′ E Zi Zi′ Π

Π̂n Z X Π̂n = Π̂n ZZ Π̂n Π
n i=1 n i=1

Since Var[Zu] < ∞ and Zi , Xi , Y i are iid, by the Central Limit Theorem,
n
!
√ 1X d
n Zi ui → N (0, Var[Zu])
n i=1

By Slutsky’s Lemma,
n
! !−1 n
!
′ 1 X i i′ ′ √ 1X d −1 ′
Zi ui → ΠE′ Zi Zi′ Π

Π̂n ZZ Π̂n Π̂n n Π N (0, Var[Zu]).
n i=1 n i=1

That is,
√
d
n β̂ n − β → N (0, Ω),
−1 −1
where Ω = E [Π′ ZZ′ Π] Π′ Var[Zu]ΠE Π′ ZZZ′ Π .

4.3.7 Efficiency of the TSTS estimator

Efficiency Recall that we solve for β using Π′ E[Zu]
= 0. Alternatively, we could have used any other matrix
Γ with some dimension such that rank ΓE ZX′ = k + 1, and use it to solve for β :
n
!−1 n
!
1 X ′ i i′ 1X ′ i i
β̃ n = ΓZX ΓZY .
n i=1 n i=1

41
As before, we will have that
√
d
n β̃ n − β → N (0, Ω̃),
−1 ′ −1 ′
where Ω̃ = E Γ′ ZX′ Γ Var[Zu]Γ E Γ′ ZX′

(the transpose in the last term is there as the term
might not be symmetric).
However, if E[u | Z] = 0 and Var[u | Z] = σ 2 , then Γ = Π is the ”best” in the sense that Ω ≤ Ω̃ for any Γ
that satisfies rank (ΓE [ZX′ ]) = k + 1

4.3.8 Estimating Ω
The natural estimator here is !
n
1 X i i′ i 2
Ω̂n = A Z Z û A′ ,
n i=1
′ Pn −1 ′
where A := Π̂n n1 i=1 Zi Zi′ Π̂n

Π̂n . Note that we already showed that A converges in probability.
P p P 2 p
2
To show that n1 → Var[Zu] is isomorphic when we showed that n1

Zi Zi′ ûi Xi Xi′ ûi →
Var[Xu] in the case of OLS estimator.
Crucially, ûi is the predicted residual from the regression of Y i on Xi —and does not involve Zi . That is,
ûi = Y i − X̂i′ β.

Intrusion Detection Honeypots
From Everand
Intrusion Detection Honeypots
Chris Sanders
3/5 (2)
Stat 331 Course Notes
No ratings yet
Stat 331 Course Notes
79 pages
Stat 111: Introduction To Statistical Inference: ©2023 by Joseph K. Blitzstein and Neil Shephard
No ratings yet
Stat 111: Introduction To Statistical Inference: ©2023 by Joseph K. Blitzstein and Neil Shephard
387 pages
Etc 2410 Notes
50% (2)
Etc 2410 Notes
133 pages
Sample 2 (Web) Rise Paper Application - 2023-2024
No ratings yet
Sample 2 (Web) Rise Paper Application - 2023-2024
9 pages
J2a El
100% (1)
J2a El
5 pages
Advanced Econometrics PDF
No ratings yet
Advanced Econometrics PDF
58 pages
Audio, Video, and Media in the Ministry
From Everand
Audio, Video, and Media in the Ministry
Clarence Floyd Richmond
No ratings yet
California Bearing Ratio Report (One Point) : Brisbane QLD-Australia
No ratings yet
California Bearing Ratio Report (One Point) : Brisbane QLD-Australia
6 pages
Fundamentals of Mathematical Statistics 2020
No ratings yet
Fundamentals of Mathematical Statistics 2020
196 pages
Statistics for Econometrics
No ratings yet
Statistics for Econometrics
100 pages
EC400Stats Lecturenotes2021
No ratings yet
EC400Stats Lecturenotes2021
101 pages
Creel M Econometrics
No ratings yet
Creel M Econometrics
479 pages
Horowitz Sinander Notes
No ratings yet
Horowitz Sinander Notes
136 pages
Foundations of Econometrics Using SAS Simulations and Examples
No ratings yet
Foundations of Econometrics Using SAS Simulations and Examples
56 pages
Econometrics Simpler Note
No ratings yet
Econometrics Simpler Note
692 pages
Stat PDF
No ratings yet
Stat PDF
132 pages
Regbook Inside
No ratings yet
Regbook Inside
21 pages
Ebook Econometrics
No ratings yet
Ebook Econometrics
1,006 pages
2021 - Creel - econometrics (githuib book)
No ratings yet
2021 - Creel - econometrics (githuib book)
1,060 pages
Graduate Econometrics Lecture Notes - Michael Creel (414 Pages)
100% (1)
Graduate Econometrics Lecture Notes - Michael Creel (414 Pages)
414 pages
Eco No Metrics
No ratings yet
Eco No Metrics
1,045 pages
Econometric s
No ratings yet
Econometric s
1,341 pages
Math Stats Lecture 2020F
No ratings yet
Math Stats Lecture 2020F
122 pages
Xxxx Statistical Estimation
No ratings yet
Xxxx Statistical Estimation
87 pages
Eco No Metrics
No ratings yet
Eco No Metrics
299 pages
305MinNotes PDF
No ratings yet
305MinNotes PDF
148 pages
Econometric S 2007
No ratings yet
Econometric S 2007
167 pages
Estimations
100% (1)
Estimations
183 pages
Econometrics UAB
No ratings yet
Econometrics UAB
353 pages
7772 LectureNotes
No ratings yet
7772 LectureNotes
120 pages
A First Course in Mathematical Statistics - Nusbaum
No ratings yet
A First Course in Mathematical Statistics - Nusbaum
195 pages
Empirical Process (Sara Van de Geer)
No ratings yet
Empirical Process (Sara Van de Geer)
91 pages
22-Lecture Notes On Probability Theory and Random Processes
100% (2)
22-Lecture Notes On Probability Theory and Random Processes
302 pages
Mathematical Statistics Intro Course 1713243381
No ratings yet
Mathematical Statistics Intro Course 1713243381
142 pages
Cimentaciones Maquinas
100% (1)
Cimentaciones Maquinas
235 pages
Eco No Metrics 1
No ratings yet
Eco No Metrics 1
299 pages
STAT613
No ratings yet
STAT613
295 pages
Handbook Statistical Foundations of Machine Learning
No ratings yet
Handbook Statistical Foundations of Machine Learning
267 pages
EcmAll PDF
No ratings yet
EcmAll PDF
266 pages
Reg Book Stat
No ratings yet
Reg Book Stat
79 pages
ECON835 Lecture Notes Part 1 Probability Through Asymptotics [Fall 2014]
No ratings yet
ECON835 Lecture Notes Part 1 Probability Through Asymptotics [Fall 2014]
75 pages
Statistical Regression
No ratings yet
Statistical Regression
32 pages
Stat 2013
No ratings yet
Stat 2013
132 pages
Lecture Notes
No ratings yet
Lecture Notes
90 pages
Applied Robust Statistics-David Olive
No ratings yet
Applied Robust Statistics-David Olive
588 pages
exponential family
No ratings yet
exponential family
45 pages
Applied Robust Statistics
No ratings yet
Applied Robust Statistics
532 pages
Applied Robust Statistics 2005 PDF
No ratings yet
Applied Robust Statistics 2005 PDF
532 pages
ECON835 Lecture Notes Part 2 Maximum Likelihood Through Panel Data [Fall 2014]
No ratings yet
ECON835 Lecture Notes Part 2 Maximum Likelihood Through Panel Data [Fall 2014]
68 pages
OSU Adjustment Notes Part 1
No ratings yet
OSU Adjustment Notes Part 1
230 pages
Notes On Estimation
No ratings yet
Notes On Estimation
76 pages
Probability and Statistics Ii: George Deligiannidis Module Lecturer 2020/21: Kalliopi Mylona
No ratings yet
Probability and Statistics Ii: George Deligiannidis Module Lecturer 2020/21: Kalliopi Mylona
107 pages
Kuan C.-M. Introduction To Econometric Theory (LN, Taipei, 2002) (202s) - GL
No ratings yet
Kuan C.-M. Introduction To Econometric Theory (LN, Taipei, 2002) (202s) - GL
202 pages
Econometric S 2
0% (1)
Econometric S 2
611 pages
Content Creation Revolution with chatGPT
From Everand
Content Creation Revolution with chatGPT
Maria Cowen
No ratings yet
Advanced college algebra study guide
From Everand
Advanced college algebra study guide
Harrison Cook
No ratings yet
ADVANCED COLLEGE ALGEBRA STUDY GUIDE
From Everand
ADVANCED COLLEGE ALGEBRA STUDY GUIDE
Harrison K Cook
No ratings yet
A Discourse Analysis of 1 Peter
From Everand
A Discourse Analysis of 1 Peter
Ervin Ray Starwalt
No ratings yet
Gray Hat Hacking the Ethical Hacker's
From Everand
Gray Hat Hacking the Ethical Hacker's
Çağatay Şanlı
5/5 (1)
DIVE IN! Real Estate Investment Advice From A Pro
From Everand
DIVE IN! Real Estate Investment Advice From A Pro
Fred Gelfand
No ratings yet
The Stock Market from A to See - 2nd Edition
From Everand
The Stock Market from A to See - 2nd Edition
John Nunez
No ratings yet
Human Nature Potential in Nurture
From Everand
Human Nature Potential in Nurture
David L. Hawk
No ratings yet
Land Use / Land Cover Change Detection Using Remote Sensing Techniques
No ratings yet
Land Use / Land Cover Change Detection Using Remote Sensing Techniques
25 pages
Serie SP 600 English
No ratings yet
Serie SP 600 English
2 pages
Mod 14 Course Summary - InstructorDeck
No ratings yet
Mod 14 Course Summary - InstructorDeck
17 pages
Chanakya: Business Decision Game
No ratings yet
Chanakya: Business Decision Game
1 page
Conversion Functional Shift Root Creation: English Linguistics
No ratings yet
Conversion Functional Shift Root Creation: English Linguistics
11 pages
HSC Board
No ratings yet
HSC Board
34 pages
Categorizations of Behavioral Biases
No ratings yet
Categorizations of Behavioral Biases
9 pages
Project Report Cement Market
88% (33)
Project Report Cement Market
69 pages
WWW Learncbse in Electricity Chapter Wise Important Questions Class 10 Science
No ratings yet
WWW Learncbse in Electricity Chapter Wise Important Questions Class 10 Science
34 pages
RDBMS File
No ratings yet
RDBMS File
10 pages
Welcome To USA On Ship
No ratings yet
Welcome To USA On Ship
15 pages
Effects-Of-Online-Games Final
No ratings yet
Effects-Of-Online-Games Final
46 pages
The Fraudulent Povedano Calendar
No ratings yet
The Fraudulent Povedano Calendar
2 pages
Ids Unit 1
No ratings yet
Ids Unit 1
24 pages
Permasil - Disperse Dyes For Polyester
No ratings yet
Permasil - Disperse Dyes For Polyester
12 pages
P07 Hirarc
No ratings yet
P07 Hirarc
10 pages
17 Nesting
No ratings yet
17 Nesting
6 pages
The Eight Essential Supply Chain Management Processes2004
No ratings yet
The Eight Essential Supply Chain Management Processes2004
10 pages
Top 10 Warehouse Management
No ratings yet
Top 10 Warehouse Management
48 pages
Banana Production & Processing
No ratings yet
Banana Production & Processing
8 pages
Module 1 Rates of Reaction
No ratings yet
Module 1 Rates of Reaction
15 pages
Bella Report
No ratings yet
Bella Report
4 pages
Lecture 2 Deep Learning Overview
No ratings yet
Lecture 2 Deep Learning Overview
99 pages
Strength of Materials
No ratings yet
Strength of Materials
6 pages
Corium 123 EN MSDS
No ratings yet
Corium 123 EN MSDS
10 pages
Advance Marketing Management
No ratings yet
Advance Marketing Management
6 pages
Filmtec SW30HRLE-400
No ratings yet
Filmtec SW30HRLE-400
3 pages