Econometría
Econometría
Course: Econometrics1
Prof.: Dr. Ricardo Quineche Uribe2
T.A.: Rita Huarancca Delgado3
Contents
1 Introduction 3
3 Conditional expectations 18
3.1 Expected Value Rule . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.2 Expected Value Rule for conditional expectation . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.2.1 Given an event A . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.2.2 Given that a random variable Y takes on an specific value . . . . . . . . . . . . . . . . 19
3.2.3 As a random variable . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.3 The mean of E[X | Y ]: Law of Iterated Expectations (LIE) . . . . . . . . . . . . . . . . . . . 19
3.4 Properties of conditional expectations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.5 About Independence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.5.1 Mean independence ⇏ independence . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.5.2 Mean independence ⇒ zero covariance . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.5.3 Uncorrelatedness ⇏ mean independence . . . . . . . . . . . . . . . . . . . . . . . . . . 20
4 Linear Regressions 21
4.1 Three interpretations of linear regressions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
4.1.1 Linear Conditional Expectation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
4.1.2 ”Best” Linear Approximation to Conditional Expectation / ”Best” Linear Predictor
of Y | X . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
4.1.3 Causal model interpretation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
4.2 Linear Regression when E[Xu] = 0 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
4.2.1 Solving for β . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
4.2.2 Solving for subvectors of β . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
4.2.3 Solving for β 2 when X′1 = 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
4.2.4 Estimating β . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
4.2.5 Unbiasedness of the natural estimator of β . . . . . . . . . . . . . . . . . . . . . . . . 27
4.2.6 Consistency of the natural estimator of β . . . . . . . . . . . . . . . . . . . . . . . . . 28
4.2.7 Limiting distribution of the natural estimator of β . . . . . . . . . . . . . . . . . . . . 28
4.2.8 Limiting distribution of the natural estimator of β if E[u | X] = 0 . . . . . . . . . . . . 29
4.2.9 Estimating Ω . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
4.2.10 Gauss Markov Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
4.2.11 Inference: Testing a single linear restriction . . . . . . . . . . . . . . . . . . . . . . . . 32
4.2.12 Inference: Testing multiple linear restrictions (Wald Test) . . . . . . . . . . . . . . . . 33
4.2.13 Tests of non-linear restrictions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
4.3 Linear Regression when E[Xu] ̸= 0 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
4.3.1 Motivating examples for when E[Xu] ̸= 0 . . . . . . . . . . . . . . . . . . . . . . . . . 35
4.3.2 What happens to the OLS estimator if E[Xu] ̸= 0 ? . . . . . . . . . . . . . . . . . . . 36
4.3.3 Solving for β: Instrumental Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
4.3.4 Estimating β . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
4.3.5 Consistency of the TSLS estimator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
4.3.6 Limiting distribution of the TSLS estimator . . . . . . . . . . . . . . . . . . . . . . . . 41
4.3.7 Efficiency of the TSTS estimator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
4.3.8 Estimating Ω . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
2
1 Introduction
In econometrics, there are three typical problems we wish to address.
Suppose X1 , X2 , . . . , Xn are independently and identically distributed according to the cumulative distribu-
tion function (CDF) P . Then, our goal is to “learn” some “features” of P from the data.
i Estimate θ(P ): Construct a function called an estimator, θ̂n = θ̂n (X1 , X2 , . . . , Xn ), that provides a
“best guess” for θ(P ). For example, µ(P ) = E [Xi ] , σ 2 (P ) = VarP [Xi ], coefficients from the regressions
(β(P )).
ii Test a hypothesis about θ(P ): For example, a test could be of the form θ(P ) = θ0 , and we want to
construct a function, ϕn = ϕn (X1 , X2 , . . . , Xn ) ∈ [0, 1], that determines the probability with which you
reject the hypothesis.
iii Construct a confidence region for θ(P ): Construct a random set, Cn = Cn (X1 , X2 , . . . , Xn ), that contains
θ(P ) with some pre-specified probability.
Studying finite sample properties of the above is typically difficult (unless we make some strong assumptions
about P ). This is why we wish to study large-sample properties.
E[X] = E X + − E X − ,
where
X + := max{X, 0}, X − := max{−X, 0}.
Then, notice that
E[|X|] = E X + + E X − .
Thus, the requirement that E [|Xi |] < ∞ is equivalent to requiring that E Xi+ , E Xi− < ∞ (i.e. they both
+ −
exist) since E Xi , E Xi ≥ 0 by construction.
3
2.3 Markov’s inequality
For any random variable X,
E [|X|q ]
P(|X| > ε) ≤ , ∀q, ε > 0,
εq
where | · | is the usual Euclidean norm.
4
2.7 Continuous Mapping Theorem (CMT) for convergence in probability
Let {Xn : n ≥ 1} and X be random vectors on Rk . Suppose that g : Rk → Rd is continuous at each point in
the set C ⊆ Rk such that P(X ∈ C) = 1. Then,
p p
Xn → X ⇒ g (Xn ) → g(X)
p
• Recall that if {Xn : n ≥ 1} , {Yn : n ≥ 1} , X and Y are random variables and that Xn → X and
p
Yn → Y , then
p
(Xn , Yn ) → (X, Y ).
• The Continuous Mapping Theorem then allows us to define a function g (Xn , Yn ) so that
p
g (Xn , Yn ) → g(X, Y ).
2.8.1 Properties
• Convergence in moments implies convergence in probability: Let {Xn : n ≥ 1} and X be random
vectors on Rk . Then,
Lq p
Xn → X ⇒ Xn → X
E[uv]2 ≤ E u2 E v 2 .
Furthermore, inequality is strict unless there exists α such that P(u = αv) = 1.
5
2.11 Convergence in Distribution
Let {Xn : n ≥ 1} and X be random vectors on Rk .Xn is said to converge in distribution to X, denoted
d
Xn → X if, for all x for which P(X ≤ x) is continuous, as n → ∞
P (Xn ≤ x) → P(X ≤ x)
2.11.1 Properties
• Convergence in probability implies convergence in distribution.
• Convergence in distribution does not imply convergence in probability(unless the convergence in dis-
tribution is into a constant).
• If
d
1. Xn → b
2. b is a constant
Then,
p
Xn → b
• If
d
1. Xn → X
d
2. Yn → Xn
Then,
d
Yn → X
• Marginal convergence in distribution does not imply joint convergence in distribution (unless one of
the marginal convergence in distribution is into a constant).
• If
d
1. Xn → X
d
2. Yn → b
3. b is a constant.
Then,
d
(Xn , Yn ) → (X, b)
6
2.13 Slutsky’s Lemma
Let {Xn : n ≥ 1} , {Yn : n ≥ 1} and X be random vectors, and c ∈ Rk a constant. If
d
1. Xn → X
d
2. Yn → c
Then,
d
Xn′ Yn → X ′ c
and
d
Xn + Yn → X + c
In particular, if
d
X ∼ N (0, σ 2 (P )) ⇒ τn (g (Xn ) − g(c)) → N 0, gx2 (c)σ 2 (P ) .
7
2.16.2 Multivariate version:
Let {Xn : n ≥ 1} and X be random vectors, c ∈ Rk a constant, and τn a sequence of constants such that
d
τn → ∞ and τn (Xn − c) → X. Let g : Rk → Rd be a function that is continuous and differentiable at c.
Denote Dg (c) as the d × k matrix of partials of g evaluated at c. Then,
d
τn (g (Xn ) − g(c)) → Dg(c)X.
In particular, if
d
X ∼ N (0, Σ) ⇒ τn (g (Xn ) − g(c)) → N (0, Dg(c)ΣDg(c)′ ) .
• Is it consistent?
• Since E Xi2 < ∞, then E [Xi ] < ∞. Then, since Xi is iid for i = 1, ..., n and E [Xi ] < ∞, as n → ∞,
by WLLN:
p
X̄n → E[X].
Also, since Xi2 is iid for i = 1, ..., n and E Xi2 < ∞, as n → ∞, by WLLN:
n
1 X 2 p 2
X → E Xi ,
n i=1 i
where,
g(a, b) = a − b2
8
2.18 Example 2: Applying Delta Method: univariate
iid
Suppose X1 , X2 , . . . , Xn ∼ P = Bernoulli(q), with q ∈ (0, 1). Then, by the Central Limit Theorem,
√ d
n X̄n − q → N (0, g(q)),
where g(q) = q(1 q). Λ natural estimator for g(q) is g X̄n . Since, Dg(q) = 1 2q, the Delta Method
implies that
√ d
n g X̄n − g(q) → Dg(q)N (0, g(q)) = N 0, (1 − 2q)2 q(1 − q) .
D2 g(q) 2
g X̄n − g(q) = Dg(q) X̄n − q + X̄n − q + R X̄n − q
2
where R(0) = 0 and R(h) = o h2 ; i.e. as h → 0,
R(h)
→ 0.
h2
Since D2 g(q) = −2, when q = 1/2, above simplifies to
2
g X̄n − g(q) = − X̄n − q + R X̄n − q .
Multiplying both sides by n, we obtain
2
n g X̄n − g(q) = −n X̄n − q + nR X̄n − q
Consider the first term on the right-hand side. Rearranging and using the Continuous Mapping Theorem
gives that:
2 √ 2
−n X̄n − q = − n X̄n − q
2 2
d 1 d 1
→ − N 0, = − N (0, 1)
4 2
2
d 1 d 1
= − N (0, 1) = − χ21 .
2 4
We can also write the second as
!
2 R X̄n − q
nR X̄n − q = n X̄n − q b 2 ,
X̄n − q
R(h)
where b is defined as before to ensure that it is continuous at Xn = q. Recall that as h → 0, h2 → 0, and
P
the fact that X̄n → q. Thus, we know that
! !
R X̄n − q R X̄n − q d
b 2 → 0 ⇒ b 2 → 0.
X̄n − q X̄n − q
And, from before, we have already shown that
2 d 1 2
n X̄n − q → χ
4
9
2.19 Example 3: Applying Delta Method: multivariate
Let X1 , . . . , Xn be an iid sequence of random variables with distribution (λ) with λ > 0. (Here, (λ) denotes
the exponential distribution with parameter λ, whose pdf is given by
λ exp(−λx) if x ≥ 0
f (x) = .
0 if x < 0
In particular, the mean and the variance of such a random variable are 1/λ and 1/λ2 , respectively.) Inde-
pendently of X1 , . . . , Xn , let Y1 , . . . , Yn be an iid sequence of random variables with distribution (µ) with
µ > 0.
Deriving the limiting distribution for log Ȳn /X̄n :
As always, write out the expression:
Ȳn
log = log Ȳn − log X̄n .
X̄n
So it looks like we’ll be splicing things together, we should work in a multivariate setting. Since Xi ’s and
Yi ’s are iid with finite variances, and Xi Yi , by the multivariate CLT,
√ 1/λ2
X̄n 1/λ d 0 0
n − →N , .
Ȳn 1/µ 0 0 1/µ2
Let g(x, y) = log(y/x) which is continuous and differentiable if x ̸= 0 and y/x > 0. Note that
Dg(x, y) = xy ∂(y/x) x ∂(y/x)
= − x1 y1 .
∂x y ∂y
√ 1/λ2
Ȳn 1/λ d 0 −λ
n log − log → N 0, −λ µ
X̄n 1/µ 0 1/µ2 µ
d
= N (0, 2).
P (µ(P ) ∈ Cn ) ≥ 1 − α
Cn would then be the confidence set of level 1 − α for µ(P ). For example, if α = 5%, then Cn gives us the
interval in which the the probability that Cn contains the true mean is 95%.
We can construct a confidence set relying or not relying on asymptotics.
10
• using X̄n − E [X] as argument with q = 2:
h 2 i
E X̄n − µ(P )
P X̄n − µ(P ) > ε ≤ 2
ε
Var X̄n
=
ε2
• Then, by construction,
Var X̄n
P X̄n − µ(P ) ≤ ε ≥ 1 − .
ε2
• If we could somehow make
Var X̄n
1− =1−α
ε2
• Then,
P X̄n − µ(P ) ≤ ε̄ ≥ 1 − α.
• That is,
P X̄n − ε̄ ≤ µ(P ) ≤ X̄n + ε̄ ≥ 1 − α.
P (g(P ) ∈ Cn ) ≥ 1 − α.
Below, we list the steps to construct a confidence set/region using Central Limit Theorem, which relies on
asymptotic properties.
11
1. Construct a consistent estimator for g(P ). (Hint: use WLLN and CMT.)
2. Find out the limiting distribution of ĝ(P ). (Hint: use CLT, MCLT, Delta Method).
3. Derive an standard distribution. (Hint: Use Slutsky’s lemma)
4. Then, if the standard distribution is the normal standard one, we can write the confidence set as
√
ĝ(P ) − x
Cn := x ∈ R : n ≤ z1− α2
sg(P )
Equivalently, we can write the confidence set as
√
sg(P )
cn := z 1− α √ ,
2
n
and
Cn := [ĝ(P ) − cn , ĝ(P )n + cn ] .
P (g(P ) ∈ Cn ) ≥ 1 − α.
• We will use the recipe we have learned in section 2.18.1.
• 1. Write down the Markov’s inequality for X̄n − E[X] with q = 2:
h 2 i
E X̄n − µ(P )
P X̄n − µ(P ) > ε ≤
ε2 P
1 n P
1
Pn Var[X] + 2 i̸=j Cov [Xi , Xj ]
Var X̄n 2 Var [ i=1 X] n2 i=1
= = n =
ε2 ε2 ε2
1 n Var(X) r(1 − r)
= 2 =
n ε2 nε2
• 2. Find out an upper bound for the RHS of Markov’s inequality such that the unknown parameters
disappear:
Note that r(1 − r) is maximised at 1/4 when r = 1/2. Thus,
r(1 − r) 1
P X̄n − µ(P ) > ε ≤ ≤ , ∀r
nε2 4nε2
12
• 4. Using the obtained above expression, rewrite such that ε = g(α):
1
ε= √
4αn
• Step 2: By CLT:
√ d
n X̄n − µ(P ) → N 0, σ 2 (P )
Thus, we can write s2n as a function f (), parameterised as s2n = f X̄n , where f (a) = a(1 − a). Since
f () is continuous, by the Continuous Mapping Theorem and WLLN:
p
s2n → σ 2 (P ).
Since σ 2 (P ) > 0, we can write a function g(), parameterised as g(a) = √1 . Since g() is continuous, by
a
the Continuous Mapping Theorem and WLLN:
1 p 1
g(s2n ) = → g(σ 2 (P )) = .
sn σ(P )
13
• Step 4: Define q
X̄n 1 − X̄n
cn := z1− α2 √ ,
n
and
Cn := X̄n − cn , X̄n + cn .
We can write confident regions in the following equivalent way:
√
n X̄n − x
Cn := x ∈ R : q ≤ z1− α
2
X̄n 1 − X̄n
Stating and verifying the coverage property that your confidence region satisfies:
• The coverage property is that P (µ(P ) ∈ Cn ) → 1 − α.
• To verify it,
P (µ(P ) ∈ Cn ) = P (Xn − cn ≤ µ(P ) ≤ Xn + cn )
= P X̄n − µ(P ) ≤ cn
q
X̄n 1 − X̄n
= P X̄n − µ(P ) ≤ z1− α2 √
n
√
n X̄n − µ(P )
= P q ≤ z1− α2
X̄n 1 − X̄n
→ P |z| ≤ z1− α2
= P z α2 ≤ z ≤ z1− α2
α α
=Φ 1− −Φ
2 2
α α
= 1− − = 1 − α.
2 2
14
2.23.1 Consistency in level
• We restrict attention to tests of the form:
ϕn = 1{Tn >cn } ,
where Tn is the test statistic which is a function of data such that “large” values provide evidence
against H0 ; and cn is the critical value which gives the definition of “large”.
• E [ϕn ] is the probability of rejecting H0 , called the power function of the test.
• For µ(P ) satisfying H0 , E [ϕn ] is the probability of type I error
• For a µ(P ) satisfying H1 , 1 − E [ϕn ] is the probability of type II error.
We say that a test ϕn = ϕn (X1 , X2 , . . . , Xn ) ∈ [0, 1] is consistent in level if
for µ(P ) satisfying H0 and where α ∈ (0, 1) is the significance level of the test.
• Therefore, the requirement of consistency in level is given by
lim sup E [ϕn ] = lim sup E 1{Tn >cn } = lim sup P (Tn > cn ) ≤ α
n→∞ n→∞ n→∞
where cn depends on α.
2.23.3 Steps
A recipe that almost always works is the following one:
1. Express the parameter in H0 as a function of Expectations.
2. Find a consistent estimator. (Hint: CMT, WLLN)
3. Find the limiting distribution of the estimator. (Hint: CLT, MCLT, Delta Method)
4. Standardize the distribution under the null. (Hint: Slutsky’s Lemma)
5. Construct the test ϕn according to H1 . That is taking into account if it is greater than, less than or
not equal to.
H0 : µ(P ) = 0
H1 : µ(P ) ̸= 0.
15
• First, we need to come up with an estimator for what is being tested under the null.
• As in Example 1, we need to try to use WLLN to construct a consistent estimator. In this case is
straightforward since it is testing the Mean.
• Then, by WLLN the consistent estimator is X̄n
• Now, we need to find a limiting distribution for the estimator and try to come up with a standard
distribution. For this, we try to use CLT. In this case, it is straightforward since it is only the estimator
of the mean.
• by CLT,
√ d
n X̄n − µ(P ) → N 0, σ 2 (P )
• We are close to get a normal standard distribution, but we need to take out σ 2 (P ). For this, we will
use Slutsky’s Lemma.
d
• We know (from Example 1) that s2n converges√in probability to σ 2 (P ), which implies that s2n → σ 2 (P ).
In particular, we use the fact that f (x) = 1/ x is a continuous function when x ̸= 0. Then, by CMT
p
1/sn → 1/σ(P ), which implies convergence in distribution. Thus, we have have that
√ d
1. n X̄n − µ(P ) → N 0, σ 2 (P )
d
2. 1/sn → 1/σ(P )
Therefore, by Slutsky’s Lemma:
√ d
(1/sn ) n X̄n − µ(P ) → (1/σ(P ))N 0, σ 2 (P )
That is, √
n X̄n − µ(P ) d
→ N (0, 1)
sn
• To conduct the test at level α, the critical value is based on the standard normal distribution. But
since the test is two sided, we would reject whenever we see that Tn is both too low or too high. So
with significance level α, α
cn := Φ 1 − := z1− α2 ,
2
where Φ is the CDF of N (0, 1). Note that Φ(x) = P(z ≤ x) and we want to choose cn such that the
probability of z being greater than cn is 1 − α/2 (and the probability of z being less than cn is α/2);
i.e. cn is the 1 − α/2 th quantile of the standard normal distribution.
• So a natural candidate for a test statistic is
√
nX̄n
Tn :=
sn
16
under the null. Notice that √ !
n X̄n
E [ϕn ] = P > z1− α2 .
sn
H0 : g(P ) = 0,
H1 : g(P ) ̸= 0.
The steps described in Section 2.22.3 are still but we have to take into consideration the fact that now we
are dealing with a variance-covariance matrix instead of a single parameter, which will have consequences
when standardizing the limiting distribution of ĝ(P ).
H0 : µ(P ) = 0,
H1 : µ(P ) ̸= 0.
17
In this case, we therefore have that
′ d
n X̄n − µ(P ) Σ−1 (P ) X̄n − µ(P ) → χ2k .
Although appears that can may use this to construct a test statistic, notice that we do not in fact know
Σ(P ). However, we can estimate Σ(P ). Recall that:
where T ∼ χ2k .
3 Conditional expectations
Let (Y, X) be random variables such that Y ∈ R, X ∈ R.
18
3.1 Expected Value Rule
X
E[X] = xpX (x)
x
19
3.5 About Independence
Note that independence implies mean independence, which in turn, implies uncorrelatedness. The converse
statements do not hold.
Then E[Y | X] = 0 so that Y is mean independent of X. However, Y and X and clearly not independent.
Cov[X, Y ] = E X 3 − E[X]E X 2 = 0 − 0 = 0.
However, E[Y | X] = E X 2 | X = X 2 .
20
4 Linear Regressions
Let (Y, X, u) be random variables/vectors where Y, u ∈ R and X ∈ Rk+1 , such that
1
X1
X= .
..
Xk
β0
β1
β= .
..
βk
and
Y = X′ β + u
• Assume E[u | X] = 0
• Then,
E[Y | X] = X′ β
• The parameter β has no causal interpretation-it is simply a convenient way of summarising a feature
of distribution of Y and X
Notice that any solution to first minimization problem is a solution to second one (and vice versa):
h i
2
E (E[Y | X] − X′ b) =E (E[Y | X] − Y +Y − X′ b)2
| {z }
:=v
=E v + 2E [v (Y − X′ b)]
2
h i
2
+ E (Y − X′ b)
h i
2
=E v 2 + 2E[vY ] − 2E [vX′ b] + E (Y − X′ b)
21
Notice that
E [vX′ b] = E [(E[Y | X] − Y )X′ b]
= E [E[Y | X]X′ b] − E [Y X′ b]
(LIE) = E [Y X′ b] − E [Y X′ b]
= 0.
Then, h i h i
2 2
E (E[Y | X] − X′ b) = E v 2 + 2E[vY ] + E (Y − X′ b)
Notice that the first two terms of the above equation are constants with respect to b. Hence, minimising the
first minimization problem with respect to b must give the same solution as minimising the second one with
respect to b.
Let β be a solution to either of the minimization problems. Define u := Y − X′ β, then we can interpret
Y = X′ β + u as either the best “Best” Linear Approximation to Conditional Expectation or “Best” Linear
Predictor of Y | X, which we denote as BLP(Y | X).
As before, no causal interpretation can be drawn from β. To derive properties of u, consider the first-order
condition of the minimization problem:
0 = 2E [X (Y − X′ β)]
= E[Xu]
• E [XX′ ] exists
−1
• There is no perfect collinearity in X: E [XX′ ] < ∞. X is said to be perfectly collinear if there is
c ̸= 0, c ∈ Rk+1 such that P (c′ X = 0) = 1.
• E[Xu] = 0
22
4.2.1 Solving for β
Using the fact that E[Xu] = 0:
0 = E[Xu]
= E [X (Y − X′ β)]
= E[XY ] − E XX′ β
−1
⇒ β = E [XX′ ] E[XY ].
′
What happens if E [XX ] is not invertible? This means that there exists multiple solutions for β.
Although having multiple solutions is not a problem under the first and second interpretations, it would
matter for the last interpretation; i.e. multiple solutions are a problem in cases where we are interested in
causal relationships.
Proof:
23
h i
First note that since E [XX′ ] is invertible by assumption, and X̃1 is linear combination of X, E X̃1 X̃′1 is
invertible. h i−1 h i
β̃ = E X̃1 X̃′1 E X̃1 Ỹ
h i−1 h i
= E X̃1 X̃′1 E X̃1 (Y − BLP (Y | X2 ))
h i−1 h i h i
= E X̃1 X̃′1 E X̃1 Y − E X̃1 BLP (Y | X2 )
Since X̃1 is the residual from regressing X1 on X2 . Then, X̃1 is orthogonal (linearly independent) to X2 :
h i h i h i
E X̃1 BLP (Y | X2 ) = E X̃1 (X′2 γ) = E X̃1 X′2 γ = 0k1 ×1
Hence,
h i−1 h i
β̃ = E X̃1 X̃′1 E X̃1 Y
h i−1 h i
= E X̃1 X̃′1 E X̃1 (X′1 β 1 + X′2 β 2 + u)
h i−1 h i h i h i
= E X̃1 X̃′1 E X̃1 X′1 β 1 + E X̃1 X′2 β 2 + E X̃1 u
h i h i
Notice that E X̃1 X′2 β 2 = 0 since E X̃1 X′2 = 0.
h i
Now, let’s analyze E X̃1 u :
h i
E X̃1 u = E [(X1 − BLP (X1 | X2 )) u] = E [X1 u] − E [(X′2 δ) u]
′
= E [X1 u] − (δE [X2 u]) = 0
Thus,
h i−1 h i
β̃ = E X̃1 X̃′1 E X̃1 X′1 β 1
h i−1 h i
′
= E X̃1 X̃′1 E X̃1 X̃′1 + BLP (X1 | X2 ) β1
h i−1 h i
′
= β 1 + E X̃1 X̃′1 E X̃1 BLP (X1 | X2 ) β 1 .
Notice that
h i h i h i
′ ′
E X̃1 BLP (X1 | X2 ) =E X̃1 (X′2 δ) = E X̃1 δ ′ X2
′
X̃11 δ1
X̃12 δ2
= E . . X2
.. ..
X̃1k1 δk2
δ1 X̃11 δ2 X̃11 ··· δk2 X̃11
δ1 X̃12 δ2 X̃12 ··· δk2 X̃12
=E (X2 )k2 ×1
.. .. .. ..
. . . .
δ1 X̃1k1 δ2 X̃1k1 ··· δk2 X̃1k1 k1 ×k2
= 0k1 ×1 ,
24
h i
where the last equality holds since E X̃1i X2j = 0 for all i = 1, 2, . . . , k1 and j = 1, 2, . . . k2 . Therefore
β̃ = β 1 .
Y = β1 + X′2 β 2 + u
where β1 is the intercept. From FWT we know that in general β 2 = β̃ can be found from:
Ỹ = X̃′2 β̃ + ũ
where
Ỹ := Y − BLP (Y | X1 ) = Y − X′1 γ,
X̃2 := X2 − BLP (X2 | X1 ) = X2 − X′1 δ,
h i
2
For this particular case BLP [Y | X2 ] solves minβ0 E (Y − β0 ) , which gives the first-order condition as
E [(Y − β0 )] = 0 ⇒ E[Y ] = β0 . Similarly for BLP [X1 | X2 ]. Substituting the expressions above, we have:
Ỹ = Y − E[Y ]
X̃2 = X2 − E [X2 ]
Thus,
′ −1
β̃ = E (X2 − E [X2 ]) (X2 − E [X2 ]) E [(X2 − E [X2 ]) (Y − E[Y ])]
−1
= Var [X2 ] Cov [X2 , Y ] .
As β 2 = β̃, we have that when there is an intercept, the vector β 2 can be found by FWT by:
−1
β 2 = Var [X2 ] Cov [X2 , Y ]
25
4.2.4 Estimating β
Let (Y, X, u) be random variables/vectors where Y, u ∈ R and X ∈ Rk+1 , where
′
X = (X0 , X1 , . . . , Xk ) , X0 = 1,
′
β = (β0 , β1 , . . . , βk ) ,
This (analog) estimator is called the ordinary least square (OLS) estimator. The name comes from the
fact that β̂ n solves
n
1X i 2
β̂ n = arg min Y − Xi′ b .
b∈R k+1 n i=1
26
Note that since X0i = 1 for all i, the first-order condition implies that
n
X
ûi = 0
i=1
n
!−1 " n
! #
h
1 2 n
i 1 X i i′ 1X i i 1 2 n
E β̂ n | X , X , . . . , X =β+ XX E Xu | X ,X ,...,X
n i=1 n i=1
n
!−1 n
!
h
1 2 n
i 1 X i i′ 1X i i 1 2 n
E β̂ n | X , X , . . . , X =β+ XX E X u | X ,X ,...,X
n i=1 n i=1
n
!−1 n
!
h i 1 X i i′ 1X i i
E β̂ n | X1 , X2 , . . . , Xn = β + X E u | X1 , X2 , . . . , Xn
XX
n i=1 n i=1
Since u is mean independent of X: h i
E β̂ n | X1 , X2 , . . . , Xn = β
By LIE: h h ii
E[β̂ n ] = E E β̂ n | X1 , X2 , . . . , Xn
h i
⇔ E β̂ n = β.
27
4.2.6 Consistency of the natural estimator of β
Consistency Without requiring further assumptions, we can show that the OLS estimator is consistent.
Recall that !−1 !
n n
1 X i i′ 1X i i
β̂ n = XX XY .
n n=1 n n=1
n
1 X i i′ p
X X → E [XX′ ]
n n=1
Since we are given that there is no perfect collinearity in X, then E [XX′ ] is invertible. Then, by the
Continuous Mapping Theorem,
n
!−1
1 X i i′ p −1
XX → E [XX′ ]
n n=1
′
Similarly, since E Xi Y i = E[XY ] = E [X (X′ β + u)] = E [XX′ ] β and E [XX′ ] exists, and Xi , Y i s are
iid,
n
1 X i i p i i
XY →E XY
n n=1
Since convergence in marginal probabilities implies converge in joint probabilities:
!−1
n n
1 X 1 X p
−1
Xi Xi′ Xi Y i → E [XX′ ] , E Xi Y i
,
n n=1 n n=1
Noting that multiplication is a continuous operation, and given above, by the Continuous Mapping Theorem,
n
!−1 n
!
1 X i i′ 1X i i p −1
β̂ n = XX XY → E [XX′ ] E[XY ] = β
n n=1 n n=1
where
−1 −1
Ω = E [XX′ ] Var[Xu]E [XX′ ]
Proof: Substituting Y i = Xi′ β + ui into β̂ n yields
n
!−1
n
!
1 X i i′ 1X i
X Xi′ β + ui
β̂ n = XX
n n=1 n n=1
n
! −1 n
!
1 X i i′ 1X i i
=β+ XX Xu
n n=1 n n=1
n
!−1 n
!
√ 1 X i i′ √ 1X i i
⇒ n β̂ n − β = XX n Xu .
n n=1 n n=1
28
By assumption, Var[Xu] exists and Xi ui ’s are iid, then by the Central Limit Theorem:
n
!
√ 1X i i d
n X u → N (0, Var[Xu]),
n n=1
where we used the fact that that E[Xu] = 0 by assumption. We already showed previously that
n
!−1
1 X i i′ p −1
XX → E [XX′ ] .
n n=1
Thus,
√
d
−1 −1
n β̂ n − β → N 0, E [XX′ ] Var[Xu]E [XX′ ] .
Proof :
Notice that
Var[Xu] = E [(Xu − E[Xu])(Xu − E[Xu])′ ]
[Since: E[Xu] = 0] = E XX′ u2
[ LIE ] = E XX′ E u2 | X .
Since,
Var[u | X] = E (u − E[u | X])2 | X
= E u2 | X − E[u | X]2
= E u2 | X = σ 2
we can write
Var[Xu] = E XX′ σ 2 .
Thus,
−1 −1
Ω = E [XX′ ] Var[Xu]E [XX′ ]
−1 −1
= E [XX′ ] E [XX′ ] σ 2 E [XX′ ]
−1
= σ 2 E [XX′ ]
4.2.9 Estimating Ω
Recall:
√
d
n β̂ n − β → N (0, Ω),
29
where
−1 −1
Ω = E [XX′ ] Var[Xu]E [XX′ ]
Since we do not observe u, we do not know Ω. Here, we focus on deriving a consistent estimator of Ω.
Under homoscedasticity:
Suppose u is homoscedastic so that
E[u | X] = 0, Var[u | X] = σ 2 .
n
!−1
1 X i i′ p −1
→ E Xi Xi′
XX
n i=1
n
1 X i 2 p
u → Var[u]
n i=1
Pn
• 1
n i=1 ui i′
X β̂ n − β . Notice that the betas can be taken outside of the summation since they do
p Pn p
not depend on i. Then, we use the fact that β̂ n → β and n1 i=1 ui Xi′ → E[Xu] = 0, and the
30
Continuous Mapping Theorem (as well as the fact that convergence in marginal probabilities implies
convergence in joint probabilities) to conclude that
1X n
p
β̂ n − β ui Xi′ → 0
n i=1
Pn h i2
1 i′
n i=1 X β̂ n − β : Since the term is positive, it would be sufficient to show that its upper
bound is zero. With u and v being (k + 1) × 1 vectors, Cauchy-Schwartz inequality tells us that
2
k
X Xk Xk
uj v j ≤ u2j vj2
j=1 j=1 j=1
where | · | is the Euclidean norm. Summing across i and dividing by n, we obtain that
n i2 n
2
1 X h i′ 1X i 2
X β̂ n − β ≤ X β̂ n − β
n i=1 n i=1
n
!
1X i 2 2
= X β̂ n − β
n i=1
h i
2
Notice that E [XX′ ] < ∞ implies that E (Xj ) < ∞ for all j = 0, 1 . . . , k :
X02
X0 X0 X1 ··· X0 Xk
X1 X0 X1 X12 ··· X1 Xk
E [XX′ ] = E X0 X1 ··· Xk = E <∞
.. .. .. .. ..
. . . . .
Xk X0 Xk X1 Xk ··· Xk2
Pn 2 p h i
2
Since Xji ’s are iid, by WLLN, we know that n1 i=1 Xji → E (Xj ) . Then, by the fact that
convergence in marginal probabilities implies convergence in joint probabilities, using the Continuous
Mapping Theorem,
k n k
X 1X 2 p X h
2
i
Xji → E (Xj ) .
j=0
n i=1 j=0
p
Therefore, by CMT and the fact that β̂ n → β
n
!
1X i 2 2 p
X β̂ n − β →0
n i=1
31
Therefore, we have that (since convergence in marginal probabilities implies convergence in joint probabili-
ties)
n
!−1
1 X p
−1
Xi Xi′ , σ̂n2 → E Xi Xi′ , Var[u]
n i=1
and applying the Continuous Mapping Theorem (with g(a, b) = ab, which is continuous) yields that
n
!−1
1 X i i′ p
Ω̃n = XX σ̂n2 → Var[Xu]
n i=1
Under heteroscedasticity
Under heteroscedasticity A natural estimator here is
n
!−1 n
! n
!−1
1 X i i′ 1 X i i′ i 2 1 X i i′
Ω̂n = XX X X û XX
n i=1 n i=1 n i=1
1
Pn −1 p −1
We already showed that n i=1 Xi Xi′ → E Xi Xi′ so our focus here is the term in the middle.
Since Xi ’s and ui ’s are both iid and, by assumption, Var [Xu] exists, applying WLLN gives us that
n
1 X i i′ i 2 p h i i′ i 2 i
X X û →E XX u
n i=1
Since linear operations are continuous, by Delta Method with g(a) = r′ a and Dg (a) = r,
√ ′
d
n r β̂ n − r′ β → N (0, r′ Ωr) .
Note that, if r ̸= 0, then r′ Ωr > 0 (i.e. Ω is positive definite) so that, by the Continuous Mapping Theorem,
−1 p
−1
r′ Ω̂n r → (r′ Ωr)
32
Then, by Slutsky’s Lemma,
√ ′
n r β̂ n − r′ β d
p → N (0, 1).
r′ Ω̂n r
Thus, we can consider the following test statistic (under the null)
√ ′
n r β̂ n − c
Tn = p
r′ Ω̂n r
and the test is
ϕn = 1n|T o .
n |>z1− α .
2
by setting
1 0 0 0 0 ···
R= 0 0 1 1 0 ···
0 1 −1 0 0 · · · 3×(k+1)
Just as we required r ̸= 0, here, to rule out redundant restrictions (e.g. 2β1 + 2β2 = 2c and β1 + β2 = c ),
we require the rows of R to be linearly independent.
This means that, given that Ω is invertible by assumption, RΩR′ is also invertible.20 To see this suppose
a b
Ω=
c d
33
By the Continuous Mapping Theorem, we also have that
−1 p −1
RΩ̂n R′ → (RΩR′ )
Recall that, if
√
d
n β̂ n − β → z ∼ N (0, Ω),
and
ϕn = 1{Tn >cp,1−α } ,
where cp,1−α is the 1 − α th quantile of χ2p distribution. By construction, this is consistent in level α
where we note that Dβ f (β)ΩDβ f (β)′ is non-singular (for the same reason as before). By the Continuous
Mapping Theorem, then ′ p
Dβ f β̂ n Ω̂n Dβ f β̂ n → Dβ f (β)ΩDβ f (β)′ .
and
ϕn = 1{Tn >cp,1−α } ,
where cp,1−α is the 1 − α th quantile of χ2p distribution. By construction, this is consistent in level α
34
4.3 Linear Regression when E[Xu] ̸= 0
Suppose that we have (Y, X, u) where Y, u ∈ R and X ∈ Rk+1 , with X0 = 1, and β such that
Y = X′ β + u.
Y = β0 + β1 X1 + β2 X2 + u
Assume that the model is causal but we still have that E[Xu] = 0. However, suppose we cannot observe X2
so that we estimate the model as
Y = β0∗ + β1∗ X1 + u∗ .
Using the true model, we can rewrite the estimated model as:
We suppose that this is the causal model and that E[Xu] = 0. However, assume that X1 is unobserved and
that we instead observe
X̂1 = X1 + v,
where E[v] = 0, Cov[u, v] = 0 and Cov [X1 , v] = 0. The model we therefore estimate is
Y = β0∗ + X̂′1 β ∗1 + u∗ .
35
Using the true model, we can rewrite the estimated model as
Y = β0 + X̂′1 β 1 + (u − v′ β 1 ) .
| {z }
=u∗
Simultaneity
Let superscript d denote demand-side variables and superscript s denote supplyside variables. Consider the
following demand and supply equations:
Qd = β0d + β1d P̃ + ud ,
Qs = β0s + β1s P̃ + us
with E ud = E [us ] = 0 and that E ud us = 0. What we observe in the data is the equilibrium outcome
determined by the market clearing condition, Qd = Qs ; i.e.
−1
= β + E [XX′ ] E[Xu] .
| {z }
̸=0
p −1 p
Hence, we although we would still have that β̂ n → E [XX′ ] E[XY ], we no longer have that β̂ n → β. That
is, the OLS estimator is inconsistent.
36
• (order condition). l + 1 ≥ k + 1 (a necessary condition for rank condition).
• Z includes all exogenous Xj (in particular, Z0 = X0 = 1 ).
• E [ZZ′ ] and E [ZX′ ] exist.
• No perfect collinearity in Z.
Now we solve for β. First, note that
E[Zu] = 0 ⇒ E [Z (Y − X′ β)] = 0
⇒ E[ZY] = E [ZX′ ] β
to proceed:
Suppose there is no perfect collinearity in Z, then
is invertible.
Multiplying both sides of E[ZY] = E ZX′ β by Π′
Π′ E[ZY] = Π′ E [ZX′ ] β
Since BLP(X | Z) = Π′ Z:
Π′ E[ZY] = Π′ E ZZ′ Πβ
As we showed that Π′ E ZZ′ Π is invertible, then we can rewrite the solution for β as:
−1 ′
β = Π′ E ZX′
Π E[ZY]
where,
−1
Π = E [ZZ′ ] E[ZX]
X = Π′ Z + v,
37
with E Zv′ = 0 (we use the fact that there is no collinearity for the existence of Π so that E ZZ′ is
invertible). Hence,
h i
′
E [ZX′ ] = E Z (Π′ Z + v) = E [ZZ′ Π] + E Zv′ = E ZZ′ Π
≤ rank(Π)
4.3.4 Estimating β
Suppose we have the following:
• (Y, X, Z, u) such that X ∈ Rk+1 with X0 = 1 and Z ∈ Rl+1 with Z0 = 1 and that Z1 includes any
exogenous Xj ’s.
• E ZZ′ and E ZX′ exist
ZÛ = 0.
38
TSLS estimator
Recall that, when l + 1 ≥ k + 1
−1 ′
β = Π′ E [ZX′ ] Π E[ZY ]
′ ′
−1 ′
= Π E [ZZ ] Π Π E[ZY ]
where BLP(X | Z) = Π′ Z so that
−1
Π = E ZZ′ E [ZX′ ]
or ! !−1 !
n n
2 ′ 1 X i i′ ′ 1X i i
β̂ n = Π̂n ZZ Π̂n Π̂n ZY
n i=1 n i=1
where !−1 !
n n
1 X i i′ 1 X i i′
Π̂n = ZZ ZX
n i=1 n i=1
1 2 ′
To see that β̂ n = β̂ n , recall that Xi = Π̂n Zi + v̂i so that
n
!!−1 n
!
1 ′ 1 X i ′ i ′ ′ 1X i i
i
β̂ n = Π̂n Z Π̂n Z + v̂ Π̂n ZY
n i=1 n i=1
n
! !−1 n
!
′ 1 X i i′ ′ 1X i i 2
= Π̂n ZZ Π̂n Π̂n ZY = β̂ n ,
n i=1 n i=1
39
and the TSLS estimator: !−1 !
n n
1 X ′ i i′ 1X ′ i i
β̂ n = Π̂ Z X Π̂ Z Y
n i=1 n n i=1 n
n
!−1 n
!
1 X ′ i i′ 1X ′ i i
= Π̂ Z Z Π̂n Π̂ Z Y
n i=1 n n i=1 n
where !−1 !
n n
1 X i i′ 1 X i i′
Π̂n = ZZ ZX
n i=1 n i=1
Since Zi and Xi are iid and E [ZZ′ ] and E[ZX]′ exist, by WLLN,
n
1X i i p
Z X → E [ZX′ ]
n i=1
n
1 X i i′ p
Z Z → E [ZZ′ ]
n i=1
By the Continuous Mapping Theorem, since there is no perfect collinearity in Z (i.e. E ZZ′ is invertible),
n
!−1
1 X i i′ p −1
→ E ZZ′
ZZ .
n i=1
Since rank E ZX′ = rank(Π) = k + 1 then E Π′ ZZZ′ Π is invertible, so, by the Continuous Mapping
Theorem,
n
! !−1
′ 1 X i i′ p −1
Π̂n ZZ Π̂n → (E [Π′ ZZ′ Π])
n i=1
Applying the Continuous Mapping Theorem, we obtain that
p −1
β̂ n → β = E [Π′ ZZ′ Π] E [Π′ ZY ] .
40
4.3.6 Limiting distribution of the TSLS estimator
Assume further that Var[Zu] exist. Then,
√
d
n β̂ n − β → N (0, Ω),
where
−1 −1
Ω = E [Π′ ZZ′ Π] Π′ Var[Zu]ΠE [Π′ ZZ′ Π]
Since Var[Zu] < ∞ and Zi , Xi , Y i are iid, by the Central Limit Theorem,
n
!
√ 1X d
n Zi ui → N (0, Var[Zu])
n i=1
By Slutsky’s Lemma,
n
! !−1 n
!
′ 1 X i i′ ′ √ 1X d −1 ′
Zi ui → ΠE′ Zi Zi′ Π
Π̂n ZZ Π̂n Π̂n n Π N (0, Var[Zu]).
n i=1 n i=1
That is,
√
d
n β̂ n − β → N (0, Ω),
−1 −1
where Ω = E [Π′ ZZ′ Π] Π′ Var[Zu]ΠE Π′ ZZZ′ Π .
41
As before, we will have that
√
d
n β̃ n − β → N (0, Ω̃),
−1 ′ −1 ′
where Ω̃ = E Γ′ ZX′ Γ Var[Zu]Γ E Γ′ ZX′
(the transpose in the last term is there as the term
might not be symmetric).
However, if E[u | Z] = 0 and Var[u | Z] = σ 2 , then Γ = Π is the ”best” in the sense that Ω ≤ Ω̃ for any Γ
that satisfies rank (ΓE [ZX′ ]) = k + 1
4.3.8 Estimating Ω
The natural estimator here is !
n
1 X i i′ i 2
Ω̂n = A Z Z û A′ ,
n i=1
′ Pn −1 ′
where A := Π̂n n1 i=1 Zi Zi′ Π̂n
Π̂n . Note that we already showed that A converges in probability.
P p P 2 p
2
To show that n1 → Var[Zu] is isomorphic when we showed that n1
Zi Zi′ ûi Xi Xi′ ûi →
Var[Xu] in the case of OLS estimator.
Crucially, ûi is the predicted residual from the regression of Y i on Xi —and does not involve Zi . That is,
ûi = Y i − X̂i′ β.
42