Lecture 2
Lecture 2
Jürgen Meinecke
Lecture 2 of 12
Research School of Economics, Australian National University
1 / 42
Roadmap
2 / 42
Given a bunch of random variables X1 , . . . , XK , Y, we wanted to
express Y as a linear combination in X1 , . . . , XK
A fancy way of saying the same thing:
We want to project Y onto the subspace spanned by X1 , . . . , XK
That projection is labeled Psp(X1 ,...,XK ) Y or Ŷ
Instead of Psp(X1 ,...,XK ) , we may simply write PX ,
where X := (X1 , . . . , XK )0
(Aside: the Xi can enter non-linearly, for example X2 := X12 )
3 / 42
Viewing X1 , . . . , XK , Y as elements of a Hilbert space, we
learned the generic characterization using the inner product:
Using the orthonormal basis X̃1 , . . . , X̃K
(such that sp(X̃1 , . . . , X̃K ) = sp(X))
K
Ŷ = PX Y = ∑ hX̃i , YiX̃i
i=1
K
= ∑ E(X̃i · Y)X̃i
i=1
K
= ∑ β∗i Xi
i=1
4 / 42
For general K, we use matrices to express β∗ := ( β 1 , . . . , β K )0
Let X := (X1 , X2 , . . . , XK )0 be a K × 1 vector
K
Ŷ = PX Y = ∑ β∗i Xi
i=1
0 ∗
=: X β ,
−1
where β∗ := (E(XX0 )) E(XY) is a K × 1 vector
Linear algebra
detour: for generic
column vectors
x1 y1
. . K
x := .. ,
y := ..
, ∑ xi yi = x0 y = y0 x,
i=1
xK yK
5 / 42
When X1 = 1, β∗ can be expressed via covariances
Corollary
When X = (1, X2 , . . . , XK )0 , then the projection coefficients are
( β∗2 , . . . , β∗K )0 = ΣXX
−1
ΣXY
β∗1 = EY − β∗2 EX2 − · · · − β∗K EXK ,
where
σ22 σ23 . . . σ2K Cov(X2 , Y)
σ32 σ32 . . . σ3K Cov(X3 , Y)
ΣXX := .
.. .. .. , ΣXY :=
..
.. . . . .
σK2 σK3 . . . σK 2 Cov(XK , Y)
6 / 42
Proof that projection is orthogonal to space spanned by X:
E(X(Y − PX Y)) = E(X(Y − X0 · E(XX0 )−1 E(XY)))
= E(XY − XX0 · E(XX0 )−1 E(XY))
= E(XY) − E(XX0 )E(XX0 )−1 E(XY)
=0
where E (Xu) = 0
An undergraduate course in econometrics (or regression
analysis) typically starts with this linear model
7 / 42
Using the linear projection representation
Y = X0 β∗ + u
Once you learn that E(Xu) = 0 you know that β∗ must be the
projection coefficient
You have learned that it exists and is unique
It is important to understand that the definition of the linear
projection model is not restrictive
In particular, E(uX) = 0 is not an assumption, it is definitional
To drive home this point, suppose I claim
Y = X0 θ + w
(i) X1 , . . . , XK , Y ∈ L2
(ii) E(XX0 ) > 0 (positive definite)
(aka, no perfect multicollinearity)
−1
where E(uX) = 0 and β∗ = (E(XX0 )) E(XY).
9 / 42
We accept and understand now that the unique projection
coefficient exists
Let’s say we’re interested in knowing the value of β∗
−1
We just learned that ( β∗2 , . . . , β∗K )0 = ΣXX ΣXY
Do we know the objects on the rhs?
These are population variances and covariances
We don’t know these, therefore we don’t know β∗
How else could we quantify β∗ ?
10 / 42
Roadmap
11 / 42
Let’s indulge ourselves and take a short detour to think about
estimation in an abstract way
This subsection is based on Stachurski A Primer in Econometric
Theory chapters 8.1 and 8.2
We’re dealing with a random variable Z with distribution P
We’re interested in a feature of P
Definition (Feature)
Let Z ∈ L2 and P ∈ P where P is a class of distributions on Z.
A feature of P is an object of the form
γ(P) for some γ : P → S
12 / 42
For some reason we are interested in γ(P)
If we knew P then we may be able to derive γ(P)
R
Example: P is standard normal and γ(P) = ZP(dZ) = 0
(mean of the standard normal distribution)
But we typically don’t know P
If all we’re interested in is γ(P) then we may not need to know
P (unless the feature we’re interested in is P itself)
Instead, we use a random sample to make an inference about a
feature of P
13 / 42
Definition (Random Sample)
The random variables Z1 , . . . , ZN are called a random sample
of size N from the population P if Z1 , . . . , ZN are mutually
independent and all have probability distribution P.
14 / 42
Definition (Statistic)
i=1 R → S that maps the
A statistic is any function g : ×N K
15 / 42
A statistic becomes an estimator when linked to a feature γ(P)
Definition (Estimator)
An estimator γ̂ is a statistic used to infer some feature γ(P) of
an unknown distribution P.
16 / 42
Earlier example: P is the standard normal distribution
(but let’s pretend we don’t know this, as is usually the case)
So Z ∼ N (0, 1)
R
And we’re interested in EZ so we set γ(P) = EZ = ZP(dZ)
We have available a random sample {Z1 , . . . , ZN }
Each Zi ∼ N (0, 1), but we don’t know this
But we do know: all Zi are iid
So they must all have the same mean EZi
What would be an estimator for EZ?
Aside: there are infinitely many
What would be a good estimator for EZi ?
(perhaps not so many anymore)
17 / 42
Analogy Principle
18 / 42
Definition (Empirical Distribution)
The empirical distribution PN of the sample {Z1 , . . . , ZN } is
the discrete distribution that puts equal probability 1/N on
each sample point Zi , i = 1, . . . , N.
19 / 42
R
We wanted to estimate γ(P) := ZP(dZ)
R
According to the analogy principle, we should use ZPN (dZ)
By definition, the empirical distribution is discrete therefore
Z N
ZPN (dZ) = ∑ Zi /N =: Z̄N
i=1
20 / 42
Roadmap
21 / 42
Recall linear projection representation
Y = X0 β∗ + u,
22 / 42
Given the random sample (Xi , Yi ), i = 1, . . . , N we can write the
linear projection representation as
Yi = Xi0 β∗ + ui ,
23 / 42
Combining findingsfrom last lecture and assignment 1:
∗
0 2
β = argmin E Y − X b
b ∈ RK
2
= argmin E Yi − Xi0 b (1)
b∈RK
24 / 42
If we define β∗ like so:
2
β∗ := argmin E Yi − Xi0 b ,
b ∈ RK
25 / 42
When you solve this you get
! −1 !
N N
1 1
N i∑ ∑ Xi Yi
β̂OLS = Xi Xi0
=1 N i=1
26 / 42
The second way of defining an estimator for β∗ , via:
β∗ = E(Xi Xi0 )−1 E(Xi Yi )
28 / 42
The OLS estimator does have a compact matrix representation
Let X := (X1 , X2 , . . . , XN )0 be the N × K matrix collecting all Xi
Let Y := (Y1 , Y2 , . . . , YN )0 be the N × 1 vector collecting all Yi
Then ∑ Xi Xi0 = X0 X and ∑ Xi Yi = X0 Y
The well known matrix representation of β̂OLS follows:
β̂OLS = (X0 X)−1 X0 Y
29 / 42
Now let’s turn to the question: How good is β̂OLS ?
What is goodness?
In the next few weeks we’ll consider things such as
• bias
• variance (small sample and large sample)
• consistency
• distribution (large sample)
30 / 42
Roadmap
31 / 42
Definition (Convergence in Probability)
A sequence of random variables Z1 , Z2 , . . . converges in
probability to a random variable Z if for all e > 0,
lim P(|ZN − Z| > e) = 0.
N →∞
p
We write ZN → Z and say that Z is the probability limit (plim)
of ZN .
32 / 42
Definition (Bounded in Probability)
A sequence of random variables Z1 , Z2 , . . . is bounded in
probability if for all e > 0, there exists be ∈ R and an integer
Ne such that
P(|ZN | ≥ be ) < e for all N ≥ Ne
We write ZN = Op (1).
33 / 42
Lemma
If ZN = c + op (1) then ZN = Op (1) for c ∈ R.
Proposition
Let WN = op (1), XN = op (1), YN = Op (1), and ZN = Op (1).
34 / 42
We’ve got a few more tricks up our sleeves
Theorem (Slutsky Theorem)
If ZN = c + op (1) and g(·) is continuous at c then
g(ZN ) = g(c) + op (1).
35 / 42
Theorem (Weak Law of Large Numbers (WLLN))
Let Z1 , Z2 , . . . be independent and identically distributed random
variables with EZi = µZ and Var Zi = σZ2 < ∞.
Define Z̄N := ∑N i=1 Zi /N. Then
p
Z̄N − µZ → 0.
In words:
sample mean converges in probability to population mean
Proving the WLLN is easy, using Chebyshev’s inequality
36 / 42
Lemma (Chebyshev’s Inequality)
Let Z be a random variable with EZ2 < ∞ and let g(·) be a
nonnegative function. Then for any c > 0
E (g(Z))
P (g(Z) ≥ c) ≤ .
c
E(Z̄N − µZ )2
≤
e2
Var ZN
=
e2
σZ2
= ,
N · e2
which converges to zero as N → ∞
We have used the fact Var ZN = σZ2 /N
37 / 42
This takes us back to the analogy principle
Remember earlier:
R
We wanted to estimate the feature γ(P) := EZ = ZP(dZ)
R
According to the analogy principle, we should use ZPN (dZ)
This led to the estimator γ̂ = ∑N
i=1 Zi /N
p
Immediately by the WLLN: γ̂ → γ(P)
Definition (Consistency of an Estimator)
p
An estimator γ̂ for γ := γ(P) is called consistent if γ̂ → γ.
38 / 42
Roadmap
39 / 42
Let’s first show that the OLS estimator is consistent
−1
Recall the result for β̂OLS := ∑N 0
i = 1 Xi Xi ∑N
i=1 Xi Yi
Using Yi = Xi0 β∗ + ui
! −1 !
N N
1 1
β̂OLS := β∗ +
N ∑ Xi Xi0 N ∑ Xi ui .
i=1 i=1
By the WLLN
1 N
N i∑
Xi Xi0 = E(Xi Xi0 ) + op (1)
=1
40 / 42
For the other factor on the rhs:
1 N
N i∑
Xi ui = E(Xi ui ) + op (1) = 0 + op (1) = op (1)
=1
It follows
β̂OLS = β∗ + Op (1) · op (1)
= β ∗ + op ( 1 )
41 / 42
But what is the distribution of β̂OLS ?
• that’s a tricky one
• β̂OLS = β∗ + (X0 X)−1 X0 u, what’s the distribution of the
second term on the rhs?
• short answer: we have no idea
• there’s some suspicion that β̂OLS may have an exact normal
distribution if u is normally distributed
• but we don’t know what the distribution of u is
42 / 42