0% found this document useful (0 votes)
43 views

Lecture 2

The analogy principle suggests estimating population parameters by sample statistics that have the same property in the sample as the parameters do in the population. For estimating the mean of a standard normal distribution based on a random sample, a good estimator following this principle would be the sample mean, as it serves the same purpose of measuring the central tendency of the data in the sample as the population mean does for the overall distribution. The sample mean takes the average of the observed values in the sample, analogous to how the population mean calculates the average value across the entire population.

Uploaded by

Vivi Enne
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
43 views

Lecture 2

The analogy principle suggests estimating population parameters by sample statistics that have the same property in the sample as the parameters do in the population. For estimating the mean of a standard normal distribution based on a random sample, a good estimator following this principle would be the sample mean, as it serves the same purpose of measuring the central tendency of the data in the sample as the population mean does for the overall distribution. The sample mean takes the average of the observed values in the sample, analogous to how the population mean calculates the average value across the entire population.

Uploaded by

Vivi Enne
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 42

Advanced Econometrics I

Jürgen Meinecke
Lecture 2 of 12
Research School of Economics, Australian National University

1 / 42
Roadmap

Projections (rinse and repeat)


Linear Projections in L2 — General Case

Ordinary Least Squares Estimation

2 / 42
Given a bunch of random variables X1 , . . . , XK , Y, we wanted to
express Y as a linear combination in X1 , . . . , XK
A fancy way of saying the same thing:
We want to project Y onto the subspace spanned by X1 , . . . , XK
That projection is labeled Psp(X1 ,...,XK ) Y or Ŷ
Instead of Psp(X1 ,...,XK ) , we may simply write PX ,
where X := (X1 , . . . , XK )0
(Aside: the Xi can enter non-linearly, for example X2 := X12 )

3 / 42
Viewing X1 , . . . , XK , Y as elements of a Hilbert space, we
learned the generic characterization using the inner product:
Using the orthonormal basis X̃1 , . . . , X̃K
(such that sp(X̃1 , . . . , X̃K ) = sp(X))
K
Ŷ = PX Y = ∑ hX̃i , YiX̃i
i=1
K
= ∑ E(X̃i · Y)X̃i
i=1
K
= ∑ β∗i Xi
i=1

For example, when X1 = 1 (constant term) and K = 2, we saw


Cov(X2 , Y)
β∗2 =
Var (X2 )
β∗1 = EY − β∗2 EX2

4 / 42
For general K, we use matrices to express β∗ := ( β 1 , . . . , β K )0
Let X := (X1 , X2 , . . . , XK )0 be a K × 1 vector
K
Ŷ = PX Y = ∑ β∗i Xi
i=1
0 ∗
=: X β ,

−1
where β∗ := (E(XX0 )) E(XY) is a K × 1 vector
Linear algebra
 detour: for generic
  column vectors
x1 y1
 .   .  K
x :=  ..  ,
  y :=  .. 

 , ∑ xi yi = x0 y = y0 x,
i=1
xK yK

compact notation: x := (x1 , . . . , xK )0 and y := (y1 , . . . , yK )0

5 / 42
When X1 = 1, β∗ can be expressed via covariances
Corollary
When X = (1, X2 , . . . , XK )0 , then the projection coefficients are
( β∗2 , . . . , β∗K )0 = ΣXX
−1
ΣXY
β∗1 = EY − β∗2 EX2 − · · · − β∗K EXK ,
where    
σ22 σ23 . . . σ2K Cov(X2 , Y)
 σ32 σ32 . . . σ3K   Cov(X3 , Y) 
   
ΣXX :=  .
 .. .. ..  , ΣXY := 
  .. 
 .. . . .  .

 
σK2 σK3 . . . σK 2 Cov(XK , Y)

ΣXX is matrix that collects variances of X on the diagonal and


covariances on the off-diagonal
ΣXY is vector that collects covariances between X and Y

6 / 42
Proof that projection is orthogonal to space spanned by X:
E(X(Y − PX Y)) = E(X(Y − X0 · E(XX0 )−1 E(XY)))
= E(XY − XX0 · E(XX0 )−1 E(XY))
= E(XY) − E(XX0 )E(XX0 )−1 E(XY)
=0

This result justifies the following linear model representation:


Y = PX Y + (Y − PX Y)
=: PX Y + u
= X0 β∗ + u,

where E (Xu) = 0
An undergraduate course in econometrics (or regression
analysis) typically starts with this linear model
7 / 42
Using the linear projection representation
Y = X0 β∗ + u

Once you learn that E(Xu) = 0 you know that β∗ must be the
projection coefficient
You have learned that it exists and is unique
It is important to understand that the definition of the linear
projection model is not restrictive
In particular, E(uX) = 0 is not an assumption, it is definitional
To drive home this point, suppose I claim
Y = X0 θ + w

Next I tell you that E(wX) = 0


−1
You therefore conclude that θ = β∗ = (E(XX0 )) E(XY)
8 / 42
In summary
Definition (Linear Projection Model)
Given

(i) X1 , . . . , XK , Y ∈ L2
(ii) E(XX0 ) > 0 (positive definite)
(aka, no perfect multicollinearity)

Then the linear projection model is given by


Y = X0 β∗ + u,

−1
where E(uX) = 0 and β∗ = (E(XX0 )) E(XY).

9 / 42
We accept and understand now that the unique projection
coefficient exists
Let’s say we’re interested in knowing the value of β∗
−1
We just learned that ( β∗2 , . . . , β∗K )0 = ΣXX ΣXY
Do we know the objects on the rhs?
These are population variances and covariances
We don’t know these, therefore we don’t know β∗
How else could we quantify β∗ ?

10 / 42
Roadmap

Projections (rinse and repeat)

Ordinary Least Squares Estimation


The Problem of Estimation
Definition of the OLS Estimator
Basic Asymptotic Theory (part 1 of 2)
Large Sample Properties of the OLS Estimator

11 / 42
Let’s indulge ourselves and take a short detour to think about
estimation in an abstract way
This subsection is based on Stachurski A Primer in Econometric
Theory chapters 8.1 and 8.2
We’re dealing with a random variable Z with distribution P
We’re interested in a feature of P
Definition (Feature)
Let Z ∈ L2 and P ∈ P where P is a class of distributions on Z.
A feature of P is an object of the form
γ(P) for some γ : P → S

Here S is an arbitrarily flexible space (usually R)


Examples of features: means, moments, variances, covariances

12 / 42
For some reason we are interested in γ(P)
If we knew P then we may be able to derive γ(P)
R
Example: P is standard normal and γ(P) = ZP(dZ) = 0
(mean of the standard normal distribution)
But we typically don’t know P
If all we’re interested in is γ(P) then we may not need to know
P (unless the feature we’re interested in is P itself)
Instead, we use a random sample to make an inference about a
feature of P

13 / 42
Definition (Random Sample)
The random variables Z1 , . . . , ZN are called a random sample
of size N from the population P if Z1 , . . . , ZN are mutually
independent and all have probability distribution P.

The joint distribution of Z1 , . . . , ZN is PN by independence


We sometimes say that Z1 , . . . , ZN are iid copies of Z
We sometimes say that Z1 , . . . , ZN are iid random variables
By the way: Zi could be vectors or matrices too

14 / 42
Definition (Statistic)
i=1 R → S that maps the
A statistic is any function g : ×N K

sample data somewhere: g(Z1 , . . . , ZN ).

The definition of a statistic is deliberately broad


It is a function that maps the sample data somewhere
Where to? Depends on the feature γ(P) you’re interested in
There are countless examples
Illustration: let K = 1 (i.e., univariate)
N
sample mean: g(Z1 , . . . , ZN ) = ∑i=1 Zi /N =: Z̄N
N
sample variance: g(Z1 , . . . , ZN ) = ∑i=1 (Zi − Z̄N )2 /N
sample min: g(Z1 , . . . , ZN ) = min {Z1 , . . . , ZN }
answer to everything: g(Z1 , . . . , ZN ) = 42

15 / 42
A statistic becomes an estimator when linked to a feature γ(P)
Definition (Estimator)
An estimator γ̂ is a statistic used to infer some feature γ(P) of
an unknown distribution P.

In other words: an estimator is a statistic with a purpose

16 / 42
Earlier example: P is the standard normal distribution
(but let’s pretend we don’t know this, as is usually the case)
So Z ∼ N (0, 1)
R
And we’re interested in EZ so we set γ(P) = EZ = ZP(dZ)
We have available a random sample {Z1 , . . . , ZN }
Each Zi ∼ N (0, 1), but we don’t know this
But we do know: all Zi are iid
So they must all have the same mean EZi
What would be an estimator for EZ?
Aside: there are infinitely many
What would be a good estimator for EZi ?
(perhaps not so many anymore)

17 / 42
Analogy Principle

A good way to create estimators is the analogy principle


Goldberger explains the main idea of it:
the analogy principle of estimation. . . proposes that population
parameters be estimated by sample statistics which have the same
property in the sample as the parameters do in the population
(Goldberger, 1968, as cited in Manski, 1988)
That is very unspecific, of course
Manski (1988) wrote an entire book on analog estimation and
explains the analogy principle precisely and comprehensively
But we can illustrate it using our earlier framework

18 / 42
Definition (Empirical Distribution)
The empirical distribution PN of the sample {Z1 , . . . , ZN } is
the discrete distribution that puts equal probability 1/N on
each sample point Zi , i = 1, . . . , N.

Definition (Analogy Principle)


To estimate γ(P) use γ̂ := γ(PN ).

How do we use this in our example?

19 / 42
R
We wanted to estimate γ(P) := ZP(dZ)
R
According to the analogy principle, we should use ZPN (dZ)
By definition, the empirical distribution is discrete therefore
Z N
ZPN (dZ) = ∑ Zi /N =: Z̄N
i=1

This is, of course, the sample average and we use the


conventional notation Z̄N
The analogy principle results in the estimator γ̂ = ∑N
i=1 Zi /N

How can we use the analogy principle to estimate β∗ ?

20 / 42
Roadmap

Projections (rinse and repeat)

Ordinary Least Squares Estimation


The Problem of Estimation
Definition of the OLS Estimator
Basic Asymptotic Theory (part 1 of 2)
Large Sample Properties of the OLS Estimator

21 / 42
Recall linear projection representation
Y = X0 β∗ + u,

where X := (X1 , . . . , XK )0 , and X1 , . . . , XK , Y ∈ L2


We saw that E(uX) = 0 implied
 −1
β∗ = E(XX0 ) E(XY)

In other words: β∗ is the projection coefficient


We want to estimate β∗
For that purpose we have available a random sample
(X1 , Y1 ), . . . , (XN , YN )
We simply write that (Xi , Yi ), i = 1, . . . , N is a random sample
These are iid copies of the ordered pair (X, Y)

22 / 42
Given the random sample (Xi , Yi ), i = 1, . . . , N we can write the
linear projection representation as
Yi = Xi0 β∗ + ui ,

where Xi := (Xi1 , . . . , XiK )0 is the K-dimensional column vector


that contains copies of X1 , . . . , XK for observation i
Because E (uX) = 0 we have E (ui Xi ) = 0

23 / 42
Combining findingsfrom last lecture and assignment 1:

 
0 2
β = argmin E Y − X b
b ∈ RK
 2 
= argmin E Yi − Xi0 b (1)
b∈RK

= E(Xi Xi0 )−1 E(Xi Yi ) (2)

Equations (1) and (2) motivate two succinct analog estimators


for β∗ :

(1) the ordinary least squares estimator;


(2) the method of moments estimator

Let’s look at both

24 / 42
If we define β∗ like so:
 2 
β∗ := argmin E Yi − Xi0 b ,
b ∈ RK

then the analogy principle suggests the estimator


N
argmin ∑ Yi − Xi0 b
2
b ∈ RK i=1

This seems very sensible and deserves a famous definition


Definition (Ordinary Least Squares (OLS) Estimator)
The ordinary least squares estimator is
N
β̂OLS := argmin ∑ Yi − Xi0 b
2
b ∈ RK i=1

It is obvious how this estimator obtained its name

25 / 42
When you solve this you get
! −1 !
N N
1 1
N i∑ ∑ Xi Yi
β̂OLS = Xi Xi0
=1 N i=1

Most people, when writing vectors, use the default column


notation, meaning that if I tell you that Xi is a K-dimensional
vector, you automatically know it is a K × 1 vector

26 / 42
The second way of defining an estimator for β∗ , via:
β∗ = E(Xi Xi0 )−1 E(Xi Yi )

The analogy principle suggests the estimator


! −1
1 N 1 N

N i=1
Xi Xi0
N i∑
Xi Yi
=1

This also seems very sensible and deserves a familiar name:


Definition (Method of Moments (MM) Estimator)
Applying the analogy principle results in
! −1
N
1 1 N
N i∑ N i∑
0
β̂MM = X X
i i Xi Yi
=1 =1

You immediately see that β̂OLS = β̂MM


I’ll simply refer to it as the OLS estimator
27 / 42
Two slides on notation
Notice that Xi and Yi are sometimes treated as random
variables, and sometimes as realizations (or observations)
When you see something like E(Xi Yi ) then both Xi and Yi are
random variables
Expectations are usually taken over random variables
But when you see something like ∑N i=1 Xi Yi then both Xi and Yi
are realizations of the random variables
That is, they are observed values that the random variables
have taken on
The context will tell you what role Xi and Yi play

28 / 42
The OLS estimator does have a compact matrix representation
Let X := (X1 , X2 , . . . , XN )0 be the N × K matrix collecting all Xi
Let Y := (Y1 , Y2 , . . . , YN )0 be the N × 1 vector collecting all Yi
Then ∑ Xi Xi0 = X0 X and ∑ Xi Yi = X0 Y
The well known matrix representation of β̂OLS follows:
β̂OLS = (X0 X)−1 X0 Y

Digression: switch of notation alert!


We are assigning a new meaning to the symbols X and Y:
• Until here, X was a K × 1 vector
From now on, X denotes an N × K matrix
• Until here, Y was a scalar
From now on, Y denotes an N × 1 vector

29 / 42
Now let’s turn to the question: How good is β̂OLS ?
What is goodness?
In the next few weeks we’ll consider things such as

• bias
• variance (small sample and large sample)
• consistency
• distribution (large sample)

30 / 42
Roadmap

Projections (rinse and repeat)

Ordinary Least Squares Estimation


The Problem of Estimation
Definition of the OLS Estimator
Basic Asymptotic Theory (part 1 of 2)
Large Sample Properties of the OLS Estimator

31 / 42
Definition (Convergence in Probability)
A sequence of random variables Z1 , Z2 , . . . converges in
probability to a random variable Z if for all e > 0,
lim P(|ZN − Z| > e) = 0.
N →∞

p
We write ZN → Z and say that Z is the probability limit (plim)
of ZN .

Often times the probability limit Z is a degenerate random


variable that takes on a constant value everywhere
If a sequence converges in probability to zero, we have special
notation:
Definition
p
If ZN → 0 we write ZN = op (1).

32 / 42
Definition (Bounded in Probability)
A sequence of random variables Z1 , Z2 , . . . is bounded in
probability if for all e > 0, there exists be ∈ R and an integer
Ne such that
P(|ZN | ≥ be ) < e for all N ≥ Ne

We write ZN = Op (1).

33 / 42
Lemma
If ZN = c + op (1) then ZN = Op (1) for c ∈ R.

Proposition
Let WN = op (1), XN = op (1), YN = Op (1), and ZN = Op (1).

WN + XN = op (1) WN + YN = Op (1) YN + ZN = Op (1)


WN · XN = op (1) WN · YN = op (1) YN · ZN = Op (1)

34 / 42
We’ve got a few more tricks up our sleeves
Theorem (Slutsky Theorem)
If ZN = c + op (1) and g(·) is continuous at c then
g(ZN ) = g(c) + op (1).

In short: g(c + op (1)) = g(c) + op (1)


That’s a reason to like the plim, it passes through nonlinear
functions (which is not true for expectation operators)
Corollary
1/(c + op (1)) = 1/c + op (1) whenever c 6= 0.

All the definitions on the previous four slides also apply


element by element to sequences of random vectors or matrices

35 / 42
Theorem (Weak Law of Large Numbers (WLLN))
Let Z1 , Z2 , . . . be independent and identically distributed random
variables with EZi = µZ and Var Zi = σZ2 < ∞.
Define Z̄N := ∑N i=1 Zi /N. Then
p
Z̄N − µZ → 0.

Or, equivalently, Z̄N = µZ + op (1).


p
Equivalently we can state: 1
N ∑N
i = 1 Zi → µ Z

In words:
sample mean converges in probability to population mean
Proving the WLLN is easy, using Chebyshev’s inequality

36 / 42
Lemma (Chebyshev’s Inequality)
Let Z be a random variable with EZ2 < ∞ and let g(·) be a
nonnegative function. Then for any c > 0
E (g(Z))
P (g(Z) ≥ c) ≤ .
c

Here we’re interested in bounding limN→∞ P (|Z̄N − µZ | > e)


P (|Z̄N − µZ | > e) = P (Z̄N − µZ )2 > e2


E(Z̄N − µZ )2

e2
Var ZN
=
e2
σZ2
= ,
N · e2
which converges to zero as N → ∞
We have used the fact Var ZN = σZ2 /N
37 / 42
This takes us back to the analogy principle
Remember earlier:
R
We wanted to estimate the feature γ(P) := EZ = ZP(dZ)
R
According to the analogy principle, we should use ZPN (dZ)
This led to the estimator γ̂ = ∑N
i=1 Zi /N
p
Immediately by the WLLN: γ̂ → γ(P)
Definition (Consistency of an Estimator)
p
An estimator γ̂ for γ := γ(P) is called consistent if γ̂ → γ.

Intuition: if the sample size is large, sample mean is almost


equal to population mean
So there is some hope that the analogy principle leads to
consistent estimators

38 / 42
Roadmap

Projections (rinse and repeat)

Ordinary Least Squares Estimation


The Problem of Estimation
Definition of the OLS Estimator
Basic Asymptotic Theory (part 1 of 2)
Large Sample Properties of the OLS Estimator

39 / 42
Let’s first show that the OLS estimator is consistent
  −1
Recall the result for β̂OLS := ∑N 0
i = 1 Xi Xi ∑N
i=1 Xi Yi

Using Yi = Xi0 β∗ + ui
! −1 !
N N
1 1
β̂OLS := β∗ +
N ∑ Xi Xi0 N ∑ Xi ui .
i=1 i=1

By the WLLN
1 N
N i∑
Xi Xi0 = E(Xi Xi0 ) + op (1)
=1

Assuming that E(Xi Xi0 ) is positive definite (inverse exists) and


using Slutsky’s theorem, and a matrix version of the earlier
Lemma that c + op (1) = Op (1):
! −1
1 N
N i∑
0
Xi Xi = E(Xi Xi0 )−1 + op (1) = Op (1) + op (1) = Op (1)
=1

40 / 42
For the other factor on the rhs:
1 N
N i∑
Xi ui = E(Xi ui ) + op (1) = 0 + op (1) = op (1)
=1

It follows
β̂OLS = β∗ + Op (1) · op (1)
= β ∗ + op ( 1 )

In words: β̂OLS converges in probability to β∗


This means β̂OLS is a consistent estimator for the projection
coefficient β∗
It illustrates the benefit of the analogy principle when it works

41 / 42
But what is the distribution of β̂OLS ?
• that’s a tricky one
• β̂OLS = β∗ + (X0 X)−1 X0 u, what’s the distribution of the
second term on the rhs?
• short answer: we have no idea
• there’s some suspicion that β̂OLS may have an exact normal
distribution if u is normally distributed
• but we don’t know what the distribution of u is

42 / 42

You might also like