0% found this document useful (0 votes)
109 views

Ec2 1

1) This document discusses the assumptions and properties of ordinary least squares (OLS) regression analysis. It covers the typical assumptions made in OLS, properties of the OLS estimator, and measures used to evaluate goodness of fit. 2) Key assumptions include the regression being correctly specified, errors having zero mean and constant variance, and independence of regressors. 3) Under the assumptions, the OLS estimator is unbiased, best linear unbiased, and normally distributed. Goodness of fit can be evaluated visually with plots or numerically with measures like the R-squared statistic.

Uploaded by

zamir
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
109 views

Ec2 1

1) This document discusses the assumptions and properties of ordinary least squares (OLS) regression analysis. It covers the typical assumptions made in OLS, properties of the OLS estimator, and measures used to evaluate goodness of fit. 2) Key assumptions include the regression being correctly specified, errors having zero mean and constant variance, and independence of regressors. 3) Under the assumptions, the OLS estimator is unbiased, best linear unbiased, and normally distributed. Goodness of fit can be evaluated visually with plots or numerically with measures like the R-squared statistic.

Uploaded by

zamir
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

RS – Lecture 1

CLM - Assumptions
• Typical Assumptions
(A1) DGP: y = Xβ + ε is correctly specified.
(A2) E[εε|X] = 0
Lecture 1 (A3) Var[εε|X] = σ2 IT
Review I (A4) X has full column rank – rank(X)=k-, where T ≥ k.

• Assumption (A1) is called correct specification. We know how the DGP.

• Assumption (A2) is called regression. From (A2) we get:


(i) E[εε|X] = 0 => E[y|X] = f(X, θ) + E[εε|X] = f(X, θ)
(ii) Using the Law of Iterated Expectations (LIE):
1
E[εε] = EX[E[εε|X]] = EX[0] = 0

Least Squares Estimation - Assumptions Least Squares Estimation – f.o.c.

• From Assumption (A3) we get • Objective function: S(xi, θ) =Σi εi2


Var[εε|X] = σ2IT => Var[εε] = σ2IT
• We want to minimize w.r.t to θ. The f.o.c. deliver the normal
equations:
This assumption implies
-2 Σi [yi - f(xi, θLS)] f ‘(xi, θLS) = -2 (y- Xb)′ X =0
(i) homoscedasticity => E[εi2|X] = σ2 for all i.
(ii) no serial/cross correlation => E[ εi εj |X] = 0 for i≠j.
• Solving for b delivers the OLS estimator:
b = (X′X)-1 X′y
• From Assumption (A4) => the k independent variables in X are
linearly independent. Then, the kxk matrix X’X will also have full
rank –i.e., rank(X’X) = k. Note: (i) b = βOLS. (Ordinary LS. Ordinary=linear)
(ii) b is a (linear) function of the data (yi ,xi).
(iii) X′(y-Xb) = X′y - X′X(X′X)-1X′y = X′e = 0 => e ⊥ X.

OLS Estimation - Properties Goodness of Fit of the Regression


Under the typical assumptions, we can establish properties for b.
• After estimating the model, we judge the adequacy of the model.
1) E[b|X]= β
There are two ways to do this:
2) Var[b|X] = E[(b-β β) (b-β
β)′|X] =(X′X)-1 X’E[εε ε′|X] X(X′X)-1
- Visual: plots of fitted values and residuals, histograms of residuals.
2
= σ (X′X) -1
- Numerical measures: R2, adjusted R2, AIC, BIC, etc.
3) b is BLUE (or MVLUE) => The Gauss-Markov theorem.
(4) If (A5) ε|X ~N(0, σ2IT) => b|X ~N(β β, σ2(X’ X)-1)
• Numerical measures. We call them goodness-of-fit measures. Most
2
=> bk|X ~N(βk, σ (X’ X)kk ) -1
popular: R2.
(the marginals of a multivariate normal are also normal.) R2 = SSR/TSS = b′′X′′M0Xb/y′′M0y = 1 - e′′e/y′′M0y

• Estimating σ2 Note: R2 is bounded by zero and one only if:


Under (A5), E[e′′e|X] = (T-k)σ2 (a) There is a constant term in X --we need e’ M0X=0!
The unbiased estimator of σ2 is s2 = e′′e/(T-k). (b) The line is computed by linear least squares.
=> there is a degrees of freedom correction. (c) R2 never falls when regressors are added to the regression.
RS – Lecture 1

Adjusted R-squared Other Goodness of Fit Measures


• R2 is modified with a penalty for number of parameters: Adjusted R2
2 • There are other goodness-of-fit measures that also incorporate
R = 1 - [(T-1)/(T-k)](1 - R2) = 1 - [(T-1)/(T-k)] RSS/TSS penalties for number of parameters (degrees of freedom).
= 1 - [RSS/(T-k)] [(T-1)/TSS]
=> maximizing adjusted R2 <=> minimizing [RSS/(T-k)]= s2
• Information Criteria
- Amemiya: [e′′e/(T – K)] × (1 + k/T)
• Degrees of freedom --i.e., (T-k)-- adjustment assumes something about
- Akaike Information Criterion (AIC)
“unbiasedness.”
AIC = -2/T(ln L – k) L: Likelihood
=> if normality AIC = ln(e’e/T) + (2/T) k (+constants)
• Adjusted-R2 includes a penalty for variables that do not add much
fit. Can fall when a variable is added to the equation.
- Bayes-Schwarz Information Criterion (BIC)
BIC = -(2/T ln L – [ln(T)/T] k)
• It will rise when a variable, say z, is added to the regression if and
only if the t-ratio on z is larger than one in absolute value. => if normality AIC = ln(e’e/T) + [ln(T)/T] k (+constants)

Maximum Likelihood Estimation Maximum Likelihood Estimation


• We assume the errors, ε, follow a distribution. Then, we select the • Let θ =(β,σ). Then, we want to
parameters of the distribution to maximize the likelihood of the T T 1
Maxθ ln L(θ | y , X ) = − ln 2π − σ 2 − 2 ( y − Xβ)′( y − Xβ)
observed sample. 2 2 2σ
• Then, the f.o.c.:
Example: The errors, ε, follow the normal distribution:
∂ ln L 1 1
(A5) ε|X ~N(0, σ2IT) = − 2 (−2 X ′y − 2 X ′Xβ) = 2 ( X ′y − X ′Xβ) = 0
∂β 2σ σ
• Then, we can write the joint pdf of y as ∂ ln L T 1
1 1 =− + ( y − Xβ)′( y − Xβ) = 0
f ( yt ) = ( )1/ 2 exp[ − ( y t − xt ' β) 2 ] ∂σ 2 2σ 2 2σ 4
2
2πσ 2σ 2
Note: The f.o.c. deliver the normal equations for β! The solution to
1 1 1 1
L = f ( y1, y2 ,..., yT | β, σ2 ) = ΠTt=1( )1/ 2 exp[− ( yt − xt 'β)2 ] = exp(− e′e) the normal equation, βMLE, is also the LS estimator, b. That is,
2πσ2 2σ2 (2πσ2 )T / 2 2σ2
Taking logs, we have the log likelihood function e ′e
βˆ MLE = b = ( X ′X ) −1 X ′y; σˆ 2MLE =
T T 1 T
ln L = − ln 2π − ln σ 2 − 2 e′e
2 2 2σ • Nice result for b: ML estimators have very good properties!

Properties of ML Estimators Properties of ML Estimators


^
(1) Efficiency. Under general conditions, we have that θ MLE (4) Sufficiency. If a single sufficient statistic exists for θ, the MLE of θ
^
−1 must be a function of it. That is, θˆ MLE depends on the sample
Var( θ MLE ) ≥ [ nI ( θ )]
observations only through the value of a sufficient statistic.
The right-hand side is the Cramer-Rao lower bound (CR-LB). If an
estimator can achieve this bound, ML will produce it.
(5) Invariance. The ML estimate is invariant under functional
transformations. That is, if θˆ MLE is the MLE of θ and if g(θ) is a
(2) Consistency. function of θ , then g(θˆ MLE ) is the MLE of g(θ) .
Sn(X; θ) and (θˆ MLE - θ) converge together to zero (i.e., expectation).

(3) Theorem: Asymptotic Normality


Let the likelihood function be L(X1,X2,…Xn| θ). Under general
conditions, the MLE of θ is asymptotically distributed as
(
θˆ MLE a → N θ , [ nI ( θ )] − 1 )
RS – Lecture 1

Specification Errors: Omitted Variables Specification Errors: Irrelevant Variables


• Omitting relevant variables: Suppose the correct model is
• Irrelevant variables
y = X1β1 + X2β2 + ε -i.e., with two sets of variables.
Suppose the correct model is y = X1β 1 + ε
But, we compute OLS omitting X2. That is,
But, we estimate y = X1β1 + X2β 2 + ε
y = X1β 1 + ε <= the “short regression.”
Let’s compute OLS with X1, X2. This is called “long regression.”

Some easily proved results:


Some easily proved results:
(1) E[b1|X] = E [(X1′X1)-1X1′ y] = β 1 + (X1′X1)-1X1′X2β 2 ≠ β1.
(1) Since the variables in X2 are truly irrelevant, then β2 = 0,
=> Unless X1′X2 =0, b1 is biased. The bias can be huge.
so E[b1.2|X] = β1 => No bias
(2) Var[b1|X] ≤ Var[b1.2|X] => smaller variance when we omit X2.
(2) Inefficiency: Bigger variance
(3) MSE => b1 may be more “precise.”

Linear Restrictions β-q=0


The General Linear Hypothesis: H0: Rβ
• Q: How do linear restrictions affect the properties of the least • We have J joint hypotheses. Let R be a Jxk matrix and q be a Jx1
squares estimator? vector.
Model ( DGP): y = Xβ β + ε
• Two approaches to testing (unifying point: OLS is unbiased):
Theory (information): Rββ - q = 0
(1) Is Rb - q close to 0? Basing the test on the discrepancy vector:
Restricted LS estimator: b* = b - (X′′X)-1R′′[R(X′′X)-1R′′]-1(Rb - q) m = Rb - q. Using the Wald statistic:
1. Unbiased? YES. E[b*|X] = β W = m′′(Var[m|X])-1m Var[m|X] = R[σ2(X’X)-1]R′′.
W = (Rb – q)′′{R[σ2(X’X)-1]R}-1(Rb – q)
2. Efficiency? NO. Var[b*|X] < Var[b|X]
Under the usual assumption and assuming σ2 is known, W ~ χJ2
3. b* may be more “precise.”
Precision = MSE = variance + squared bias. In general, σ2 is unknown, we use s2= e′′e/(T-k)
4. Recall: e′′e = (y -Xb)′′(y-Xb) ≤ e*′′e* = (y –Xb*)′′(y-Xb*) W* = (Rb - q)′′{R[s2(X’X)-1]R}-1(Rb - q)
=> Restrictions cannot increase R2 => R2 ≥ R2* = (Rb – q)′′{R[σ2(X’X)-1]R}-1(Rb – q)/(s2/σ2 )
F = W/J / [(T-k) (s2/σ2)/(T-k)] = W*/J ~ FJ,T-k.

β-q=0
The General Linear Hypothesis: H0: Rβ β-q=0
Example: Testing H0: Rβ
(2) We know that imposing the restrictions leads to a loss of fit. R2 • In the linear model
must go down. Does it go down a lot? -i.e., significantly? y = X β + ε = β1 + X2 β2 + X3 β3 + X4 β4 + ε

Recall (i) e* = y – Xb* = e – X(b*– b) • We want to test if the slopes X3, X4 are equal to zero. That is,
(ii) b*= b – (X′′X)-1R′′[R(X′′X)-1R′′]-1(Rb – q)
H0 : β3 = β4 = 0
=> e*′′e* - e′′e = (Rb – q)′′[R(X′′X)-1R′′]-1(Rb – q) H1 : β 3 ≠ 0 or β4 ≠ 0 or both β3 and β4 ≠ 0

Recall • We can use, F = (e*′e* – e′e)/J / [e′e/(T-k)] ~ FJ,T-K.


- W = (Rb – q)′′{R[σ2(X’X)-1]R}-1(Rb – q) ~ χJ2 (if σ2 known)
- e′′e/σ2 ~ χT-k2 .
Define Y = β1 + β 2 X 2 + ε RSS R
Then, Y = β1 + β 2 X 2 + β 3 X 3 + β 4 X 4 + ε RSSU
F = (e*′′e* – e′′e)/J / [e′′e/(T-k)] ~ FJ,T-K.
Or RSSR-RSSU kU-kR
F (cost in df, unconstr df ) =
F = { (R2 - R*2)/J } / [(1 - R2)/(T-k)] ~ FJ,T-K. RSSU T-kU
32
RS – Lecture 1

Functional Form: Chow Test Functional Form: Ramsey’s RESET Test


• Assumption (A1) restricts f(X,β) to be a linear function: f(X,β) = X β. • To test the specification of the functional form, we can use the RESET
But, within the framework of OLS estimation, we can be more flexible:
test. From a regression, we keep the fitted values, ŷ = Xb.
(1) We can impose non-linear functional forms, as long as they are
linear in the parameters (intrinsic linear model). • Then, we add ŷ2 to the regression specification. If ŷ2 is added to the
(2) We can use qualitative variables (dummies) to create non-linearities regression specification, it should pick up quadratic and interactive
(splines, changes in regime, etc.) A Chow test (an F-test) can be used to nonlinearity:
check for regimes/categories or structural breaks. y = X β + ŷ2 γ + ε
(a) Run OLS with no distinction between regimes. Keep RSSR. • We test H0 (linear functional form): γ=0
(b) Run two separate OLS, one for each regime (Unrestricted H1 ( non linear functional form): γ≠0
regression). Keep RSS1 and RSS2 => RSSU= RSS1 + RSS2.
=> t-test on the OLS estimator of γ.
(3) Run a standard F-test (testing Restricted vs. Unrestricted models):
• If the t-statistic for ŷ2 is significant => evidence of nonlinearity.
( RSS R − RSSU ) /(kU − k R ) ( RSS R − [ RSS1 + RSS 2 ]) / k
F= =
( RSSU ) /(T − kU ) ( RSS1 + RSS 2 ) /(T − 2k ) 32 3

Prediction Intervals Forecast Variance


• Prediction: Given x0 => predict y0.
• Variance of the forecast error is
(1) Estimate: E[y|X, x0] = β′x0; σ2 + x0’ Var[b|x0]x0 = σ2 + σ2[x0’ (X’X)-1x0]
(2) Prediction: y0 = β′x0 + ε0 If the model contains a constant term, this is
• Predictor: ŷ0 = b’x0 + estimate of ε0. (Est. ε0=0, but with variance)
 1 K −1 K −1 
Var[e0 ] = σ 2 1 + + ∑∑ ( x0j − x j )( xk0 − xk )(Z′M 0 Z) jk 
y0
• Forecast error. We predict with ŷ0 = b′x0.  n j =1 k =1 
ŷ0 - y0 = b′x0 - β′x0 - ε0 = (b - β)′x0 - ε0 (where Z is X without x1=ί). In terms squares and cross products of
deviations from means.
=> Var[(ŷ0-y0)|x0] = E[(ŷ0-y0)′(ŷ0-y0)|x0]= x0′Var[(b - β)|x0]x0 + σ2

Note: Large σ2, small n, and large deviations from the means, decrease
• How do we estimate this? Two cases: the precision forecasting error.
(1) If x0 is a vector of constants => Form C.I. as usual.
(2) If x0 has to be estimated => Complicated (what is the • Interpretation: Forecast variance is smallest in the middle of our
variance of the product?). Use bootstrapping. “experience” and increases as we move outside it.

Forecasting performance of a model: Tests and Evaluation of forecasts


measures of performance
• Summary measures of out-of-sample forecast accuracy
T +m T +m
• Evaluation of a model’s predictive accuracy for individual (in- ∑ ∑
1 1
Mean Error = ( yˆi − yi ) = ei
sample and out-of-sample) observations m i=T +1 m i =T +1
T +m T +m

∑ ∑
1 1
Mean Absolute Error (MAE) = | yˆi − yi |= | ei |
m i =T +1 m i=T +1
• Evaluation of a model’s predictive accuracy for a group of (in-
sample and out-of-sample) observations T +m T +m

∑ ∑
1 1
Mean Squared Error (MSE) = ( yˆi − yi )2 = ei 2
m i =T +1 m i =T +1

• Chow prediction test T +m T +m

∑ ∑
1 1 2
Root Mean Square Error (RMSE)= ( yˆi − yi )2 =
m i=T +1
ei
m i=T +1
T +m


1
ei 2
m i =T +1
U=
Theil’s U-stat = T

∑y
1 2
i
T i=1
RS – Lecture 1

CLM: Asymptotics CLM: New Assumptions


• Now, we have a new set of assumptions in the CLM:
• To get exact results for OLS, we rely on (A5) ε|X ~iid N(0, σ2IT)
But, (A5) in many situations is unrealistic. Then, we study on the (A1) DGP: y = X β + ε.
behavior of b (and the test statistics) when T →∞ i.e., large samples. (A2’) X stochastic, but E[X’ ε]= 0 and E[ε]=0.
(A3) Var[ε|X] = σ2 IT
• New assumptions: (A4’) plim (X’X/T) = Q (p.d. matrix with finite elements, rank= k)
(1) {xi,εi} i=1, 2, ...., T is a sequence of independent observations.
- X is stochastic, but independent of the process generating ε. • We want to study the large sample properties of OLS:
- We require that X have finite means and variances. Similar Q 1: Is b consistent? s2? YES & YES
a
requirement for ε, but we also require E[ε]=0. Q 2: What is the distribution of b? b  → N(β,(σ2/T)Q-1)
Q 3: What about the distribution of the tests?
(2) Well behaved X: => tT =[(zT - µ)/sT] d → N(0,1)
plim (X′X/T) = Q (Q a pd matrix of finite elements) => W = (zT - µ) ′ST-1(zT - µ) d → χ2rank(ST)
=> (not too much dependence in X). => F d → χ2rank(Var[m])

Asymptotic Tests: Small sample behavior? The Delta Method


• The p-values from asymptotic tests are approximate for small • It is used to obtain the asymptotic distribution of a non-linear
samples. They may be very bad. Their performance depends on: function of a RV (usually, an estimator).
(1) Sample size, T. Tools: (1) A first-order Taylor series expansion
(2) Distribution of the error terms, ε. (2) Slutsky’s theorem.
(3) The number of regressors, k, and their properties
• Let xn be a RV, with plim xn=θ and Var(xn)=σ2 < ∞.
(4) The relationship between the error terms and the regressors.
We use the CLT to obtain n½(xn - θ)/σ →
d
N(0,1)

• A simulation/bootstrap can help. • Goal: g(xn) a


→ ? (g(xn) is a continuous differentiable
function, independent of n.)
• Bootstrap tests tend to perform better than tests based on Steps:
approximate asymptotic distributions. (1) Taylor series approximation around θ :
• The errors committed by both asymptotic and bootstrap tests g(xn) = g(θ) + g′(θ) (xn - θ) + higher order terms
diminish as T increases. We assume the higher order terms are o(n) --as n grows, they vanish.

The Delta Method Delta Method: Example


(2) Use Slutsky theorem: plim g(xn) = g(θ)
Let xn a

→ N(θ, σ2/n)
plim g’(xn) = g’(θ)

Q: g(xn)=δ/xn a
→ ? (δ is a constant)
Then, as n grows, g(xn) ≈ g(θ) + g′(θ) (xn - θ)
=> n½([g(xn) - g(θ)]) ≈ g′(θ) [n½(xn - θ)]. First, calculate the first two moments of g(xn):
=> n½([g(xn) - g(θ)]/σ) ≈ g′(θ) [n½(xn - θ)/σ]. g(xn) = δ/xn => plim g(xn)=(δ/θ)
g’(xn) = -(δ/xn2) => plim g’(xn)=-(δ/θ 2)
The asymptotic distribution of (g(xn) - g(θ)) is given by that of [n½(xn -
θ)/σ], which is a standard normal. Then, Recall delta method formula: g(xn) → N(g(θ), [g′(θ)]2 σ2/n).
a

n½([g(xn) - g(θ)]) → N(0, [g′(θ)]2 σ2).


a
Then,
g(xn) → N(δ/θ, (δ2/θ4)σ2/n)
a

After some work (“inversion”), we obtain:


g(xn) → N(g(θ), [g′(θ)]2 σ2/n).
a
RS – Lecture 1

The IV Problem The IV Problem


• What makes b consistent when X'ε /T → p
0 is that approximating • We start with our linear model
(X'ε/T ) by 0 is reasonably accurate in large samples. y = Xβ + ε.

• Now, we challenge the assumption that {xi,εi} is a sequence of • Now, we assume plim(X’ε/T) ≠ 0.
independent observations. plim (X’X/T) = Q

• Then, plim b = plim β + plim (X’X/T)-1 plim (X′ε/T)


• Now, we assume plim (X’ε/T) ≠ 0 => This is the IV Problem!
= β + Q-1 plim (X′ε/T) ≠ β
• Q: When might X be correlated ε? => b is not a consistent estimator of β.
- Correlated shocks across linked equations
- Simultaneous equations • New assumption: we have l instrumental variables, Z such that
- Errors in variables plim(Z’X/T) ≠ 0 but plim(Z’ε/T) = 0
- Model has a lagged dependent variable and a serially correlated error
term

Instrumental Variables: Assumptions Instrumental Variables: Estimation


• To get a consistent estimator of β, we also assume: • To get the IV estimator, we start from the system of equations:
{xi, zi, εi} is a sequence of RVs, with: W'Z’X bIV = W'Z’y
E[X’X] = Qxx (pd and finite) (LLN => plim(X’X/T) =Qxx )
E[Z’Z] = Qzz (finite) (LLN => plim(Z’Z/T) =Qzz ) • Case 1: l = k -i.e., number of instruments = number of regressors.
E[Z’X] = Qzx (pd and finite) (LLN => plim(Z’X/T) =Qzx ) - Z has the same dimensions as X: Txk => Z’X is a kxk matrix
E[Z’ε] = 0 (LLN => plim(Z’ε/T) = 0) - In this case, W is irrelevant, say, W=I.
- Then,
• Following the same idea as in OLS, we get a system of equations: bIV = (Z’X)-1Z’y
W'Z’X bIV = W'Z’y

• We have two cases where estimation is possible:


- Case 1: l = k -i.e., number of instruments = number of regressors.
- Case 2: l > k -i.e., number of instruments > number of regressors.

IV Estimators IV Estimators
• Properties of bIV • Properties of σˆ 2, under IV estimation:
(1) Consistent - We define σˆ 2 :
T T

∑e ∑ (y
1 1
bIV = (Z’X)-1Z’y = (Z’X)-1Z’(Xβ+ε) σˆ 2 = 2
IV = i − x ' b IV ) 2
T i =1
T i =1
= (Z’X/T)-1 (Z’X/T) β + (Z’X/T)-1Z’ε/T
= β + (Z’X/T)-1 Z’ε/T → β p
(under assumptions) where eIV = y - X bIV = y - X(Z’X)-1Z’y = [I - X(Z’X)-1Z’]y = Mzx y
- Then,
σˆ 2 = eIV'eIV /T = ε'Mzx'Mzxε/T
(2) Asymptotic normality
√T (bIV - β) = √T (Z’X)-1Z’ε = ε'ε/T – 2 ε'X (Z’X)-1Z’ε/T + ε'Z (Z'X)-1X’X(Z’X)-1Z’ε/T
= (Z’X/T)-1 √T (Z’ε/T)
=> plim σˆ 2 = plim(ε'ε/T) - 2 plim[(ε'X/T) (Z’X/T)-1 (Z'ε/T)] +
Using the Lindberg-Feller CLT √T (Z’ε/T) → N(0, σ2Qzz)
d

+ plim(ε'Z (Z’X)-1X’X(Z’X)-1Z’ε/T) = σ2
Then, √T (bIV - β) d
→ N(0, σ2Qzx-1QzzQxz-1)
Est Asy. Var[bIV] = E[(Z'X)-1 Z’εε'Z (Z’X)-1]= σ̂2(Z’X)-1 Z'Z(Z’X)-1
RS – Lecture 1

IV Estimators: 2SLS (2-Stage Least Squares) IV Estimators: 2SLS (2-Stage Least Squares)
• Case 2: l > k -i.e., number of instruments > number of regressors. • We can easily derive properties for bIV:
- This is the usual case. We can throw l-k instruments, but throwing b IV = ( X ' PZ X ) −1 X ' PZ y = ( X ' PZ PZ X ) −1 X ' PZ PZ y
away information is never optimal. = ( Xˆ ' Xˆ ) −1 Xˆ ' y = ( Xˆ ' Xˆ ) −1 Xˆ ' yˆ
- The IV normal equations are an l x k system of equations:
Z’y = Z’Xβ+ Z’ε (1) bIV is consistent
Note: We cannot approximate all the Z’ε by 0 simultenously. There (2) bIV is asymptotically normal.
will be at least l-k non-zero residuals. (Similar setup to a regression!) - This is estimator is also called GIVE (Generalized IV estimator)

- From the IV normal equations => W'Z’X bIV = W'Z’y • Interpretations of bIV
- We define a different IV estimator
- Let ZW = Z(Z’Z)-1Z’X = PZX = Xˆ b IV = b 2 SLS = ( Xˆ ' Xˆ ) − 1 Xˆ ' y This is the 2SLS interpretation
- Then, X'PZX bIV = X'PZy b = ( Xˆ ' X ) −1 Xˆ ' y This is the usual IV Z = Xˆ
IV
bIV = ( X ' PZ X ) −1 X ' PZ y = ( X ' PZ PZ X ) −1 X ' PZ PZ y = ( Xˆ ' Xˆ ) −1 Xˆ ' yˆ

Asymptotic Efficiency Problems with 2SLS


• Z’X/T may not be sufficiently large. The covariance matrix for
• The variance is larger than that of 0LS. (A large sample type the IV estimator is Asy. Cov(b) = σ2[(Z’X)(Z’Z)-1(X’Z)]-1
of Gauss-Markov result is at work.)
– If Z’X/T goes to 0 (weak instruments), the variance explodes.
(1) OLS is inconsistent.
(2) Mean squared error is uncertain: • When there are many instruments, X̂ is too close to X; 2SLS
becomes OLS.
MSE[estimator|β] = Variance + square of bias.
• Popular misconception: “If only one variable in X is correlated with
ε, the other coefficients are consistently estimated.” False.
• IV may be better or worse. Depends on the data: X and ε.
=> The problem is “smeared” over the other coefficients.

• What are the finite sample properties of bIV? We do not have the
condition E[ε|X] = 0, we cannot conclude that bIV is unbiased, or
that it has a Var[b2SLS] equal to its asymptotic covariance matrix.
=> In fact, b2SLS can have very bad small-sample properties.

Endogeneity Test (Hausman) Endogeneity Test: The Wu Test


Exogenous Endogenous • The Hausman test is complicated to calculate
OLS Consistent, Efficient Inconsistent
• Simplification: The Wu test.
2SLS Consistent, Inefficient Consistent • Consider a regression y = Xβ + ε, an array of proper instruments Z,
and an array of instruments W that includes Z plus other variables
• Base a test on d = b2SLS - bOLS that may be either clean or contaminated.
- We can use a Wald statistic: d’[Var(d)]-1d • Wu test for H0: X is clean. Setup
Note: Under H0 (plim (X’ε/T) = 0) bOLS = b2SLS = b (1) Regress X on Z. Keep fitted values X̂ = Z(Z’Z)-1Z’X
Also, under H0: Var[b2SLS ]= V2SLS > Var[bOLS ]= VOLS
(2) Using W as instruments, do a 2SLS regression of y on X, keep
=> Under H0, one estimator is efficient, the other one is not. RSS1.
(3) Do a 2SLS regression of y on X and a subset of m columns of X̂
• Q: What to use for Var(d)? that are linearly independent of X. Keep RSS2.
- Hausman (1978): V = Var(d) = V2SLS - VOLS
(4) Do an F-test: F = [(RSS1 - RSS2)/m]/[RSS2/(T-k)].
H = (b2SLS - bOLS)’[V2SLS - VOLS ]-1(b2SLS - bOLS) 
→
d χ2rank(V)
RS – Lecture 1

Endogeneity Test: The Wu Test Endogeneity Test: Augmented DWH Test


• Under H0: X is clean, the F statistic has an approximate Fm,T-k • Davidson and MacKinnon (1993) suggest an augmented regression
distribution. test (DWH test), by including the residuals of each endogenous right-
hand side variable.
Davidson and MacKinnon (1993, 239) point out that the DWH test
really tests whether possible endogeneity of the right-hand-side • Model: y = X β + Uγ + ε, we suspect X is endogenous.
variables not contained in the instruments makes any difference to
the coefficient estimates.
• Steps for augmented regression DWH test:
1. Regress x on IV (Z) and U:
• These types of exogeneity tests are usually known as DWH (Durbin,
Wu, Hausman) tests. x = Z П + U φ + υ => save residuals vx
2. Do an augmented regression: y = Xβ + Uγ + vx δ + ε
3. Do a t-test of δ. If the estimate of δ, say d, is significantly different
from zero, then OLS is not consistent.

Measurement Error Measurement Error


• DGP: y* = βx* + ε - ε ~ iid D(0, σε2) CASE 2 - Only y* is measured with error.
y* = y - v = βx* + ε
• But, we do not observe or measure correctly x*. We observe x, y: => y = βx* + ε + v = βx* + (ε + v)
x = x* + u u ~ iid D(0, σu2) -no correlation to ε,v
y = y* + v v ~ iid D(0, σv2) -no correlation to ε,u • Q: What happens when y is regressed on x?
A: Nothing! We have our usual OLS problem since ε and v are
independent of each other and x*. CLM assumptions are not
• Let’s consider two cases: violated!

CASE 1 - Only x* is measured with error (y=y*):


y = β(x- u) + ε = βx + ε - βu = βx + w
E[x’w] = E[(x* + u)’(ε - βu)] = -βσu2 ≠ 0
=> CLM assumptions violated => OLS inconsistent!

Finding an Instrument: Not Easy Weak Instruments: Finance application


• The IV problem requires data on variables (Z) such that • Finance example: The consumption CAPM.
(1) Cov(x,Z) ≠ 0 -relevance condition • In both linear and nonlinear versions of the model, IVs are weak, --
(2) Cov(Z,ε) = 0 -valid (exogeneity) condition see Neeley, Roy, and Whiteman (2001), and Yogo (2004).

Then, we do a first-stage regression to obtain fitted values of X: • In the linear model in Yogo (2004):
x = ZП + Uδ + V -V ~N(0, σV2I) X (endogenous variable): consumption growth
Then, using the fitted values we estimate and do tests on β. Z (the IVs): twice lagged nominal interest rates, inflation,
consumption growth, and log dividend-price ratio.
• Finding a Z that meets both requirements is not easy.
- The valid condition is not that complicated to meet. • But, log consumption is close to a random walk, consumption
- The relevant condition is more complicated: Finding a Z correlated growth is difficult to predict. This leads to the IVs being weak.
with X. But, the explanatory power of Z may not be enough to allow => Yogo (2004) finds F-statistics for H0: П = 0 in the 1st
inference on β. In this case, we say Z is a weak instrument. stage regression that lie between 0.17 and 3.53 for different countries.
RS – Lecture 1

Weak Instruments: Summary Weak Instruments: Detection and Remedies


• Even if the instrument is “good” –i.e., it meets the relevant • Symptom: The relevance condition, plim(Z’X/T ) not zero, is close
condition--, matters can be made far worse with IV as opposed to to being violated.
OLS (“the cure can be worse...”). • Detection of weak IV:
– Standard F test in the 1st stage regression of xk on Z. Staiger
• Weak correlation between IV and endogenous regressor can pose and Stock (1997) suggest that F < 10 is a sign of problems.
severe finite-sample bias. – Low partial-R2X,Z.
– Large Var[bIV] as well as potentially severe finite-sample bias.
• Even small Cov(Z,e) will cause inconsistency, and this will be
exacerbated when Cov(X,Z) is small. • Remedy:
– Not much – most of the discussion is about the condition,
• Large T will not help. A&K and Consumption CAPM tests have not what to do about it.
very large samples! – Use LIML? Requires a normality assumption. Probably, not
too restrictive. (Text, 375-77)

Weak Instruments: Detection and Remedies M-Estimation


• An extremum estimator is one obtained as the optimizer of a
• Symptom: The valid condition, plim(Z’ε/T ) zero, is close to being criterion function, q(z,b).
violated. Examples:
OLS: b = arg max (-e’e/T)
• Detection of instrument exogeneity: MLE: bMLE = arg max ln L =∑i=1,…,T ln f (yi,xi,b)
– Endogenous IV’s: Inconsistency of bIV that makes it no GMM: bGMM = arg max - g(yi,xi,b)’ W g(yi,xi,b)
better (and probably worse) than bOLS
– Durbin-Wu-Hausman test: Endogeneity of the problem
• There are two classes of extremum estimators:
regressor(s)
- M-estimators: The objective function is a sample average or a sum.
- Minimum distance estimators: The objective function is a measure
• Remedy:
of a distance.
– Avoid endogeneous weak instruments. (Also avoid weak IV!)
– General problem: It is not easy to find good instruments in
• "M" stands for a maximum or minimum estimators --Huber (1967).
theory and in practice. Find natural experiments.

M-Estimation M-Estimation
• The objective function is a sample average or a sum. For example,
• If s(z,b) = ∂q(z,b)/∂b′ exists (almost everywhere), we solve
we want to minimize a population (first) moment:
∑i s(zi,bM)/T =0 (*)
minb E[q(z,β)]

• If, in addition, EX[s(z,b)] = ∂/∂b′ EX[q(z,b)] -i.e., differentiation


– Using the LLN, we move from the population first moment to the and integration are exchangeable-, then
sample average:
EX[∂q(z,β)/∂β′] = 0.
∑i q(zi,b)/T →p
E[q(z,β)]
– We want to obtain: b = argmin ∑i q(zi,b) (or divided by T)
• Under these assumptions, the M-estimator is said to be of ψ-type (ψ=
– In general, we solve the f.o.c. (or zero-score condition): s(z,b)=score). Often, bM is taken to be the solution of (*) without
Zero-Score: ∑i ∂q(zi,b)/∂b′ = 0 checking whether it is indeed a minimum).

– To check the s.o.c., we define the (pd) Hessian: • Otherwise, the M-estimator is of ρ-type. (ρ= q(z,β)).
H = ∑i ∂2q(zi,b)/∂b∂b′
RS – Lecture 1

M-Estimation: LS & ML M-Estimators: Properties


• Least Squares
– DGP: y = f(x,β) + ε, z =[y,x] • Under general assumptions, M-estimators are:
p
– q(z;β) = S(β) = ε′ε = ∑i=1,…,T (yi - f(xi;β))2 - bM → b0
– Now, we move from population to sample moments - bM → N(b0,Var[b0])
a

– q(z;b) = S(b) = e′e = ∑i=1,…,T (yi - f(xi;b))2 - Var[bM] =(1/T) H0-1V0 H0-1
– bNLLS = argmin S(b) - If the model is correctly specified: -H = V.
Then, Var[b] = V0
• Maximum Likelihood
– Let f (xi,β) be the pdf of the data.
– H and V are evaluated at b0:
– L(x,β) = Πi=1,…,T f (xi;β)
– log L(x,β) = ∑i=1,…,T ln f (xi;β) - H = ∑i [∂2q (zi,b)/∂b∂b′]
– Now, we move from population to sample moments - V = ∑i [∂q(zi,b)/∂b][∂q(zi,b)/∂b′]
– q(z,b) = -log L(x,b)
– bMLE = argmin – log L(x;b)

Nonlinear Least Squares: Example NLLS: Linearization


• We start with a nonlinear model: yi = f(xi,β) + εi
Example: Min β S(β) ={½ Σi [yi - f(Xβ)]2 }

• We expand the regression around some point, β0:


• From the f.o.c., we cannot solve for β explicitly. But, using some
f(xi,β) ≈ f(xi,β0) + Σk[∂f(xi,β0)/∂βk0]( βk - βk0)
steps, we can still minimize RSS to obtain estimates of β.
= f(xi,β0) + Σk xi0 ( βk - βk0)
• Nonlinear regression algorithm: = [f(xi,β0) - Σk xi0 βk0] + Σk xi0 βk
= f0 + Σk xi0 βk = f0 + xi0′ β
1. Start by guessing a plausible values for β , say β0.
where
2. Calculate RSS for β0 => get RSS(β0)
3. Make small changes to β0, => get β1. fi0 = f(xi,β0) - xi0′ β0 (fi0 does not depend on unknowns)
4. Calculate RSS for β1 => get RSS(β1)
Now, f(xi,β) is (approximately) linear in the parameters. That is,
5. If RSS(β1) < RSS(β0) => β1 becomes your new starting point.
yi = fi0 + xi0′ β + ε0i (ε0i = εi + linearization error i)
6. Repeat steps 3-5 until you RSS(βj) cannot be lowered. => get βj.
=> y0i = yi – fi0 = xi0′ β + ε0i
=> βj is the (nonlinear) least squares estimates. 4

NLLS: Linearization NLLS: Linearization


• We linearized f(xi,β) to get: • Compute the asymptotic covariance matrix for the NLLS estimator
y = f0 + X0 β + ε 0 (ε0 = ε + linearization error) as usual:
=> y0 = y - f0 = X0 β + ε0 Est. Var[bNLLS|X0] = s2NLLS (X0′ X0)-1
s2NLLS = [y - f(xi, bNLLS)]′ [y - f(xi, bNLLS)]/(T-k).
• Now, we can do OLS:
bNLLS = (X0′ X0)-1 X0′ y0 • Since the results are asymptotic, we do not need a degrees of
freedom correction. However, a df correction is usually included.
Note: X0 are called pseudo-regressors.

• In general, we get different bNLLS for different β0. An algorithm can


be used to get the best bNLLS.

• We will resort to numerical optimization to find the bNLLS.


RS – Lecture 1

Gauss-Newton Algorithm Box-Cox Transformation


• bNLLS depends on β0 . That is, • A simple transformation that allows non-linearities in the CLM.
bNLLS (β0) = (X0′ X0)-1 X0′ y0
y = f(xi,β) + ε = Σk xk(λ) βk + ε
• We use a Gauss-Newton algorithm to find the bNLLS. Recall GN: xk(λ) = (xkλ -1)/λ limλ→0 (xkλ -1)/λ = ln xk
βk+1 = βk + (JT J)-1 JT ε -- J: Jacobian = δf(xi;β)/δβ.
• For a given λ, OLS can be used. An iterative process can be used to
• Given a bNLLS at step m, b(j), we find the bNLLS for step j+1 by: estimate λ. OLS s.e. have to be corrected. Not a very efficient method.
b(j+1) = b(j) + [X0(j)′X0(j)]-1X0(j)′e0(j)
• NLLS or MLE will work fine.
Columns of X0(j) are the derivatives: ∂f(xi,b(j))/∂b(j)′
e0(j) = y - f[x,b(j)] • We can have a more general Box-Cox transformation model:
y(λ1) = Σk xk(λ2) βk + ε
• The update vector is the slopes in the regression of the residuals on
X0. The update is zero when they are orthogonal. (Just like OLS)

Testing non-linear restrictions Testing non-linear restrictions


• Testing linear restrictions as before. • Linearize R(bNLLS) around β (=b0)
• Non-linear restrictions change he usual tests. We want to test: R(bNLLS) ≈ R(β) + G(bNLLS) (bNLLS - β)
H0: R(β) = 0
where R(β) is a non-linear function, with rank[∂R(β)/∂ββ=G(β)]=J. • Recall √T (bM - b0) →
d
N(0, Var[b0])
where Var[b0] = H(β)-1V (β) H(β)-1
• Let m = R(bNLLS) – 0. =>√T [R(bNLLS) - R(β)] d
→ N(0, G(β) Var[b0] G(β)′ )
Then, W=m′(Var[m|X])-1m = R(bNLLS)′(Var[R(bNLLS)|X])-1 R(bNLLS) => Var[R(bNLLS)] = (1/T) G(β) Var[b0] G(β)′
• Then,
But, we do not know the distribution of R(bNLLS). We know the
distribution of bNLLS. Then, we linearize R(bNLLS) around β: W = T R(bNLLS)′{G(bNLLS) Var[bNLLS] G(bNLLS)′}-1 R(bNLLS)
=> W →
d
χJ2
R(bNLLS) ≈ R(β) + G(bNLLS) (bNLLS - β)

You might also like