Dealing With and Understanding Endogeneity: Enrique Pinzón
Dealing With and Understanding Endogeneity: Enrique Pinzón
Enrique Pinzón
StataCorp LP
Model building
Endogeneity contradicts:
I Unobservables have no effect or explanatory power
I The covariates cause the outcome of interest
Endogeneity prevents us from making causal claims
Endogeneity is a fundamental concern of social scientists (first to
the party)
Model building
Endogeneity contradicts:
I Unobservables have no effect or explanatory power
I The covariates cause the outcome of interest
Endogeneity prevents us from making causal claims
Endogeneity is a fundamental concern of social scientists (first to
the party)
yi = β0 + β1 x1i + . . . + βk xki + εi
E (εi |x1i , . . . , xki ) = 0
yi = β0 + β1 x1i + . . . + βk xki + εi
E (εi |x1i , . . . , xki ) = 0
E (ε|X ) 6= 0
E X 0 ε 6= 0
E (ε|X ) 6= 0
E X 0 ε 6= 0
E (ε|X ) 6= 0
E X 0 ε 6= 0
y = β 0 + β 1 x1 + β 2 x2 + ε
E (ε|x1 , x2 ) = 0
y = β0 + β1 x1 + ν
y = β 0 + β 1 x1 + β 2 x2 + ε
E (ε|x1 , x2 ) = 0
y = β0 + β1 x1 + ν
y = β 0 + β 1 x1 + β 2 x2 + ε
E (ε|x1 , x2 ) = 0
We know that
ν = β2 x2 + ε
and
E (ν|x1 ) = β2 E (x2 |x1 )
E (ν|x1 ) = 0 only if β2 = 0 or x2 and x1 are uncorrelated
y = β 0 + β 1 x1 + β 2 x2 + ε
E (ε|x1 , x2 ) = 0
We know that
ν = β2 x2 + ε
and
E (ν|x1 ) = β2 E (x2 |x1 )
E (ν|x1 ) = 0 only if β2 = 0 or x2 and x1 are uncorrelated
. clear
. set obs 10000
number of observations (_N) was 0, now 10,000
. set seed 111
. // Generating a common component for x1 and x2
. generate a = rchi2(1)
. // Generating x1 and x2
. generate x1 = rnormal() + a
. generate x2 = rchi2(2)-3 + a
. generate e = rchi2(1) - 1
. // Generating the outcome
. generate y = 1 - x1 + x2 + e
The demand and supply equations for the market are given by
Qd = βPd + εd
Qs = θPs + εs
We assume:
All quantities are scalars
β < 0 and θ > 0
E (εd ) = E (εs ) = E (εd εs ) = 0
E ε2d ≡ σd2
E (Pd εd ) = 0
Using our equilibrium conditions and the fact that εs and εd are
uncorrelated we get
εs − εd
E (Pd εd ) = E εd
β−θ
E (εs εd ) E ε2d
= −
β−θ β−θ
2
E εd
= −
β−θ
σ2
= − d
β−θ
E (Pd εd ) = 0
Using our equilibrium conditions and the fact that εs and εd are
uncorrelated we get
εs − εd
E (Pd εd ) = E εd
β−θ
E (εs εd ) E ε2d
= −
β−θ β−θ
2
E εd
= −
β−θ
σ2
= − d
β−θ
y = sin(x) + ε
E (ε|x) = 0
y = xβ + ν
y = sin(x) + ε
E (ε|x) = 0
y = xβ + ν
y = xβ − xβ + sin(x) + ε
y = xβ + ν
ν ≡ sin(x) − xβ + ε
y = xβ − xβ + sin(x) + ε
y = xβ + ν
ν ≡ sin(x) − xβ + ε
y = xβ − xβ + sin(x) + ε
y = xβ + ν
ν ≡ sin(x) − xβ + ε
E (y |X1 , y2 ≥ 0) = X1 β + E (ε|X1 , y2 ≥ 0)
E (y |X ) = X1 β + E (ε|X )
E (ε|X ) 6= 0
. clear
. set seed 111
. quietly set obs 20000
.
. // Generating Endogenous Components
.
. matrix C = (1, .8\ .8, 1)
. quietly drawnorm e v, corr (C)
.
. // Generating exogenous variables
.
. generate x1 = rbeta(2 ,3)
. generate x2 = rbeta(2 ,3)
. generate x3 = rnormal()
. generate x4 = rchi2(1)
.
. // Generating outcome variables
.
. generate y1 = x1 - x2 + e
. generate y2 = 2 + x3 - x4 + v
. quietly replace y1 = . if y2 <=0
. clear
. set seed 111
. set obs 10000
number of observations (_N) was 0, now 10,000
. generate a = rchi2(2)
. generate e = rchi2(1) -3 + a
. generate v = rchi2(1) -3 + a
. generate x2 = rnormal()
. generate z = rnormal()
. generate x1 = 1 - z + x2 + v
. generate y = 1 - x1 + x2 + e
. reg y x1 x2
Source SS df MS Number of obs = 10,000
F(2, 9997) = 1571.70
Model 12172.8278 2 6086.41388 Prob > F = 0.0000
Residual 38713.3039 9,997 3.87249214 R-squared = 0.2392
Adj R-squared = 0.2391
Total 50886.1317 9,999 5.08912208 Root MSE = 1.9679
. quietly regress x1 z x2
. predict double x1hat
(option xb assumed; fitted values)
. preserve
. replace x1 = x1hat
(10,000 real changes made)
. quietly regress y x1 x2
. estimates store manual
. restore
Instrumented: x1
Instruments: x2 z
. estimates store tsls
E [g (x, θ)] = 0
E [g (x, θ)] = 0
OIM
Coef. Std. Err. z P>|z| [95% Conf. Interval]
Structural
y <-
x1 -1.015205 .0252942 -40.14 0.000 -1.064781 -.9656292
x2 1.005596 .0348808 28.83 0.000 .9372314 1.073961
_cons 1.042625 .0357962 29.13 0.000 .9724656 1.112784
x1 <-
x2 .9467476 .0244521 38.72 0.000 .8988225 .9946728
z -.987925 .0241963 -40.83 0.000 -1.035349 -.9405011
_cons 1.011304 .0243764 41.49 0.000 .9635269 1.059081
Robust
Coef. Std. Err. z P>|z| [95% Conf. Interval]
xb
x1 -1.015205 .0252261 -40.24 0.000 -1.064647 -.9657627
x2 1.005596 .0362111 27.77 0.000 .934624 1.076569
_cons 1.042625 .0363351 28.69 0.000 .9714094 1.11384
xpi
x2 .9467476 .0251266 37.68 0.000 .8975004 .9959949
z -.987925 .0233745 -42.27 0.000 -1.033738 -.9421118
_cons 1.011304 .0243761 41.49 0.000 .9635274 1.05908
Where
ε = y − (β0 + x1 β1 + x2 β2 )
ν = x1 − (π0 + x2 π1 + zπ2 )
Where
ε = y − (β0 + x1 β1 + x2 β2 )
ν = x1 − (π0 + x2 π1 + zπ2 )
. clear
. set seed 111
. quietly set obs 20000
.
. // Generating Endogenous Components
.
. matrix C = (1, .4\ .4, 1)
. quietly drawnorm e v, corr (C)
.
. // Generating exogenous variables
.
. generate x1 = rbeta(2 ,3)
. generate x2 = rbeta(2 ,3)
. generate x3 = rnormal()
. generate x4 = rchi2(1)
.
. // Generating outcome variables
.
. generate y1 = -1 - x1 - x2 + e
. generate y2 = (1 + x3 - x4)*.5 + v
. quietly replace y1 = . if y2 <=0
. generate yp = y1 !=.
E (y |X1 , y2 ≥ 0) = X1 β + E (ε|X1 , y2 ≥ 0)
φ (Z γ)
= X1 β + βs
Φ (Z γ)
φ(Z γ)
In other words regress y on X1 and Φ(Z γ)
E (y |X1 , y2 ≥ 0) = X1 β + E (ε|X1 , y2 ≥ 0)
φ (Z γ)
= X1 β + βs
Φ (Z γ)
φ(Z γ)
In other words regress y on X1 and Φ(Z γ)
y1
x1 -1.117284 .0464766 -24.04 0.000 -1.208377 -1.026192
x2 -1.049901 .0458861 -22.88 0.000 -1.139836 -.9599656
_cons -.9559192 .0329022 -29.05 0.000 -1.020406 -.891432
select
x3 .4990633 .0104891 47.58 0.000 .478505 .5196216
x4 -.4785327 .0101864 -46.98 0.000 -.4984976 -.4585677
_cons .4807396 .0125354 38.35 0.000 .4561707 .5053084
LR test of indep. eqns. (rho = 0): chi2(1) = 208.78 Prob > chi2 = 0.0000
. estimates store heckman
. quietly probit yp x3 x4
. matrix A = e(b)
. quietly predict double xb, xb
. quietly generate double mills = normalden(xb)/normal(xb)
. quietly regress y1 x1 x2 mills
. matrix B = A, _b[x1], _b[x2], _b[_cons], _b[mills]
x1 -1.117284 -1.1172841
.04647661 .04647661
x2 -1.0499007 -1.0499007
.04588611 .04588611
L .72875877
.02963515
_cons -.95591918 -.95592061
.03290222 .03290166
legend: b/se
. clear
. set seed 111
. set obs 10000
number of observations (_N) was 0, now 10,000
. generate a = rchi2(2)
. generate e = rchi2(1) -3 + a
. generate v = rchi2(1) -3 + a
. generate x2 = rnormal()
. generate z = rnormal()
. generate x1 = 1 - z + x2 + v
. generate y = 1 - x1 + x2 + e
y = X1 β1 + X2 β2 + ε
X2 = X1 Π1 + Z Π2 + ν
ε = νρ +
E (|X1 , X2 ) = 0
y = X1 β1 + X2 β2 + νρ +
y = X1 β1 + X2 β2 + ε
X2 = X1 Π1 + Z Π2 + ν
ε = νρ +
E (|X1 , X2 ) = 0
y = X1 β1 + X2 β2 + νρ +
y = X1 β1 + X2 β2 + ε
X2 = X1 Π1 + Z Π2 + ν
ε = νρ +
E (|X1 , X2 ) = 0
y = X1 β1 + X2 β2 + νρ +
y1∗ = y2 β + xΠ + ε
y2 = xγ1 + zγ2 + ν
y1 = j if κj−1 < y1∗ < κj
κ0 = −∞ < κ1 < . . . < κk = ∞
ε ∼ N (0, 1)
cov (ν, ε) 6= 0
∗
y1gsem = y2 b + xπ + t + Lα
t ∼ N (0, 1)
L ∼ N (0, 1)
∗
Where y1gsem = My1∗ and M is a constant. Noting that
∗
y1gsem = My1∗
y2 b + xπ + t + Lα = y2 Mβ + xMΠ + Mε
Mε = t + Lα
2
M Var (ε) = Var (t + Lα)
M2 = 1
p+ α
2
M = 1 + α2
(StataCorp LP) October 20, 2016 Barcelona 55 / 59
gsem Representation
∗
y1gsem = y2 b + xπ + t + Lα
t ∼ N (0, 1)
L ∼ N (0, 1)
∗
Where y1gsem = My1∗ and M is a constant. Noting that
∗
y1gsem = My1∗
y2 b + xπ + t + Lα = y2 Mβ + xMΠ + Mε
Mε = t + Lα
2
M Var (ε) = Var (t + Lα)
M2 = 1
p+ α
2
M = 1 + α2
(StataCorp LP) October 20, 2016 Barcelona 55 / 59
gsem Representation
∗
y1gsem = y2 b + xπ + t + Lα
t ∼ N (0, 1)
L ∼ N (0, 1)
∗
Where y1gsem = My1∗ and M is a constant. Noting that
∗
y1gsem = My1∗
y2 b + xπ + t + Lα = y2 Mβ + xMΠ + Mε
Mε = t + Lα
2
M Var (ε) = Var (t + Lα)
M2 = 1
p+ α
2
M = 1 + α2
(StataCorp LP) October 20, 2016 Barcelona 55 / 59
Ordered Probit with Endogeneity: Simulation
. clear
. set seed 111
. set obs 10000
number of observations (_N) was 0, now 10,000
. forvalues i = 1/5 {
2. gen x`i´ = rnormal()
3. }
.
. mat C = [1,.5 \ .5, 1]
. drawnorm e1 e2, cov(C)
.
. gen y2 = 0
. forvalues i = 1/5 {
2. quietly replace y2 = y2 + x`i´
3. }
. quietly replace y2 = y2 + e2
.
. gen y1star = y2 + x1 + x2 + e1
. gen xb1 = y2 + x1 + x2
.
. gen y1 = 4
.
. quietly replace y1 = 3 if xb1 + e1 <=.8
. quietly replace y1 = 2 if xb1 + e1 <=.3
. quietly replace y1 = 1 if xb1 + e1 <=-.3
. quietly replace y1 = 0 if xb1 + e1 <=-.8