0% found this document useful (0 votes)
259 views

ARE 213 Notes

This document discusses linear regression and the conditional expectation function (CEF). It makes three key points: 1) Linear regression estimates the CEF when the CEF is linear. Even when the CEF is nonlinear, regression provides the best linear approximation to the CEF. 2) Regression finds the best linear predictor of the dependent variable given the regressors. It minimizes the mean squared error when predicting within the range of the data. 3) Regression approximates the CEF without requiring assumptions about the data distribution, independence, or whether regressors are fixed or random. It provides a useful tool for estimating relationships in empirical work.

Uploaded by

Kelly Zhang
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
259 views

ARE 213 Notes

This document discusses linear regression and the conditional expectation function (CEF). It makes three key points: 1) Linear regression estimates the CEF when the CEF is linear. Even when the CEF is nonlinear, regression provides the best linear approximation to the CEF. 2) Regression finds the best linear predictor of the dependent variable given the regressors. It minimizes the mean squared error when predicting within the range of the data. 3) Regression approximates the CEF without requiring assumptions about the data distribution, independence, or whether regressors are fixed or random. It provides a useful tool for estimating relationships in empirical work.

Uploaded by

Kelly Zhang
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 227

M.

Anderson, Lecture Notes 1, ARE 213 Fall 2012 1


ARE 213 Applied Econometrics
UC Berkeley Department of Agricultural and Resource Economics
Ordinary Least Squares and Agnostic Regression:
Why We Do the Things We Do
1
Linear regression is the bread and butter of econometrics, and we begin the course with
a quick review of it. But this review is unlikely to resemble anything you saw in ARE 212
or earlier, primarily because it focuses on what linear regression is, rather than on what you
would like it to be. That is to say, as long as you satisfy certain trivial conditions (e.g.,
your matrix of regressors is full rank), you can always run a linear regression. And there
is absolutely nothing wrong with doing that regardless of problems involving endogene-
ity or omitted variables or measurement error as long as you interpret the results
appropriately. To paraphrase the National Rie Association, Regressions dont give biased
inferences, people interpreting those regressions give biased inferences. Or, in the words of
Sergeant Joe Friday, All we want are the facts. This lecture will focus on what we refer
to as agnostic regression.
1 The Conditional Expectation Function
Consider a dependent variable, y
i
, and a vector of explanatory variables, x
i
. We are interested
in the relationship between the dependent variable and the explanatory variables. (Josh
Angrist: What matters for empirical work, as in life, is relationships.) There are several
possible reasons that we may be interested in this relationship, including:
1. Description What is the observed relationship between y and x?
2. Prediction Can we use x to create a good forecast of y?
1
These notes are heavily derived from Josh Angrists Empirical Strategies lecture notes. Any errors in
transcription or interpretation are my own. The answer to this question is not going to be the Gauss-Markov
Theorem.
M. Anderson, Lecture Notes 1, ARE 213 Fall 2012 2
3. Causality What happens to y if we experimentally manipulate x?
It is generally the last item that causes rants about exogeneity conditions and so forth.
We will ignore all that negative energy for the moment and, instead of worrying about what
you cant infer, focus on what you can infer. Think positive!
Of course, few real-world relationships are deterministic. Recognizing this fact, we focus
on relationships that hold on average, or in expectation. Given our variables y and x,
we may be interested in the conditional expectation of y given x. That is to say, given a
particular value of x, where is the distribution of y centered? This relationship is given by
the Conditional Expectation Function, or the CEF.
E[y
i
|x
i
] = h(x
i
)
We dene the CEF residual as:

i
= y
i
h(x
i
) where
E[
i
|x
i
] = 0
Note that, because
i
is the CEF residual, E[
i
|x
i
] = 0 holds by denition we do not
require any exogeneity assumptions regarding x
i
.
Proof.
E[
i
|x
i
] = E[y
i
h(x
i
)|x
i
] = E[y
i
|x
i
] E[h(x
i
)|x
i
]
= E[y
i
|x
i
] h(x
i
) = E[y
i
|x
i
] E[y
i
|x
i
] = 0
To recap, the CEF residual always has zero conditional expectation. By denition. No
assumptions necessary. Always.
Theorem. CEF residuals are mean-independent of the arguments in the CEF (x
i
). They
are therefore orthogonal to any function of the conditioning variables.
M. Anderson, Lecture Notes 1, ARE 213 Fall 2012 3
Proof. Iterated expectations.
E[
i
f(x
i
)] = E[E[
i
f(x
i
)|x
i
]] = E[E[
i
|x
i
]f(x
i
)] = E[0] = 0
More importantly, the CEF is the best function of x that exists for predicting y (where
best is dened in terms of expected squared loss).
Theorem. E[y
i
|x
i
] = argmin
g
E[(y
i
g(x
i
))
2
] In other words, the CEF is the function
that minimizes the expected squared deviations from y
i
. We say that the CEF is the
minimum mean-square error (MMSE) predictor for y
i
given x
i
.
Proof.
E[(y
i
g(x
i
))
2
] = E[((y
i
E[y
i
|x
i
]) + (E[y
i
|x
i
] g(x
i
)))
2
] =
E[(y
i
E[y
i
|x
i
])
2
+ 2(y
i
E[y
i
|x
i
])(E[y
i
|x
i
] g(x
i
)) + (E[y
i
|x
i
] g(x
i
))
2
] =
E[E[(y
i
E[y
i
|x
i
])
2
+ 2(y
i
E[y
i
|x
i
])(E[y
i
|x
i
] g(x
i
)) + (E[y
i
|x
i
] g(x
i
))
2
|x
i
]] =
E[E[(y
i
E[y
i
|x
i
])
2
|x
i
] + 2(E[y
i
|x
i
] E[y
i
|x
i
])(E[y
i
|x
i
] g(x
i
)) + (E[y
i
|x
i
] g(x
i
))
2
] =
E[E[(y
i
E[y
i
|x
i
])
2
|x
i
]] + E[(E[y
i
|x
i
] g(x
i
))
2
]
It should be clear that choosing g(x
i
) such that g(x
i
) = E[y
i
|x
i
] minimizes the second
term in the last line. The rst term in the last line does not contain g(x
i
) and is therefore
unaected by our choice of g(x
i
). The CEF, E[y
i
|x
i
], therefore solves min
g
E[(y
i
g(x
i
))
2
].
2 Regression and the CEF: Why We Regress
Clearly the CEF has some desirable properties in terms of summarizing the relationship
between x
i
and y
i
and making predictions about y
i
given x
i
. In particular, we have seen
that it is the MMSE predictor of y
i
. But what does this have to do with linear regression,
and why might we want to use linear regression?
M. Anderson, Lecture Notes 1, ARE 213 Fall 2012 4
2.1 Reason the First: Regression-CEF Theorem
Theorem. If the CEF is linear, then the regression of y
i
on x
i
estimates the CEF. Formally, if
E[y
i
|x
i
] = x

i
, then = E[x
i
x

i
]
1
E[x
i
y
i
] (which is what the regression coecient converges
to).
Proof.
E[x
i
x

i
]
1
E[x
i
y
i
] = E[x
i
x

i
]
1
E[E[x
i
y
i
|x
i
]] = E[x
i
x

i
]
1
E[x
i
E[y
i
|x
i
]] =
E[x
i
x

i
]
1
E[x
i
x

i
] = E[x
i
x

i
]
1
E[x
i
x

i
] =
Of course, there is no reason the CEF has to be linear. Two of the most common sucient
conditions for a linear CEF are: (1) joint normality of x
i
and y
i
or (2) a saturated model for
discrete regressors. A saturated model is one in which you estimate a separate parameter for
each point in the support of x
i
(e.g., you have a separate dummy variable for each unique
value of the vector x
i
in your data set). This is more common in empirical work than joint
normality.
In most cases, however, the CEF is not linear. But we still run regressions anyway. Why
do we do this? One reason is that it is computationally tractable and that we understand
its properties both when it is correctly specied and under misspecication (or, at least, we
understand its properties under misspecication better than we understand the properties
of other estimators). Nevertheless, there are good theoretical reasons to regress as well.
2.2 Reason the Second: BLP Theorem
Theorem. If you want to predict y
i
, and you limit yourself to linear functions of x
i
, then
x

i
= x

i
E[x
i
x

i
]
1
E[x
i
y
i
] is the best linear predictor (BLP) of y
i
in a MMSE sense. Formally,
= E[x
i
x

i
]
1
E[x
i
y
i
] = argmin
b
E[(y
i
x

i
b)
2
].
Proof.
M. Anderson, Lecture Notes 1, ARE 213 Fall 2012 5
E[(y
i
x

i
b)
2
]/b = 2E[x
i
(y
i
x

i
b)] = 0
E[x
i
y
i
] E[x
i
x

i
]b = 0
b = E[x
i
x

i
]
1
E[x
i
y
i
] =
If youre limiting yourself to linear combination of x
i
, then linear regression gives you
the best predictor of y
i
. Of course, this isnt a big surprise given that the OLS estimator is
derived by minimizing the sample analog of E[(y
i
x

i
b)
2
]. Regardless, this property is nice
if youre in the business of forecasting, but its not as useful if your interest is in estimating
the CEF as a summary of the underlying relationship between y
i
and x
i
. Which brings us
to our third reason to regress (arguably the best reason).
2.3 Reason the Third: Regression Approximation Theorem
Theorem. The MMSE linear approximation to the CEF is = E[x
i
x

i
]
1
E[x
i
y
i
]. Formally,
= E[x
i
x

i
]
1
E[x
i
y
i
] = argmin
b
E[(E[y
i
|x
i
] x

i
b)
2
].
Proof.
E[(E[y
i
|x
i
] x

i
b)
2
]/b = 2E[x
i
(E[y
i
|x
i
] x

i
b)] = 0
E[E[x
i
y
i
|x
i
]] E[x
i
x

i
]b = 0
b = E[x
i
x

i
]
1
E[x
i
y
i
] =
So regression provides the best linear approximation to the CEF, even when the CEF is
non-linear. Regression can therefore give you a pretty decent approximation of the CEF as
long as you dont try to extrapolate beyond the support of x
i
.
M. Anderson, Lecture Notes 1, ARE 213 Fall 2012 6
3 Discussion
If your object of interest is the CEF, then linear regression is a good tool for estimating it.
Specically, it is the best linear predictor in terms of minimizing the mean squared error
from the CEF. More importantly, this result depends on absolutely nothing. In particular,
it does not depend on:
Whether your data are i.i.d.
Whether you treat your regressors as random variables or xed quantities.
Whether your regressors are correlated with the CEF residuals (by denition, they are
not, since the residuals are mean-independent of any function of the conditioning variables).
Whether the CEF is linear or not.
Whether your dependent variable is continuous, discrete, non-negative, or anything else.
Regression is therefore remarkably robust as an estimation tool, provided that you inter-
pret it for what it actually is an approximation of the conditional expectation function
rather than what you might like it to be (an estimate of a causal relationship). So if youre
only interested in description or prediction, we can probably end the class right here.
4 Application: Predicting College Success
Geiser and Santelices (2007) use high school GPA, standardized test scores (SAT), and
other covariates to predict college performance (college GPA) using linear regression for UC
freshman entering between Fall 1996 and Fall 1999. The results from this exercise are listed
in Table 4 of their article, reproduced below. They nd that, in this sample, high school
GPA is a more eective predictor of college GPA than any other measure. In particular, it is
much more eective than SAT I (the standard SAT). This can be seen in at least two ways.
First, in comparing Model 1 which uses high school GPA as a predictor and Model 2
M. Anderson, Lecture Notes 1, ARE 213 Fall 2012 7
Geiser and Santelices: VALDTY OF HGH-SCHOOL GRADES 9
CSHE Research & Occasional Paper Series
Organization of Report
Section of the report presents findings on the relative contribution of high-school grades
and standardized admissions tests in predicting cumulative fourth-year grade-point
average at UC. Section compares the predictive validity of HSGPA and test scores
between the first and fourth year of college and reports a surprising finding, namely, that
the predictive validity of admissions factors actually improves over the four years of
college, accounting for a greater proportion of the variance in cumulative fourth-year
college GPA than freshman GPA; possible explanations for this phenomenon are
considered. Section then utilizes multilevel and hierarchical linear modeling to
examine the extent to which clustering of students within campuses, academic
disciplines and other higher-level organizational units may affect the predictive validity of
student-level admissions factors. Section V examines the relative contribution of
HSGPA and test scores in predicting four-year graduation from UC. The paper
concludes with a discussion of the implications of our findings for admissions policy.
I. Validity of Admissions Factors in Predicting Cumulative Fourth-Year GPA
We begin with findings on the relative contribution of admissions factors in predicting
cumulative four-year college GPA. Table 4 shows the percentage of explained variance
in cumulative fourth-year GPA that is accounted for by HSGPA, SAT verbal and math
scores, and SAT Writing, Mathematics and Third Test scores. The estimated effects of
these admissions factors on cumulative fourth-year GPA were analyzed both singly and
in combination. Parents' education, family income and school AP rank were also
included in all of the regression models in order to control for the "proxy effects, noted
above, of socioeconomic status on standardized test scores and other admissions
variables.
Three main conclusions can be drawn from Table 4. First, looking at the admissions
factors individually Models 1 to 3 in the table HSGPA is the best single predictor of
cumulative fourth-year college GPA, accounting for 20.4 percent of the variance in a
High School SAT SAT SAT SAT SAT Parents' Family School % Explained
GPA Verbal Math Writing Math 3rd Test Education ncome AP Rank Number Variance
Model 1 0.41 x x x x x 0.12 0.03 0.08 59,637 20.4%
Model 2 x 0.28 0.10 x x x 0.03 0.02 0.01 59,420 13.4%
Model 3 x x x 0.30 0.04 0.12 0.05 0.02 -0.01 58,879 16.9%
Model 4 0.36 0.23 0.00 x x x 0.05 0.02 0.05 59,321 24.7%
Model 5 0.33 x x 0.24 -0.05 0.10 0.06 0.02 0.04 58,791 26.3%
Model 6 x 0.06 -0.01 0.26 0.04 0.12 0.04 0.02 -0.01 58,627 17.0%
Model 7 0.34 0.08 -0.02 0.19 -0.04 0.09 0.05 0.02 0.04 58,539 26.5%
Boldface indicates coefficients are statistically significant at 99% confidence level.
Source: UC Corporate Student System data on first-time freshmen entering between Fall 1996 and Fall 1999.
Standardized Regression Coefficients
Relative Contribution of Admissions Factors in Predicting Cumulative Fourth-Year GPA
Table 4
which uses SAT I as a predictor we see that Model 1 has a much higher R
2
; in other words,
high school GPA is explaining much more of the variation in college GPA than SAT I score
is (Note: This is probably the only time in this course that you will hear reference to R
2
. In
general it is not an interesting statistic in answering policy-relevant questions.) We also see,
in Model 7, that the standardized coecient on high school GPA is substantially larger than
the standardized coecient(s) on SAT I (the standardized coecient is a normal regression
coecient that has been rescaled to indicate how many standard deviations y changes with
a one standard deviation change in x). Moving up one standard deviation in the high school
GPA distribution is therefore much more benecial for college GPA (in a predictive sense)
than moving up one standard deviation in the SAT I score distribution.
M. Anderson, Lecture Notes 1, ARE 213 Fall 2012 8
Does this relationship answer any interesting, policy-relevant questions? Arguably, yes.
If you are a UC admission ocer, and you are tasked with reducing acceptance rates due
to state budget cuts (sadly, this scenario is likely to occur), then you can use the regression
results to predict which students are least likely to succeed. We know from the previous
theorems that the CEF provides the MMSE prediction of y (college GPA) and that regression
provides the MMSE linear approximation to the CEF. So in a predictive sense you are likely
to do well (at least relative to alternative choices), and in this case what you care about is
prediction.
These results also have policy relevance in that the University of California would like
to maintain a diverse student body but is not allowed to give any weight to ethnicity as
an admission criterion. UC administrators are aware, however, that weighting SATs more
heavily (as is traditionally done) tends to favor Caucasians (and possibly Asians?), while
weighting high school GPA more heavily tends to favor African Americans and Latinos (in
a relative sense). But will putting more weight on high school GPA and less weight on SAT
scores result in a lower quality student body? The results from Table 4 indicate that it will
not; in fact, if anything, it may result in a higher quality student body.
Are the estimated relationships causal? Highly unlikely. Even after controlling for
parental education and income, there are probably unobserved individual, family, neigh-
borhood, and peer characteristics that aect college success and are correlated with high
school GPA and SAT scores.
2
The regression results, however, are still useful for prediction
and have interesting applications in policy-relevant questions.
One can still take issue with the results along multiple dimensions. For example, should
some adjustment be done to GPA to reect the students choice of major?
3
Might there be
2
In fact, it doesnt even make sense to talk about an experimental manipulation of high school GPA or
SAT scores. The eects on college success will almost surely depend on whether the treatment entails raising
these attributes through cram sessions or through mentoring programs or through intensive intervention
earlier in life. The treatment is better dened as the actual intervention than as raising GPA by one point
or increasing SAT scores by 100 points.
3
I would strongly recommend against getting into this debate it will be a great way to alienate a lot of
colleagues very quickly.
M. Anderson, Lecture Notes 1, ARE 213 Fall 2012 9
other variables collected from the applicants that could improve the predictive power of the
model? Nevertheless, the fact remains that the results are useful and interesting despite the
fact that the coecients do not have causal interpretations. This example makes appropriate
use of a descriptive relationship estimated via linear regression, which is probably more than
can be said for the vast majority of empirical applications in economics.
M. Anderson, Lecture Notes 2, ARE 213 Fall 2012 1
ARE 213 Applied Econometrics
UC Berkeley Department of Agricultural and Resource Economics
Introduction to Causality and Research Design:
No Causation Without Manipulation
1
Causal eects are of interest to economists (and other social scientists) because we would
often like to know what the eects of manipulating a particular program or policy are.
Take, for example, the return to schooling possibly the most heavily analyzed quantity in
labor economics (maybe even in all applied microeconomics!). Using CPS data, it is easy to
estimate the relationship between schooling and earnings we can, for example, use linear
regression to approximate the expected value of earnings conditional upon years of schooling
(see previous lecture). However, this only reveals to us how these two variables covary in
the US population. It does not, in general, reveal what eects a policy manipulation that
increased schooling by one year for each student in the US might have on earnings. Using
the terms that you learned in ARE 212, years of schooling is endogenously determined by
individual students and their parents. If you are interested in predicting how earnings change
when you draw a dierent individual with a higher level of education from the US population,
then it is perfectly reasonable to apply the regression coecient. The policy manipulation,
however, refers to an exogenous change in years of schooling. There is therefore no reason
that the regression coecient estimated using data in which schooling is endogenously
determined should correspond to the eect of an exogenous change in years of schooling.
Note that it is not the case that one quantity is right and the other is wrong which
one is correct depends on what question you are trying to answer. Rather, its simply the
case that the two quantities are dierent, and one cannot be substituted for the other.
Despite the central role that causality plays in answering policy-relevant questions (since
1
This phrase comes from Holland (1986). For those unfamiliar with recent American history, it refers to
the American Colonial protest, No Taxation Without Representation. That phrase now appears on license
plates in the District of Columbia as Taxation Without Representation, because D.C. residents pay the
same Federal income taxes that you and I do but do not have the luxury of being able to vote for voting
representation in the House of Representatives or the Senate.
M. Anderson, Lecture Notes 2, ARE 213 Fall 2012 2
a policy intervention implies, almost by denition, some sort of external manipulation),
many econometrics courses do not formally present or discuss a model of causality. Instead,
they often begin by presenting a structural model of some economic phenomenon which
is implied to have an underlying causal interpretation and then proceed to discuss the
cases in which linear regression (or some other estimator) will estimate this model. This
presentation, however, sometimes leaves students thinking that a regression is inherently
wrong or useless if it doesnt provide unbiased or consistent estimates of an underlying
structural model (which, as we observed in the previous lecture, is certainly not the case).
It is also true that in some (many?) cases the structural parameters themselves do not
correspond to any meaningful causal eect without further transformations or assumptions.
The causal model we discuss today has come to be known as the Rubin Causal Model
(RCM), in reference to Rubin (1974) and subsequent publications. The RCM relies heavily
upon the notion of potential outcomes that is to say, possible outcomes under dierent
values of a variable we shall refer to as the treatment and it is useful for two reasons. First,
it is useful when understanding many common estimation techniques, such as instrumental
variables, regression discontinuity design, propensity score matching, etc. More importantly,
however, it can be useful in framing or understanding what question you are trying to
answer or what eect you are trying to estimate. If the quantity cannot be conceptualized
as arising from an experimental manipulation of some type of treatment, then it cannot
be estimated from a randomized trial, and the techniques that we learn which simulate
randomized experiments will be inappropriate.
2
1 The Rubin Causal Model
Suppose that we have N units, i = 1, ..., N, drawn randomly from a large population. We
are interested in the eect of some binary treatment variable, D
i
, on an outcome, Y
i
. We
2
Of course, the question may still be of interest, but you will have to nd a dierent (possibly easier!)
way to answer it, and you should understand that the answer will not correspond to the eect of a policy
intervention.
M. Anderson, Lecture Notes 2, ARE 213 Fall 2012 3
refer to D
i
= 1 as the treatment condition and D
i
= 0 as the control condition. Given
these two possiblities treatment and control we postulate the existence of two potential
outcomes for each unit: Y
i
(0) under the control condition and Y
i
(1) under the treatment
condition.
3
The key here is that, although we will never observe both Y
i
(0) and Y
i
(1) (we
will observe at most one or the other, but never both), it is theoretically possible that we
could observe either. In Hollands terminology, every unit must be potentially exposable to
every value of the treatment variable. If you cannot conceptualize both Y
i
(0) and Y
i
(1) for
the same unit, then D does not correspond to a treatment that is potentially manipulable
and we cannot talk about the causal eect of manipulating D without further dening the
problem. Holland, for example, argues that race is not something to which each unit is
potentially exposable we do not in general think of race as being something that we can
experimentally manipulate, and it is unclear what it would mean to ask what my potential
outcomes would be if I changed my race to be, for example, African-American.
Using the notation above, we dene the causal eect of treatment D = 1 on outcome Y
for unit i as:
Y
i
(1) Y
i
(0) =
i
Alternatively, we often refer to
i
as the treatment eect for unit i. Several things are
important to note here. First, the eect of a treatment is always dened in a relative sense
in this case it is the eect of the treatment condition D = 1 relative to the potential outcome
that would have occurred under the control condition D = 0. In medicine, D = 1 might
correspond to giving a drug (e.g., Lipitor) to a patient, while D = 0 corresponds to giving
a placebo to the patient. In our eld, D = 1 might correspond to implementing a specic
3
Note that the notation here is slightly dierent than in the excellent Holland (1986) article. In Hollands
article, the subscript of Y
t
(i) corresponds to treatment/control while the argument inside the parentheses
corresponds to the unit number (1, ..., N in our case). In our notation, the subscript corresponds to the
unit number while the argument inside the parentheses corresponds to treatment/control. We also deviate
slightly from Cameron and Trivedis notation in that they use subscripts for both treatment/control and
unit number. We do this because our notation corresponds to the notation used in seminal articles such as
Angrist, Imbens, and Rubin (1996).
M. Anderson, Lecture Notes 2, ARE 213 Fall 2012 4
carbon tax in California, while D = 0 corresponds to not doing so.
4
Second, the eect of
the treatment need not be constant across dierent units, as indicated by the fact that is
indexed by i many (probably most) treatments have heterogeneous eects. Finally, we will
never observe both Y
i
(1) and Y
i
(0) for any given unit. This is because, although it is not
evident in the notation, treatments also involve a time dimension. When we write D = 1
and D = 0, we implicitly mean that we are applying the treatment or control condition at a
specic point in time. In the medical example, if we administer Lipitor to a patient on his
55th birthday, we cannot simultaneously not administer Lipitor to him at the exact same
moment. In the environmental policy example, if we implement a carbon tax in California for
the 2011 scal year, we cannot simultaneously not implement that carbon tax in California
during the same scal year. We might choose not to implement the tax in 2010 or 2012 just
as we might choose not to administer Lipitor to the patient on his 54th or 56th birthdays
but since other factors aecting the unit can change during the interim period, we are
not guaranteed of observing the outcome that would have occurred had we implemented the
control condition in 2011 (or on the 55th birthday).
This inability to observe both Y
i
(0) and Y
i
(1) for any given unit leads to the following
theorem:
Fundamental Problem of Causal Inference: It is impossible to observe the value of
Y
i
(0) and Y
i
(1) on the same unit i and, therefore, it is impossible to observe
i
, the eect for
unit i of the treatment on Y
i
. (Holland 1986)
The Fundamental Problem of Causal Inference would appear to rule out any precise
estimation of
i
, and, at the unit level, it is true that we can never observe the exact
treatment eect. However, all is not lost. As we mentioned in Lecture 1, we are often
interested in relationships that hold on average, or in expectation. In this context, it is
possible to estimate quantities of interest. We dene the average causal eect or average
treatment eect (ATE) of the treatment relative to the control as the expected value of the
4
If the treatment variable can take on more than two values (e.g., 0, 1, or 2), then multiple treatment
eects exist for each unit (e.g., Y
i
(1) Y
i
(0) and Y
i
(2) Y
i
(1)), and these eects need not be equal, just as
the relationship between a dependent variable and an explanatory variable need not be linear.
M. Anderson, Lecture Notes 2, ARE 213 Fall 2012 5
dierence Y
i
(1) Y
i
(0), or
= E[Y
i
(1) Y
i
(0)] = E[Y
i
(1)] E[Y
i
(0)]
With the appropriate research design, it is possible to estimate ATE.
2 Estimation of Treatment Eects: The Randomized
Controlled Trial
For each unit i, there exist the quantities (Y
i
(0), Y
i
(1), D
i
). However, we only observe (Y
i
,
D
i
), where
Y
i
= (1 D
i
)Y
i
(0) + D
i
Y
i
(1)
The distinction between what exists conceptually and what we can actually observe
is subtle but tremendously important. Although we can only observe Y
i
(0) for untreated
units and Y
i
(1) for treated units, we can conceive of the counterfactual quantities Y
i
(1) for
untreated units (i.e., the outcome that control unit i would have realized under the treatment
condition) and Y
i
(0) for treated units (i.e., the outcome that treated unit i would have
realized under the control condition). Understanding the distinction between the observed
Y
i
and the unobserved-but-still-existent counterfactual quantities (Y
i
(0) or Y
i
(1)) will be
crucial in subsequent derivations in this course.
By denition,
E[Y
i
|D
i
= 1] = E[Y
i
(1)|D
i
= 1]
E[Y
i
|D
i
= 0] = E[Y
i
(0)|D
i
= 0]
Note that in general E[Y
i
(0)|D
i
= 0] = E[Y
i
(0)|D
i
= 1] (and E[Y
i
(1)|D
i
= 1] =
E[Y
i
(1)|D
i
= 0]). That is to say, people who select into the control condition generally have
M. Anderson, Lecture Notes 2, ARE 213 Fall 2012 6
dierent outcomes under the control condition (Y
i
(0)) than people who do not select into the
control condition. Thus, the average control outcome for the control unit E[Y
i
(0)|D
i
= 0]
need not equal the average control outcome for all units E[Y
i
(0)], which is a combination of
both control and treated units. The fact that we do not observe control outcomes (Y
i
(0)) for
any of the treated units, however, does not prevent us from imagining the existence of these
counterfactual outcomes. In the context of our medical example, Y is cholesterol level and
D represents treatment with Lipitor. Patients who choose to take Lipitor (D
i
= 1) are likely
to have high cholesterol levels in the absence of Lipitor (i.e., Y
i
(0) is high, though we do not
observe Y
i
(0) for them). Patients who choose not to take Lipitor (D
i
= 0) are likely to have
low cholesterol levels in the absence of Lipitor (i.e., Y
i
(0) is low, and for these patients we
observe Y
i
(0) since Y
i
= (1 D
i
)Y
i
(0) +D
i
Y
i
(1) = Y
i
(0)). The average untreated cholesterol
level for patients not taking Lipitor, E[Y
i
(0)|D
i
= 0], is therefore less than both the average
untreated cholesterol level for treated patients, E[Y
i
(0)|D
i
= 1], and the average untreated
cholesterol level for all patients, E[Y
i
(0)].
There is, however, an important case in which E[Y
i
(0)|D
i
= 0] = E[Y
i
(0)|D
i
= 1] =
E[Y
i
(0)] (and E[Y
i
(1)|D
i
= 1] = E[Y
i
(1)|D
i
= 0] = E[Y
i
(1)]). Suppose that the treatment
assignment, D, is randomly assigned. In that case, D is independent of both Y (0) and
Y (1). The conditional distribution of Y
i
(0) (and Y
i
(1)) given D
i
is therefore equal to the
unconditional distribution, and it must be the case that
E[Y
i
(0)|D
i
= 0] = E[Y
i
(0)]
E[Y
i
(1)|D
i
= 1] = E[Y
i
(1)]
The average causal eect, , is thus
= E[Y
i
(1)] E[Y
i
(0)] = E[Y
i
(1)|D
i
= 1] E[Y
i
(0)|D
i
= 0] = E[Y
i
|D
i
= 1] E[Y
i
|D
i
= 0]
We can easily estimate by taking the dierence between the average value of Y
i
in
the treatment group and the average value of Y
i
in the control group. Because it allows
M. Anderson, Lecture Notes 2, ARE 213 Fall 2012 7
estimation of ATE, the randomized controlled trial is considered the gold standard of
evidence in medicine, and in many areas of social science as well.
In some instances we may be willing to assume that E[Y
i
(0)|D
i
= 0] = E[Y
i
(0)|D
i
= 1] =
E[Y
i
(0)] but not that E[Y
i
(1)|D
i
= 1] = E[Y
i
(1)|D
i
= 0] = E[Y
i
(1)]. In other words, we may
be willing to assume that the untreated potential outcomes are mean-independent of the
treatment assignment, but not that the treated potential outcomes are mean-independent
of the treatment assignment. This is equivalent to saying that there is no selection into
treatment based on the level of untreated outcomes, but there is selection into treatment
based on the potential gains of being treated. You could probably write down an economic
model that would give this result, but to be honest I doubt it would be a palatable assumption
in most empirical settings. Regardless, under this slightly weaker assumption, you can still
identify

TOT
= E[Y
i
(1)|D
i
= 1] E[Y
i
(0)|D
i
= 1]
This quantity is commonly referred to as the eect of the treatment on the treated, or
TOT (treatment-on-treated) or ATOT (average treatment-on-treated) or some other strange
permutation of those letters. It is the causal eect of the treatment on those who select into
treatment.
3 The Stable Unit Value Treatment Assumption: SUTVA
Beyond the assumption of random assignment of D, there is an implicit assumption em-
bedded in the previous section that is known rather awkwardly as the stable unit treatment
value assumption, or SUTVA. Let D be a N 1 column vector that contains the treatment
values for all N units. Formally, SUTVA states that
If D
i
= D

i
, then Y
i
(D) = Y
i
(D

).
M. Anderson, Lecture Notes 2, ARE 213 Fall 2012 8
We have not yet dened what Y
i
(D) is, but it is exactly analogous to our denition
of Y
i
(D
i
) (i.e., Y
i
(0) and Y
i
(1)). That is to say, Y
i
(D) is the potential outcome for unit i
under treatment regime D. Now, instead of just specifying whether unit i is receiving the
treatment or the control, we are specifying values of D
i
for all units in the sample. For this
reason, SUTVA is often referred to as the no interference assumption, since it states that
unit i s potential outcomes are unaected by whether unit j (j = i) is treated or untreated.
A classic example of SUTVA not holding is the case of vaccines. If D
i
represents inoculation
of unit i with the measles vaccine, and Y
i
represents whether unit i gets measles, clearly
Y
i
(D
i
) depends on the values of the entire vector D. In particular, if D
j
= 1 for all j = i,
then Y
i
(0) will likely be 0 despite the fact that unit i is unprotected, because there are no
other unprotected units to spread the disease to unit i. If D
j
= 0 for all j = i, however,
then Y
i
(0) might change to 1. Another example of SUTVA not holding is the carbon tax
scenario presented earlier. If Y
c
(0) is the average temperature in California (in the year 2100)
in the absence of a California carbon tax, it should be clear that Y
c
(0) depends on whether
other states or countries implement carbon taxes. If SUTVA does not hold, then there is not
just one treatment eect,
i
, per unit but rather a multitude of treatment eects (one for
each dierent permutation of D). More importantly, it may be impossible to estimate the
treatment eect relative to the no intervention scenario (i.e., the scenario in which D
i
= 0
for all units i), because as soon as one unit is treated, all are potentially aected (so it is
impossible to construct an unbiased estimate of E[Y
i
(0)] with the data).
Rubin (1986) discusses SUTVA in the context of poorly dened treatments. That is to
say, he focuses on cases in which, even if D = D

, it is still the case that Y


i
(D) = Y
i
(D

).
This occurs because the treatment, and in particular the assignment mechanism, is not
precisely dened, so even though D = D

, its really not the same treatment (we will discuss


some examples shortly). In subsequent years, however, interest has focused on the no
interference aspect of SUTVA in many cases, treating one unit indirectly aects other
units, and SUTVA does not hold.
M. Anderson, Lecture Notes 2, ARE 213 Fall 2012 9
4 Applications and Discussion
4.1 Poorly Dened Treatments
When does it make sense to talk about D as a cause and when does it not? Holland (1986)
has a nice discussion on pp. 954-955 that I urge you to read, as does Rubins comment to
that article. In Hollands example, there are three hypothetical scenarios:
(1) She scored highly on the exam because she is female.
(2) She scored highly on the exam because she studied.
(3) She scored highly on the exam because her teacher tutored her.
In scenario (3), it is clear that the treatment is well-dened: the teacher tutors her. We
can easily conceive of manipulating whether or not this tutoring occurs. In scenario (1),
Holland argues that the students gender cannot be considered a cause because we cannot
manipulate it. It is certainly the case that the treatment is not well-dened in this case and
cannot t within the causal framework, although Rubin points out that further renements
could allow the scenario to t within the RCM. For example, if we said, She scored highly
on the exam because she received sex reassignment surgery, then we would have a clearly
dened treatment. Scenario (2) is the most problematic because it involves a voluntary
activity that the student can choose to do. Although we could certainly conceive of an
intervention that might prevent the student from studying (anesthesia, for example, would
be a pretty good bet), it is hard to imagine a manipulation that would force the student
to study (or at least force her to study as well as she would if she voluntarily studied).
Since we cannot manipulate this attribute (studying for the exam), we cannot think of it as
cause, at least not within the potential outcomes framework. Hence Hollands phrase, No
Causation Without Manipulation. It should also be clear from this discussion why there is
such a close linkage between the potential outcomes framework and policy relevance. If you
cant conceive of manipulating a particular attribute, then by denition you cannot design
a policy that would manipulate that attribute!
M. Anderson, Lecture Notes 2, ARE 213 Fall 2012 10
4.2 Poorly Dened Treatments II: Assignment Matters
While labor, public, and development economics have clearly moved more towards reduced
form empirical work that often attempts to uncover causal eects of interventions (as dened
by the RCM), industrial organization has become highly structural and rarely uses the
potential outcomes framework at all. Although there is something to be said for hysteresis
(as well as available data sets), I believe to a signicant degree this divergence is due to the
diculty in dening treatments in much of the IO context. In short, it can be important
to think about how the assignment mechanism aects not just selection into the treatment,
but the actual eect of the treatment itself.
In many strategic situations with incomplete information, we are interested in how a unit
reacts to the actions of other units, not just because those actions directly aect the unit
in question, but because the action may reveal new information to the unit. For example,
consider natural resource extraction scenarios oil drilling, sheries, etc. in which the
location of the resources is unknown. We may want to test whether extractor i take cues
from the location of other extractors in order to learn about the distribution of resources.
In this case, the assignment mechanism for the location of other extractors (the treatment)
becomes very important. Random assignment of the other extractors will not allow us to
estimate the eect in question because extractor i will react dierently if (s)he realizes that
the other extractors are being randomly assigned. This can be true even if extractor i does
not know which extractors are being randomly assigned the only way his or her behavior
will be unchanged is if he does not realize that there is any (additional) randomization being
applied.
5
It can therefore be exceedingly dicult to nd valid natural experiments in these
types of situations. In general, any time that an action sends a signal, and our interest is in
estimating the response to that signal, then randomization of that action will not yield the
estimate that we are looking for (unless the signals target has no idea that the randomization
is being introduced). This may be one reason why the potential outcomes framework, with
5
The same thing applies to randomly revealing valid location information to some of the other extractors.
Extractor i s behavior can change if (s)he knows that additional valid information is being inserted into the
game.
M. Anderson, Lecture Notes 2, ARE 213 Fall 2012 11
its emphasis on RCTs, is relatively uncommon in IO. Assignment matters.
4.3 Eects of Causes vs. Causes of Eects
Holland writes that an emphasis on the eects of causes rather on the causes of eects is,
in itself, an important consequence of bringing statistical reasoning to bear on the analysis
of causation and directly opposes more traditional analyses of causation. This distinction
between eects of causes and causes of eects may seem somewhat pedantic. It is not.
A concrete example should help clarify the distinction. Consider the obesity epidemic,
an issue of great importance to both agricultural economists (on the input side) and health
economists (on the output side). Researchers in dierent elds cite a myriad of causes of
this epidemic: increased consumption of snack foods, larger portions at restaurants, more
frequent consumption at restaurants, more sedentary jobs, the introduction of high fructose
corn syrup, etc. Under these explanations, however, it is rarely clear what the counterfactual
is. Take, for example, the increased consumption of snack foods over the last 30 years. One
possible counterfactual is what would have happened if people had not increased their con-
sumption of snack foods while everything else remained unchanged. It is unclear, however,
what policy manipulation could enforce that counterfactual scenario. Although it is easy to
imagine limiting snack food consumption (through a quota or a tax, for example), it is im-
possible to imagine doing so while simultaneously preventing individuals from compensating
in any other manner. Another possible counterfactual is to imagine a world in which the
complex set of technologies and changes in consumer preferences that led to the increase in
snacking had been inhibited from developing. Even if it were possible to imagine this sort
of manipulation, however, there is no guarantee that total caloric consumption would fall
by exactly the amount that snack food consumption has increased (in fact, it almost surely
would not). It becomes clear that the answer to the question of what has caused the
obesity epidemic is every single thing about the world that aects weight and has changed
since 1970. But even this turns out to be an incomplete answer because the distribution of
weight in 1970 is itself a cause of the distribution of weight today, so the obesity epidemic
M. Anderson, Lecture Notes 2, ARE 213 Fall 2012 12
is in fact caused by everything in the history of the world that has ever had an eect on
weight. As Holland (1986) notes, there is really no denable answer to this type of question.
Another example may make the conundrum even clearer. What caused the loss of life
in New Orleans during the ooding from Katrina? Was it the hurricane? The fact that the
levies were not constructed well enough? The fact that not all residents chose (or had the
means to) evacuate? The fact that Bush appointed Michael Brown, a former commissioner
of the International Arabian Horses Association, as head of FEMA (where, according to
Bush, he proceeded to do a heck of a job during the rescue eort)? The fact that a
pumping system was built in the early 20th century that allowed development of below-sea
level areas? The fact that the city even existed at all following the Louisiana Purchase? The
list of possibilities is endless.
In contrast to the causes of eects which is eectively an unlimited exercise in accounting
and description the eects of causes are clearly dened under the RCM. Even if we cannot
measure them with existing data, we can at least conceive of what they are.
5 Additional References
Rubin, Donald. Estimating Causal Eects of Treatments in Randomized and Non-randomized
Studies. Journal of Educational Psychology, 1974, 66, 688-701.
M. Anderson, Lecture Notes 3.1, ARE 213 Fall 2012 1
ARE 213 Applied Econometrics
UC Berkeley Department of Agricultural and Resource Economics
LaLonde Paper Detour:
Truncation Models and Heckman Selection Models
LaLonde (1986) presents a regression of the form Y
i
= D
i
+X
i
+rH
i
+u
i
that he says
comes from a Heckman Selection Model. How is this equation derived and where does it
come from?
1 Truncation Model
Consider rst a data truncation model of the form:
y

i
= x
i
+ u
i
Assume that you observe a unit i if and only if y

i
> 0. Otherwise, the data are missing
the distribution of y
i
is truncated. Will a regression of y
i
on x
i
yield an unbiased estimate
of ? In general, no. To see this, suppose that is positive. In that case, if x is large, then
you are likely to observe y

i
even when u
i
is small. But if x is small, you will only observe
y

i
when u
i
is large. So the observations with small x will tend to have larger u, and the
observations with large x will tend to have smaller u. This will cause your estimate of to
be attenuated towards zero when you run the regression of y on x. Truncation models of
this form produce attenuation bias.
Can we correct for it? Yes, if we make strict distributional assumptions (big if there).
Suppose that u N(0, 1).
1
What is E[y

i
|x
i
, y

i
> 0]? This is equivalent to asking, what is
1
We can alternatively assume that it has some constant variance
2
. The form of the derivation is
unchanged, but you just have to normalize by dividing through by and then keep track of the s as you
go.
M. Anderson, Lecture Notes 3.1, ARE 213 Fall 2012 2
E[u
i
|x
i
, x
i
+ u
i
> 0]? First, we need to compute the conditional density of u
i
given that
x
i
+ u
i
> 0.
P(u
i
< u|x
i
, x
i
< u
i
) =
P(u
i
< u &x
i
< u
i
)
P(x
i
< u
i
)
=
P(x
i
< u
i
< u)
P(x
i
< u
i
)
P(u
i
< u|x
i
, x
i
< u
i
) =

(u)(x
i
)
(x
i
)
if x
i
< u
0 otherwise
Thus the conditional density of u
i
given that the observation is observed (i.e., x
i
+u
i
> 0)
is the derivative of the function above with respect to u:
f(u|x
i
, x
i
< u
i
) =

(u)
(x
i
)
if x
i
< u
0 otherwise
(1)
So what is E[u
i
|x
i
, x
i
+ u
i
> 0]?
E[u
i
|x
i
, x
i
+ u
i
> 0] =

u
(u)
(x
i
)
du =
1
(x
i
)

u
1

2
exp(
u
2
2
)du =
1
(x
i
)
[
1

2
exp(
u
2
2
)|

x
i

] = 0
1
(x
i
)
[
1

2
exp(
(x
i
)
2
2
)] =
(x
i
)
(x
i
)
(2)
So E[u
i
|x
i
, x
i
+ u
i
> 0] =
(x
i
)
(x
i
)
. We therefore know that in our observed data,
E[y
i
|x
i
] = x
i
+
(x
i
)
(x
i
)
How do we estimate this given that it is a nonlinear function? Two possibilities are
nonlinear least squares (NLLS) and maximum likelihood estimation (MLE). For NLLS, ob-
serve that we can minimize the sum of squared residuals for the function above just like we
minimize the sum of squared residuals when deriving OLS:
Min wrt
N

i=1
(y
i
x
i
+
(x
i
)
(x
i
)
)
2
M. Anderson, Lecture Notes 3.1, ARE 213 Fall 2012 3
Of course, unlike with OLS, in general we cannot solve this problem analytically, so you
will have to use a computer to numerically solve for the value of that minimizes that
function.
Alternatively, we could use MLE. Note that P(Y
i
< y
i
|observed) = P(x
i
+ u
i
<
y
i
|observed) = P(u
i
< y
i
x
i
+ |observed). From equation (1) we know the likelihood
function for an individual observation is
(y
i
x
i
)
(x
i
)
, so the likelihood for all the data is:
N

i=1
(y
i
x
i
)
(x
i
)
Maximizing this function with respect to , or more practically, maximizing the log of
it, will produce the MLE for . Again, you will have to do this via computer rather than
analytically.
2 Heckman Selection Model
We just reviewed the truncation model, which is about missing data, where the missingness
is nonrandom. The Heckman Selection Model that LaLonde presents is also about missing
data; in this case it is the counterfactual potential outcomes, Y
i
(1) and Y
i
(0) that are missing.
Consider a model similar to LaLondes of the form:
y
i
(0) = x
i
+ u
i
y
i
(1) = + x
i
+ u
i
Suppose that the following rule for selecting into the treatment holds, where z
i
contains
all the variables in x
i
and possibly some extra regressors:
We observe y
i
(0) if d
i
= z
i
+
i
< 0 (i.e., D
i
= 0)
We observe y
i
(1) if d
i
= z
i
+
i
> 0 (i.e., D
i
= 1)
M. Anderson, Lecture Notes 3.1, ARE 213 Fall 2012 4
Further suppose that both u
i
and
i
are distributed N(0, 1) with some nonzero covariance,
and that (u
i
,
i
) z
i
. Note that we have a missing data problem we want to know =
y
i
(1) y
i
(0), but y
i
(1) is missing when D
i
= 0 and y
i
(0) is missing when D
i
= 1. So what is
E[u
i
|z
i
, D
i
= 1]? First, note that since u
i
and
i
are joint normal, E[u
i
|
i
, z
i
] = E[u
i
|
i
] =
i
(the rst equality is because u
i
z
i
, the second equality is due to the Regression-CEF
Theorem from the rst lecture). Therefore:
E[u
i
|z
i
, D
i
= d] = E[E[u
i
|
i
, z
i
, D
i
= d]|z
i
, D
i
= d]
= E[E[u
i
|
i
, z
i
]|z
i
, D
i
= d]
= E[
i
|z
i
, D
i
= d] = E[
i
|z
i
, D
i
= d]
The rst step is just iterated expectations. In the second step, we can drop D
i
from
the inner expectation because it is a deterministic function of z
i
and
i
(xing z
i
and
i
completely determines D
i
). In the third step, we substitute in using the equality in the
paragraph above.
Therefore, E[u
i
|z
i
, D
i
= 1] = E[
i
|z
i
, D
i
= 1] = E[
i
|z
i
, z
i
+
i
> 0]. Applying our
result from equation (2), we know that:
E[
i
|z
i
, z
i
+
i
> 0] =
(z
i
)
(z
i
)
Thus E[y
i
|z
i
, D
i
= 1] = + x
i
+
(z
i
)
(z
i
)
. A similar set of derivations shows that
E[y
i
|z
i
, D
i
= 0] = x
i

(z
i
)
1(z
i
)
E[y
i
|z
i
, D
i
] is thus:
E[y
i
|z
i
, D
i
] = D
i
+ x
i
+ [D
i
(z
i
)
(z
i
)
(1 D
i
)
(z
i
)
1 (z
i
)
]
M. Anderson, Lecture Notes 3.1, ARE 213 Fall 2012 5
This is virtually identical to what LaLonde shows on p. 615, though for some reason the
denominators on his bias correction terms are switched around.
2
Regardless, we can write
rH
i
= [D
i
(z
i
)
(z
i
)
(1 D
i
)
(z
i
)
1(z
i
)
] and estimate the following equation:
y
i
= D
i
+ x
i
+ rH
i
+ v
i
To estimate H
i
, we can run a probit of D
i
on z
i
this will give us the tted values z
i

that we need to construct

H
i
. Then we can regress y
i
on D
i
, X
i
, and

H
i
. Hence it is a
two-step procedure.
Is the procedure really that useful? Probably not. On the one hand, if z
i
contains more
elements than x
i
, then you can simply run 2SLS, with the elements in z
i
that are excluded
from x
i
as your instruments the only thing the Heckman model is really getting you is a
nonlinear rst stage. The problem is that if the distributional assumptions or the functional
assumptions are wrong, you can still get an inconsistent estimate even if you have a good
instrument. On the other hand, if z
i
= x
i
, then you are identied purely o getting the
functional form of E[y
i
|x
i
] correct, and modeling the choice equation correctly, and making
the correct distributional assumptions about the residuals, which is a very tenuous form of
identication. Nevertheless, the model has a long history in econometrics (and it is central
to the work for which Heckman was awarded the Nobel), so you should at least be aware of
it.
2
I have run Stata simulations to conrm that my derivation gives consistent estimates under the stated
assumptions, while LaLondes does not.
M. Anderson, Lecture Notes 3, ARE 213 Fall 2012 1
ARE 213 Applied Econometrics
UC Berkeley Department of Agricultural and Resource Economics
Cautionary Notes:
Its a Harsh World Out There
We have seen in the past two lectures that:
(a) it is almost always valid to run a regression as long as you interpret it correctly, and
(b) randomized experiments are generally the preferred method for estimating causal
eects under the potential outcomes framework.
Under what conditions will a linear regression the bread and butter of econometrics
approximate a randomized experiment? Loosely speaking, we need it to be the case that x
i
is as good as randomly assigned, i.e. x
i
needs to be uncorrelated with unobserved factors
that determine y
i
after controlling for other observable factors. This type of research design
is often referred to as selection on observables. How often does it hold up in practice? Not
as often as we would like.
1 LaLonde (1986): The NSW
LaLonde (1986) analyzes a randomized experiment evaluating a job training program, the
National Supported Work Demonstration (NSW). The NSW, operated by Manpower Demon-
stration Research Corporation (MDRC), admitted into the program AFDC women, ex-drug
addicts, ex-criminal oenders, and high school dropouts of both sexes.
1
(LaLonde 1986, p.
605) While the NSW is shown to increase post-training earnings by $800-$900 (1982 dollars),
that is not the main focus of the article. Instead, LaLonde uses the experimental estimates
as a benchmark to test whether typical econometric techniques can reproduce the same
1
It is unclear from LaLondes description how the MDRC administrators chose which applicants would
enter the experiment. Given that there were only 6,616 trainees distributed between 10 cities, there was
presumably a scarcity of slots relative to potential applicants.
M. Anderson, Lecture Notes 3, ARE 213 Fall 2012 2
results. The short answer is that they cannot.
LaLonde needs a simulated control group in order to conduct his exercise if he applies
any sensible estimator to the experimental data (treated and control groups), he will get
reasonable estimates because the treatment is randomly assigned. He therefore constructs a
series of simulated control groups using data from the PSID and the CPS (merged with SSA
data). This is somewhat unusual in that the treated individuals and the control individuals
are drawn from two entirely separate data sets, but it is not unreasonable for his purposes.
LaLonde begins the benchmarking exercise by applying a series of dierences-in-dierences
type estimators. The basic model is:
(1) y
i,1979
y
i,1975
= D
i
+ (
i,1979

i,1975
)
This specication dierences out any unobserved individual eects that are constant over
time it is equivalent to including individual xed eects in the cross-sectional regression.
Identication comes from comparing the change in earnings for those that participated in
training to the change in earnings for those that did not participate. LaLonde also supple-
ments this model with a regression that, instead of dierencing, controls for pre-treatment
earnings. This specication is more exible in that it does not restrict the coecient on
pre-treatment earnings to be one.
(2) y
i,1979
= D
i
+ y
i,1975
+ X
i
+
i,1979
Table 1 presents estimates from these two specications. The Pre-Treatment column
presents dierences between the income of treatment and control groups (or simulated control
groups) in 1975, before the training program starts. If the treatment is randomly assigned,
this dierence should be close to zero, and for the true controls it is. LaLonde presents
eight simulated control groups for brevity I present the two control groups per gender
that were closest to the experimental sample in terms of pre-treatment income. Table 1
is therefore more favorable to the nonexperimental estimates than LaLondes equivalent
M. Anderson, Lecture Notes 3, ARE 213 Fall 2012 3
tables. Nevertheless, we observe large variations between the experimental estimates and
the nonexperimental estimates.
The Dis-in-Dis column presents estimates using the rst dierences specication
presented above (equation 1). The experimental benchmarks are $833 for females and $847
for males. The nonexperimental estimates range from -$1,637 to $3,145, and in only one case
does the nonexperimental condence interval contain the experimental point estimate. The
last two columns of Table 1 apply the model presented in equation 2 (rst without additional
covariates, and then with additional covariates). These models perform slightly better three
of the eight point estimates get reasonably close to the experimental benchmarks (one of them
gets quite close). Nevertheless, the pre-treatment dierences are actually the worst for the
samples that produce the closest results, so there is no reason to believe that an objective
econometrician would reliably choose point estimates close to the experimental benchmarks.
Table 1: One-Stage Estimates
Pre-Treatment Controlling For Fully
Estimator: Dierences Dis-in-Dis Previous Earnings Adjusted
Females
Controls -17 833 843 854
(122) (323) (308) (312)
PSID-3 -77 3,145 3,070 2,919
(202) (557) (531) (592)
CPS-4 -1,189 2,126 1,222 827
(249) (654) (637) (814)
Males
Controls 39 847 897 662
(383) (560) (467) (506)
PSID-3 455 242 629 397
(539) (884) (757) (1,103)
CPS-3 337 -1,637 -1,396 1,466
(343) (631) (582) (984)
Notes: Standard errors in parentheses. Source: LaLonde (1986).
LaLonde then considers the performance of more advanced two-stage estimators. In
particular, he applies the Heckman selection correction model from Heckman (1978). The
M. Anderson, Lecture Notes 3, ARE 213 Fall 2012 4
Heckman selection correction models two equations separately: the participation equation
(the rst stage, a non-linear probit model) and the earnings equation (the second stage).
In this sense it is not unlike two-stage least squares, but there are a couple key dierences.
First, it uses a control function approach to solve the endogeneity problem. Specically, it
uses estimates from the rst stage equation as a regressor to control for the expected value
of the earnings residual conditional on participation and the determinants of participation.
Second, because it species the participation (treatment) dummy as a non-linear function
of the covariates, it is possible to identify the training coecient without any instruments
(i.e., exclusion restrictions). Nevertheless, LaLonde experiments with several (questionable)
instruments to see how well this model performs.
Table 2: Two-Stage Estimates
Females Males
Training Participation Training Participation
Controls 861 284 889 -876
(318) (2,385) (840) (2,601)
5 Sketchy IVs 1,102 -606 -22 -1,437
(323) (480) (584) (449)
3 Sketchy IVs 1,256 -823
(405) (410)
2 Sketchy IVs 1,564 -552 13 -1,484
(604) (569) (584) (450)
No Instrument 1,747 -526 213 -1,364
(620) (568) (588) (452)
Notes: Standard errors in parentheses. Source: LaLonde (1986).
The results from the Heckman two-step estimator are reported in Table 2. The Heckman
correction allows you to test the exogeneity of the treatment indicator by testing whether the
coecient on the selection correction term in the second stage is signicantly dierent than
zero. For brevity I present only results for the samples that displayed the least evidence of
selection into the treatment. The two-step estimators perform somewhat better than the one-
M. Anderson, Lecture Notes 3, ARE 213 Fall 2012 5
step estimators, but the results are still not encouraging. On the positive side, the condence
intervals for all but one of the nonexperimental estimates contain the experimental point
estimates. But the standard errors are so large that much of this encouraging performance
is primarily due to the fact that the condence intervals are huge. Male nonexperimental
estimates are particularly bad, ranging from -$1,333 to $213 (see LaLondes Table 6 for the
full set of results). It seems likely that if additional data were available, the nonexperimental
estimates would converge to dierent values than the experimental estimates.
When LaLondes paper was published in 1986, it caused signicant consternation among
applied researchers trying to estimate causal eects. It is probably not an understatement
to say that it sparked the pursuit of clean, transparent research designs that continues to
this day.
2 Freedman (1991): A Natural Experiment
Freedman (1991) oers a critique of linear regression applications along with an example of
an historical natural experiment as an alternative research design. Freedman begins with four
possible views of regression, progressing from the most optimistic to the most pessimistic:
(1) Regression usually works, although it is (like anything else) imperfect and
may sometimes go wrong.
(2) Regression sometimes works in the hands of skillful practitioners, but it isnt
suitable for routine use.
(3) Regression might work, but hasnt yet.
(4) Regression cant work.
Source: Freedman (1991), p. 292.
Freedman professes that his own view falls between (2) and (3). Im not sure exactly
what (3) entails the properties of linear regression are pretty well-established, so if it were
M. Anderson, Lecture Notes 3, ARE 213 Fall 2012 6
going to work, I would think it would have done so by now. But, like Freedman, I agree that
good examples [of causal estimates from regression] are quite hard to nd.
In contrast to regression models (and more sophisticated models), Freedman presents the
work of John Snow on cholera in the 1850s (that is to say, Snow conducted the work during
the 1850s, on cholera at that time). Snow postulated that unsanitary water caused cholera
outbreaks (at the time it was believed that cholera arose from poisonous particles in the air).
Snow had several pieces of circumstantial evidence to support his position, but in order to
prove his hypothesis he observed that water distribution in London gave rise to a natural
experiment.
In the area that Snow was studying, two water supply companies, Southwark and Vaux-
hall Company and Lambeth Company, competed for customers. One company (Lambeth)
drew water upstream of the sewage discharge points in the River Thames, while the other
(Southwark and Vauxhall) drew water downstream of the discharge points. Both companies
had pipes running down virtually every street and alley, and which houses chose which com-
pany appeared to be virtually random. Snow wrote, Each company supplies both rich and
poor, both large houses and small; there is no dierence either in the condition or occupa-
tion of the persons receiving the water of the dierent Companies. In todays terminology,
Snow would say that the observable attributes (covariates) were balanced across the two
companies. Having convinced himself that the choice of water company was nearly random,
he examined the cholera death rate for customers of both companies.
The cholera results, presented in Table 3, are striking. Death rates for the downstream
company are over eight times higher than death rates for the upstream company. Given the
sample size, and the fact that the customers of both companies are spatially intermixed, it
is clear that these results are highly signicant despite the absence of standard errors. As
Freedman writes (p. 298):
As a piece of statistical technology, Table [3] is by no means remarkable. But the
story it tells is very persuasive. The force of the argument results from the clarity
M. Anderson, Lecture Notes 3, ARE 213 Fall 2012 7
of the prior reasoning, the bringing together of many dierent lines of evidence,
and the amount of shoe leather Snow was willing to use to get the data.
Table 3: Snows Table IX
Number of Deaths from Deaths per
Houses Cholera 10,000 Houses
Southwark and Vauxhall 40,046 1,263 315
Lambeth 26,107 98 37
Rest of London 256,423 1,422 59
Notes: Source: Freedman (1991).
Freedmans emphasis is that the ndings credibility is due to the persuasive research
design in conjunction with an impressive data set, rather than the sophistication of the
statistical modeling technique. The implication is that, if you cant get data that have some
sort of clean variation in the treatment of interest, then you cant convincingly identify a
causal eect, no matter how fancy an estimation technique or theoretical model you apply.
My personal view is in line with Freedmans, but it is certainly a matter considered open for
debate within economics/econometrics. Regardless, as a piece of empirical evidence, Snows
150-year-old study is clearly more credible than the vast majority of articles published today
in economics (or other social sciences).
Freedman gives several examples of unconvincing regression studies these arent really
worth reading since they are so common. He also stresses the importance of replication in
the context of a response to LaLonde (1986) by Heckman and Hotz (pp. 306-307). This is
an important point that unfortunately gets little recognition within economics (along this
dimension, the medical literature is ahead of us) our eld generally assigns little value
to replicative studies unless they produce remarkably dierent results. The problem is that
modern computing power allows the estimation of dozens, if not hundreds, of dierent models
on the same data set. Given that researchers can only be expected to report the results of
a few of these models, it is dicult to tell whether a result represents a true underlying
relationship, or whether it is simply a statistical artifact of the ability to choose between so
many dierent estimates. The only way to truly test the nding is to verify the result in
M. Anderson, Lecture Notes 3, ARE 213 Fall 2012 8
dierent contexts with dierent data sets. But the incentives are not aligned to encourage
researchers to engage in that type of work.
With these caveats in mind, we begin our study of estimation techniques developed to
uncover causal eects. Virtually all of these techniques have historical roots that precede
the LaLonde (1986) paper by years, if not decades. Nevertheless, their increased popularity
is likely in part a reaction to LaLondes study.
3 Additional References
Heckman, James. Dummy Endogenous Variables in a Simultaneous Equations System.
Econometrica, 1978, 46, 931-59.
M. Anderson, Lecture Notes 4, ARE 213 Fall 2012 1
ARE 213 Applied Econometrics
UC Berkeley Department of Agricultural and Resource Economics
Selection on Observables Designs:
Part I, Regression Adjustment
This set of lecture notes begins our discussion of what we refer to as selection on observ-
ables designs. The key assumption underlying these designs is that the treatment assign-
ment is ignorable which you can interpret as as good as randomly assigned after you
condition on a set of observable factors. There are a variety of estimation techniques avail-
able in this scenario: standard linear regression, exible nonparametric regression, matching
estimators, and propensity score estimators. The underlying (untestable) assumption of all
of these estimators, however, is that you observe all of the factors that aect treatment as-
signment and are correlated with the potential outcomes. In other words, to the extent that
there is systematic selection into treatment, this selection is only a function of the observable
variables. Hence, if you can control for the eects of these variables on the probability of
selection, then you can produce consistent estimates of causal eects. The ip side is that,
if you dont observe all the determinants of selection, then these methods do not, in general,
produce estimates with a causal interpretation. This important fact is often overlooked by
applied practitioners who focus on the sophistication of the estimation technique (matching
is somewhat en vogue these days). In my opinion, the underlying selection on observables
assumption is too strong to hold in most cases, so these methods are probably applied more
often than they should be. Nevertheless, in some cases the assumption is palatable (or at
least defensible), and in those cases these techniques can be quite helpful.
1
1
For example, consider a case in which individuals apply for some program or job, and then are assigned
to dierent areas/departments/treatments/whatever based upon the data in their applications. In this
scenario, the researcher can observe all of the non-random factors that aected selection (i.e. the data in
the applications), and the selection on observables assumption clearly makes sense.
M. Anderson, Lecture Notes 4, ARE 213 Fall 2012 2
1 Regression Adjustment
The key underlying assumption motivating regression adjustment (and the other selection
on observables designs) is that the treatment is independent of the potential outcomes (par-
ticularly the untreated potential outcomes) after conditioning on the observable covariates,
X
i
. We write this assumption as:
(Y
i
(1), Y
i
(0)) D
i
|X
i
This assumption is referred to as the unconfoundedness assumption, the selection on
observables assumption, or the conditional independence assumption. When combined
with an assumption about overlap, 0 < P(D
i
= 1|X
i
) < 1, it is referred to as strongly
ignorable treatment assignment.
How does this t in with what we previously learned about regression? We can translate
the potential outcomes framework into a classical linear regression model by dening Y
i
(0)
and Y
i
(1) as follows and assuming constant treatment eects.
Y
i
(0) = +
i
Y
i
(1) = Y
i
(0) + +
i
The constant treatment eects assumption amounts to
i
= 0. Under these denitions,
we have
Y
i
= +D
i
+
i
If D
i
is randomly assigned, then D
i
Y
i
(0), so D
i

i
(our typical regression orthogonal-
ity assumption). What happens when we relax the assumption to Y
i
(0) D
i
|X
i
? Consider
rewriting our expression for Y
i
as
M. Anderson, Lecture Notes 4, ARE 213 Fall 2012 3
Y
i
= +D
i
+ h(X
i
) +
i
where h(X
i
) E[D
i
|X
i
] and
i
=
i
E[D
i
|X
i
]. Note that h(X
i
) will not be estimated
by running a regression of Y
i
on D
i
and X
i
even if E[D
i
|X
i
] is linear because the regression
of Y
i
on D
i
and X
i
estimates (or approximates) E[Y
i
|D
i
, X
i
]. Using partitioned regression,
however, we know that partialing out h(X
i
) from D
i
and then running a bivariate regression
of Y
i
on the partialed-out D
i
generates the same estimate of as the multiple regression of
Y
i
on D
i
and h(X
i
). So rewrite our expression for Y
i
once again as:
Y
i
= +

D
i
+ u
i
where
2

D
i
= D
i
E[D
i
|X
i
]
u
i
=
i
+E[D
i
|X
i
]
Note that this identity comes straight from the fact that Y
i
= +D
i
+
i
. The standard
orthogonality condition that we need for a regression of Y
i
on

D
i
to generate unbiased
estimates of can be written as E[u
i

D
i
] = 0. Under what circumstances does this condition
hold?
E[u
i

D
i
] = E[
i

D
i
] + E[E[D
i
|X
i
]

D
i
]
= E[
i

D
i
] + 0 = E[
i
(D
i
E[D
i
|X
i
])]
= E[
i
D
i
] E[
i
E[D
i
|X
i
]]
= E[
i
D
i
] E[E[
i
|X
i
]E[D
i
|X
i
]]
= E[E[
i
D
i
|X
i
]] E[E[
i
|X
i
]E[D
i
|X
i
]]
2
Formally,

D = (I P
H(X)
)D, where P
H(X)
= H(H

H)
1
H

and H
i
= E[D
i
|X
i
]. The fully transformed
regression model would be

Y
i
= +

D
i
+
i
, where the operator is dened such that

A represents variable
A after partialing out H. However, since the projection matrix P
H
is idempotent, the regression coecient
(

D


D)
1
(

D

Y ) is unaected by whether or not we partial out H from Y .


M. Anderson, Lecture Notes 4, ARE 213 Fall 2012 4
= E[E[
i
|X
i
]E[D
i
|X
i
]] E[E[
i
|X
i
]E[D
i
|X
i
]]
= 0
So if we have the unconfoundness assumption and if we know the CEF h(X
i
) = E[D
i
|X
i
]
and if we have a homogeneous treatment eect, then we can get consistent estimates of
from a regression of Y
i
on D
i
and h(X
i
). Of course, we rarely know h(X
i
), but if the CEF
is linear then we know that simply including X
i
as a set of additional control variables
will be sucient to estimate the CEF (see Regression-CEF Theorem from previous notes).
3
Even if the CEF is not linear, including X
i
as a set of regressors provides the MMSE linear
approximation to the CEF (see Regression Approximation Theorem from previous notes).
The accuracy of this approximation will often depend on how much extrapolation we are
asking of the linear approximation.
For example, suppose that we assume a linear model of the form:
Y
i
= +D
i
+X
i
+ u
i
We can transform this model to look like our standard model that does not contain
covariates:
Y
i
X
i
= Y

i
= +D
i
+u
i
In this model we estimate the treatment eect as the dierence in means between the
treated and control groups. Replacing actual values of the coecients with their estimates
we have:
= Y

T
Y

C
= (Y
T

X
T
) (Y
C

X
C
)
3
Note that the joint-normality case that gives you a linear CEF is guaranteed not to hold in this case
because D
i
is clearly not normal. Basically, the CEF E[D|X] is not going to be linear unless it is saturated
in X.
M. Anderson, Lecture Notes 4, ARE 213 Fall 2012 5
= (Y
T
Y
C
)

(X
T
X
C
)
From the last line, you can see that if X
T
X
C
, then the precise specication of our
linear approximation will generally not be that important. However, if X
T
is far from X
C
,
i.e. there is not much overlap in the distributions of X for the treated and control samples,
then we are performing an extrapolation that will depend heavily on whether we get the
functional form right. Of course, as X becomes multidimensional, it becomes harder to dene
what constitutes X
T
X
C
. The overlap issue is one that we will revisit in nonparametric
regression and propensity score matching. For the time being, however, note that the most
important assumption is the unconfoundedness assumption (Y
i
(0) D
i
|X
i
) without this
assumption we have nothing. In contrast, without the linear CEF assumption, we can still
fall back upon the Regression Approximation Theorem.
2 Regression Adjustment Application: Kreuger (1993)
and DiNardo and Pischke (1997)
Krueger (1993) is an example of a carefully executed regression adjustment application that
nevertheless fails to identify a causal eect (ideally I would have an example of a carefully
executed regression adjustment application that does identify a causal eect, but those are
regrettably rare and, regardless, its generally impossible to know that a paper got the right
answer).
4
Krueger uses CPS data to examine the wage premium for using computers at work.
His primary specication is:
ln(W
i
) = X
i
+ C
i
+
i
C
i
is the treatment of interest, a dummy variable that is unity if an employee uses
a computer at work and zero otherwise. W
i
corresponds to the employees hourly wage,
and X
i
contains other variables that might aect both wages and computer usage. When
4
More accurately, Kruegers regressions give us no reason to believe that he identies a causal eect,
although we may still believe that he has done so based on prior knowledge.
M. Anderson, Lecture Notes 4, ARE 213 Fall 2012 6
X
i
contains no covariates, Krueger estimates = 0.33 (using 1989 CPS data). When X
i
contains a rich set of covariates education, experience, race, gender, marital status, etc.
Krueger estimates = 0.19. When X
i
contains a rich set of covariates plus eight occupation
dummies, Krueger estimates = 0.16. Including 48 two-digit industry dummies reduces the
coecient by another 20 percent.
As a descriptive exercise, the results are surely valid. Do they have a causal interpreta-
tion, i.e. does represent the return to teaching a computer illiterate worker how to use
a computer? Krueger cautions that a critical concern in interpreting the OLS regressions
reported above is that workers who use computers on the job may be abler workers, and
therefore may have earned higher wages even in the absence of computer technology. (pp.
42-43) He presents four empirical facts to argue in favor a causal interpretation. First, he
controls for computer use at home, arguing that this should reduce selection bias. The coef-
cient on computer use at work is unaected. Second, he conducts a survey demonstrating
that temporary agencies report paying higher wages for computer-literate secretaries (note
that this could still be an artifact of selection) and nd it protable to oer computer train-
ing. Third, he uses a dierent data set to conrm the results. Fourth, he demonstrates that
occupations which adopted computers most quickly showed higher wage growth. Based on
this evidence, he argues, reasonably, that the computer use coecient may have a causal
interpretation (and that increased computer usage can account for one-third to one-half of
the increase in the return to education during the late 1980s.).
In 1997, John DiNardo and Steve Pischke revisited the question in a paper entitled,
The Returns to Computer Use Revisited: Have Pencils Changed the Wage Structure Too?
DiNardo and Pischke replicate Kruegers methodology using a German data set. They nd a
similar association between computer use and wages, but unlike Krueger they have additional
data on usage of oce tools such as calculators, telephones, and pencils. Many of these tools
demonstrate returns that are almost as high as the return associated with computer usage;
in the rst specication they present, for example, the coecient on computer usage is 0.11
while the coecient on pencil usage is 0.12. To clarify their argument, DiNardo and Pischke
M. Anderson, Lecture Notes 4, ARE 213 Fall 2012 7
dene the causal eect of computer usage on wages to be the eect of randomly assigning
computer skills to some employees but not to others.
5
Within this framework, it becomes
clear that the return to pencils (and hence possibly to computers) must be illusory
virtually every German worker knows how to use a pencil, whereas only 60 percent of jobs
involve using a pencil, so the ability to use a pencil cannot be a scarce skill that demands
a high premium. It is unlikely that the pencil-using jobs are paying higher wages simply
because it is expensive to hire employees that know how to use pencils.
Although DiNardo and Pischke do not prove that the computer premium is illusory,
they do demonstrate that Kruegers research design is unable to distinguish between causal
relationships and relationships that are due to selection. Is this because the regression ap-
proximation of E[C|X] is inaccurate or is it because the selection on observables assumptions
does not hold, i.e. it is not true that W
i
(0) C
i
|X
i
? Almost surely it is the latter. In the
following lectures we will learn methods that can deal with the former, but it is important to
keep in mind that these methods are no better than regression if the selection on observables
assumption does not hold.
3 Detecting Bias: Altonji, Elder, and Taber (2005)
Altonji, Elder, and Taber (2005) examine the eects of attending a Catholic school on
student outcomes. They lack random variation in Catholic school attendance, so they instead
regression adjust their estimates based upon the observed characteristics of the students and
their families. This approach is not novel, but they make a nice contribution by applying a
variant of the omitted variables bias formula to estimate the potential bias from unobserved
characteristics. The critical assumption underlying this estimate of potential bias is that,
loosely speaking, the relationship between the treatment and the unobserved characteristics
is no stronger than the relationship between the treatment and the observed characteristics.
5
Technically, even this treatment is not truly a treatment since you cant assign skills, you can only
assign training that you hope will create those skills. Nevertheless, DiNardo and Pischke elucidate their
argument greatly by clearly specifying what the counterfactual they have in mind is.
M. Anderson, Lecture Notes 4, ARE 213 Fall 2012 8
Consider a model of the form:
Y
i
= +D
i
+X
i
+u
i
D is the treatment of interest, and X are observed determinants of Y that may be
correlated with D. Dene and u
i
such that u
i
is orthogonal to X
i
(i.e., reects both the
causal eect of X on Y and the projection of the unobserved determinants of Y onto X).
Using partitioned regression we can rewrite the regression of Y
i
on D
i
and X
i
as:

Y
i
= +

D
i
+ u
i
where

Y
i
and

D
i
are residuals from regressions of Y
i
and D
i
on X
i
respectively. Note that
u
i
= u
i
because u
i
is already orthogonal to X
i
by construction. The bias due to the potential
correlation between the partialed-out treatment,

D
i
, and the unobserved determinants, u
i
,
is:
E[

] =
Cov(

D,

Y )
Var(

D)
=
Cov(

D,

D + u)
Var(

D)
=
Cov(

D,

D)
Var(

D)
+
Cov(

D, u)
Var(

D)
=
Var(

D)
Var(

D)
+
Cov(

D, u)
Var(

D)
=
Cov(

D, u)
Var(

D)
What is a reasonable estimate for Cov(

D, u)/Var(

D)? Altonji, Elder, and Taber suggest
assuming that:
Cov(D, u)
Var(u)
=
Cov(D, X)
Var(X)
(1)
In other words, assume that the relationship between D
i
and u
i
(the unobserved de-
terminants of Y
i
) is no stronger than the relationship between D and X
i
(the observed
determinants of Y
i
). This assumption is useful because we can estimate the latter relation-
ship but not the former. Is the assumption reasonable? It implies that a one unit change
M. Anderson, Lecture Notes 4, ARE 213 Fall 2012 9
in X
i
is associated with the same change in D as a one unit change in u
i
. This would
be true if, for example, the observed determinants of Y
i
were randomly chosen from the set
of all determinants of Y
i
. If the control variables X
i
were chosen specically because they
were likely to be correlated with D
i
, then the assumption is particularly likely to hold (in a
bounding sense). Nevertheless, it is still an assumption.
Given equation (1), it is straightforward to estimate the potential bias. Note that
Cov(D, X)/Var(X) = , where is the regression coecient from regressing D
i
on X
i
.
Thus
Cov(

D, u)
Var(

D)
=
Var(u)
Var(

D)

Cov(D, u)
Var(u)
=
Var(u)
Var(

D)

Cov(D, X)
Var(X)
=
Var(u)
Var(

D)

The equality of Cov(



D, u) and Cov(D, u) arises from the fact that u
i
is uncorrelated
with X
i
by construction, so its covariance with

D
i
(the residuals of regressing D
i
on X
i
) is
identical to its covariance with D
i
. This equality suggests a simple procedure for estimating
the potential bias of

:
1. Regress Y
i
on X
i
. Collect the tted values from this regression, X
i

. Also save the


mean squared error from this regression,
2
u
.
2. Regress D
i
on X
i

. Save the coecient from this regression, .


3. Regress D
i
on X
i
. Save the mean squared error from this regression,
2
d
.
4. Calculate the potential bias as Cov(

D, u)/Var(

D) =
2
u
/
2
d
.
5. Compare the potential bias to the actual coecient estimate

. If

is much larger
than the potential bias (i.e., the ratio is much greater than 1), then it is unlikely that
the observed relationship between D
i
and Y
i
is due solely to selection bias. If

is
of similar magnitude to (or smaller than) the potential selection bias, then it is more
plausible that the observed relationship between D
i
and Y
i
is due solely to selection
bias.
M. Anderson, Lecture Notes 4, ARE 213 Fall 2012 10
Note that we impose the null hypothesis that = 0 in the calculations above. But even
if this hypothesis is violated, it shouldnt aect the potential bias too much unless is large
enough to generate a high partial R
2
(uncommon in applied work).
Unfortunately, this procedure is of limited use when the regression of Y
i
on X
i
has a low
R
2
(a common scenario in applied work). A low R
2
implies that the ratio of the unexplained
variation to explained variation is very high; in other words, the ratio of Var(u) to Var(X) is
very large. In this scenario, even a weak relationship between D
i
and X
i
can still translate
into a large potential bias from u
i
. Ultimately, you may need to argue that the unobserved
factors determining Y
i
are less correlated with D
i
than the observed factors determining Y
i
but this is not much dierent from arguing that the selection on observables assumption
holds.
M. Anderson, Lecture Notes 5, ARE 213 Fall 2012 1
ARE 213 Applied Econometrics
UC Berkeley Department of Agricultural and Resource Economics
Selection on Observables Designs:
Part II, Nonparametric Regression
This set of lecture notes discusses nonparametric regression roughly speaking, regression
models that allow us to relax the linearity assumptions that we have maintained to date.
There are two reasons why we might be interested in doing this. First, the treatment of
interest may be multi-valued, i.e. D
i
may not be binary. If D
i
is continuous, then we may
wish to estimate E[Y
i
|D
i
] without imposing linearity or other functional form assumptions.
If D
i
is randomly assigned, then E[Y
i
|D
i
] will have a causal interpretation.
1
Alternatively,
D
i
may be binary, and we may believe that the selection on observables assumption holds,
i.e. Y
i
(0) D
i
|X
i
. If we want to control for E[D|X], and we dont have reason to believe
that E[D|X] is a linear function of X, then we may want to use a nonparametric method
to control for X (which is equivalent to estimating E[D|X] nonparametrically). We ignore
the latter scenario for the moment and concentrate on the former; we will discuss the latter
scenario at the end, however, as a segue into matching and the propensity score.
1 Series Regression
Suppose that we are interested in estimating h(X
i
) = E[Y
i
|X
i
], where X
i
is a scalar. This
function may have a causal interpretation (if X is randomly assigned), or it may not that
is not the focus at the moment, even though we are still in the Selection on Unobservables
section of the course. As before, we dene the CEF residual as:

i
= Y
i
E[Y
i
|X
i
]
1
In the medical literature, this function is sometimes referred to as the dose response function.
M. Anderson, Lecture Notes 5, ARE 213 Fall 2012 2
Thus Y
i
= E[Y
i
|X
i
] +
i
, where
i
is orthogonal to X
i
by construction. If E[Y
i
|X
i
] is
a linear function of X
i
, then we can estimate E[Y
i
|X
i
] using a standard linear regression
of Y on X. If not, then X

gives us the MMSE linear approximation of E[Y


i
|X
i
]. But
what if we want something better than the MMSE linear approximation of E[Y
i
|X
i
]? Given
that we dont know the functional form of E[Y
i
|X
i
], we are going to have to approximate it
using some method. An obvious candidate is a series estimator, analogous to a Taylor Series
expansion that you might use to approximate a function f(x) near a point a. To review, we
can approximate an innitely dierentiable function f(x) in the neighborhood of a using the
following expansion:
f(x) f(a) +
f

(a)
1!
(x a) +
f

(a)
2!
(x a)
2
+
f

(a)
3!
(x a)
3
+ ...
If we choose a = 0 (the Maclaurin series), this expansion becomes
f(x) f(0) +
f

(0)
1!
x +
f

(0)
2!
x
2
+
f

(0)
3!
x
3
+ ...
This suggests that we might estimate E[Y |X] = h(X) using a high degree polynomial of
X the regression coecients will simply estimate f(0),
f

(0)
1!
,
f

(0)
2!
, etc.
y
i
=
0
+
1
x
i
+
2
x
2
i
+ ... +
p
x
p
i
+
i
If the support of x is of limited range, then the approximation should be accurate, but
as the support of x becomes wider it will become less accurate. To address this problem, we
include splines in the regression that allow the function to change as we move along the
support of x. The regression spline model is basically a piecewise polynomial and looks like:
y
i
=
0
+
p

j=1

j
x
j
i
+
l

j=p+1

j
1(x
i
> k
jp
)(x k
jp
)
p
+
i
In other words, the regression spline contains a normal power series up to degree p (x,
x
2
, x
3
, ..., x
p
), and then contains splines that kick in at knots k
1
, k
2
, ..., k
lp1
. In general,
M. Anderson, Lecture Notes 5, ARE 213 Fall 2012 3
the knots are evenly spaced over the support of x; the number of knots is often determined
by looking at the data. Alternatively, one might place knots at various quantiles of x. There
is really no set way to choose the knots like operating the Digital Conveyer, it is more an
art than a science.
The most popular form for regression splines is probably the cubic spline. This is imple-
mented as:
y
i
=
0
+
1
x
i
+
2
x
2
i
+
3
x
3
i
+
4
1(x
i
> k
1
)(x
i
k
1
)
3
+ ... +
3+l
1(x
i
> k
l
)(x
i
k
l
)
3
+
i
The nice thing about series regression is that, if you have any experience with linear
regressions, its very straightforward and easy to implement and understand. It is still
focused, however, on estimating the conditional mean, albeit very exibly. What if you want
to estimate an entire distribution, or entire conditional distribution? Or what if you dont
want to have to choose where the knots go? Enter kernel density estimation.
2 Kernel Regression
2.1 Kernel Density Estimation
Suppose that we are trying to empirically estimate a density function f(x). Note that at this
point we are not talking about a conditional density function that varies over some argument
(e.g., f(y|x)); we are just talking about estimating the marginal density for a single variable,
x. One way to do this is to create a histogram. That is to say, divide the support of x into a
bunch of evenly spaced bins, count how many observations fall into each bin, and then create
a bar graph where the height of each bar is proportional to the number of observations that
fall into the corresponding bin. It turns out that the histogram is a special case of the kernel
density estimator.
2
2
Specically, it is equivalent to a kernel density estimator that uses a uniform kernel with bandwidth
equal to one-half the histogram bin width, evaluated only at the midpoints of the histogram bins.
M. Anderson, Lecture Notes 5, ARE 213 Fall 2012 4
One problem with a histogram is that, unlike most real density functions, it is discon-
tinuous the value jumps sharply as you move from one bin to the next. Intuitively, we
would prefer an estimator that is smooth as you move across dierent values of x, like most
real densities. How might we do this? Well, one way would be to take some window, say
of radius 1 unit (i.e., a total of 2 units wide), and move it along the x-axis. As we move
it along, we count how many observations fall within the window. For any given point x

,
we dene our estimate of f(x

),

f(x

), as the number of observations that fall within the


moving window when it is centered at x

(rescaled so that

f(x

) integrates to one at the


end). This partially solves the smoothing problem, since

f(x

) will never jump by more


than one observation for a small enough change in x. However,

f(x

) will still be somewhat


discontinuous as observations drop into and out of the window. To solve this problem, we
could augment our moving window with a moving weighting function call it a kernel
that places an increasing amount of weight on an observation as it moves closer to the center
of the window. For example, consider a function that places zero weight on an observation
that is more than one unit from the windows center, 0.01 weight on an observation that is
0.99 units from the windows center, 0.1 weight on an observation that is 0.9 units from the
windows center, 0.2 weight on an observation that is 0.8 units from the windows center,
and so on, until the observations gets a weight of 1 when it lies directly at the windows
center.
3
By utilizing this weighting function, we ensure that a given observation x
i
never
abruptly drops out of the moving window it just smoothly slides out of the window until
its inuence disappears entirely at one unit from the center or beyond. This, in essence, is
what a kernel density estimator is.
Formally, the kernel density estimator is dened as:

f
h
(x) =
1
Nh
N

i=1
K(
x x
i
h
)
In this denition, N is the size of the data set (each observation is indexed by i), K(.)
is the kernel density function, and h is the kernel bandwidth. How does this estimator
3
This particular weighting function is referred to as the triangle kernel.
M. Anderson, Lecture Notes 5, ARE 213 Fall 2012 5
work? Basically, it works as I described above, but in the context of the formal denition
it works as follows. Suppose that h = 1 and K(u) = 1(|u| < 1)(1 |u|), i.e. the triangle
kernel that I described above in which the weight is a linear function of the distance (until
a certain point). At a given point x

we have

f
1
(x

) =
1
N
N

i=1
1(|x

x
i
| < 1)(1 |x

x
i
|)
An alternative way for writing this that might be easier to digest is

f
1
(x

) =
1
N
N

i=1
1(d(x

, x
i
) < 1)(1 d(x

, x
i
))
where d(.) is the Euclidean distance metric. So, at any given point x

, we sum over every


observation x
i
in the data set (as you might guess, this is rather computationally intensive).
If x
i
falls more than one unit of distance away from x

, however, it receives a weight of


zero and has no input into the sum. If x
i
falls less than one unit away from x

, then it
receives a positive weight, and the value of that weight increases the closer that x
i
is to x

.
In this particular case, the weight increases linearly as the distance gets closer, but that is
not the case for alternative kernels. In general, however, the weight is weakly increasing as
the distance gets closer. Computing this summation at all possible values of x

maps out
the entire function

f
h
(x).
The preceding discussion ignored the contributions of the bandwidth, h, and the kernel
density function, K(.), by choosing h = 1 and K(.) as the triangle kernel. What alternative
values might we use for these parameters? The kernel is probably the less important of the
two choices. Kernels are dened to integrate to 1 besides the triangle kernel, other common
kernels include:
Uniform:
1
2
1(|u| < 1)
Epanechnikov:
3
4
1(|u| < 1)(1 u
2
)
Gaussian:
1

2
e

1
2
u
2
M. Anderson, Lecture Notes 5, ARE 213 Fall 2012 6
Note that the uniform kernel abruptly drops observations (i.e., they have a weight of
0.5 as along as theyre within one unit of distance, and then a weight of 0 as soon as they
move beyond one unit), providing less smoothing. The Gaussian kernel, in contrast, never
drops out an observation (i.e., all the observations have an eect on

f
h
(x

) at every point x

,
although observations that are far from x

have a trivial eect), providing more smoothing.


4
In most applications, the choice of bandwidth, h, is more important than the choice of
kernel. For any given kernel, increasing the bandwidth extends the reach of the kernel.
Take, for example, the uniform kernel,
1
2
1(|
xx
i
h
| < 1). When h = 1, then any observations
within a radius of one unit of x will aect the kernel density estimator at x, and all other
observations will not. When h = 2, however, then observations within a radius of two
units of x will aect the kernel density estimator at x. For any value of h, observations
within a radius of h units of x will aect the kernel density estimator at x. The same thing
holds for the triangle and Epanechnikov kernels. The Gaussian kernel gives positive weight
to all observations at any point x. Nevertheless, increasing h is equivalent to increasing
the variance in the Gaussian kernel this makes the kernel atter and redistributes weight
away from observations close to x towards observations that are further from x (of course,
closer observations still receive more weight in absolute terms, but relatively speaking their
contribution diminishes).
Increasing h, and thus increasing the reach of the kernel, has the eect of increasing the
amount of smoothing in the kernel density estimator. Intuitively, if the kernel bandwidth
is miniscule, then at any given point

f
h
(x) will be inuenced by just one or two (or zero)
observations, so it will vary tremendously as observations move into and out of the kernels
reach. As you increase the bandwidth, however,

f
h
(x) becomes inuenced by more and more
observations at any given point, and so the addition or deletion of a single observation as
you change x has little eect on

f
h
(x). In other words, it looks smooth. So more bandwidth
means more smoothing, less bandwidth means less smoothing.
4
All of these kernels are known as second order kernels, which is to say that their rst non-zero moment
is the second moment (any good kernel has its rst moment, i.e. its mean, equal to zero). Higher order
kernels exist, but they are not commonly used.
M. Anderson, Lecture Notes 5, ARE 213 Fall 2012 7
Smoother seems like it should be better (just watch some beer commercials), and isnt
that the purpose of using the kernel density estimator to begin with? Why not just set h to
be some very large number? The answer is that there is a tradeo between reducing variance
(smoothing) and increasing bias.
5
Increasing h increases the inuence of observations that
are far away from x. This introduces bias since, in general, the true density f(x) is not
constant you are performing an extrapolation when you allow observations that are far
from x to inuence

f
h
(x) at x. In an extreme case, if you set h to be an enormous number,
you should compute

f
h
(x) to basically be at along the support of x.
How do we choose an optimal value for h? Like choosing the number of knots with
regression splines, or operating the Digital Conveyer, choosing the value of h is more an art
than a science. The preferred method is to simply eyeball it choose h so that the density
is reasonably smooth but still has shape. Alternatively, if youd really like a formula to do
the work for you, you can choose h such that it minimizes the integrated mean squared error
(IMSE). The integrated squared error (ISE) is:
ISE(h) =

(

f
h
(x) f(x))
2
dx
The IMSE is equal to the mean integrated squared error (MISE):
E[ISE(h)] =

E[(

f
h
(x) f(x))
2
]dx
Minimizing this quantity with respect to h yields:
h

= (

(x)
2
dx)
0.2
N
0.2
varies according to the kernel; = (

K(u)
2
du
(

u
2
K(u)du)
2
)
0.2
. Note that h

depends on the
square of the second derivative of f(x) (the true density) if f(x) is highly variable, then
(

(x)
2
dx) will be large, so h

will be smaller. Intuitively, if f(x) uctuates a lot, then you


5
This is seemingly the tradeo that we always face in econometrics.
M. Anderson, Lecture Notes 5, ARE 213 Fall 2012 8
want a smaller bandwidth because you dont want to extrapolate much from observations
that are far from x. Or, to put it more basically, youre not going to approximate a (true)
choppy density with an (estimated) smooth density.
Of course, using this formula for h

presupposes that we know f(x), in which case, why


are we trying to estimate it? There are two solutions to this problem. First, we can simply
assume that f(x) is close to some known distribution, say, the normal density.
6
In that case,
h

= 1.364 N
0.2
s, where s is the sample standard deviation of x. In practice, it is
common to use Silvermans plug-in estimate of the optimal bandwidth:
h

= 1.364 N
0.2
min(s, iqr/1.349)
where iqr is the sample interquartile range (i.e., the distance between the 0.25 quantile
and the 0.75 quantile), in order to minimize the eects of extreme outliers.
Alternatively, we might think that we should look to the data to get a guess at f(x) when
we are choosing h. This method for choosing h is known as cross-validation. We know that
the ISE is
ISE(h) =

(

f
h
(x) f(x))
2
dx =


f
h
(x)
2
dx 2


f
h
(x)f(x)dx +

f(x)
2
dx
The last term does not depend on h, so we can omit it from the analysis. We thus have
ISE(h) =

(

f
h
(x) f(x))
2
dx =


f
h
(x)
2
dx 2


f
h
(x)f(x)dx + C
Our estimate of the quantity above turns out to be:
CV (h) =
1
N
2
h

K(
x
i
x
j
h
t)K(t)dt
2
N
N

i=1

f
i,h
(x
i
)
6
This seems a little stupid because youre presupposing a lot about the thing youre trying to estimate.
Hence we wont concentrate on how h

is derived.
M. Anderson, Lecture Notes 5, ARE 213 Fall 2012 9
where

f
i,h
(x
i
) is the kernel density estimate of f(x) when omitting observation i from
the data set. Note that CV (h) no longer depends on knowing f(x); the CV method thus
works by choosing h such that we minimize CV (h). Intuitively, the fact that h appears in
the denominator of
1
N
2
h
puts upward pressure on h. The fact that it appears in

K(
x
i
x
j
h

t)K(t)dt puts downward pressure on h. To see this, write the term as

K(a t)K(t)dt =
E[K(a t)], where we think of t as a random variable with density K(t). As h gets large,
a approaches 0, and E[K(a t)] goes to E[K(t)]. This makes E[K(a t)] large because,
by denition, K(a t) takes its maximum value at the points at which K(t) has the most
probability mass. So a large h makes

K(
x
i
x
j
h
t)K(t)dt large but
1
N
2
h
small, and theres
a balancing act.
In practice, cross-validation is computationally burdensome on any data set of moderate
to large size. Estimating a kernel density on a data set of approximately 10,000 observations
takes about 15 seconds on my MacBook Pro Core Duo using Stata/MP (both cores going,
baby). Doing this for hundreds or thousands of dierent values of h is clearly a time-
consuming task.
7
In most situations it is thus best to just eyeball the choice of h perhaps start with the
Silverman plug-in estimate and then see what happens as you change h in either direction.
Regardless of what method you use to choose h, it is important to explore the sensitivity
of your kernel density estimate to changes in the bandwidth for example, what happens if
you double h or halve h? Any critical reader is going to ask you whether your estimate is
sensitive to the choice of bandwidth.
2.2 Kernel Regression
All of our discussion so far has centered on estimating a univariate density. In mean-land
(where all we focus on is means), this is equivalent to focusing on estimating E[Y ]. But
what if we want to estimate a multivariate density or a conditional density? The latter is
7
Note that the number of locations at which

f
i,h
(x
i
) will need to be calculated is heavily dependent on
the size of the data set.
M. Anderson, Lecture Notes 5, ARE 213 Fall 2012 10
equivalent to estimating E[Y |X] in mean-land. We will get to that in a minute, but rst
consider the case in which we want to estimate a bivariate density, f(x, y). The multivariate
kernel density estimator is a simple extension of the univariate kernel density estimator. In
the bivariate case we have:

f
h
(x, y) =
1
Nh
2
N

i=1
K(
x x
i
h
,
y y
i
h
)
Of course, we now need a bivariate kernel. Multivariate kernels are generally just the
product of several univariate kernels. For example, bivariate versions of the uniform, triangle,
and Epanechnikov kernels are:
Uniform:
1
4
1(|u
1
| < 1) 1(|u
2
| < 1)
Triangle: ((1 |u
1
|) 1(|u
1
| < 1)) ((1 |u
2
|) 1(|u
2
| < 1))
Epanechnikov:
9
16
(1(|u
1
| < 1)(1 u
2
1
)) (1(|u
2
| < 1)(1 u
2
2
))
So a multivariate kernel density estimator using the triangle kernel is:

f
h
(x, y) =
1
Nh
2
N

i=1
(1 |
x x
i
h
|) 1(|
x x
i
h
| < 1)) ((1 |
y y
i
h
|) 1(|
y y
i
h
| < 1)
This is just a natural extension of the univariate kernel density estimator the same
intuition still holds. Its possible (and often desirable) to choose dierent bandwidths for
dierent variables; the h
2
term becomes h
1
h
2
, the h associated with x becomes h
1
, and
the h associated with y becomes h
2
. There is, however, a dark side to this increase in
dimensionality, but well worry about that later.
Although estimating bivariate (or multivariate) densities is of some interest, our real
focus lies in estimating conditional densities, f(y|x), and conditional expectations, E[y|x].
The conditional density f(y|x) equals
f(x,y)
f(x)
. Thus we can estimate f(y|x) as

f
h
(y|x) =

f
h
(x, y)

f
h
(x)
=
1
h
N

i=1
K(
x x
i
h
,
y y
i
h
)/
N

i=1
K(
x x
i
h
)
M. Anderson, Lecture Notes 5, ARE 213 Fall 2012 11
We can then use this conditional density to nonparametrically estimate the conditional
expectation of y given x, E[y|x]. We know that E[y|x] =

yf(y|x)dy. Thus a nonparametric


estimator for E[y|x] is:

y

f
h
(y|x)dy =

N
i=1
K(
xx
i
h
,
yy
i
h
)
h

N
i=1
K(
xx
i
h
)
dy
Focus for the moment on the numerator of the fraction above the denominator is
constant with respect to y, so we can pull it out of the integral and set it aside for now.
Getting rid of the denominator, we have

y
N

i=1
K(
x x
i
h
,
y y
i
h
)dy =
N

i=1
K(
x x
i
h
)

yK(
y y
i
h
)dy
Time to integrate by substitution. Let u = (y y
i
)/h. We get:
N

i=1
K(
x x
i
h
)

K(u)(uh + y
i
)hdu
So we have
N

i=1
K(
x x
i
h
)

K(u)(uh
2
+ hy
i
)du =
N

i=1
K(
x x
i
h
)(h
2

uK(u)du + hy
i

K(u)du)
But we know that the rst moment of any kernel K(u) equals 0 (i.e.,

uK(u)du = 0)
and that any kernel K(u) must integrate to 1 (i.e.,

K(u)du = 1). We therefore have


N

i=1
K(
x x
i
h
)(h
2

uK(u)du + hy
i

K(u)du) = h
N

i=1
K(
x x
i
h
)y
i
Picking up the denominator that we set aside earlier, we nally have:

y

f
h
(y|x)dy =

N
i=1
K(
xx
i
h
)y
i

N
i=1
K(
xx
i
h
)
M. Anderson, Lecture Notes 5, ARE 213 Fall 2012 12
Pretty neat, and much simpler, huh? This procedure is called kernel regression, for
reasons that should be obvious. For the intuition of what this is doing, note that we could
write it as E[y|x] =
1
W

N
i=1
w(x, x
i
) y
i
, where W =

i
w(x, x
i
). In other words, this just a
weighted mean of the y
i
. How are the weights determined? The weights are determined by
the kernel function, K(
xx
i
h
). Thus, for a given value x

,

E[y|x

] is a weighted sum of the


y
i
s, where the weights are decreasing for observations (y
i
, x
i
) that lie far from x

. In other
words,

E[y|x

] is essentially the average value of y


i
for observations in which x
i
is close to
x

. Again, the bandwidth determines how much extrapolation occurs if h is high, then
observations for which x
i
is far from x

can still inuence



E[y|x

]. This will make



E[y|x

]
smoother at the cost of potentially increasing bias.
Multivariate kernel regression is a simple extension of bivariate kernel regression. If
we have two righthand side variables, x
1
and x
2
, then we can nonparametrically estimate

E[y|x
1
, x
2
] as

y

f
h
(y|x
1
, x
2
)dy =

N
i=1
K(
x
1
x
1i
h
)K(
x
2
x
2i
h
)y
i

N
i=1
K(
x
1
x
1i
h
)K(
x
2
x
2i
h
)
2.3 Lowess Regression
Kernel regression takes a weighted mean of the y
i
s that are near x when estimating E[y|x].
In this sense, it is a local constant estimator E[y|x] is assumed to be constant within
a neighborhood near x (of course, for a small enough neighborhood i.e., a small enough
bandwidth this should be true, but that doesnt mean that we have enough data to work
with such a bandwidth). What if we instead used a local regression estimator? That is
to say, what if we ran a weighted regression on the observations that are near x when
estimating E[y|x]? This is the basic idea underlying Lowess regression its basically a
kernel regression regression.
8
At rst glance, running a local regression may seem somewhat redundant. After all,
8
You will be forgiven if, at this point, you start having visions of the episode of Da Ali G Show featuring
Boutros Boutros Boutros-Ghali.
M. Anderson, Lecture Notes 5, ARE 213 Fall 2012 13
a regression line always passes through ( x, y), so if the data used to estimate

E[y|x

] are
centered at x

, then

E[y|x

] will equal the average of the y


i
s in the neighborhood of x

regardless of whether we take a mean of y or run a regression of y on x

and use

x

. There
are two reasons, however, that the local regression estimator can give dierent estimates
than the local constant estimator. First, the data used to estimate the regression are not
necessarily centered at x

they can easily be skewed one direction or the other. Second, we


can (and often do) include higher order terms in the regression (e.g., a quadratic or cubic in
x).
The algorithm for running a Lowess regression, introduced in Cleveland (1979), is as
follows:
(1) For each data point x
i
in the data set, run a weighted regression of y
j
on x
j
and
x
2
j
(or higher order terms of x), where the weight for each observation j in the regression
is generated by a tricubic kernel: K(x
i
, x
j
) = 1(
|x
i
x
j
|
h
i
< 1)(1 (
|x
i
x
j
|
h
i
)
3
)
3
. Note that
for each data point we are running a regression potentially containing the entire data set.
The kernel bandwidth, h
i
, is the distance to the rth nearest x
j
, i.e. the bandwidth varies
by x
i
so that the regression window always contains r observations. For each point x
i
,

E[y
i
|x
i
] =

0i
+

1i
x
i
+

2i
x
2
i
, where the coecient estimates come from the local regression
estimated in the neighborhood of x
i
.
9
(2) Let
i
= y
i


E[y
i
|x
i
]. Dene weights
j
= 1(|

j
6s
| < 1)(1 (

j
6s
)
2
)
2
, where s is the
median of |
i
|.
(3) Generate a new set of

E[y
i
|x
i
] for each observation i by tting a weighted regression
of the same order as in step (1) using weights
j
K(x
i
, x
j
) (note that i indexes the observation
at which the regression is begin t, and j indexes all the observations in that regression).
(4) Repeat steps (2) and (3) t times. For the nal pass, compute

E[y|x] for all values
of x (like with kernel regression) rather than just the values x
i
that appear in the data set.
9
Note the slight departure from the kernel regression estimates in that we are only estimating the regres-
sion at actual points x
i
that occur in the data set, not any arbitrary x. This is because we are going to
generate residuals in the next step that will be used to downweight outliers.
M. Anderson, Lecture Notes 5, ARE 213 Fall 2012 14
These are your Lowess regression tted values.
Lowess regression is a fairly popular way of data smoothing or nonparametrically esti-
mating conditional expectations. It is quite similar to kernel regression but with three key
dierences. First, it is a locally weighted regression of y
i
s near x, rather than a locally
weighted mean of y
i
s near x. Second, it uses a variable size bandwidth (h
i
) that increases
in places where the x
i
s are sparse and decreases in places where the x
i
s dense. Finally, it
uses a multiple pass procedure that downweights outliers when it runs the local regressions
(hence the
j
terms).
3 The Curse of Dimensionality
The nonparametric methods discussed above series regression, kernel density estimation,
kernel regression, and Lowess regression generally work well when estimating a univariate
density or the expectation of Y conditional on a single X. If we are interested in recovering
causal estimates, however, then we rarely have just a single explanatory variable (unless the
treatment is randomly assigned). Instead, we often assume that the treatment assignment,
D
i
, is independent of potential outcomes, Y
i
(0), conditional on an entire set of covariates, X
i
.
We demonstrated in the previous notes that, if this selection on observables assumption
holds, and we have a constant treatment eect, then it is sucient to simply control for
E[D
i
|X
i
] when regressing Y
i
on D
i
. We rarely know the functional form of E[D
i
|X
i
], but
using the nonparametric procedures presented above we can, in principle, estimate E[D
i
|X
i
]
without any functional form assumptions.
In practice, however, we are plagued by the Curse of Dimensionality. If we want
to estimate E[D|X] via kernel regression, then need to essentially estimate a multivariate
density of dimension r, where r is the number of variables in X. However, it turns out that
the sparsity of the data grows exponentially with r the more dimensions you have, the less
and less data you have to estimate E[D|X

] at any given point X

.
10
This fact is known as
10
The same thing holds with series regressions. Essentially, as you add additional variables to X, you
M. Anderson, Lecture Notes 5, ARE 213 Fall 2012 15
the widely feared Curse of Dimensionality.
It is thus generally infeasible to apply fully nonparametric methods when X contains more
than a couple variables. Although you can sometimes reduce the dimensionality problems
by making various parametric assumptions (e.g., assuming an additive structure between the
dierent elements of X when using series estimators), you can never truly defeat the Curse
of Dimensionality. It is, after all, a curse.
4 Additional References
Cleveland, William. Robust Locally Weighted Regression and Smoothing Scatterplots.
Journal of the American Statistical Association, 1979, 74, 829-836.
have to include increasingly large numbers of interactions between all the powers and splines of the various
x variables, and the whole thing starts to blow up.
M. Anderson, Lecture Notes 6, ARE 213 Fall 2012 1
ARE 213 Applied Econometrics
UC Berkeley Department of Agricultural and Resource Economics
Selection on Observables Designs:
Part III, Matching, Dimensionality Reduction, and the Propensity Score
This set of lecture notes discusses matching under the assumption of unconfoundedness
(i.e., selection on observables). The idea behind matching is very simple. If Y
i
(0), Y
i
(1)
D
i
|X
i
, then we can estimate (x) = E[Y
i
(1) Y
i
(0)|X
i
= x] because the treatment is eec-
tively randomly assigned after conditioning on X
i
. In the last lecture, we learned a variety
of techniques for nonparametrically estimating conditional expectations. These techniques
in particular, kernel regression have a close linkage to a nonparametric technique known
as matching. The idea behind matching is to compare treated units (D
i
= 1) to control
units (D
i
= 0) that have similar values of X
i
. This guarantees that every treatment-control
comparison is performed on units with identical (or close to identical) values of X
i
, so we are
literally conditioning on X
i
= x. Given the selection on observables assumption, we know
that D
i
is as good as randomly assigned after conditioning on X
i
, so we should get causal
estimates.
1 Matching
For every treated unit, i.e. every unit with D
i
= 1, the goal of the matching estimator is to
nd a comparison unit among the controls that has similar values of observable characteristics
X
i
. It is important to note, however, that this comparison unit need not be a single unit
rather, it can be a composite (i.e., a weighted average) of several dierent control units that
have similar values of X
i
.
1
Assume that there are N
T
treated units and N
C
control units.
Dene N
T
sets of weights, with N
C
weights in each set: w
i
(j) (i = 1, ..., N
T
, j = 1, ..., N
C
).
1
There are obvious eciency gains to doing this, particularly if there are more control units than treated
units.
M. Anderson, Lecture Notes 6, ARE 213 Fall 2012 2
For each set of weights, let

j
w
i
(j) = 1. Then the generic matching estimator is:

M
=
1
N
T

i{D=1}
[y
i

j{D=0}
w
i
(j)y
j
]
In other words, we are simply computing the average dierence between the treated
units and the composite comparison units. The key to this estimator is how you calculate
the weights used to construct the composite comparison units, w
i
(j). For example, you could
set w
i
(j) =
1
N
C
. In that case,

j
w
i
(j)y
i
simply equals the control group mean, y
C
, for all
i, and
M
is just the dierence in means between the treated and control groups. Obviously
that is not very exciting estimator and it does not solve the selection on observables problems.
In general, we want to choose w
i
(j) so that it measures the nearness of X
j
to X
i
w
i
(j) is what I will call the distance measure. If X is discrete, then in principle you
could choose w
i
(j) such that it equals one if X
i
= X
j
and zero otherwise (you would of
course have to rescale w
i
(j) by

j
w
i
(j) so that it summed to one for each i). If X is
continuous, then that particular measure wont work, but there are several common choices
of distance measures. The most popular is probably nearest-neighbor matching. With
nearest-neighbor matching, w
i
(j) is a function of the Euclidean distance between X
i
and X
j
.
Specically, w
i
(j) equals one for the control unit with the closest X
j
to X
i
where closeness
is measured by Euclidean distance ((X
i
X
j
)

(X
i
X
j
)) and zero otherwise. Thus, w
i
(j)
selects the nearest (control) neighbor j to treated unit i, and
M
computes the mean
dierence between each treated unit and its nearest control neighbor. This procedure should
produce valid causal estimates under the selection on observables assumption, assuming that
there is sucient overlap between the treated and control groups (we will return to this issue
shortly).
Of course, the choice of units for each component of X
i
is arbitrary, so it may not make
sense to weight each component equally when computing the distance between two points,
as the Euclidean distance metric does. A popular alternative to the Euclidean metric is
thus the Mahalanobis distance metric, (X
i
X
j
)

1
x
(X
i
X
j
), where
x
is the covariance
M. Anderson, Lecture Notes 6, ARE 213 Fall 2012 3
matrix of X note the parallel to GLS. Eectively, you are normalizing the components of
(X
i
X
j
) by the root of the inverse covariance matrix.
2
If there are substantially more control than treated units, then it seems somewhat in-
ecient to choose only N
T
nearest neighbors, when N
T
< N
C
.
3
Why not use more of the
information in the control group? This is the idea behind kernel matching. With kernel
matching, we set w
i
(j) =
K(X
i
X
j
)

j
K(X
i
X
j
)
, where K(.) is one of the kernel functions that we
discussed in the previous lecture, such as the triangle kernel or the Epanechnikov kernel.
We could also include a bandwidth term, h, to control the kernels reach. The advantage
of kernel matching over nearest-neighbor matching is that, if there are several control units
with X
j
s in the neighborhood of X
i
, then it makes sense to take the average outcome for all
of those control units rather than just to pick a single control unit the former should be
more ecient than the latter. Kernel matching allows us to do this kind of averaging and
to create composite comparison control units for each match. The kernel weighting function
puts the most weight on the closest control unit and less weight on further control units.
The link to kernel regression should now be clear: eectively, we are creating each com-
parison unit by estimating a kernel regression at X
i
using the sample of controls. We can
increase the number of controls used to create the comparison unit by increasing the reach
of the kernel (i.e., raising h), and we know that doing so reduces the variance but increases
the bias (youre extrapolating from units with X
j
far from X
i
).
4
However, the topic of kernel
regression also brings up memories of our old nemesis, the Curse of Dimensionality.
The short story is, the Curse of Dimensionality strikes back with a vengeance in the case
of matching. As with kernel regression, increasing the dimension of X increases the sparsity
of the data. The more variables you have in X, the less likely you are to nd a comparison
2
The odd thing about Mahalanobis distance is that, depending on the covariance structure, you can end
up in situations in which (10, 10) is closer to (0, 0) than (8, 2). I believe this is because the weight in the
inverted covariance matrix can become negative.
3
In fact, we will almost always choose less than N
T
neighbors, because some comparison units will be
picked more than once.
4
While the variance reduction holds at any given point X
i
, its not as clear for the mean of all of
the composite comparison units, because theres a lot of duplication in observations between composite
comparison units as you increase the bandwidth.
M. Anderson, Lecture Notes 6, ARE 213 Fall 2012 4
control unit lying close to any given treatment unit there are simply too many dimensions
to match along. What can be done? Enter propensity score matching.
2 Methods of the Propensity Score
Assume that we have unconfoundedness: (Y
i
(0), Y
i
(1)) D
i
|X
i
. Also assume that the
overlap assumption holds: 0 < P(D
i
= 1|X
i
) < 1. Combining these two assumptions, we
say that the treatment assignment is strongly ignorable. We know that if we condition
on X
i
, then we can get a consistent estimate of ATE by simply comparing the dierence in
means between treated and control units. In practice, however, it is hard to condition on X
i
if X
i
is high dimensional. Note that this is eectively because the overlap assumption fails
in nite samples for most observations, it is impossible to nd a comparison unit with the
opposite treatment assignment and the same value of X.
An important result is that, under strongly ignorable treatment assignment, it is sucient
to condition simply on p(X
i
) = E[D
i
|X
i
], also known as the propensity score. Formally, if
we assume (Y
i
(0), Y
i
(1)) D
i
|X
i
, then
(Y
i
(0), Y
i
(1)) D
i
| p(X
i
)
Proof
We will show that P(D
i
= 1 | Y
i
(0), Y
i
(1), p(X
i
)) = P(D
i
= 1 | p(X
i
)) = p(X
i
). This
implies independence of D
i
and (Y
i
(0), Y
i
(1)) after conditioning on p(X
i
).
P(D
i
= 1 | Y
i
(0), Y
i
(1), p(X
i
)) = E[D
i
| Y
i
(0), Y
i
(1), p(X
i
)]
= E[E[D
i
| Y
i
(0), Y
i
(1), p(X
i
), X
i
] | Y
i
(0), Y
i
(1), p(X
i
)]
= E[E[D
i
| Y
i
(0), Y
i
(1), X
i
] | Y
i
(0), Y
i
(1), p(X
i
)]
= E[E[D
i
| X
i
] | Y
i
(0), Y
i
(1), p(X
i
)]
M. Anderson, Lecture Notes 6, ARE 213 Fall 2012 5
= E[p(X
i
) | Y
i
(0), Y
i
(1), p(X
i
)]
= p(X
i
)
For completeness, note that:
P(D
i
= 1 | p(X
i
)) = E[D
i
| p(X
i
)]
= E[E[D
i
| p(X
i
), X
i
] | p(X
i
)]
= E[E[D
i
| X
i
] | p(X
i
)]
= E[p(X
i
) | p(X
i
)]
= p(X
i
)
So P(D
i
= 1 | Y
i
(0), Y
i
(1), p(X
i
)) = p(X
i
) = P(D
i
= 1 | p(X
i
)). Since D
i
is binary, this
implies independence of D
i
and (Y
i
(0), Y
i
(1)) after conditioning on p(X
i
). In other words, it
is sucient to merely condition on p(X
i
) we dont have to condition on X
i
.
Why is it sucient to condition on the propensity score? Our concern is that units select-
ing into treatment dier in some meaningful way from units that do not select into treatment,
and that this dierence is consistently related to the probability of entering treatment. If,
however, we only compare units with the exact same probability of treatment, then it is im-
possible for the dierences to be consistently related to the probability of treatment.
5
After
conditioning on the propensity score, the units are as good as randomly assigned.
5
If they were, then we would be using them to estimate the propensity score, or so our unconfoundedness
assumption claims.
M. Anderson, Lecture Notes 6, ARE 213 Fall 2012 6
2.1 Estimating the Propensity Score
Before you can condition on the propensity score, p(X
i
) = E[D
i
|X
i
], you have to estimate
it. There are several ways to do this its not clear that one method is uniformly superior,
so your choice may be context dependent. The easiest way suggested by Rubin and
Rosenbaum (1983) is to use a exible logit specication (exible in the sense that there are
interactions between the various components of X
i
). Alternatively, one could use the kernel
regression methods that we discussed in the previous lecture. Finally, Hirano, Imbens, and
Ridder (2003) suggest a variant on a series estimator the series logit estimator. This is
similar to Rubin and Rosenbaum, but it also contains higher order terms how high the
order becomes is a function of the sample size and the dimension of X.
6
2.2 Regression Adjusting on the Propensity Score
Once youve estimated the propensity score, the next question is what to do with it, i.e. how
to condition on it. One obvious candidate is to simply include it as a regressor, i.e. run the
regression:
Y
i
= + D
i
+ p(X
i
) + u
i
As we saw in an earlier lecture, controlling for the conditional expectation of D which is
equivalent to controlling for the propensity score when D is binary is sucient to generate
consistent estimates of under unconfoundedness and the assumption of a constant (i.e.,
homogeneous) treatment eect. More generally, we may want to interact the propensity score
with the treatment indicator if we believe that the treatment eects may be heterogenous
(and that the heterogeneity may vary with X in some consistent manner). To see how this
type of heterogeneity might aect our estimator, consider the following model:
Y
i
(0) = + X
i
+ u
i
6
The rule for choosing the order is rather arcane it should be less than N
1
9
and more than N
1
24
. You
are probably best o simply choosing something reasonable.
M. Anderson, Lecture Notes 6, ARE 213 Fall 2012 7
Y
i
(1) = Y
i
(0) +
1
+
2
X
i
Then
Y
i
= D
i
Y
i
(1) + (1 D
i
)Y
i
(0) = D
i
(Y
i
(0) +
1
+
2
X
i
) + (1 D
i
)Y
i
(0)
= +
1
D
i
+
2
X
i
D
i
+ X
i
+ u
i
The interaction term between D and X suggests that we should interact the propensity
score with treatment status in order to capture the treatment eect heterogeneity. It is thus
preferable to run the following specication:
Y
i
= +
1
D
i
+
2
D
i
p(X
i
) + p(X
i
) + u
i
Note that if you run this regression, the estimated treatment eect for any given X
i
is
1
+
2
p(X
i
), and the average treatment eect is
1
+
2
p(X
i
). However, there is little
to be gained from using propensity score methods in this manner. There is no guarantee
that the even the interacted propensity score regression produces consistent estimates of
meaningful average treatment eects when there is treatment eect heterogeneity.
7
And it
may not produce estimates that are much dierent from a normal linear regression with X
i
as controls. In fact, if we estimated the propensity score using a linear probability model,
then we would get the exact same estimate of from the rst regression (i.e., non-interacted)
that we would get from running Y
i
on D
i
and all of the terms of X
i
that go into the linear
probability model (and at least in the latter case we would get the standard errors right). The
advantage in using the propensity score is really that it enables us to apply less parametric
estimators even if X
i
is high-dimensional. Of course, we still have to estimate p(X
i
), and
our ability to do this nonparametrically is limited if X
i
is high-dimensional. So, in practice,
conditioning on the propensity score, even with superior methods like blocking and weighting
(see below), is often not as advantageous as you might initially believe.
8
7
I believe the treatment eect heterogeneity has to be a function of p(X
i
), rather than of X
i
, to be
guaranteed consistent estimates with the interacted propensity score regression model.
8
Imbens, for example, writes: Although [propensity-score methods] avoid the high-dimensional non-
parametric estimation of the two conditional expectations, they require instead the equally high-dimensional
M. Anderson, Lecture Notes 6, ARE 213 Fall 2012 8
2.3 Blocking on the Propensity Score
Another candidate for using the propensity score is to block, or stratify, on the propensity
score. That is to say, divide the range of the propensity score into K blocks (Dehejia and
Wahba use 20 blocks of width 0.05) and place observations in each block according to their
estimated propensity scores, p(X
i
). Within each block k, compute,
k
, the dierence in
means between treated and untreated observations. Finally, combine all K treatment eect
estimates as follows:
=
K

k=1

k

N
1k
+ N
0k
N
In other words, the average treatment eect is a weighted sum of the block-level treatment
eects, with the each blocks weight equal to the number of observations contained in that
block. Choosing the number of blocks is at the researchers discretion. One popular algorithm
is to start with a given number of blocks (e.g., 10), and check whether the covariates are
balanced within each block. If they are not, then split the blocks and check again. Continue
until the covariates are balanced.
9
If the covariates remain unbalanced within blocks even
when the propensity score is balanced, then you may need to estimate the propensity score
more exibly.
The overlap assumption becomes prominent when blocking on the score. When a block
contains either zero treated units or zero control units, no estimate of the treatment eect
exists for that block, and it must be discarded. Furthermore, because the logit specication
forces 0 < p(X
i
) < 1, it may appear that the overlap assumption is satised for all units
when in fact it is not. To be safe, one should discard all control units with p(X
i
) less than
nonparametric estimation of the propensity score. In practice the relative merits of [propensity score vs.
doing nonparametric regression] will depend on whether the propensity score is more or less smooth than
the regression functions, or whether additional information is available about either the propensity score or
the regression functions.
9
Note that if you have many covariates and many blocks, you should not expect 100% of the covariates to
have no signicant relationship to the treatment status in every block - some coecients should be signicant
simply by chance. A more realistic target would be, for example, to nd that only 10% of the covariates are
signicantly related to treatment status at the 10% level.
M. Anderson, Lecture Notes 6, ARE 213 Fall 2012 9
the minimum p(X
i
) in the treated group and all treated units with a p(X
i
) greater than the
maximum p(X
i
) in the control group.
10
Note that blocking on the score is analogous to matching on the score in that you are
only comparing observations with propensity scores that are close to one another. One could
formally implement a matching estimator, however, using one of the methods discussed in
Section 1. Dehejia and Wahba, for example, use nearest-neighbor matching as an alternative
estimator to blocking on the propensity score (both estimators give similar results in most,
but not all, cases).
2.4 Weighting with the Propensity Score
Hirano, Imbens, and Ridder (2003) advocate weighting by the (inverse of) the propensity
score as a method to adjust for dierences between treated and control units. To understand
this estimation procedure, rst consider a simple estimator that takes the dierence in means
between treated and control units (no conditioning on the propensity score or any regressors):

naive
= y
T
y
C
=

D
i
Y
i

D
i

(1 D
i
)Y
i

(1 D
i
)
This method is biased because E[Y
i
(0)|D
i
= 1] = E[Y
i
(0)] the units that select into
treatment have dierent (unobserved) control outcomes than the entire population of units
(which includes units that do not select into treatment). Hence using the observed control
units to estimate the unobserved control outcomes of the treated units gives is an invalid
strategy. Suppose, however, that we knew the propensity score, p(X
i
). If we weighted each
treated observation by the inverse of p(X
i
), we would nd:
E[
D
i
Y
i
p(X
i
)
] = E[
D
i
(D
i
Y
i
(1) (1 D
i
)Y
i
(0))
p(X
i
)
] = E[
D
i
Y
i
(1)
p(X
i
)
]
10
I am assuming that the minimum p(X
i
) occurs in the control group and the maximum p(X
i
) occurs in
the treated group. If not, perform the trimming so that the minimum p(X
i
) is (virtually) the same for both
groups and the maximum p(X
i
) is (virtually) the same for both groups.
M. Anderson, Lecture Notes 6, ARE 213 Fall 2012 10
= E[E[
D
i
Y
i
(1)
p(X
i
)
|X
i
]]
= E[
E[D
i
|X
i
]E[Y
i
(1)|X
i
]
p(X
i
)
]]
= E[
p(X
i
)E[Y
i
(1)|X
i
]
p(X
i
)
]]
= E[E[Y
i
(1)|X
i
]]
= E[Y
i
(1)]
Likewise, weighting each control observation by the inverse of 1 p(X
i
) gives us:
E[
(1 D
i
)Y
i
1 p(X
i
)
] = E[Y
i
(0)]
We can implement the weighting scheme with the following estimator:

p(X)
=
1
N
N

i=1
(
D
i
Y
i
p(X
i
)

(1 D
i
)Y
i
1 p(X
i
)
)
Whats the intuition behind this reweighting scheme (which, to me, is the least transpar-
ent of the methods)? Recall the problem: the propensity score is not balanced across treated
and control groups. Treated observations are, on average, those with X
i
s that make them
more likely to be treated (which is, from our perspective, bad - covariates are not balanced
across treated and controls), as well as those that randomly got treated (which is, from our
perspective, good).
Consider any given observation i. Suppose that p(X
i
) = 0.80. That means that there is
an 80% chance that this observation would end up in the treated group and a 20% chance
that it would end up in the control group i.e., it is four times more likely to be in the
treated group relative to the control group. Therefore, on average, there are four of these
M. Anderson, Lecture Notes 6, ARE 213 Fall 2012 11
observations in the treated group for every one that is in the control group. Our weighting
scheme xes this imbalance. When this observation with p(X
i
) = 0.80 is in the treated
group, we weight it by
1
0.8
, eectively upweighting it by a factor of 1.25. When it is in the
control group, however, we weight it by
1
0.2
, eectively upweighting it by a factor of 5. The
control group weight is thus 4 times (5/1.25) the treated group weight, and the observations
frequency in the control group is increased by a factor of four relative to its frequency in
the treated group. Our weighting scheme thereby ensures that this observation is equally
represented (in expectation) in the treated and control groups. The same analysis holds true
for any p(X
i
) such that 0 < p(X
i
) < 1, so the weighting scheme balances the propensity
score across treated and control groups.
11
The problem with the estimator given above is that there is no guarantee that the weights
will sum to one (
1
N

D
i
p(X
i
)
equals one in expectation, but it need not equal one in any given
sample). We can instead normalize each weighted sum by the actual sum of the weights.
Doing this, and plugging in the estimated score in place of the true score, gives us our
weighting estimator:
= (
N

i=1
D
i
Y
i
p(X
i
)
/
N

i=1
D
i
p(X
i
)
) (
N

i=1
(1 D
i
)Y
i
1 p(X
i
)
/
N

i=1
1 D
i
1 p(X
i
)
)
The nice thing about this weighting estimator is that it is, according to Hirano, Imbens,
and Ridder, ecient.
2.5 Dual Methods: Two Are Better Than One
One can enhance most of the methods above by combining them with regression. The
advantage of this strategy is that incorporating regression can reduce any remaining bias and
potentially enhance the precision of the estimator. Furthermore, it produces an estimator
that is doubly robust that is to say, if either the propensity score or the regression
11
Note again the importance of the overlap assumption.
M. Anderson, Lecture Notes 6, ARE 213 Fall 2012 12
function is correctly specied, then the estimator will be consistent. In Guido Imbens
words, you have two chances to get lucky.
To combine the weighting estimator with regression adjustment, simply run a weighted
least squares regression of:
Y
i
= + X
i
+ D
i
+ u
i
where the regression weights are dened as:
w
i
=

D
i
p(X
i
)
+
1 D
i
1 p(X
i
)
This estimator has the double robustness property. Alternatively, we could combine
the blocking procedure with regression by running the following regression within each of
the K blocks:
Y
i
=
k
+ X
i

k
+
k
D
i
+ u
i
Then combine all of the
k
estimates together as:
=
K

k=1

k

N
1k
+ N
0k
N
2.6 Regression Revisited
Recall that in a previous lecture we demonstrated that if you assume unconfoundedness
and a constant treatment eect, then you can consistently estimate with the following
regression:
Y
i
= + D
i
+ E[D
i
|X
i
] + u
i
M. Anderson, Lecture Notes 6, ARE 213 Fall 2012 13
E[D
i
|X
i
] is, of course, simply the propensity score. If E[D
i
|X
i
] is linear, then we can
estimate it by simply including X
i
as regressors.
12
If E[D
i
|X
i
] is of unknown functional
form, then we can in principle estimate it nonparametrically using kernel or series regres-
sion, although in practice we may face the Curse of Dimensionality. What if we drop the
assumption of constant treatment eects, however? In that case, we need to account for the
possibility that the treatment eect heterogeneity may be related to the covariates in some
manner (e.g., perhaps some program has dierential eects on high school dropouts and
college graduates). Consider the simplest case, in which the treatment eect heterogeneity
is only a function of X
i
:
Y
i
(0) = + g
0
(X
i
) +
i
Y
i
(1) = Y
i
(0) + + g
1
(X
i
)
Then we get the following regression model:
Y
i
= Y
i
(1)D
i
+ Y
i
(0)(1 D
i
)
= (Y
i
(0) + + g
1
(X
i
))D
i
+ Y
i
(0)(1 D
i
)
= ( + g
1
(X
i
))D
i
+ Y
i
(0)
= + D
i
+ g
0
(X
i
) + g
1
(X
i
)D
i
+
i
So with treatment eect heterogeneity, you have to estimate separate series or kernel
estimators for treated observations and nontreated observations, even if you assume uncon-
foundedness. Conceptually, it may be easier to split the sample into treated and untreated
12
Note that, strictly speaking, E[D
i
|X
i
] is only likely to be linear when you have a saturated model.
M. Anderson, Lecture Notes 6, ARE 213 Fall 2012 14
observations and estimate the functions separately on each sample. Let f
0
(X
i
) = +g
0
(X
i
)
and f
1
(X
i
) = + g
0
(X
i
) + + g
1
(X
i
). Then
E[Y
i
(0)|D
i
= 0, X
i
] = f
0
(X
i
)
E[Y
i
(1)|D
i
= 1, X
i
] = f
1
(X
i
)
E[Y
i
(1) Y
i
(0)|X
i
] = f
1
(X
i
) f
0
(X
i
)
In practice, we would estimate this quantity using:
1
N

f
1
(X
i
)

f
0
(X
i
)
Since we observe Y
i
(1) when D
i
= 1 and Y
i
(0) when D
i
= 0, we set

f
1
(X
i
) = Y
i
(1) when
D
i
= 1 and

f
0
(X
i
) = Y
i
(0) when D
i
= 0. When D
i
= 0, our estimate of f
1
(X
i
) is generated
by plugging X
i
into the regression on the treated subsample, i.e.

f
1
(X
i
). When D
i
= 1, our
estimate of f
0
(X
i
) is generated by plugging X
i
into the regression on the control subsample,
i.e.

f
0
(X
i
). Thus our regression estimator is:
=
1
N

(D
i
Y
i
+ (1 D
i
)

f
1
(X
i
)) ((1 D
i
)Y
i
+ D
i

f
0
(X
i
))
Though it may not be readily apparent, the estimator above highlights the importance of
overlap in the distributions of the X
i
for treated and control groups. This is probably easiest
to see if you consider estimating f
0
and f
1
using kernel regressions. Consider

f
1
(X
i
) for
some control observation i.

f
1
(X
i
) is estimated by running a kernel regression at X
i
(which,
remember, comes from a control observation) using the treated data. But if the treated and
control distributions of X do not have overlap, then there may be no data in the treated
group near X
i
, and the kernel regression will need to extrapolate from data points that are
far from X
i
(i.e., it will need to have a large bandwidth).
M. Anderson, Lecture Notes 6, ARE 213 Fall 2012 15
If you recall from the initial lecture on regression adjustment, we noted that overlap is
important in determining whether the precise specication of a regression will impact our
estimated treatment eects. Here, we show that overlap is also important if you are using
nonparametric estimators. We have also seen that overlap is important for matching (i.e., if
you dont have overlap of the covariates, then you cant nd matches) and for the propensity
score (if p(X
i
) gets close to zero or one, then the weights get enormous if youre doing the
weighting procedure, or it becomes hard to nd matches if you are blocking or matching).
The dominant theme is thus that the estimation technique itself is probably not as important
as:
(1) Whether the unconfoundedness assumption holds.
(2) Whether there is overlap in the treatment and control distributions of the covariates.
2.7 Assessing Overlap: LaLonde Revisited
We have seen that all the estimation procedures are sensitive to whether there is overlap in
the distributions of the Xs for the treated and control groups. In cases in which we have
only one or two variables in X, it is fairly easy to assess the overlap of the covariates simply
plot the distribution of X for the treated group and the control group, and see whether they
have similar support. This could be done with a histogram (just dont make the bins too
wide), or you could use a kernel density estimator.
In higher dimensional cases, inspecting marginal distributions of single covariates is not
as informative. It could easily be the case that the marginal distributions overlap for each
covariate, and yet the joint distributions do not overlap. For example, suppose that X
contains two covariates: income and race. Further suppose that both the treated and control
groups contain a mix of blacks and whites and a mix of rich and poor. When inspecting the
marginal distributions of income and race, we appear to have overlap. Suppose, however,
that all rich whites appear in the treated group, and that all poor blacks appear in the
control group. Obviously, we do not overlap in important parts of the joint distribution.
M. Anderson, Lecture Notes 6, ARE 213 Fall 2012 16
However, because both the treated and control groups contain rich blacks and poor whites,
we appear to have overlap when examining each of the marginal distributions.
One nice thing about the propensity score methods is that they allow you to assess over-
lap using the propensity score itself recall that the overlap assumption for the propensity
score is 0 < p(X
i
) < 1. Simply plot the distribution of the propensity scores for the treated
and control groups, and check to see whether there is sucient overlap in those distributions
(insucient overlap would be neighborhoods of p(X) that contain many estimated scores
from one group and few of the other often these areas will be near the extreme values of 0
and 1). Of course, the accuracy of your assessment will depend on the accuracy of your spec-
ication of the score. In this case, it is better to use procedures that do less smoothing (e.g.,
lower bandwidth with kernel estimators, more higher order terms, interactions, and knots
with series estimators) otherwise you may articially produce overlap in the propensity
score distributions by extrapolating from areas in which the propensity score has positive
support to areas in which it does not.
If there are some points in the propensity score distributions of the two samples that
do not overlap, then you should trim one or both of the samples to address the problem.
In practice, this entails discarding some observations with propensity scores above or below
a certain level. Imbens performs a reanalysis of the LaLonde (1986) data to demonstrate
how this is done. He begins with a table that summarizes the values of key covariates for
the treated group, the control group, and the CPS sample from which LaLonde draws his
simulated control groups. Table 1 summarizes the values of several of the covariates.
Note that, as expected (given the random assignment), the dierences between the control
group and the treated group are small less than 0.2 standard deviations in all cases. The
dierences between the CPS sample and the treated group, however, are substantial. All
but one of them is greater than 0.5 standard deviations, and in one case the dierence
reaches 2.8 standard deviations. Roughly speaking, any dierence in means in excess of 0.25
standard deviations is considered large (i.e., imbalanced) estimation methods relying on
linear regression will require substantial extrapolation. Note that, unlike the t-statistic, this
M. Anderson, Lecture Notes 6, ARE 213 Fall 2012 17
Table 1: Summary Statistics
Controls Treated CPS
Mean S.D. Mean S.D. Di/SD Mean S.D. Di/SD
Age 25.05 7.06 25.82 7.16 0.11 33.23 11.05 -0.67
Black 0.83 0.38 0.84 0.36 0.04 0.07 0.26 2.80
Education 10.09 1.61 10.35 2.01 0.14 12.03 2.87 -0.59
Hispanic 0.11 0.31 0.06 0.24 -0.17 0.07 0.26 -0.05
Married 0.15 0.36 0.19 0.39 0.09 0.71 0.45 -1.15
Earnings 74 2.11 5.69 2.10 4.89 -0.00 14.02 9.57 -1.24
Earnings 75 1.27 3.10 1.53 3.22 0.08 13.65 9.27 -1.30
Unempl 74 0.75 0.43 0.71 0.46 -0.09 0.12 0.32 1.77
Unempl 75 0.68 0.47 0.60 0.49 -0.18 0.11 0.31 1.54
Source: Imbens (2007).
metric is not a function of the sample size. Using a t-statistic as a measure of balance can
be misleading since, as the sample grows, any nontrivial dierence in means will necessarily
generate a large t-statistic. This property of the t-stat is somewhat deceptive for our purposes
because a larger sample is actually more desirable (in that it allows us more freedom to
discard observations for which there is no overlap).
Table 2 presents results for a wide variety of estimators using the treated data and the full
CPS comparison data set, with 1975 earnings as the dependent variable. Since the program
did not begin until after 1975, we can be sure that any signicant result in this table is
simply due to selection bias. Though all the estimators OLS, propensity score methods,
matching, and dual methods do much better than a simple dierence in means between
the treated group and the full CPS group (the estimated treatment eect decreases by 77 to
91 percent, depending on the method), they all incorrectly reject the null hypothesis of no
treatment eect at a high level of statistical signicance. Ironically, the two OLS procedures
actually perform better than the more sophisticated procedures in this case. Nevertheless,
the take-away message is that none of the procedures performs well if there is not overlap in
the covariate distributions between the treated and control samples. In this case, all methods
are performing substantial extrapolation in some form or another.
Figures 1 through 6, taken from Imbens (2007), plot the distributions of the (estimated)
M. Anderson, Lecture Notes 6, ARE 213 Fall 2012 18
Table 2: Estimates with Earnings 75 As Outcome
Eect S.E. t-stat
Simple Di -12.12 0.68 -17.8
OLS (parallel) -1.15 0.36 -3.2
OLS (separate) -1.11 0.36 -3.1
Propensity Score Weighting -1.17 0.26 -4.5
Propensity Score Blocking -2.80 0.56 -5.0
Propensity Score Regression -1.68 0.79 -2.1
Propensity Score Matching -1.31 0.46 -2.9
Matching -1.33 0.41 -3.2
Weighting and Regression -1.23 0.24 -5.2
Blocking and Regression -1.30 0.50 -2.6
Matching and Regression -1.34 0.42 -3.2
Source: Imbens (2007).
propensity scores for each of the samples. Figures 1 and 2 plot histograms of the propensity
scores for the control sample and the treated sample respectively.
13
As expected, given the
random assignment, there is a high degree of overlap between these two propensity score
distributions.
Figures 3 and 4 plot histograms of the propensity scores for the CPS comparison sample
and the treated sample respectively. In this case, the propensity score is generated by running
a logit regression in which the dependent variable is an indicator that equals unity if a unit is
in the treated group and zero if a unit is in the CPS sample. Note the marked lack of overlap
between the two distributions very few of the CPS comparison units have propensity scores
exceeding 0.05, while the treated units have propensity scores ranging from 0 to 0.70.
To address this clear lack of overlap, Imbens implements a simple 0.1 rule in which he
drops all observations in either group with propensity scores less than 0.1 or greater than 0.9.
13
In Dehejia and Wahba (1999) the propensity score is generated by running a logit regression in which
the dependent variable is an indicator that equals unity if a unit is in the treated group and zero if a unit is in
the CPS sample, and the covariates are the ones listed in the table plus other unlisted covariates, with higher
order and interaction terms. In Figures 1 and 2, I believe that Imbens may actually estimate the propensity
score using only the experimental treated and experimental control groups, which seems strange because
the treatment status was randomly assigned. However, one could think of it as being similar to controlling
for covariates even when you have random assignment. Regardless, the random assignment explains why
virtually all of the p-scores fall between 0.2 and 0.6.
M. Anderson, Lecture Notes 6, ARE 213 Fall 2012 19
This rule reduces the treated population by 24%, from 185 to 141. It reduces the simulated
control population by 98%, from 15,992 to 313. Figures 5 and 6 plot the propensity score
distributions from the trimmed CPS sample and the trimmed treatment group. Note that
the overlap of the two distributions is substantially improved.
14
Table 3 presents summary statistics for the treated group and the trimmed CPS sample.
Note that there is now much better balance of the covariates between the two groups. All
but two of the covariates now have dierences in means that are less than 0.25 standard
deviations.
Table 3: Summary Statistics
CPS Controls Treated
Mean S.D. Mean S.D. Di/SD
Age 26.60 10.97 25.69 7.29 -0.09
Black 0.94 0.23 0.99 0.12 0.21
Education 10.66 2.81 10.26 2.11 -0.15
Hispanic 0.06 0.23 0.01 0.12 -0.21
Married 0.22 0.42 0.13 0.33 -0.24
Earnings 74 1.96 4.08 1.34 3.72 -0.15
Earnings 75 0.57 0.50 0.80 0.40 0.49
Unempl 74 0.92 1.57 0.75 1.48 -0.11
Unempl 75 0.55 0.50 0.69 0.46 0.28
Source: Imbens (2007).
Table 4 presents results for a wide variety of estimators using the trimmed data in which
24% of treated observations and 98% of CPS observations have been discarded. Two models
are run, one using the 1975 earnings data, and one using the 1978 earnings data. For the
former outcome, we expect no treatment eect (the program did not begin until after 1975).
For the latter outcome, Imbens doesnt give the experimental benchmark, but it appears that
he is using Dehejia and Wahbas RE74 Earnings Sample, in which case the experimental
benchmark is approximately $1,600 to $1,800.
With the trimmed data, virtually all of the estimators simple dierences, OLS, propen-
14
Both groups appear to have some propensity scores that fall below 0.1, which one would think would be
impossible given the trimming rule. Its likely that Imbens reestimated the propensity scores after he did
the trimming then it would be possible to get estimated propensity scores below 0.1.
M. Anderson, Lecture Notes 6, ARE 213 Fall 2012 20
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
0
1
2
3
4
5
6
Figure 1: histogram propensity score for controls, exper full sample
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
0
1
2
3
4
5
6
Figure 2: histogram propensity score for treated, exper full sample
M. Anderson, Lecture Notes 6, ARE 213 Fall 2012 21
0 0.2 0.4 0.6 0.8 1
0
2
4
6
8
10
12
14
16
18
20
Figure 3: hist p!score for controls, cps full sample
0 0.2 0.4 0.6 0.8 1
0
0.5
1
1.5
2
2.5
3
3.5
4
Figure 4: hist p!score for treated, cps full sample
0 0.2 0.4 0.6 0.8 1
0
1
2
3
4
5
6
Figure 5: hist p!score for controls, cps selected sample
0 0.2 0.4 0.6 0.8 1
0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
5
Figure 6: hist p!score for treated, cps selected sample
M. Anderson, Lecture Notes 6, ARE 213 Fall 2012 22
sity score methods, matching, and dual methods do very well. No estimator, including
the simple dierence in means between the two samples, nds a signicant dierence in the
pre-treatment data, and the estimated treatment eects using the 1978 earnings data range
from 1.73 to 2.23 (the experimental benchmark is approximately 1.7). The one exception
is the p-score matching estimator, which estimates a treatment eect of only 0.65 using the
1978 earnings data. So, in a sample with a high degree of overlap in the two propensity
score distributions, we see that almost all estimators perform well. The important step is
thus trimming the samples to achieve overlap the choice of estimators after that point
is a second order consideration. Of course, this is all still conditional on the selection on
observables assumption...
Table 4: Estimates Using Trimmed Data
Earn 75 Outcome Earn 78 Outcome
Mean S.E. t-stat Mean S.E. t-stat
Simple Di -0.17 0.16 -1.1 1.73 0.68 2.6
OLS (parallel) -0.09 0.14 -0.7 2.10 0.71 3.0
OLS (separate) -0.19 0.14 -1.4 2.18 0.72 3.0
Propensity Score Weighting -0.16 0.15 -1.0 1.86 0.75 2.5
Propensity Score Blocking -0.25 0.25 -1.0 1.73 1.23 1.4
Propensity Score Regression -0.07 0.17 -0.4 2.09 0.73 2.9
Propensity Score Matching -0.01 0.21 -0.1 0.65 1.19 0.5
Matching -0.10 0.20 -0.5 2.10 1.16 1.8
Weighting and Regression -0.14 0.14 -1.1 1.96 0.77 2.5
Blocking and Regression -0.25 0.25 -1.0 1.73 1.22 1.4
Matching and Regression -0.11 0.19 -0.6 2.23 1.16 1.9
Source: Imbens (2007).
3 Additional References
Hirano, Keisuke, Guido Imbens, and Geert Ridder. Ecient Estimation of Average Treat-
ment Eects Using the Estimated Propensity Score. Econometrica, 2003, 71, 1161-1189.
M. Anderson, Lecture Notes 7, ARE 213 Fall 2012 1
ARE 213 Applied Econometrics
UC Berkeley Department of Agricultural and Resource Economics
Cautionary Notes on Selection on Observables Designs:
There Is No Santa Claus
We have reviewed a variety of estimation procedures that are valid under the selection on
observables assumption: linear regression, nonparametric regression, matching, and propen-
sity score methods. We have seen that the most important feature for these designs is that
the treatment and control groups (either the original ones, or the ones you construct through
trimming) have good overlap in terms of their distributions of covariates or propensity scores.
So, if you have a rich set of covariates, and you have good overlap, then almost surely you
will be able to reproduce the results of an experiment, right? Bwahahahahahahaha. No.
1 Arceneaux, Gerber, and Green: The NSW Euphoria
Antidote
It was recently discovered by political scientists, no less that the National Supported
Work Demonstration was in fact not the only randomized experiment ever run in the history
of humankind (shocking, I know). Arceneaux, Gerber, and Green (2006) (henceforth AGG)
perform an exercise similar to LaLondes and Dehejia and Wahbas exercises using data from
a large-scale voter mobilization eort (this type of eort is often referred to as a Get Out
the Vote campaign). In this eort, households are randomly called and encouraged to vote.
Although the calling assignment is random, whether a household is actually contacted is
non-random people often do not answer their phones. Regressing a households voting
behavior on whether or not that household was contacted can thus give biased estimates of
the causal eect of encouragement on voting. One way to correct for this bias is to use the
original random calling assignment as an instrument for actual contact/encouragement. This
estimator will consistently estimate the causal eect of encouragement on voting for house-
M. Anderson, Lecture Notes 7, ARE 213 Fall 2012 2
holds for whom the original calling assignment changed whether or not they were contacted
(i.e., households that actually got contacted). In this case, that means that the instrumen-
tal variables estimator will estimate TOT, the eect of the treatment (being contacted and
encouraged) on the treated (those who were contacted and encouraged). I refer to these
estimates as the experimental estimates.
Alternatively, however, we could try to use a matching estimator to condition on the
observed covariates and, in that manner, estimate TOT. Specically, we could nd a match
for every treated unit, and compare the dierence in voter participation for the treated
units and their matched pairs. We could then benchmark the results of this matching
estimator, which should be valid under the selection on observables assumption, against the
experimental estimates. This is exactly what AGG do.
The AGG data have at least one important advantage over the NSW data: the AGG
sample size is massive. There are approximately 60,000 treated individuals and almost two
million control individuals. All individuals (treated and control) were taken from voter
registration lists, which contain detailed information on voting histories and demographic
characteristics. Once included in the study, individuals were randomly assigned to treatment
or control groups. Obviously, most people (97%) were assigned to the control group.
The rst column of Table 1 reports experimental benchmark estimates (i.e., IV esti-
mates). These estimates suggest that voter encouragement raises the probability of voting
by approximately 0.3 to 0.5 percentage points. These estimates are precisely estimated and
are not signicantly dierent than zero voter encouragement appears to have no appreciable
eect on voting behavior, at least for this population.
The second column of Table 1 reports OLS estimates that regress voting behavior on
whether an individual was contacted, conditioning on a variety of covariates. These covari-
ates include age, household size, gender, contest indicators, county indicators, and two years
of previous voting behavior.
1
This is roughly comparable to LaLonde (1986) and Dehejia and
1
Though AGG do not show the covariate balance across treated versus untreated individuals, there must
be substantial imbalance (i.e., covariates can predict whether an individual answers her phone) because
M. Anderson, Lecture Notes 7, ARE 213 Fall 2012 3
Table 1: Eect of Voter Encouragement on Voting
Experimental OLS Matching
Sample w/o Unlisted No. 0.5 2.7 2.8
(0.4) (0.3) (0.3)
N 1,905,320 1,905,320 22,711
Sample w/ Unlisted No. 0.3 4.4 4.4
(0.5) (0.3) (0.3)
N 2,474,927 2,474,927 23,467
Source: Arceneaux, Gerber, and Green (2006). Parentheses contain
standard errors.
Wahba (1999), who have age, education, marital status, gender, race, and two years of prior
earnings. In particular, both studies include two years worth of pre-treatment outcomes,
which DW stress as being very important. The OLS estimates range from 2.7 to 4.4 and are
highly signicant (t-statistics of 9 to 14). These estimates imply that voter encouragement
raises the probability of voting by 2.7 to 4.4 percentage points. Clearly the OLS estimates
are biased does this bias occur because of a lack of overlap in the covariates between the
treated and untreated groups?
The third column of Table 1 reports matching estimates that nd an exact match in
the control group for each treated unit (this is possible because all covariates are discrete
and the control group is enormous). They are able to match about 91% of observations
using exact matches and 99.9% of observations using slightly less exact matches (e.g., coding
age in 3 year intervals and dropping some geographic indicators). The overlap assumption
therefore appears to be satised, in the sense that close matches can be found for virtually
all observations. Nevertheless, matching estimates range from 2.8 to 4.4 and are highly
signicant. Matching therefore does not appear to solve the selection bias problem, even
with excellent overlap in the covariate distributions of the treated and control observations.
controlling for covariates has a strong eect on the OLS estimates.
M. Anderson, Lecture Notes 7, ARE 213 Fall 2012 4
2 Conclusions
The AGG experiment demonstrates that even in cases that seem well-suited to the selection
on observables design, estimates can be biased pretty badly. The main problem that AGG
face (or would face, if they didnt have the experimental estimates) is that its hard to make
the case that they observe all of the important factors in determining whether a caller makes
contact with an individual or whether an individual turns out to vote. Of course, one could
say the same thing about the NSW data, so its dicult for the real-world econometrician to
determine when the selection on observables design does or does not hold.
2
An alternative
to the selection on observables design is, of course, the selection on unobservables design.
The next section of the course focuses on this category of research designs.
2
My personal belief is that its most plausible when the selection was performed by an individual or body
that has observes the same data that is available to the researcher e.g., a college admissions ocer who
does not conduct interviews or read long personal essays. In cases in which units are self-selecting (which to
be honest is most cases), its far less plausible.
M. Anderson, Lecture Notes 8, ARE 213 Fall 2012 1
ARE 213 Applied Econometrics
UC Berkeley Department of Agricultural and Resource Economics
Transitioning From Selection on Observables to Selection on Unobservables:
Bridging the Gap
The purpose of these notes is to draw the practical distinctions between the selection on
observables (SOO) designs that we have studied and the selection on unobservables designs
(SOU) that we are about to study.
1 Selection on Observables vs. Selection on Unobserv-
ables
Formally, we know that selection on observables designs revolve around the assumption that
Y
i
(0), Y
i
(1) D
i
|X
i
the potential outcomes are independent of the treatment conditional on
observable covariates. Selection on unobservables designs relax this assumption. Of course,
you cant get something for nothing, so you have to replace it with a new assumption often
the existence of some variable Z
i
with properties Y
i
(0), Y
i
(1) Z
i
and Cov(Z
i
, D
i
) = 0.
From a theoretical perspective, the latter strategy looks more challenging than the former,
since you now have two assumptions to satisfy rather than one. But how do they dier in
practice?
To a rst approximation, all research designs that aim to estimate the causal eect of
a treatment, D, on some outcome, Y , hinge upon isolating good variation in D with
which to estimate the eect of D on Y . By good variation, we mean variation that is as
good as randomly assigned. Bad variation is the complement of good variation it is
variation in D that is correlated with the potential outcomes, Y
i
(0) and Y
i
(1).
The rationale behind a SOO design is that you can identify all of the bad variation in D.
You can then eliminate that bad variation, possibly using some nonparametric technique,
M. Anderson, Lecture Notes 8, ARE 213 Fall 2012 2
and estimate the eect of D on Y using the remaining (good) variation.
The rationale behind a SOU design is that you can identify some subset of the good
variation in D. You can then set aside that good variation in D and use it to estimate the
eect of D on Y , throwing away all the remaining variation in D. Note that some of the
remaining variation in D that you throw away will likely be good variation. Thats okay it
only aects the precision of our estimates (and our ability to make general statements about
average treatment eects).
In practice, then, it becomes clear that the two SOU assumptions are not necessarily
harder to satisfy than the single SOO assumption. In fact, they may be easier. This is
because SOU only requires you to identify a (possibly small) subset of one type of variation,
whereas SOO requires you to identify all of the other type of variation.
2 Selection on Observables: I Know Everything About
Everything
Without a doubt, bad research designs populate both the SOO world and the SOU world.
That being said, on the margin I think that SOO designs are harder to implement in a credible
manner.
1
From a theoretical perspective, a SOO design eectively amounts to making the
claim that you know everything of any importance that systematically aects selection into
treatment. This is, in my opinion, generally a dubious claim. From a practical perspective, it
can be dicult to falsify SOO designs. You cannot compare the covariate balance across the
treatment assignment in a SOO design, because the covariates are assumed to be correlated
with the treatment (thats why you have to condition on them).
2
1
The key words here are on the margin. A decent SOO design is undoubtedly preferable to a weak SOU
design.
2
You can still test for eects on pre-treatment outcomes, as we saw in the Imbens (2007) exercise with
the LaLonde data, but that test is also available in SOU designs. Furthermore, if you are using pre-treatment
outcomes as a matching/conditioning variable (which you generally would do), then pre-treatment outcomes
will be uncorrelated with the treatment almost by construction.
M. Anderson, Lecture Notes 8, ARE 213 Fall 2012 3
2.1 Properties of SOO Covariates
A question that often comes up in class is what properties the covariates used in SOO designs
should have. For example, should they be exogenous, and if so, what does that mean?
Is it simply sucient that they be predetermined? Or do they need to have coecients to
which we can attach causal interpretations?
The answer is that they should be predetermined, but they do not have to be as good
as randomly assigned. It may be easiest to understand this in the context of propensity
score methods. Considering blocking or matching on the propensity score, p(X). Clearly
X cannot be endogenous in the sense that changing D changes X. For example, suppose
that D is admission to a four year college, and X includes high school performance measures
(SAT, GPA, etc.). Would it be acceptable if X included senior year GPA? No, because
that is not predetermined it is (potentially) aected by college admission. In particular,
a student that gets admitted to college may slack o more during second semester of senior
year than a student that gets rejected and plans on reapplying after two years of community
college. Therefore, two students with the same values of X, including senior year GPA, and
thus the same propensity scores p(X), could dier in terms of expected potential outcomes
if one student has been admitted and the other has been rejected. In this case, the admitted
student will probably be, on average, a better student than the rejected student because
her GPA was articially lowered through being admitted, even though the two students
look identical in terms of observed covariates. Conditioning on endogenous variables thereby
violates the key SOO assumption, Y
i
(0), Y
i
(1) D
i
|X
i
.
So covariates do have to be predetermined exogenous in the sense that they can-
not be endogenously determined by the treatment. Many econometrics texts, however, use
exogeneity interchangeably with the notion of random assignment or independence of the
potential outcomes (e.g., x
i
is independent of
i
, where
i
in the regression models corre-
sponds to the component of Y
i
(0) that varies by unit). The covariates do not have to be
exogenous in the sense that they are as good as randomly assigned. In other words, we do
M. Anderson, Lecture Notes 8, ARE 213 Fall 2012 4
not need to be able to attach a causal interpretation to the coecients on X that we get
when estimating the propensity score. In fact, if X were randomly assigned, we wouldnt
need to remove the variation in D that is related to X it would be good variation rather
than bad variation. So, our model of the propensity score may not have a causal inter-
pretation (in fact, some components of X may not even be things that we can conceive of
manipulating). This ties into our point in the rst lecture that there is nothing wrong with
any given regression per se, its just a question of how you want to apply it and what kind
of interpretation you want to give it.
That being said, the most convincing SOO designs are, in my opinion, ones in which
the propensity score model does have a causal interpretation. For instance, consider the
college admissions example above. If we had all the data that goes onto college applications
(assume that its all quantiable data there is no essay or admissions interview), then
we could estimate the probability of being admitted to college conditional on all of these
covariates. If we then performed an experimental manipulation of SAT score or high school
GPA, in the sense that we took an individuals le and replaced her GPA with a higher GPA,
we would expect our model to accurately predict the eect of this change on her chances of
getting into college (at least in expectation). So in that sense the coecients have a causal
interpretation, even though the covariates themselves are not randomly assigned (again, we
wouldnt need to condition on them if they were!).
For comparison, consider the propensity score model that Dehejia and Wahba estimate.
In their model, the selection is being performed by the individual. One covariate that they
use is marital status. Their propensity score coecients do not have any causal interpretation
if we forced an individual to get married or divorced (at least on paper), its unlikely that
their probability of participation would change (in expectation) by the amount implied by
the DW propensity score model. Rather, marital status is used as a proxy for other attributes
that also aect program participation, and in that sense conditioning on it helps remove bad
variation from D. The problem, however, is that you no longer know all of the variables that
aect treatment (like you did in the college admission case). Now you are forced to believe
M. Anderson, Lecture Notes 8, ARE 213 Fall 2012 5
that you have included a wide enough range of covariates to proxy for everything that needs
to be proxied for and soak up all of the bad variation in D. This assumption may or may
not be credible as we saw in AGG, it can easily fail.
The intuition for other SOO designs, e.g. linear regression or matching, is eectively the
same. You can control for or match on covariates that are predetermined, but they do not
have to be randomly assigned (again, if they were, you wouldnt need to control for them).
They cannot, however, be endogenously determined by D in that case they are outcomes
rather than covariates. To take a very simple example, assume that D corresponds to veg-
etarianism, Y corresponds to heart disease, and X corresponds to cholesterol level. You do
not want to control for cholesterol level when estimating the eect of vegetarianism on
heart disease because vegetarianism will aect cholesterol levels directly, so controlling for
cholesterol level will likely lead you to underestimate the eect on heart disease. Predeter-
mined characteristics, however, can be controlled for, despite the fact that doing so will not
generally produce coecients on those characteristics that have a causal interpretation.
3 Selection on Unobservables: I Know A Little Bit
About A Little Bit
In contrast to SOO designs, with SOU designs you do not have to claim that you can identify
all of the good variation in D or all of the bad variation in D. All you have to do is identify
some source of clean variation in D, and then use that source of clean variation to estimate
the eect of D on Y .
3
Of course, you can still control for covariates in SOU designs to
increase precision (just as you could control for them in a randomized experiment). The
rules are the same endogenous covariates that could be aected by the treatment are not
okay, covariates that are predetermined (but not necessarily randomly assigned) are okay.
The nice thing with SOU designs is that the covariates should be uncorrelated with the good
variation in D that you identied, so its possible to conduct falsication tests using the
3
Finding this source of variation is easier said than done.
M. Anderson, Lecture Notes 8, ARE 213 Fall 2012 6
covariates.
4
The primary danger in SOU designs is that the clean source of variation that you
identify in D may be a small fraction of the total variation in D. This necessarily means
that your estimate will have large standard errors (at least in a relative sense). Furthermore,
small changes in P(D = 1) imply small changes in Y to rescale the estimate to correspond
to changing D from 0 to 1, you need to inate it by a large number. But that means
that small amounts of bias in your estimator also get inated (this is, in essence, the weak
instruments issue). So SOU designs have their perils as well.
4 Where Does Structural Estimation Fit?
Thats a good question I dont think it ts cleanly in one category or the other. That being
said, in terms of the subtitles I suppose it is closer to SOO in that it presumes extremely
detailed knowledge of the statistical and economic processes at work. The Heckman Selection
Model that we saw earlier is a very simple example of a structural estimator it puts a lot of,
well, structure on the problem in the form of distributional and functional form assumptions.
The identication hinges crucially upon whether these assumptions hold.
Table 1 presents the results of a set of Monte Carlo simulations. These simulations are
run using a simplication of the selection model that LaLonde presents. The basic structure
of the model is:
y
i
= D
i
+ x
i
+u
i
(1)
D
i
= 1(z
i
+x
i
+v
i
> 0) (2)
In addition, we assume that u
i
and v
i
are distributed normally with positive covariance
(i.e., there is positive selection). Both z and x are initially assumed to be uncorrelated with u
4
A special case that combines SOO and SOU designs would be if you had an instrument Z that, conditional
on X, was independent of the potential outcomes Y
i
(0) and Y
i
(1). Obviously in those cases you do not expect
covariates to be balanced across Z.
M. Anderson, Lecture Notes 8, ARE 213 Fall 2012 7
and v. For convenience, I will refer to equation (1) as the structural equation and equation
(2) as the selection equation, though you could argue that both equations are structural.
I simulate the results for three estimators. The OLS estimator just regresses y on D
and x (and z when appropriate). The IV estimator uses z as an instrument for D (when
possible). The Heckman Selection estimator runs a probit of D on z and x, and then
uses the coecients from that probit to construct the selection adjustment term in LaLonde
(1986) call it H. It then runs a regression of y on D, x, and H (and z when appropriate).
Column (1) reports the coecients using the base model outlined above (all simulations
use 100,000 observations). As expected, the OLS coecient, 1.826, is biased upwards due
to positive selection on D. The IV and Heckman Selection coecients, 1.003 and 1.007
respectively, are both unbiased. This is unsurprising since the necessary assumptions for
both estimators are satised in the base model.
Table 1: Coecient on D From Simulation
(1) (2) (3) (4) (5) (6)
OLS 1.826 1.847 2.878 1.702 2.881 3.094
(0.007) (0.009) (0.009) (0.018) (0.009) (0.009)
IV 1.003 N/A 1.007 1.046 1.044 N/A
(0.014) (0.034) (0.041) (0.033)
Heckman Selection 1.007 1.003 1.177 -0.070 1.156 1.818
(0.011) (0.016) (0.025) (0.038) (0.025) (0.075)
Notes: Parentheses contain standard errors.
Column (2) reports the coecients using a modication of the base model: now the
structural equation is y = D + x + z + u. Also, the conditional expectations of u and v are
assumed to be linear functions of z and x (i.e., z and x are now correlated with u and v, albeit
in a very elementary manner). The OLS of y on D, x, and z is again biased the coecient
is 1.847. The IV coecient now cannot be estimated because there is no variable in the
selection equation that is excluded from the structural equation. The Heckman Selection
coecient, however, remains estimable and unbiased, at 1.003. Thus we see the advantage of
M. Anderson, Lecture Notes 8, ARE 213 Fall 2012 8
this structural procedure if our functional form and distributional assumptions are exactly
correct, we can identify the parameter of interest even without a clean source of variation
in the form of exclusion restrictions or unconfoundedness assumptions. Note, however, that
this identication is entirely dependent on getting the assumptions exactly right. If I change
the conditional expectation of v to be quadratic in x or the log of z, the Heckman Selection
estimate can easily change to 3 or 4 (in fact, it does much worse than OLS in this scenario).
Column (3) reports the coecients using a minor modication of the base model: ev-
erything is exactly the same, except now the distribution of u and v has a bimodal shape
rather than a Gaussian shape.
5
OLS continues to be biased, with a coecient estimate of
2.878. IV is unbiased, because the exclusion restriction for z is still valid regardless of the
distributional shape for u and v the coecient is 1.007. The Heckman Selection coecient,
however, is now upwardly biased, at 1.177, despite the fact that there is a valid exclusion
restriction, the structural and selection equations are perfectly modeled, and everything is
exactly as expected, except that we got the distributions of u and v wrong. Thus we see that
the Heckman Selection model is sensitive to distributional assumptions even under virtually
perfect conditions.
Column (4) reports the coecients using another modication of the base model: the
structural equation is changed to y = D + x
3
+ u. u and v remain normally distributed,
however. OLS, at 1.702, is again biased. IV is still unbiased (albeit less precise) the
exclusion restriction on z still holds despite ddling with the functional form at 1.046. The
Heckman Selection coecient is now heavily biased downwards, at -0.070. Thus we see that
the Heckman estimator is sensitive to functional form assumptions (in fact, it is more biased
than OLS in this case).
Column (5) reports the coecients using a modication that combines non-normal errors,
as in Column (3), and a dierent form for the structural equation: y = D + ln(x + 5) + u.
OLS is biased at 2.881. IV remains unbiased at 1.044 (again, the exclusion restriction still
holds). To help the Heckman procedure, I modify it so that it includes cubics in x and z
5
This might happen if, for example, you had several large clusters in the data.
M. Anderson, Lecture Notes 8, ARE 213 Fall 2012 9
and an interaction between x and z when estimating the selection equation and a cubic in x
when estimating the structural equation. We can see that incorporating a exible functional
form helps minimize the bias in the Heckman coecient it is 1.156.
6
This should not be
surprising when we exibly model x in the structural equation, more of the identication
should come from the exclusion of z from the structural equation, which is the source of
the IV identication. Nevertheless, the Heckman estimator is still biased upwards by about
15%.
Column (6) reports the coecients using a modication that combines non-normal errors,
as in Column (3), with a dierent form for the structural equation: y = D + ln(x + 5) +
ln(z +5) +u. Note that z is no longer excluded from the structural equation. OLS, at 3.094,
remains biased. IV cannot be estimated because there is no exclusion restriction. I now
modify the Heckman Selection model so that it includes cubics in x and z and an interaction
between x and z when estimating the selection equation and a cubic in x when estimating the
structural equation. Nevertheless, it is still biased, estimating a coecient of 1.818. At this
point all of the identication is coming o of functional form and distributional assumptions,
which are very dicult to get exactly right.
This exercise actually uses pretty mundane functional form modications in that they
are smooth and continuous. Breaking the functional form assumptions in more exotic ways
e.g., allowing selection into the sample on x (which could be achieved by simply changing
ln(x + 5) to ln(x)) makes the Heckman procedure perform much worse (while IV is, of
course, insensitive to selection on x). So in general you probably would not want to use
this particular estimation procedure in the context of estimating causal eects its either
dominated by simple IV (if you have a good exclusion restriction), or its identied o of
detailed assumptions that are unlikely to be defensible.
Of course, in many cases structural estimators by which I mean estimators which make
strong(er) assumptions about functional form or distributions are the only realistic path.
For example, consider estimating a simple model positing that heart disease is a function
6
Without the exible terms it is centered around 1.21.
M. Anderson, Lecture Notes 8, ARE 213 Fall 2012 10
of smoking status and blood pressure. Since blood pressure is a function of smoking, it
is impossible to think of an experimental manipulation that changes smoking but does not
aect blood pressure (put another way, you cant condition on an outcome). The only way to
estimate this model is to impose more structure you essentially need to use other variation
in blood pressure to estimate how blood pressure aects heart disease, and then write down
a model in which the eect of blood pressure on heart disease from this other source of
variation is assumed to be exactly equal to the eect of blood pressure on heart disease
from smoking. If you want to get inside the black box to explore how a treatment works
(i.e., what the channels of causation are), then youre going to have to use some sort of
structural equation model (conceptually, this is similar to path diagrams that you might
see applied in sociology or psychology journals, for example). The thing to keep in mind is
that, ultimately, whether you get the functional form right will heavily inuence whether you
get the estimates right. In that sense, these techniques presume a lot of a priori knowledge,
just as SOO designs do.
M. Anderson, Lecture Notes 9, ARE 213 Fall 2012 1
ARE 213 Applied Econometrics
UC Berkeley Department of Agricultural and Resource Economics
Selection On Unobservables Designs:
Part 1, Fixed Effects and Random Effects Models
1
Panel data models loosely qualify under the rubric of selection on unobservables designs
because they assume that individual-specic time series variation is a valid source of variation
for identifying causal eects (i.e. it is as good as randomly assigned).
Panel, or longitudinal, data sets consist of repeated observations for the same units, rms,
individuals or other economic agents. Typically the observations are at dierent points in
time. Let Y
it
denote the outcome for unit i in period t, and X
it
a vector of explanatory
variables. The index i denotes the unit and runs from 1 to N, and the index t denotes time
and runs from 1 to T. Typically T is relatively small (as small as two), and N is relatively
large. As a result, when we try to approximate sampling distributions for estimators, we
typically approximate them assuming that N goes to innity, keeping T xed.
Here we will mainly look at balanced panels, where T is the same for each unit. An
unbalanced panel has potentially dierent numbers of observations for each unit. This may
arise because of units dropping out of the sample; a practical example would be rms going
out of business.
The core research design issue that panel data addresses is the possibility that individual
units may dier in important, unobserved ways that aect their outcomes in a manner that
is constant over time. From a statistical standpoint, however, the key issue with panel data
is that Y
it
and Y
is
tend to be correlated even conditional on the covariates X
it
and X
is
.
2
1
These notes are partially derived from Guido Imbens old ARE 213 notes. Some passages have been
quoted directly.
2
Even if you are unconcerned about whether your estimates have a causal interpretation, the statistical
issue still exists. In the context of the model below, if you assume that X
it
is scalar and let =
Cov(X,Y )
V ar(X)
,
you will still need to account for the fact that Y
it
and Y
is
are correlated.
M. Anderson, Lecture Notes 9, ARE 213 Fall 2012 2
Consider these possibilities in a linear model setting:
Y
it
= X

it
+ c
i
+
it
.
Statistically, the presence of c
i
, the unobserved individual eect, creates a correlation
between Y
it
and Y
is
even if
it
is uncorrelated over time and units. If we assume that
E[
it
|X
i1
, ..., X
iT
, c
i
] = 0, then we can condition on the c
i
to estimate the eect of x on y.
For the moment, however, we focus on the statistical issue.
The two main approaches to dealing with such issues are xed eects and random eects.
The labels are, unfortunately, rather deceptive. The key distinction is whether we model the
correlation between the individual eects, c
i
, and the covariates, X
it
, or whether we assume
that they are independent. Better labels would therefore be correlated and uncorrelated
random eects, but the labels xed and random eects are, by this point, xed.
In both cases we assume that the vectors of individual outcomes are independent across
individuals.
The rst assumption we make for random eects (and also for xed eects) is
Assumption 1 (Strict Exogeneity)
E[
it
|X
i1
, . . . , X
iT
, c
i
] = 0.
Second,
Assumption 2 (uncorrelated effects)
E[c
i
|X
1i
, . . . , X
iT
] = 0.
Note that these two assumptions imply (somewhat unrealistically) that the composite
error term, c
i
+
it
, is uncorrelated with the explanatory variables X
it
. Thus an OLS regression
M. Anderson, Lecture Notes 9, ARE 213 Fall 2012 3
of Y
it
on X
it
will produce causal estimates, assuming that the X
it
can be interpreted as
treatments.
We will examine some of the ideas and concepts with data from the Card and Krueger
(1994) minimum wage paper, which examines the eect of a minimum wage increase in New
Jersey on fast-food employment, using Pennsylvania restaurants as a control group. We will
examine regressions of fulltime employment on wages:
emp
it
=
0
+
1
wage
it
+ v
it
.
1 OLS
Given the strict exogeneity and uncorrelated eects assumptions, we can write
Y
it
= X

it
+ v
it
,
with v
it
= c
i
+
it
, the composite error term, uncorrelated with the covariates. Thus we
can use OLS to estimate :

ols
= (X

X)
1
(X

Y ) =
_

i,t
X
it
X

it
_
1
_

i,t
X
it
Y
it
_
.
Under our two assumptions, OLS gives us consistent estimates of . However, the OLS
standard errors will be incorrect because they assume independence across observations. In
fact, the existence of c
i
implies that observations in dierent time periods within a given
cross-sectional unit will be positively correlated, even if the c
i
terms are independent across
units (as we assumed). We can write the OLS standard errors as:
V (

) =
2
v
_

i,t
X
it
X

it
_
1
=
2
v

i
X

i
X
i
_
1
,
where X
i
is the T K matrix with tth row equal to X

it
(i.e., X
i
contains all of the X
it
for
a given cross-sectional unit i).
M. Anderson, Lecture Notes 9, ARE 213 Fall 2012 4
We can get robust (clustered) variances that account for the correlated structure of the
data by rst estimating the residuals as
v
it
= Y
it
X

it

.
The robust variance is then
V (

) =
_

i
X

i
X
i
_
1

i
X

i
v
i
v

i
X
i
_

i
X

i
X
i
_
1
,
where v
i
is the T vector with tth element equal to v
it
.
The variance estimator above is what you get when you specify the , cluster(unit)
option in Stata, where unit is the variable containing the cross-sectional identier. We
will discuss the properties of this variance estimator in further detail in a later lecture.
We apply the clustered variance estimator to the CK data. In these data, the restaurant
is the cross-sectional unit i, and there are only two time periods (pre-minimum wage increase
and post-minimum wage increase). We nd
emp
it
= 12.4430 + 1.1103 wage
it
(OLS) (4.4951) (0.9328)
(Clustered) (3.9921) (0.8284)
The clustered variance estimates are, in this case, slightly smaller than the OLS variance
estimates, suggesting that correlation within restaurants over time is not seriously biasing
the conventional standard errors.
3
This is due in part to the fact that there are only two
time periods, so the potential bias in the standard errors is limited.
4
3
There are, however, other issues with the standard errors in this research design which we will discuss
later.
4
Particularly because the treatment is negatively correlated within units lying in New Jersey! Again, we
will discuss these issues in more depth later.
M. Anderson, Lecture Notes 9, ARE 213 Fall 2012 5
2 Random Eects (GLS)
Recall that in conventional (i.e., non-panel) data sets with heteroskedastic residuals, we
have two options for estimation. One option is to simply estimate the coecients via OLS
(which remains consistent) and use the Eicker-Huber-White robust standard errors to get
the standard errors right. The other is to model the heteroskedasticity as a function of X
and to then use weighted least squares (a special case of GLS) to estimate the coecients
and standard errors. Both procedures should produce consistent coecient estimates and
standard errors, but the latter is theoretically more ecient than the former (assuming you
can model the heteroskedasticity correctly).
The situation with panel data is not dierent. As we saw above, under our two assump-
tions we can use OLS to estimate the coecients and then correct the standard errors using
the clustered standard errors. However, in principle we should be able to leverage the corre-
lated structure of the data to produce a GLS estimator that gets the standard errors right
and is more ecient than OLS. This, in essence, is what the random eects (RE) estimator
is.
To exploit some of the random eects structure, dene
= E[v
i
v

i
],
where v
i
= (v
i1
, . . . , v
iT
)

, i.e. it is the vector of all residuals for unit i. In other words, is


the within-unit variance/covariance matrix of the residuals it denes how residuals for the
same unit across dierent time periods are correlated with each other.
The RE error structure consists of an assumption that implies that all residuals within
unit i are equally correlated with each other:
M. Anderson, Lecture Notes 9, ARE 213 Fall 2012 6
Assumption 3 (Random Effects Error Structure)
=
2

I
T
+
2
c

T

T
=
_
_
_
_
_

+
2
c

2
c
. . .
2
c

2
c

2

+
2
c
. . .
2
c
.
.
.
.
.
.

2
c
. . .
2

+
2
c
_
_
_
_
_
.
Here
T
is a column vector of dimension T with all elements equal to one, i.e.
T

T
is a
T T matrix full of ones. However, we continue to assume that the residuals across dierent
units are uncorrelated with each other.
5
We can exploit the error structure by estimating the variance/covariance matrix and then
using weighted least squares:

RE
=
_

i
X

1
X
i
_
1
_

i
X

1
Y
i
_
.
Here X
i
is the T K matrix with tth row equal to X

it
. The consistency of this estimator,
like that for the OLS estimator, does not depend on the random eects error structure (we
are, after all, simply reweighting the data we are still assuming that all variation in X is
valid variation for estimating ).
The estimator for is

=
_
_
_
_
_

2

+
2
c

2
c
. . .
2
c

2
c

2

+
2
c
. . .
2
c
.
.
.
.
.
.

2
c
. . .
2

+
2
c
_
_
_
_
_
,
To get estimates for
2

and
2
c
, rst estimate by OLS, then calculate the residuals
v
it
= Y
it
X

it

,
5
Thus if we sorted observations rst by i and then by t, the variance/covariance matrix for all of the data
would look like a block diagonal matrix, with on the diagonal.
M. Anderson, Lecture Notes 9, ARE 213 Fall 2012 7
Estimate the residuals variance as

2
v
=
1
N T K
N

i=1
T

t=1
v
2
it
,
Note that this is simply the mean of the sum of squared residuals for the entire data set
(with a degrees of freedom adjustment applied) nothing fancy here, despite the double-sum
notation.
Then estimate the variance of the unobserved individual eect as

2
c
=
1
NT(T 1)/2 K
N

i=1
T1

t=1
T

s=t+1
v
it
v
is
,
This is just the mean of the product of v
it
and v
it+u
over the entire data set (with a
degrees of freedom adjustment). The NT(T 1)/2 term in the denominator is divided by 2
because the last sum starts at t + 1 which, on average, is halfway to T.
Finally, estimate

2

=
2
v

2
c
(or zero if this is negative).
Applying this procedure to the Card and Krueger data we get
2
v
= 79.8707,
2
c
= 43.4978,
and
2

= 36.3729.
If the model is correct, including our specication for , then the variance for

RE
is
V (

) =
_

i
X

1
X
i
_
1
.
If the specication for is not correct, we can still apply the random eects estimator
(its still consistent under our two original assumptions, its just not ecient anymore) and
use the robust (clustered) variance estimator to get the standard errors right:
V (

) =
_

i
X

1
X
i
_
1

i
X

1
v
i
v

1
X
i
_

i
X

1
X
i
_
1
.
M. Anderson, Lecture Notes 9, ARE 213 Fall 2012 8
Note that the clustered variance estimator relaxes the assumption that the o-diagonal
elements of are all the same. In other words, it allows for dierent correlations between
dierent time periods, whereas the conventional RE standard errors assume that the corre-
lation between dierent observations for the same unit is always the same, regardless of how
far apart the time periods are.
Applying the random eects estimator to the Card and Krueger data gives us
emp
it
= 11.6952 + 1.2659 wage
it
(GLS) (3.8631) (0.7995)
(Clustered) (3.2317) (0.6701)
Note that the random eects point estimates are still in the same ballpark as the OLS
point estimates. If we believe our assumptions, then the similarity of the point estimates is
not surprising both estimators are consistent under these assumptions.
3 Feasible Generalized Least Squares (FGLS)
In the conventional random eects model, we make fairly strong assumptions about the
structure of the within-unit covariance matrix, . An alternative form of Feasible Gener-
alized Least Squares (FGLS) that is less restrictive does not rely on assuming the exact
structure of the covariance matrix of the residuals. Instead, it estimates the structure of
the covariance matrix using an estimator that is similar to the robust clustered variance
estimator that we have been using for the clustered standard errors.
Again we start with OLS estimates which are consistent but not ecient in a wide range
of settings. Estimate the residuals v
i
as v
it
= Y
it
X

it

ols
. Then estimate the residual
covariance matrix as

=
1
N
N

i=1
v
i
v

i
.
M. Anderson, Lecture Notes 9, ARE 213 Fall 2012 9
Note that v
i
v

i
forms an estimate of the T T covariance matrix from the data for each
cross-sectional unit. Unlike the conventional random eects estimator, it does not impose
the constraint that all of the o-diagonal elements be equal.
Finally, estimate as

FGLS
=
_

i
X

1
X
i
_
1
_

i
X

1
Y
i
_
.
The advantage of the FGLS estimator relative to the RE estimator in the case with more
than two periods is that it allows for a more exible correlation structure. The disadvantage
is that if the RE restrictions are (close to being) satised, then you may introduce a lot of
extra noise by not exploiting them. This will be particularly true if N is not too large
then the quantity
1
N

N
i=1
v
i
v

i
will not be a very accurate estimate of . This is because,
regardless of the size of T, each element in the

matrix is estimated by only N observations.
The RE restriction that all of the o-diagonal elements are equal allows us, in contrast, to
estimate the o-diagonal elements using NT(T 1)/2 observations.
With only two periods FGLS gives us results identical to those for the RE estimator
because there is only one unique o-diagonal element in the matrix. Thus the RE structure
does not restrict the full variance/covariance matrix of the residuals at all.
We can test for the presence of individual unobserved eects using the statistic
S =
1

N
N

i=1
T1

t=1
T

s=t+1
v
it
v
is

N
2
c
. T(T 1)/2
If
2
c
= 0, then
2
c
will plim to 0, and S will be asymptotically normal:
S N(0, V ),
Note that we are eectively treating each of the i units as being independent of each
other, and saying that the sum of the unique o-diagonal elements in

for each unit have
M. Anderson, Lecture Notes 9, ARE 213 Fall 2012 10
some variance V . In other words, think of the random variable as:
S
i
=
T1

t=1
T

s=t+1
v
it
v
is
.
Then S =
1

N
i=1
S
i
, and we know the CLT will apply since the S
i
are distributed i.i.d.
(we can even relax the identical assumption its the independence part that is important).
All we need to do, then, is estimate Var(S
i
) = V .
We can easily estimate V using method of moments as

V =
1
N
N

i=1
S
2
i
=
1
N
N

i=1
_
T1

t=1
T

s=t+1
v
it
v
is
_
2
.
Under the null hypothesis,
2
c
will plim to 0; under the alternative hypothesis,
2
c
will plim
to something else. Thus our test consists of testing whether S/
_

V is distributed N(0, 1).


6
In the Card and Krueger data,
c
= 43.4978, so S = 823.0176, V = 21, 455 and the test
statistic is 5.6188, so we can reject the hypothesis that
2
c
= 0.
4 Fixed Eects (FE)
The OLS, RE, and FGLS estimators all maintain the uncorrelatedness assumption, i.e.
E[c
i
|X
1i
, . . . , X
iT
] = 0.
From a research design perspective, then, these estimators are not that interesting be-
cause they do not allow us to relax the assumption that our variables of interest, X
it
, are
uncorrelated with the composite error term, v
it
= c
i
+
it
, i.e. the Xs that were interested
in still need to be as good as randomly assigned with respect to the errors.
6
Note that the denominator here is estimated using the second moment rather than the second centered
moment. Even so, S/
_

V will go to innity as N gets large under the alternate hypothesis, because S is


proportional to

N
2
c
while

V just converges to a xed quantity.
M. Anderson, Lecture Notes 9, ARE 213 Fall 2012 11
The Fixed Eects model (FE) allows us to relax this uncorrelatedness assumption. We
now only assume strict exogeneity:
E[
it
|X
i1
, . . . , X
iT
, c
i
] = 0.
In other words, we only need assume that
it
is mean-independent of X
i
after conditioning
on the individual eects c
i
. Any part of the composite error that is time-invariant will get
folded into c
i
, so we can relax the unconfoundedness assumption for any component in the
error term that varies across individuals but not over time. For statistical inference, we
actually make a strong assumption (that we will relax later):
E[
i

i
|X
i
, c
i
] =
2
I
T
The idea behind xed eects is that we want to either estimate the c
i
parameters (so
that we can control for them) or just get rid of them altogether (so that we dont have to
worry about them). We rst consider the traditional xed eects estimator, which consists
of simply throwing in a whole mess of dummy variables, one for each unit i.
7
Formally, we
implement this by adding an N-dimensional vector of covariates, R
it
, with its jth element
for unit i in period t equal to:
R
it,j
= 1(i = j)
In other words, the rst element of R equals unity if a given observation corresponds
to unit 1 and zero otherwise, the second element of R equals unity if a given observation
corresponds to unit 2 and zero otherwise, and the Nth element of R equals unity if a given
observation corresponds to unit N and zero otherwise. One dummy for each cross-sectional
unit.
7
Of course, to avoid perfect colinearity, you will have to exclude one of the dummy variables when running
the regression.
M. Anderson, Lecture Notes 9, ARE 213 Fall 2012 12
We can then estimate the FE model using the following linear regression:
Y
it
= X

it
+ R

i
c +
it
where c is now an N 1 column vector containing all N of the c
i
terms. Our estimate
of and c will now be unbiased (given our assumptions), but our estimates of c are not
consistent because we do not get more observations with which to estimate each coecient
c
i
as we increase N. If this is not intuitive, consider a special case in which X
it
does not
exist. In that case, c
i
=

t
Y
it
/T = Y
i
, and the precision of Y
i
does not increase as T stays
xed and N increases.
4.1 Within Estimator
In practice, it can be hard to estimate the FE model if N is very large. If N were 10,000,
for example, you would eectively be asking the computer to invert a greater than 10, 000
10, 000 matrix when you included R as matrix of regressors. Fortunately, it turns out that
there is a simple demeaning transformation that we can apply to the data that gives us
estimates of that are numerically identical to those produced by the FE estimator. This
estimator is known as the within estimator because it identies using within-individual
variation. Because it generates the same estimates as the FE estimator, people sometimes
use the terms FE and within estimator interchangeably.
Dene the unit-specic averages for unit i as
Y
i
=
1
T
T

t=1
Y
it
, and X
i
=
1
T
T

t=1
X
it
Then dene the deviations from the unit-specic means as

Y
it
= Y
it
Y
i
, and

X
it
= X
it
X
i
,
M. Anderson, Lecture Notes 9, ARE 213 Fall 2012 13
The within estimator is based on running the regression:

Y
it
=

X

it
+
it
Note that we no longer have to invert the mega-matrix in order to run this regression.
The key here is that the c
i
terms disappear because c
i
= c
i
c
i
= 0. We are thus left needing
to only satisfy the orthogonality condition:
E[
it
|

X
it
] = 0
This holds true under the strict exogeneity assumption that we began with (you can use
iterated expectations to condition on c
i
if you want to show this).
Intuitively, it should be clear that the within estimator is equivalent to the FE estimator
if you consider doing partitioned regression. If you regress each variable in X on all of the
individual specic dummy variables, R, then the coecients on the dummies will be equal to
the individual specic means (recall that regressing a variable on a column of ones estimates
the mean of that variable; regressing a variable on an indicator variable for unit i estimates
the mean of that variable for unit i). Thus, after we partial out R from X, we will have

X = X RX, where X is a N K matrix in which the ith row consists of X

i
. Thus

X
it
= X
it
X
i
, which is identical to the within transformation.
We can also write the within estimator in matrix form by dening the T T matrix:
A = I
T

T

T
/T =
_

_
1
1
T

1
T
. . .
1
T

1
T
1
1
T
. . . . . .
.
.
.
.
.
.
.
.
.
.
.
.

1
T
.
.
.
.
.
. 1
1
T
_

_
Then

X
i
= AX
i
.
M. Anderson, Lecture Notes 9, ARE 213 Fall 2012 14
One notable property of A is that it is idempotent, so technically you could recover the
within estimates by only applying the transformation to the X
i
(you wouldnt have to do it
to the Y
i
).
8
If you wanted to apply the transformation to the entire NT K matrix of data,
X, then it would be

X = AX
where A is a block-diagonal NT NT matrix containing A as its diagonal elements.
We should note that the OLS standard errors will be incorrect when running the within
estimator because
it
=
it

i
will be correlated across dierent observations within the
same unit. To get the correct standard errors when using the within estimator, multiply the
OLS standard errors by
_
T/(T 1).
4.2 Dierencing Estimators
Another class of estimators that can get rid of the c
i
terms are dierencing estimators. The
most common dierencing estimator is the rst dierences estimator. First dene:
Y
it
= Y
it
Y
it1
, X
it
= X
it
X
it1
,
it
=
it

it1
Then we can run the regression
Y
it
= X

it
+
it
using data from time periods 2, ..., T. Note that the c
i
terms drop out because they
are constant within any cross-sectional unit over time. OLS on this regression will produce
consistent estimates, though the standard errors will need to be adjusted for serial correlation
(one easy way would be to use the robust clustered variance estimator).
8
Note that AA = (I
T

T

T
/T)(I
T

T

T
/T) = I
T
2
T

T
/T +
T

T
/T
2
= I
T

T

T
/T = A
M. Anderson, Lecture Notes 9, ARE 213 Fall 2012 15
When T = 2,

FE
and

are numerically identical. You will be asked to show this in a


problem set exercise.
More generally, we can dene a dierencing estimator over s periods as:

s
Y
it
= Y
it
Y
its
,
s
X
it
= X
it
X
its
,
s

it
=
it

its
Then run the regression

s
Y
it
=
s
X

it
+
s

it
using data from time periods s + 1, ..., T. When we choose s = T 1 the largest
possible value for s we refer to the resulting estimator as the long dierences estimator
(its the longest dierencing estimator we can apply to the data).
4.3 Fixed Eects vs. Random Eects
We know that the FE estimator is equivalent to the within estimator it estimates using
only within-individual variation.
9
This sort of qualies as a selection on unobservables
design in that we think that within-individual variation is a source of good variation in
X, while between-individual variation is not a source of good variation in X. We get rid
of the between-individual variation in X by including xed eects or applying the within
transformation.
The complement of the within estimator is, of course, the between estimator,

B
it
estimates using only between individual variation. It can be easily implemented by running
the regression:
Y
i
= X

i
+
i
9
In practice we often include a dummy variable for each time period so that we remove aggregate time
trends as well.
M. Anderson, Lecture Notes 9, ARE 213 Fall 2012 16
It turns out that the RE estimator is a weighted average of the within estimator (i.e., the
FE estimator) and the between estimator. First dene the within-individual and between-
individual sums of squares:
S
w
xx
=

t
(X
it
X
i
)(X
it
X
i
)

S
b
xx
=

i
T(X
i
X)(X
i
X)

Then the RE estimator equals

RE
=

F
w

FE
+ (I

F
w
)

B
where

F
w
= [S
w
xx
+

2

+
2
c
S
b
xx
]
1
S
w
xx
If X contains a single regressor and
2
c
= 0, then

F
w
is simply the proportion of total
variation in X that is within-individual variation, and each estimator receives a weight equal
to its proportion of the total variation in X. If
2
c
> 0, then

F
w
increases, i.e. the within
estimator begins to receive more weight at the expense of the between estimator. The
intuition here is that, if
2
c
= 0, then v
i
is an average of several independent variables, so the
idiosyncratic shocks will tend to cancel each other out. If
2
c
> 0, however, then v
i
contains
an individual-specic shock (c
i
) that aects all observations for individual i. This individual-
specic shock will not get cancelled out when taking the mean over dierent observations
for the same individual, so the between-individual data will have more noise than it would if

2
c
= 0. Hence

B
has a higher variance, and the RE estimator puts less weight on

B
than
it would if
2
c
= 0.
It should now be obvious why random eects is more ecient than xed eects RE uses
both within-individual and between-individual variation for estimating while FE uses only
M. Anderson, Lecture Notes 9, ARE 213 Fall 2012 17
within-individual variation for estimating . If c
i
is not correlated with X
i
, then there is no
reason to throw out all that between-individual variation.
10
On the other hand, it is also
clear why RE and OLS are biased if the uncorrelated eects assumption is dropped, while
FE remains unbiased. RE and OLS are averages of the within and between estimators. If
E[c
i
|X
i
] = 0, then the between estimator will be biased, and RE and OLS will be biased.
4.4 Measurement Error
Measurement error often becomes more problematic when working with panel data. Consider
the classical measurement error scenario in which the true variable, x

i
, is measured with
error. The observed variable is x
i
= x

i
+u
i
, where u
i
has zero mean and is uncorrelated with
x

i
or the error term
i
. In this model, it is straightforward to show that

OLS
converges to


2
x

2
x

+
2
u
. In other words, the degree of attenuation bias is a function of the signal-to-noise
ratio.
Now consider the panel data estimators in the context of a panel with T = 2. In this
panel, the FE/within estimator and the rst dierences estimator are numerically identical
(technically the long dierences estimator is also identical, since the longest possible lag is
one time period). Consider a one-regressor case with classical measurement error.
plim(

FE
) = plim(

) =
Cov(x, y)
Var(x)
=
Cov(x

+ u, x

+ )
Var(x

+ u)
Substituting in x

= x

it
x

it1
and u = u
it
u
it1
and simplifying the expression
above gives
plim(

) =

2
x
(1
x
)

2
x
(1
x
) +
2
u
(1
u
)
10
OLS can also be written as a weighted mean of the within and between estimators. In the case of OLS,
the formula is identical to the formula for

RE
except that it does not contain any of the
2
terms. This
is not surprising if we assume
2
c
= 0, then the variance-covariance matrix in the RE model looks exactly
like the standard OLS covariance matrix. Accordingly, assuming
2
c
= 0 in

F
w
gives us the OLS weights.
M. Anderson, Lecture Notes 9, ARE 213 Fall 2012 18
where

2
x
= Var(x

it
) and
2
u
= Var(u
it
)

x
=
Cov(x

it
, x

it1
)
Var(x

it
)
and
u
=
Cov(u
it
, u
it1
)
Var(u
it
)
Suppose that we estimate using only cross-sectional variation; in particular, suppose
that we use the between estimator,

B
. In that case, doing a similar derivation to the one
above, we will nd that:
plim(

B
) =

2
x
(1 +
x
)

2
x
(1 +
x
) +
2
u
(1 +
u
)
When is the attenuation bias from using the within-individual variation in the panel data
(i.e., the FE/rst dierences estimator) more severe than the attenuation bias from using
the cross-sectional variation in the panel data (i.e., the between estimator)? In other words,
when is

2
x
(1 +
x
)

2
x
(1 +
x
) +
2
u
(1 +
u
)
>

2
x
(1
x
)

2
x
(1
x
) +
2
u
(1
u
)
?
Its possible to solve this by cranking through a ton of algebra, but its easier to simply
note that when
x
=
u
, then the two sides of the inequality are equal. But increasing
x
unambiguously increases the left side (the top rises faster than the bottom) and decreases the
right side (the top falls faster than the bottom). And decreasing
u
unambiguously increases
the left side and decreases the right side. So the inequality above is satised if and only if:

x
>
u
Thus we see that attenuation bias gets worse from using FE/rst dierences (in compar-
ison to cross-sectional variation) when the inter-period correlation in x

it
is greater than the
inter-period correlation in u
it
. This condition is likely to hold if u
it
truly is random noise.
M. Anderson, Lecture Notes 9, ARE 213 Fall 2012 19
Intuitively, if x

it
is highly correlated across time periods within a given individual, while u
it
is relatively uncorrelated across time periods, then removing the individual-specic mean x

i
removes a lot of variation from the signal, but removing u
i
does not remove much noise.
Hence the signal-to-noise ratio often gets worse when using FE or dierencing estimators
(relative to RE or OLS, both of which leverage some degree of between-individual variation).
When T > 2, FE and dierencing estimators continue to have problems with attenuation
bias, but the degree of bias now diers because the two estimators are no longer numerically
identical. In general, if you are willing to assume that the correlation between x

it
and x

is
is higher when s and t are closer, then rst dierences should have more attenuation bias
than xed eects, and long dierences should have less attenuation bias than xed eects.
A sensible strategy for exploring whether you have measurement error issues might then be:
(1) First estimate

RE
and

FE
. If the two are very close, then stop (you dont appear
to have any issues).
(2) If |

FE
| < |

RE
| (which is often the case), estimate using rst dierences (

FD
)
and long dierences (

LD
).
(3) If |

FD
| < |

FE
| < |

LD
|, then you may have a measurement error problem (in
which case you would probably prefer

LD
). If FE, rst dierences, and long dierences
are all similar, it is more likely that FE is less than RE because the uncorrelated eects
assumption on c
i
is false.
Of course, there is no guarantee that |

FD
| < |

FE
| < |

LD
| implies a measurement
error problem. You could alternatively argue that the long dierences estimator is more
vulnerable to omitted variables bias from individual-specic trends than the FE and rst
dierences estimators.
M. Anderson, Lecture Notes 9, ARE 213 Fall 2012 20
4.5 Application: Deschenes and Greenstone (2007)
Deschenes and Greenstone (2007) is an application that highlights both the potential ad-
vantages and weaknesses of the xed eects approach. They use panel data on agricultural
prots from 1978 to 2002 (collected at 5 year intervals) to estimate the eects of climate
change on agriculture. Early work on this question examined the cross-sectional relationship
between climate (e.g., temperature and rainfall) and the value of agricultural land. These es-
timates of the eect of climate on agricultural prots are valid (and, in fact, clearly superior
to xed eects estimates) if there is no correlation between average climate across dierent
counties and unobserved characteristics that may aect agricultural prots. For example,
states in the Midwest have dierent climates than states in the Northeast, and both have
dierent climates than California. Cross-sectional estimates will be valid if we believe that
other factors that aect agricultural prots and dier across states are uncorrelated with the
average climate of a state.
We may not want to make the assumption that cross-sectional variation in average climate
is uncorrelated with other factors aecting agricultural. Deschenes and Greenstone apply a
xed eects estimator to their panel data to remove the cross-sectional variation. Specically,
they include county xed eects in their model, so that all of the variation occurs at the
within-county level over time.
11
Since weather is pretty close to as good as randomly
assigned on a year-to-year basis, this xed eects model is likely to recover the causal eect
of short-term climate uctuations on agricultural prots.
Running OLS on the pooled county-level data with a variety of controls, Deschenes and
Greenstone estimate that a 5 degree (F) increase in temperatures and an 8 percent increase
in precipitation could reduce agricultural land values by $75.1 billion (t = 2.7). When they
include soil and socioeconomic covariates, however, the estimate changes to an increase of
$0.7 billion (t = 0.0), and when they include state xed-eects as well the estimate changes
11
They also include state-by-year xed eects to control for aggregate yearly shocks by state. Their basic
concern is that there could be long term time trends in weather at the region or state level; including the
state-by-year eects will exibly control for these trends.
M. Anderson, Lecture Notes 9, ARE 213 Fall 2012 21
to an increase of $110.8 billion (t = 4.7).
12
OLS estimates therefore vary widely depending
on the inclusion or exclusion of certain covariates.
Running models that include county and state-by-year xed eects, Deschenes and Green-
stone estimate that a 5 degree (F) increase in temperatures and an 8 percent increase in
precipitation could increase agricultural prots by $0.7 billion (t = 1.7).
13
Controlling for
various covariates or experimenting with dierent specications has minimal impact in this
model the estimated eect always remains close to zero and statistically insignicant.
Note that the dependent variable has now shifted from land values to prots, since random
year-to-year uctuations in weather should have minimal eect on long term land values.
Translating into land values using a 5 percent discount rate, the 95% condence interval
ranges from -$2 billion to $30 billion. This is much tighter than the range reported above for
the cross-sectional models (which doesnt even account for sampling variation), and it rules
out strong negative eects. In this sense, moving to the xed eects specication seems to
give more stable estimates that are a priori more credibly identied.
The downside to the xed eects model, however, is that it only estimates the causal eect
of short-term climate uctuations on agricultural prots. There are good reasons to believe
that the eects of a long-term increase in temperatures would be markedly dierent than
the eects of a short-term increase. On the one hand, farmers might engage in adaptations
in the long run that are unprotable in the short run. This could reduce any negative eects
that we observe in the short term. On the other hand, some short-run adaptations, e.g.
drawing down reservoirs or storing crops, are not feasible in the long run. Furthermore,
factors such as changes in the distribution of pests can take years to materialize. It is
thus possible that the eects of long-term climate change are substantially worse than the
eects of short-term climate uctuations.
14
In many applications, including this one, it
12
Note that even with state xed eects, there is still some cross-sectional variation because the cross-
sectional unit is the county.
13
Deschenes and Greenstones preferred estimate uses climate predictions from the Hadley 2 climate change
model rather than the benchmark 5 degrees/8 percent scenario. Using the Hadley model, the estimated
eect changes to $1.3 billion.
14
For additional discussion of the limitations, see Fisher, Hanemann, Roberts, and Schlenker (2007).
In particular, there appears to be signicant measurement error in Deschenes and Greenstones coding of
M. Anderson, Lecture Notes 9, ARE 213 Fall 2012 22
can therefore be problematic to extrapolate too far o of estimates based on short-term,
year-to-year variation.
4.6 Fixed Eects in Other Contexts
Although the focus of these notes is on panel data, the xed estimator can be useful in
other contexts as well. For example, Currie and Thomas (1995) use a xed eects estimator
to determine the eect of the Head Start program (an early intervention program available
to children in poorer families) on test scores and health outcomes. Because children in
poor families are more likely to enroll in the program, Currie and Thomas include mother
xed eects to control for any family-specic characteristics that aect all children similarly.
The FE estimates thus use within-family variation comparing outcomes for siblings that
attended Head Start to outcomes for siblings that did not attend Head Start to estimate
the eects of Head Start.
Table 1 presents estimation results for both whites and African Americans. The rst set
of rows presents the coecient from a bivariate regression of Peabody Picture Vocabulary
Test (PPVT) scores on a Head Start indicator. Head Start appears to have a negative
eect on whites (almost surely due to selection) and no eect on blacks. The second set of
rows presents the coecient from a regression of PPVT scores on a Head Start indicator,
controlling for other covariates such as household income, mothers education, etc. The
coecient for whites changes to imply no eect of Head Start, while the coecient for
African Americans remains substantially unchanged. The third set of rows presents the
coecient from a regression of PPVT scores on a Head Start indicator, including mother
xed eects and some child-specic covariates (including household income at the time the
child was three years old). Now the coecient for whites is nally positive and signicant,
though for African Americans there is still no eect. Applying the xed eects model has a
substantial eect on the (white) estimates, even in comparison to an OLS regression with a
rich set of covariates.
weather. This may cause attenuation bias in some of the specications.
M. Anderson, Lecture Notes 9, ARE 213 Fall 2012 23
Table 1: Eects of Head Start on PPVT Scores
Whites Blacks
OLS Unadjusted -5.62 1.04
(1.57) (1.22)
OLS Adjusted -0.38 0.74
(1.45) (1.14)
Fixed Eects 5.88 0.25
(1.52) (1.36)
Source: Currie and Thomas (1995).
Parentheses contain standard errors.
Data are from NLSY.
5 Additional References
Fisher, T, M. Hanemann, M. Roberts, and W. Schlenker. (2010) The Economic Impacts of
Climate Change: Evidence from Agricultural Output and Random Fluctuations in Weather:
Comment Forthcoming, American Economic Review.
M. Anderson, Lecture Notes 10, ARE 213 Fall 2012 1
ARE 213 Applied Econometrics
UC Berkeley Department of Agricultural and Resource Economics
Selection On Unobservables Designs:
Part 2, Differences-in-Differences and Case Studies with Synthetic Controls
The most common research design for policy analysis with panel data is the dierences-
in-dierences model. In its simplest incarnation, the dis-in-dis model entails identifying
two cross-sectional units (states, cities, countries, etc.), one of which was exposed to a policy
change (or some other treatment) and the other of which was not. With longitudinal data,
we collect information on the two units both before the policy change and after the policy
change. To estimate the eect of the policy on a given outcome, we simply compare the
change in the outcome for the treated unit to the change in the outcome for the control unit.
1 Dierences-in-Dierences
Suppose that we observe two states, s = 0 and s = 1, one of which is aected by a policy
change and the other of which was not. Further suppose that we observe these states for two
time periods, t = 0 (pre-policy change) and t = 1 (post-policy change). Formally, for some
outcome Y
ist
that we observe at the individual level, the dierences-in-dierences estimator
is
(Y
11
Y
10
) (Y
01
Y
00
)
where Y
st
=
1
Nst

i
Y
ist
. To examine the strengths and weaknesses of this estimator,
write Y
ist
= Y + D
st
+
s
+
t
+
st
+ u
ist
. Note that the inclusion of
st
guarantees that
u
st
= 0.
(Y
11
Y
10
) (Y
01
Y
00
) =
M. Anderson, Lecture Notes 10, ARE 213 Fall 2012 2
[(Y + +
1
+
1
+
11
) (Y +
1
+
0
+
10
)] [(Y +
0
+
1
+
01
) (Y +
0
+
0
+
00
)] =
( +
1

0
+
11

10
) (
1

0
+
01

00
) =
+ (
11

10
) (
01

00
)
The key assumption for identifying will therefore be E[
11

10
] = E[
01

00
]. In
other words, the outcomes for the two states must have similar trajectories over the two
time periods absent any treatment eect. Any factor that is specic to state s but does
not change over time, or changes over time but changes in equal amount for both states, is
netted out in the dis-in-dis estimator.
It is important to note, however, that the condition above only guarantees that we will
identify in expectation. Because we only observe a single observation for each of the
st
terms in the expression above, there is no guarantee that the noise from
st
will not swamp
our estimate of the treatment eect, we cannot appeal to the law of large numbers as
we do when we have N independent observations. This is an issue that we will return to
shortly.
The dis-in-dis estimator can also be easily implemented within a regression framework.
Consider running the regression:
Y
ist
= + D
st
+ 1(s = 1) + 1(t = 1) +
st
+ u
ist
In other words, simply regress Y on a treatment indicator, a state dummy, and a time
dummy. The state dummy controls for between-state dierences in Y that are constant over
time, and the time dummy controls for between-time period dierences in Y that are identical
across states. Identication of again comes from the assumption that
st
is uncorrelated
with the treatment indicator (which is equal to the interaction between the state dummy
and the time dummy) conditional on the state dummy and the time dummy. Note that in
the regression format, it is easy to control for individual-level covariates. You can also see
M. Anderson, Lecture Notes 10, ARE 213 Fall 2012 3
the standard errors issue in this framework. If we use the typical OLS standard errors that
assume independence across all observations, we are eectively claiming that the only error
in our estimator is sampling error that arises because we do not observe the entire population
of each state. However, if
2

= 0, i.e. there are state-specic shocks that vary over time,


then this independence assumption is violated, and our standard errors will be wrong.
1.1 Triple Dierences
A dis-in-dis research design can sometimes be made more compelling by adding another
layer of dierencing to the estimator, resulting in a triple-dis estimator. For example,
consider a policy change in state 1 in time period 1 that only aects persons 65 years and
older. In that case, we might use individuals aged 55-64 as an additional control group.
In practice, we would implement this with a triple dierences estimator. Let Y
sta
be dened
as above, but with a = 0 signifying persons of age 55-64 and a = 1 signifying persons of age
65 and older. Then the triple dierences estimator is:
[(Y
111
Y
110
) (Y
101
Y
100
)] [(Y
011
Y
010
) (Y
001
Y
000
)]
In other words, we compare the evolution of the gap between 65+ year olds and 55-64
year olds in the treated state to the evolution of the gap between 65+ year olds and 55-64
year olds in the control state. The advantage of this triple-dis structure is that it allows us
to relax our assumptions on
st
. We no longer need to assume that outcomes for both states
would evolve similarly in expectation we now need only assume that, to the extent that
outcomes evolve dierently in state s = 1 than state s = 0, the dierences aect age groups
a = 1 and a = 0 similarly.
We can easily implement this triple-dis estimator within the regression framework. The
key is to put in an indicator for every main eect or interaction up to, but not including, the
level at which the treatment varies. Thus we include main eects for age, state, and time,
as well as all possible two-way interactions between each of those indicators. The regression
M. Anderson, Lecture Notes 10, ARE 213 Fall 2012 4
looks like:
Y
ista
= + D
sta
+
1
1(s = 1) +
2
1(t = 1) +
3
1(a = 1) +
4
1(s = 1)1(t = 1)
+
5
1(s = 1)1(a = 1) +
6
1(t = 1)1(a = 1) +
sta
+ u
ista
1.2 Applications: Card (1990), Card & Krueger (1994), Kellogg
& Wol (2008)
Two canonical examples of dis-in-dis papers are Cards (1990) study of the Mariel Boatlift
and Card and Krugers (1994) study of the minimum wage increase in New Jersey.
1
The
Mariel Boatlift occurred from May to September of 1980 when Cuba allowed any citizen
wishing to emigrate to the United States free passage from the port of Mariel. Approximately
125,000 Cuban immigrants arrived in Miami during this time period, increasing the local
labor force by about 7%.
Card examines wage and employment outcomes for various groups of natives, particularly
blacks and lower-skilled workers the latter group is more likely to be in direct competition
with the newly arrived immigrants (who were relatively low-skilled). He compares the evo-
lution of these outcomes over the 1979 to 1981 period in Miami to their evolution in four
comparison cities: Atlanta, Los Angeles, Houston, and Tampa-St. Petersburg. For blacks,
the dierence in log wages between Miami and comparison cities changes from 0.15 in
1979 to 0.11 in 1981, so the dis-in-dis estimate for log wages is 0.04 (with a standard
error of about the same size). The dierence in the employment-to-population ratio between
Miami and comparison cities changes from 0.00 in 1979 to 0.02 in 1981, so the dis-in-dis
1
The term dierences-in-dierences is thrown around often, but to my knowledge there is no formal
denition for what classies as a dis-in-dis paper. Arguably many panel data papers that control
for both individual-specic eects and aggregate time eects are using some form of double dierencing
estimator, but that doesnt mean that wed necessarily refer to them as dis-in-dis papers. In my mind, a
dis-in-dis paper generally uses some sort of variation in the treatment that occurs at an aggregated level,
e.g. the city or state level. We therefore tend not to worry so much about individuals selecting into the
treatment (its unlikely that most will move in response to just one shock), but rather we worry that the
treatment was implemented in one area rather than another for some non-random reason (e.g., legislative
endogeneity).
M. Anderson, Lecture Notes 10, ARE 213 Fall 2012 5
estimate for employment is 0.02. Estimates for unemployment rates and low-skilled blacks
show similar patterns. Overall, there is no evidence that immigrants harm natives labor
market outcomes.
2
Card and Krueger (1994) study the impact of a 19% increase in the New Jersey minimum
wage in 1992. They survey fast-food restaurants before and after the change on both sides
of the New Jersey-Pennsylvania border. Examining average employment per store in the
dis-in-dis framework they nd:
(Emp
NJ,1
Emp
NJ,0
) (Emp
PA,1
Emp
PA,0
) = (21.03 20.44) (21.17 23.33) = 2.75
Thus we see that, if anything, the minimum wage increase appears to have raised employ-
ment (though the increase is not statistically signicant). Because the ndings run counter to
the predictions of economic theory, they have been heavily scrutinized and criticized, though
they survive a battery of robustness checks and have been replicated in other settings. One
arguably valid criticism, however, concerns the standard errors. As we saw earlier, if the

st
terms have non-zero variance, then we cannot accurately compute the standard errors
because we do not have enough observations to compute the variance for the state-by-time
level shocks. In essence, despite the fact that the Card (1990) and Card and Krueger (1994)
studies contain hundreds or even thousands of observations, both are essentially case studies
in that the treatments vary only at the city or state level and the studies contain only a few
cities or states.
3
We will return to this issue shortly.
Kellogg and Wol (2008) provide a nice example of a triple dierences research design.
Their interest is in estimating the eect of Daylight Savings Time (DST) on electricity usage.
DST may reduce energy usage because, for example, it aligns the hours at which people are
awake with the hours at which the sun is up, thus reducing lighting needs. On the other
hand, it may increase energy usage because people wake up when the sun rises (as opposed
2
Interestingly, for Americans of Cuban descent, Card nds evidence of some increase in unemployment
rates, but no eect on wages.
3
This fact is not lost on the authors. The title of Card and Krueger (1994), for example, reads, Minimum
Wages and Employment: A Case Study of the Fast-Food Industry in New Jersey and Pennsylvania.
M. Anderson, Lecture Notes 10, ARE 213 Fall 2012 6
to after it has risen) and need to heat their homes during this time.
Kellogg and Wol leverage an extension to DST in Australia that was put in place for
the Summer 2000 Olympics. Some Australian states, including New South Wales (where
the Olympics were held) and Victoria, extended DST beyond the date at which normally
terminates. Other states, including South Australia, did not. They compare the change in
electricity usage for Victoria (the treated state) to the change in electricity usage for South
Australia (the control state). They are concerned, however, that electricity usage might
be trending dierently in these two states for reasons unrelated to the DST extension.
To address this concern, they observe that DST should not aect electricity usage during
the middle of the day, when the sun is always in the sky regardless of whether you are
on DST or standard time. The midday hours thus provide an extra control group that
should be unaected by DST. This allows them to implement a triple dierences estimator.
Specically, they dene the treated portion of the day as the hours from 0:00 to 12:00 and
14:30 to 24:00. They dene the control portion of the day as the hours from 12:00 to 14:30.
Using a simple dierences-in-dierences estimator with electricity usage during the treated
portion of the day as the outcome, they nd that electricity usage fell by 0.4% in Victoria
as compared to South Australia. However, they also nd that electricity usage during the
control portion of the day fell by 0.2% in Victoria as compared to South Australia. The triple
dierences estimator takes the dierence between these two double dierences estimators;
thus their nal estimate is that DST reduced electricity usage by 0.2% (with a standard
error of 1.5%). The identifying assumption here is that, if Victoria and South Australia are
trending dierently from each other, these dierential trends still have the same proportional
eect on electricity usage from 12:00-14:30 and electricity usage from 0:00-12:00/14:30-24:00.
To increase the precision of their estimates, they also implement the triple dierences
estimator in a regression framework. The regression framework allows them to control for
other determinants of electricity usage (e.g., day of week, weather, etc.). This reduces the
unexplained variation in the outcome and thus reduces their standard errors. An observation
M. Anderson, Lecture Notes 10, ARE 213 Fall 2012 7
in this regression is the half-hour-by-day-by-state. They regress electricity usage on the
treatment variable (one if DST is in eect and it is before 12:00 or after 14:30, zero otherwise)
and day-by-state indicators (which basically correspond to the state-by-time interactions
in Section 1.1), hour-by-state indicators (which basically correspond to the state-by-age
interactions in Section 1.1), hour-by-year indicators (which basically correspond to the time-
by-age interactions in Section 1.1), and other control variables.
In the regression framework, they nd that DST increases energy usage by 0.02% (if
they impose a homogeneous eect across all treated hours) or 0.09% (if they allow for het-
erogeneous eects of DST across dierent times of day). The standard error drops to 0.4%,
so they are able to rule out substantial electricity savings from DST savings of 0.5% or
higher, for example, are unlikely.
2 Case Studies with Synthetic Controls
The Card (1990) and Card and Krueger (1994) papers are two examples of thoughtful dis-
in-dis studies, but they also highlight two central issues that aect much of the dis-in-dis
literature. First, in many cases there are multiple control units available for the researcher to
choose from. For example, the choice of Eastern Pennsylvania as a control for Western New
Jersey in the Card and Krueger (1994) study seems fairly obvious due to the direct geographic
proximity of the two regions. The choice of Atlanta, Houston, Los Angeles, and Tampa as
control cities for Miami in the Card (1990) study, however, seems somewhat more arbitrary.
Though Card chose these cities because they had relatively large populations of blacks and
Hispanics and because they exhibited a pattern of economic growth similar to that in Miami
over the late 1970s and early 1980s, a dierent researcher might have chosen an entirely
dierent set of control cities using a dierent (but still reasonable) algorithm. Second, as
we have noted for both studies, the reported standard errors are not necessarily robust to
the possibility of state-by-time (or city-by-time) specic shocks. For instance, perhaps New
Jersey simply experienced some positive economic shock in late-1992 (the post-treatment
M. Anderson, Lecture Notes 10, ARE 213 Fall 2012 8
period) with only two observations on New Jersey, its impossible for us to even estimate
what the variance of New Jerseys statewide economic shocks might be.
Abadie, et al. (2010) present an estimation strategy synthetic controls that ad-
dresses both of these issues. At its heart, the synthetic controls strategy is basically a
combination of dis-in-dis and matching control units are chosen based on how closely
they resemble the treated unit in the pre-treatment periods.
4
Furthermore, the strategy
turns the large number of available control units into an advantage when estimating stan-
dard errors, because its possible to study the variance of our estimator by constructing
placebo estimates for units that were never treated.
Suppose that we observe a single treated unit and J control units over T time periods.
Assume the policy intervention occurs at the end of period T
0
, so that periods 1, ..., T
0
are
pre-intervention, and periods T
0
+1, ..., T are post-intervention. Let Y
jt
represent the outcome
of interest for unit j in period t, let D
jt
represent the treatment, and let X
jt
represent a
set of K observed covariates. The question at hand is how to construct a synthetic control
group for the treated unit out of the J potential control units.
Let Z
1
= [X
1
, Y
11
, Y
1T
0
/2
, Y
1T
0
]

be a K + 3 1 column vector of covariates and pre-


intervention outcomes for the treated unit, where X
1
contains the K observed covariates,
each averaged over the pre-intervention time period. Let Z
0
be a K + 3 J matrix that
contains the same averaged covariates and pre-intervention outcomes for all J potential
control units (each column corresponds to one of the control units). Our goal is to choose
W

, a J 1 column vector of weights. We will use these weights to combine all J control
units into a single synthetic control unit against which to compare the treated unit. You
could call Z
1
and Z
0
the matching matrices in the sense that they contain the variables
that we are going to use to try to nd the combination of control units that best matches
the treated unit.
4
Technically, some of the reanalyses of the LaLonde data are also combining dis-in-dis and matching.
However, as I argued in an earlier footnote, I wouldnt necessarily consider the LaLonde paper to represent
an archetype of a dis-in-dis paper, because it involves selection into treatment at the individual level
rather than selection on the aggregate level.
M. Anderson, Lecture Notes 10, ARE 213 Fall 2012 9
Choose W

such that it minimizes the distance between Z


1
and Z
0
subject to w

j
0 and

j
w

j
= 1, where distance is dened as

(Z
1
Z
0
W)

V (Z
1
Z
0
W) for some symmetric,
positive semidenite (K + 3) (K + 3) matrix V .
5
In other words, choose the weights
to minimize the distance between the treated units covariates (including pre-intervention
outcomes) and the synthetic control units covariates (including pre-intervention outcomes).
In practice, there are several likely candidates for the matrix V :
(1) Set V equal to the inverse of a diagonal matrix in which the diagonal element in row
i equals the variance of the covariate/pre-intervention outcome in row i of the Z matrices.
This will minimize the normalized Euclidean distance.
(2) Set V equal to the inverse of the variance-covariance matrix of the covariates/pre-
intervention outcomes in Z. This will minimize Mahalanobis distance.
(3) Set V to minimize the mean squared prediction error of the outcome variable during
the pre-intervention period. Formally, let Y
p
1
be a T
0
1 column vector of pre-intervention
outcomes for the treated unit, and let

Y
p
0
= Y
p
0
W(V ) be a T
0
1 vector of pre-intervention
outcomes for the synthetic control unit. Note that

Y
p
0
is a function of W, which is itself a
function of V . We will choose V such that the choice of V minimizes (Y
p
1


Y
p
0
)

(Y
p
1


Y
p
0
).
In other words, we choose V such that the resulting synthetic control provides the best t
for the pre-intervention outcome trajectory in the treated unit. In practice this means that
we pick a value of V , solve for the W implied by that V , and see how closely the synthetic
control unit produced by that W matches the pre-intervention outcomes of the treated unit.
We keep doing this until we nd the V that provides the best t of the pre-intervention
outcomes.
6
The last algorithm is not trivial to implement; fortunately Abadie, et al. provide code
5
The weights need to be nonnegative and sum to one because, if they were not, it would be possible to
perfectly t Z
1
and Z
0
W whenever there were more potential control units than elements in Z
1
. This also
prevents extrapolation outside of the support (i.e., the convex hull) of Z
0
.
6
This procedure seems somewhat convoluted in the sense that we could simply choose W to minimize
(Y
p
1

Y
p
0
)

(Y
p
1

Y
p
0
) to begin with. My guess is that the reason we dont do this is because Y
jt
always contains
some error, so leveraging the predictive power of the covariates can potentially give us a better out-of-sample
forecast than just matching on pre-intervention outcomes and throwing away all the covariates.
M. Anderson, Lecture Notes 10, ARE 213 Fall 2012 10
for implementing the synthetic control estimator in Stata (sort of), Matlab, and R on the
web. Once you have estimated W and constructed the synthetic control unit Y
0
W, you could
estimate a dis-in-dis model in which you compare the treated unit to the synthetic control
unit. In practice, however, it is generally more informative to simply graph the outcomes for
the treated unit against those for the synthetic control unit over all time periods, 1, ..., T.
Abadie, et al. present a case study of Californias Prop 99, passed at the end of 1988,
which increased the cigarette tax and implemented other anti-tobacco measures. They apply
the synthetic control group estimator to a data set that contains 30 years of data (18 year
pre-intervention, 12 years post-intervention) and 50 states. After discarding from the control
state donor pool other states that implemented tobacco control programs or raised their
cigarette taxes by more than 50 cents, they retain 38 states. Estimating a synthetic control
unit for California using 3 years of pre-intervention data (1975, 1980, and 1988) and a variety
of covariates (e.g., retail cigarette price, per capita income, per capita beer consumption),
Abadie, et al. nd positive weights for ve states: Colorado, Connecticut, Montana, Nevada,
and Utah. All other states receive a weight of zero.
7
Figure 1 shows the evolution of cigarette sales in California versus Synthetic California.
There is a clear break at the passage of Prop 99, but this break is somewhat deceptive.
We constructed the synthetic control unit so that it tracked the treated unit closely in the
pre-intevention period, so of course the two units will diverge more in the post-intervention
period than they do in the pre-intervention period, even if the treatment has no eect.
Fortunately, the plethora of untreated control units gives us a sample with which to conduct
statistical inference and determine whether the post-intervention divergence is signicant or
not.
Abadie, et al. suggest a variant on the exact permutation test (which we will discuss
in greater detail when we talk about standard errors). Suppose that we randomly chose to
implement Prop 99 in California. In this scenario, there is no bias in our estimate of the
eect of Prop 99 our only question is whether the divergence we observe between California
7
This may seem surprising until you consider that weights are always constrained to be zero or greater.
M. Anderson, Lecture Notes 10, ARE 213 Fall 2012 11
and Synthetic California represents a real treatment eect or is simply due to chance. The
key is that the variation in our results can be conceptualized as arising from the variation
in the treatment assignment. In our world, we happened to assign Prop 99 to California,
but in alternative worlds we could have assigned it to Delaware or Texas or Wyoming. If we
assume that Prop 99 had no eect (the null hypothesis), we can pretend to assign it to these
other states that were, in actuality, untreated. In doing so, we can map out the distribution
of the estimator under the null hypothesis. This is exactly what Abadie, et al. do in Figure
2.
Figure 2 graphs the dierence between the treated state and its synthetic control for
Figure 1: Per-Capita Cigarette Sales in CA vs. Synthetic CA. Source: Abadie, et al. (2010)
Figure 2: Trends in Per-Capita Cigarette Sales: California vs. synthetic California
1970 1975 1980 1985 1990 1995 2000
0
2
0
4
0
6
0
8
0
1
0
0
1
2
0
1
4
0
year
p
e
r
!
c
a
p
i
t
a

c
i
g
a
r
e
t
t
e

s
a
l
e
s

(
i
n

p
a
c
k
s
)
California
synthetic California
Passage of Proposition 99
37
M. Anderson, Lecture Notes 10, ARE 213 Fall 2012 12
all 38 states in the study. The black line represents California (the actual treated state),
while the gray lines represent all of the control states. These states received a placebo
treatment in the sense that we pretended that they were treated, and estimated the synthetic
control estimator as if they were, but in actuality they received no treatment. The fact that
the California line is at or near the bottom during the post-intervention period strongly
suggests that these results are not simply due to chance. If we had randomly picked an
untreated state and implemented the same procedure, it is unlikely that we would have
found a post-intervention deviation of this magnitude.
To conduct a formal test, Abadie, et al. compute the mean squared prediction error (i.e.,
the mean squared gap between the treated state and its synthetic control) for the pre-
Figure 2: Per-Capita Cigarette Sales Gap: (Actual - Synthetic Control). Source: Abadie, et
al. (2010)
Figure 4: Per-Capita Cigarette Sales Gaps in California and Placebo Gaps in all 38 Control
States
1970 1975 1980 1985 1990 1995 2000
!
3
0
!
2
0
!
1
0
0
1
0
2
0
3
0
year
g
a
p

i
n

p
e
r
!
c
a
p
i
t
a

c
i
g
a
r
e
t
t
e

s
a
l
e
s

(
i
n

p
a
c
k
s
)
California
control states
Passage of Proposition 99
39
M. Anderson, Lecture Notes 10, ARE 213 Fall 2012 13
and post-intervention periods for all 38 states plus California. They nd that the ratio of
the post-intervention mean squared prediction error and the pre-intervention mean squared
prediction error is higher for California than for any other state. Thus they reject the null
hypothesis of no treatment eect at p = 0.026.
To recap, the synthetic control estimator has two nice properties that can augment many
dis-in-dis designs. First, it provides a more rigorous, less ad-hoc way of selecting control
units from a large pool of potential controls. Second, it leverages the large pool of potential
controls to conduct permutation-based inference in a manner that is robust to the possibility
of unit-by-time period specic shocks. In other words, it accounts for the fact that, even if
we observed the entire population for each unit (e.g., state, city, etc.), there would still be
some deviation between the treated unit and its synthetic control because there are aggregate
(i.e., unit-level) shocks that occur at the unit-by-time level.
M. Anderson, Lecture Notes 11, ARE 213 Fall 2012 1
ARE 213 Applied Econometrics
UC Berkeley Department of Agricultural and Resource Economics
Selection On Unobservables Designs:
Part 3, Instrumental Variables
To date we have studied selection on observables designs and a single selection on unob-
servables design: panel data with xed eects or dierences-in-dierences. In the panel data
models, we assume that any unobservable determinants of Y
it
that are correlated with treat-
ment assignments are constant over time (and thus get dierenced out or absorbed by the
xed eects). This assumption often seems questionable, however changes within individ-
uals or cross-sectional units do not necessarily occur at random. To address this possibility,
we discussed synthetic control methods that essentially combine matching with dis-in-dis.
Note that this combination basically put us back in the selection on observables world we
essentially assume that we have enough observable characteristics to construct a synthetic
control unit whose trajectory of Y
it
would match the treated units trajectory of Y
it
(absent
any treatment eect).
We now turn to a true selection on unobservables design the instrumental variables
(IV) estimator. IV methods are a cornerstone of econometrics these methods date back
to the work of Tinbergen and Haavelmo in the 1930s and 1940s.
1
Our understanding of IV
methods advanced signicantly during the 1990s, however, with seminal work on IV in the
context of treatment eect heterogeneity and IV methods in the case of a large number of
weak instruments. For the purposes of these notes, I will use the phrase IV methods to
refer generally to methods using instrumental variables, including IV, two stage least squares
(2SLS), and limited information maximum likelihood (LIML).
1
Saddam Hussein is reputed to have called IV the mother of all unobservables designs.
M. Anderson, Lecture Notes 11, ARE 213 Fall 2012 2
1 Instrumental Variables
1.1 Basic IV
Consider a model of the form
y
i
=
0
+
1
d
i
+
i
(1)
I will sometimes refer to this equation as the structural equation. At this point we are
not assuming that d
i
is binary it may have more than two points of support, or it may
be continuous. The standard condition that we need for a linear regression of y
i
on d
i
to
consistently estimate
1
is Cov(d
i
,
i
) = 0. This will be true if d
i
is randomly assigned, and
it could be true in other situations as well. In general, however, it will not be true.
An alternative way to estimate
1
is via instrumental variables. The goal in IV is to nd
some subset of the variation in d
i
, call it z
i
, that is uncorrelated with
i
(i.e., as good as
randomly assigned). Formally, our goal is to nd an instrument z
i
, not in equation (1), that
satises the following two properties:
1. Cov(z
i
, d
i
) = 0
2. Cov(z
i
,
i
) = 0
The rst assumption ensures that z
i
actually captures some of the variation in d
i
. If
it doesnt, then it will be of no use to us in estimating the eect of d
i
on y
i
. The second
assumption ensures that z
i
is uncorrelated with
i
(obviously). This assumption is often
referred to as the exclusion restriction because it implies that the instrument, z
i
, can be
excluded from equation (1). If z
i
were correlated with
i
, we would want to include it as a
covariate (given that its also correlated with d
i
by our rst assumption). This would violate
our condition that z
i
not be in equation (1).
M. Anderson, Lecture Notes 11, ARE 213 Fall 2012 3
To x ideas, let us consider an application. Suppose that we would like to estimate the
causal eect of schooling on earnings. One way to estimate this eect would be to regress
earnings on schooling and a bunch of other covariates. Is it plausible that the variation
in schooling is uncorrelated with everything else that aects earnings, such as unobserved
ability? Probably not, even after conditioning on covariates. We generally think that on
average people who go to college are dierent in fundamental ways from people who only
complete high school, and specically we believe that the two groups are probably dierent
in ways that aect earnings (e.g. perhaps the college graduates have higher innate ability).
An alternative way to estimate the eect of schooling on earnings is to nd an instrument
for schooling that satises the criteria above. Suppose that we are lucky and we learn
that many years ago the state government of California (the greatest state in the Union,
according to my jury duty brieng) decided to start giving out free scholarships to U.C.
schools in order to promote higher education. However, because the government has a limited
budget, they could not oer the scholarships to everyone, so in the interest of fairness they
decided to hold a lottery in which every family in the state was automatically entered, and
the scholarships were randomly assigned to lucky families who won the lottery. (Again, to
my knowledge this has not actually happened its just a hypothetical example to make
things more concrete.)
If we let z
i
be a dummy variable that is 1 if a family wins the scholarship lottery and 0
otherwise, then z
i
is a promising instrument for schooling. We know that the rst condition,
Cov(z
i
, d
i
) = 0, will be satised because people who win the free scholarships will be more
likely to attend college than those who dont, inducing a positive correlation between z
i
and
d
i
. We also know that the second condition for a good instrument, Cov(z
i
,
i
) = 0, is likely
to be satised because the winners of the lottery were randomly chosen by the state. By
denition, no characteristic, other than those characteristics directly aected by the lottery,
can possibly be correlated with whether or not a family won the lottery.
2
2
I give an example in Section 1.2 of another treatment, besides schooling, that the lottery might be
aecting. This would be a violation of the exclusion restriction.
M. Anderson, Lecture Notes 11, ARE 213 Fall 2012 4
The IV estimator is:

IV
= (Z

D)
1
(Z

Y )
In the general case, Z could contain not only the instrument z
i
(the lottery number),
but also predetermined covariates x
i
(gender, race, parental education, etc.). D would then
contain both the treatment of interest, d
i
, and the predetermined covariates x
i
. In the case
in which there are no covariates, we can write

IV
= Cov(z
i
, y
i
)/Cov(z
i
, d
i
).
It is straightforward to show that

IV
is a consistent estimator of
1
given the assumptions
above.
plim(

IV
) = plim[(Z

D)
1
(Z

Y )]
= plim[(Z

D)
1
(Z

D + Z

)]
= plim[(Z

D)
1
(Z

D)] + plim[
1
N
(Z

D)
1
]plim[
1
N
(Z

)]
=
This formal derivation, however, gives limited intuition regarding why or how IV operates.
For intuition, we will turn to alternative methods of implementing the IV estimator.
1.2 The Reduced Forms and 2SLS
The most popular way to implement the IV estimator is via a two stage procedure known
as two stage least squares (2SLS). If we have one instrument and one variable that we want
to instrument for, 2SLS and IV are the exact same thing (in this case we would say that we
are exactly identied). IV is thus a special case of 2SLS you can always use 2SLS in any
scenario in which you can use IV, though the reverse is not true. We begin by writing out
the two stages of 2SLS, and then consider what is going on:
M. Anderson, Lecture Notes 11, ARE 213 Fall 2012 5
1. First Stage We rst estimate a regression of d
i
(the variable that we want to instrument
for e.g., schooling, in our hypothetical example) on the instrument, z
i
(e.g., the lottery
number), and all of the predetermined covariates, x
i
. This regression looks like:
d
i
=
1
z
i
+ x
i

2
+ u
i
where z
i
and
1
are scalars, x
i
is a 1 K + 1 vector that includes all covariates and a
1, and u
i
is a residual term. Take the predicted values of d
i
(e.g., predicted schooling,

d
i
=
1
z
i
+ x
i

2
) from this regression and use them in place of the actual values of d
i
in the second stage.
2. Second Stage In the second stage, we run the regression that we originally wanted to
estimate, but instead of including the variable that we want to instrument for (d
i
),
we include its predicted values from the rst stage (

d
i
). In our example, instead of
running earnings on schooling and other covariates, we would run earnings on predicted
schooling (from the rst stage) and other covariates. Thus the regression looks like:
y
i
=
0
+
1

d
i
+ x
i

2
+
i
The estimate of
1
from this regression will be consistent.
Note that both the rst and second stages always contain the same set of covariates
(you cant exclude certain covariates from the rst stage and then include them in the
second stage, and you cant exclude covariates from the second stage and include them in
the rst stage, unless you intend to use them as instruments). In matrices, dene Z to
be a matrix that includes the instrument (z
i
) and the predetermined covariates (x
i
). D
is a matrix that includes the treatment (d
i
) and the predetermined covariates (x
i
). Then

2SLS
= (D

P
Z
D)
1
(D

P
Z
Y ), where P
Z
= Z

(Z

Z)
1
Z.
Now that we have introduced the rst stage and the second stage, we are almost done, but
before we move on to the next section I will introduce the reduced form equation. Technically
M. Anderson, Lecture Notes 11, ARE 213 Fall 2012 6
the term reduced form refers to any regression which regresses an endogenous variable (i.e.,
a not-exogenous variable; in our case y
i
and d
i
are our two endogenous variables) on all of
the exogenous variables (z
i
and x
i
). So, if you consider the two regressions that we estimated
above, you will see that the rst stage is in fact a reduced form equation. However, in general
I will use the term reduced form to refer specically to the reduced form equation that
regresses y
i
on all of the exogenous variables (z
i
and x
i
). So the reduced form in our example
is:
y
i
=
1
z
i
+ x
i

2
+ v
i
What does the reduced form measure? The reduced form measures the causal eect of
the instrument (z
i
) on the outcome variable (y
i
). In our example, the coecient that we get
from running the reduced form gives us an estimate of the eect on earnings of winning the
scholarship lottery. Note that if z
i
is a good instrument, then the causal eect should run
only through the variable that is being instrumented for (d
i
). In our example, that means
that winning the schooling lottery should raise your income only because it encourages you
to get more schooling on average, not for some other reason (e.g., because the parents of
lottery winners used the money they saved on college tuition to pay for additional private
tutoring for their children).
So there are three equations we want to keep in mind:
1. The rst stage, which regresses the variable were instrumenting for on the instru-
ment(s) and the other exogenous variables. This predicts how the variable were in-
strumenting for changes as our instrument changes.
2. The second stage, which regresses y
i
on the predicted values from the rst stage and
the other exogenous variables. This gives us our IV estimate of
1
.
3. The reduced form, which regresses y
i
on the instrument and the other exogenous vari-
ables. This measures how y
i
changes as we change our instrument z
i
. Note that we
M. Anderson, Lecture Notes 11, ARE 213 Fall 2012 7
never have to run the reduced form in the 2SLS procedure, but as you will see in the
next section, it is a useful concept to keep in mind.
1.3 IV Intuition
At this point, some might ask, Why not just run the reduced form? Why bother with
IV (2SLS) at all? After all, the reduced form gives unbiased predictions, and its much less
complex than this two stage procedure. In other words, why not simply replace the variable
we want to instrument for (d
i
) with the instrument (z
i
)? Actually, this isnt necessarily a
bad idea. As Josh Angrist says, Many papers would do well to stop with the reduced form.
The reduced form makes explicit exactly where the identication in the research design is
coming from, and it does not suer from some of the weak instruments issues that we will
discuss later. Any time you are dealing with a single instrument, its a good idea to estimate
the reduced form and check whether it conforms to your expectations, even if you dont put
it into the paper.
The answer to the question above, however, is that we usually are not interested in
measuring the eect of z
i
on y
i
, which is what the reduced form gives us. Instead, we are
interested in measuring the eect of d
i
on y
i
. That is what IV gives us. In our example, we
are interested in measuring the eect of schooling on wages, so we run IV. If we just ran the
reduced form, we would get the eect of winning the scholarship lottery on wages. While
that may be of some policy interest in evaluating the scholarship program, it is not what we
are looking for.
From a linear algebra perspective, 2SLS/IV estimates by rst projecting all of the data
onto the subspace spanned by Z all of the exogenous variables in the regression (i.e., the
instrument and the predetermined covariates) and then running the regression of y
i
on d
i
and x
i
after they have been projected onto this subspace. In this sense it should be clear
that we are only using the good variation in d
i
(i.e., the variation in d
i
that comes from
z
i
) to estimate
1
. However, in the case in which you have a single instrument (which is all
M. Anderson, Lecture Notes 11, ARE 213 Fall 2012 8
we have discussed so far), there is an even cleaner interpretation.
3
In the case of one treatment and one instrument, the estimate of
1
that we get from IV
equals the reduced form coecient rescaled by the rst stage coecient. That is to say:

1IV
=

1

1
What this shows is that the IV estimate is very closely related to the reduced form
estimate in fact, its exactly proportional to the reduced form estimate. Why is this a
useful formulation? Well, consider what each coecient means.
In our example, the reduced form coecient (
1
) measures the eect of winning the
scholarship lottery on earnings. But that is not what we want; what we want is the eect
of an additional year of schooling on earnings. Because the scholarship lottery only aects
earnings due to its eect on increasing schooling (or so were assuming), the reduced form
coecient represents the eect of an unknown additional amount of schooling on earnings.
The problem is that we dont have the units right. If we knew that everyone who won
the scholarship lottery got, on average, one more year of school than they otherwise would
have, then we could interpret the reduced form coecient as the causal eect of one more
year of schooling on earnings. Why? Well, remember that because the instrument z
i
is
randomly assigned (i.e. winners are randomly picked), the winners and the losers are on
average comparable in every way except that the winners get, on average, one more year of
schooling than the losers. So any dierence in earnings between the winners (i.e. those with
z
i
= 1) and the losers (i.e. those with z
i
= 0) must be due to the extra year of schooling.
Thus the coecient on z
i
in the reduced form (
1
) is the eect of one more year of schooling
on earnings.
In general, however, it is unlikely that the scholarship lottery winners get, on average,
exactly one more year of schooling than the losers. So how do we rescale the reduced form
3
If you are overidentied, i.e. you have more instruments than you need, then this interpretation
does not hold anymore, though it is still conceptually useful. In our example, we are just identied (one
variable to instrument for, i.e. schooling, and one instrument, i.e. the scholarship lottery), so you can apply
the interpretation that Im about to give.
M. Anderson, Lecture Notes 11, ARE 213 Fall 2012 9
coecient so that we get the units right? The answer is that we divide through by the
rst stage coecient,
1
. Why does this work? Consider our specic example. The rst
stage estimates the eect of winning the scholarship lottery on years of education. So the
rst stage coecient,
1
, tells you how much, on average, your years of schooling increase
if you win the scholarship lottery. So suppose that
1
= 0.5 , i.e. that those who win the
scholarship lottery get half a year more of education on average than those who do not win
the lottery. Also suppose that
1
= 500, i.e., those who win the scholarship lottery earn $500
more per year on average than those who do not win the scholarship lottery. Then we know
that the people winning the lottery are earning $500 more because they have extra schooling
(this comes from the reduced form). And we know that they are on average getting an extra
0.5 years of schooling when they win the lottery. So what is the return to one additional year
of schooling? It is $500/0.5 (the change in earnings divided by the change in schooling), or
$1000. In other words, our estimate of the eect of schooling on earnings is

IV
=

1

1
. So the
reduced form coecient represents the causal eect of some additional amount of schooling
on earnings (how much additional schooling is unknown until we see the rst stage), and
the rst stage coecient rescales that coecient appropriately to reect the amount of extra
schooling that the instrument (the scholarship lottery) generates.
So far you have taken it on faith that the formula

IV
=

1

1
is actually true. But it is
actually simple to prove. Recall that

IV
= (Z

D)
1
(Z

Y ). If you accept that we can apply


partitioned regression to IV just like we can with OLS, then it is trivial to transform the
formula for

IV
into one in which Z and D are always vectors.
4
If Z contains covariates
X, simply redene Z such that

Z = M
X
Z
1
, where Z
1
is a column vector containing only
the instrument and M
X
is the orthogonal projection matrix for the covariates, M
X
= I
X(X

X)
1
X

(X is an N K + 1 matrix containing all covariates and a column of ones).


5
Thus we can always write

1IV
as

1IV
= (

Z

1
D
1
)
1
(

Z

1
Y ) = Cov( z
i
, y
i
)/Cov( z
i
, d
i
)
4
That partitioned regression works for the 2SLS procedure should be fairly obvious. It is less self-evident
that partitioning must work for the IV formula as well.
5
Also dene the column vector D
1
such that D
1
contains only the treatment, d
i
.
M. Anderson, Lecture Notes 11, ARE 213 Fall 2012 10
Now consider
1
and
1
. The former comes from a regression of y
i
on z
i
, so
1
=
Cov( z
i
, y
i
)/Cov( z
i
, z
i
). The latter comes from a regression of d
i
on z
i
, so
1
= Cov( z
i
, d
i
)/Cov( z
i
, z
i
).
Thus

1
/
1
=
Cov( z
i
, y
i
)/Cov( z
i
, z
i
)
Cov( z
i
, d
i
)/Cov( z
i
, z
i
)
=

1IV
The takeaway of all of this is that, when working with an IV estimator, the entire experi-
ment is in the reduced form. The reduced form measures the causal impact of the instrument
on the outcome the rst stage exists only to rescale that estimate and get the units right.
Thus, when applying IV, you should always consider the underlying reduced form that you
are running and ascertain whether it makes sense and whether it is identifying the causal
eect in the manner that you originally imagined.
1.4 Multiple Instruments
Its often very dicult to nd one good instrument, let alone two or more good instruments.
Nevertheless, in some cases a single conceptual instrument will be parameterized though
multiple variables (we will see an example of this in the next section). In those cases, we say
that the equation is overidentied, in the sense that we have more instruments than we
need. Its impossible to incorporate more than one instrument into the IV estimator because

IV
= (Z

D)
1
(Z

Y ); Z and D must have the same number of columns, or else the rst half
of

IV
wont be conformable with the second half. One option would be to simply pick one
instrument and discard the rest, but this seems undesirable from an eciency standpoint
because youre throwing away valid information for estimating . An attractive alternative
then is to use 2SLS, which can trivially accommodate more than one instrument.
In the two stage procedure, simply include all instruments in the rst stage when you
predict the value of d
i
. For example, if you have two instruments, z
1i
and z
2i
, estimate the
rst stage as:
d
i
=
1
z
1i
+
2
z
2i
+ x
i

3
+ u
i
M. Anderson, Lecture Notes 11, ARE 213 Fall 2012 11
Then use

d
i
as the regressor in the second stage instead of d
i
. In matrices, the formula
remains the same:

2SLS
= (D

P
Z
D)
1
(D

P
Z
Y ). Now Z contains more columns than D,
but that doesnt aect the conformability of P
Z
(which is an N N matrix) with D. Under
Gauss-Markov type assumptions, 2SLS eciently combines all of the instruments to estimate
.
2 Applications
We now consider two important applications of instrumental variables. These applications
are particularly helpful when studying IV in the context of heterogeneous treatment eects
and the weak instruments issue (both of which we will cover).
2.1 Medical Trials
For a variety of reasons, medical trials are a fantastic example of an application of instru-
mental variables I would argue the best, in fact. First of all, they are socially important
(perhaps the most important application of IV to date). Furthermore, they are very clean
in terms of experimental design, so they make a great teaching example for conveying the
intuition behind what the IV estimator is doing. My personal recommendation would be to
use this example whenever possible to guide you in understanding how IV operates.
The model for a medical trial is the same simple regression model that we are accustomed
to: y
i
=
0
+
1
d
i
+
i
. In this case, y
i
represents a medical outcome, which could either be
a continuous variable such as blood pressure or cholesterol level or a discrete variable such
as whether or not you survive (e.g., 1 if you survive, 0 if you do not). The variable d
i
is
generally a dummy variable that is 1 if you receive the treatment and 0 if you do not. It
could alternatively be continuous (for example, it could be the dosage in milligrams of the
drug that you receive), but in this example we will assume it is binary (you either take the
pill or you do not take the pill). The error term
i
represents all other factors that aect
the health outcome. Note that the regression model corresponds to the potential outcomes
M. Anderson, Lecture Notes 11, ARE 213 Fall 2012 12
model with constant treatment eects (y = dy
1
+ (1 d)y
0
, y
0
=
0
+ , y
1
= y
0
+
1
).
At this point I will switch to a specic example in order to make the discussion clearer.
Let y
i
be blood pressure, and let d
i
represent a pill that is designed to treat high blood
pressure, so d
i
= 1 if individual i takes the pill and d
i
= 0 if individual i does not take the
pill. Our goal is to estimate the eect that the pill has on lowering blood pressure our hope
is that
1
is large and negative. One way to estimate the eect is to start selling the drug to
the general population and then collect some data and run a regression of blood pressure on
whether or not you take the pill. However, this estimate will clearly suer from a selection
issue people who take the pill are the ones who have high blood pressure to begin with!
We will likely get a positive estimate of
1
from this procedure, even if the true
1
is large
and negative. This may be true even after we condition on observable covariates using one
of the selection on observables designs we discussed earlier. Therefore, in order to accurately
estimate
1
, we design a medical trial in which we randomly assign some patients to the
treatment group and assign other patients to the control group. The patients assigned to
the treatment group are then given the pill and told to take it, while the patients assigned
to the control group are given a placebo (or nothing at all).
Back in the old days (perhaps even older than me), people estimated the eect of the
drug by simply subtracting the mean of y
i
for the control group from the mean of y
i
for
the treatment group (in other words, regressing y
i
on a variable that is 1 if you are in
the treatment group and 0 if you are in the control group). This is what is known as an
intention to treat analysis, because you are taking the dierence between the group that
you intend to treat and the group that you do not intend to treat. But there was the problem
of non-compliance some people in the treatment group would fail to take the pill and
others in the control group would obtain the pill from another source, even though they were
not supposed to. This non-compliance can cause a bias in the estimate of
1
, and it was not
immediately clear how to x this bias until it became obvious that what we were looking at
was actually a simple IV problem.
In this case, the instrument z
i
is the intention to treat, i.e. z
i
= 1 if you are assigned
M. Anderson, Lecture Notes 11, ARE 213 Fall 2012 13
to treatment group (we intend to treat you), and z
i
= 0 if you are assigned to the control
group (we do not intend to treat you). It is easy to see that z
i
satises the two properties of
a good instrument. First of all, z
i
is uncorrelated with
i
by construction, because whether
you are assigned to the treatment group or the control group is randomly determined, so
Cov(z
i
,
i
) = 0. Second, z
i
is correlated with d
i
, because you are going to be more likely to
take the pill if you are in the treatment group, so Cov(z
i
, d
i
) = 0. Therefore, z
i
is a valid
instrument for d
i
, and the IV estimator gives us a consistent estimate of
1
, the eect of
taking the pill on blood pressure.
How does this x the non-compliance problem that we discussed before? To facilitate
understanding, assume that the non-compliance problem only exists for the people in the
treatment group. That is to say, assume that nobody in the control group takes the pill, but
also assume that only half the people in the treatment group take the pill (i.e., half of the
treatment group fails to comply and does not take the pill, while the other half takes the
pill, as they were supposed to). What will the IV estimate look like?
The rst stage will regress d
i
on z
i
, i.e. regress whether you took the pill on whether you
were in the treatment group. So the rst stage is:
d
i
=
1
z
i
+ u
i
Since zero people in the control group took the pill while half the people in the treatment
group took the pill, it should be intuitively clear that our estimate for
1
will be 0.5 (being
in the treatment group raises your probability of taking the pill by 50 percentage points, so

1
= 0.5).
Now recall that the IV estimate is the reduced form rescaled by the rst stage. In this
case, the reduced form is a regression of y
i
(your blood pressure) on z
i
(whether you were
assigned to the treatment or control group). So the reduced form is:
y
i
=
1
z
i
+ v
i
M. Anderson, Lecture Notes 11, ARE 213 Fall 2012 14
Therefore, our IV estimate is

1IV
=
1
/
1
=
1
/0.5. How is this xing the non-complier
problem? Well, we know that the reduced form estimates the causal eect of the instrument
on y
i
, so in our case the reduced form is estimating the eect that being assigned to the
treatment group has on blood pressure. If there were a perfect correlation between being
assigned to the treatment group and taking the pill (i.e. everyone in the treatment group
took the pill, and nobody in the control group took the pill), then the reduced form estimate
would be the eect of taking the pill on blood pressure. In that case the rst stage would
give us
1
= 1, and the IV would be

1IV
=

1

1
=
1
. In other words, the IV would be the
same as the reduced form (which is what we would expect, since both are supposed to be
estimating the same thing in this case, i.e., the eect of the pill on blood pressure).
In our example, however, there is not a perfect correlation between being assigned to the
treatment group and taking the pill, which is why our rst stage estimate is
1
= 0.5, not

1
= 1. So in our case, the reduced form is estimating the eect on your blood pressure of
increasing the probability that you take the pill by 50 percentage points. This means that
the reduced form is not going to be estimating the full eect of taking the pill. Instead,
its estimating half of the eect of taking the pill. If it helps, imagine that there are 10
people in the treatment group, 5 of whom take the pill and 5 of whom do not, and 10 people
in the control group, 0 of whom take the pill. The (expected) mean blood pressure for the
treatment group will be
5
0
+5(
0
+
1
)
10
=
0
+

1
2
, while the (expected) mean blood pressure for
the control group will just be
0
. So the reduced form coecient,
1
, will be the dierence
of means between the treatment and control groups, or

1
2
. This is, of course, half the eect
of taking the pill.
Therefore, the (plim of the) IV estimate will be
1IV
=

1

1
=
0.5
1
0.5
=
1
, which is exactly
what we want. We can see that the IV estimate gives us a consistent estimate precisely
because it is rescaling the reduced form by the rst stage. In our example what this means
in practice is that we are rescaling the reduced form to account for the fact that being in the
treatment group only increases your probability of taking the pill by 50 percentage points,
not by a full 100 percentage points. So the reduced form only represents half the eect of
M. Anderson, Lecture Notes 11, ARE 213 Fall 2012 15
taking the pill, and it must be rescaled by (divided by) 0.5 in order to estimate the full eect
of taking the pill.
More generally, what this example demonstrates is that IV functions by taking the esti-
mated causal eect of z
i
on y
i
(the reduced form) and rescaling it by the estimated causal
eect of z
i
on d
i
(the rst stage).
Before we move on, I should note how IV is dierent than simply taking the mean of y
i
for the people in the treatment group who took the pill and subtracting the mean of y
i
for
the people in the control group who did not take the pill (which, in our example, is the entire
control group). The estimator I just described, which I will refer to as the nave estimator,
is aected by the same selection issues as a simple OLS regression of y
i
on d
i
. Specically, it
may be the case that the people in the treatment group who choose not to take the pill do
so because their blood pressure was not very high to begin with. Thus the group of people
that actually took the pill are the ones that all had high blood pressure to begin with, and
we will tend to estimate that the pill does not have much of an eect (because its downward
eect is being counteracted by the fact that the people who select to take it all had high
blood pressure to begin with).
The IV estimator does not suer from this selection problem because it does not release
the people in the treatment group who choose not to take the pill. To understand this,
imagine for the moment that there are two types of people in our sample: high blood
pressure types and low blood pressure types. Assume that they occur with equal frequency,
so that when we randomly assign our sample to the treatment and control groups, half of the
treatment group is high blood pressure, half of the treatment group is low blood pressure,
half of the control group is high blood pressure, and half of the control group is low blood
pressure. The half of the treatment group that takes the pill all have high blood pressure, so
when we apply the nave estimator and compare their average blood pressure to the average
blood pressure of the control group, we underestimate the eect of the pill because we are
comparing a group of high blood pressure people (who took the pill) to a group that is a 50/50
mix of high blood pressure and low blood pressure people (who did not take the pill). In
M. Anderson, Lecture Notes 11, ARE 213 Fall 2012 16
contrast, what IV does is compare the mean of the treatment group (which is half high blood
pressure people and half low blood pressure people) to the mean of the control group (which
is half high blood pressure people and half low blood pressure people) in the reduced form.
It then rescales this dierence in means by the rst stage to account for the fact that not all
of the treated group took the pill. So unlike the nave estimator, which deceptively compares
a high blood pressure group to a half-high/half-low blood pressure group, IV compares two
comparable groups, and that is why it gives us a consistent estimate of the eect of the pill.
2.2 Quarter of Birth
The quarter of birth application is perhaps the most-studied example of IV in the economics
literature. This example is taken from Angrist and Krueger (1991). I will discuss the basic
framework and idea; for more details see the article itself. The purpose is to demonstrate
a nice application of IV/2SLS (which may help you think about what a good instrument
looks like) and to familiarize you with the canonical example used in the weak instruments
literature.
The question addressed with this instrument is a familiar one: what is the return to an
additional year of schooling? One way to answer this question is to run a standard regression,
y
i
=
0
+
1
d
i
+ x
i

2
+
i
, where y
i
is log wages, d
i
is years of school, and x
i
is a vector
of covariates. However, as we know, this regression is likely to give us a biased estimate of

1
for a variety of reasons, including selection bias and measurement error. The problem
is that Cov(d
i
,
i
) = 0; one way to address this problem is to nd an instrument z
i
that is
correlated with d
i
(schooling) but uncorrelated with
i
.
Angrist and Krueger suggest quarter of birth as the instrument. Why use this as an
instrument for schooling? The idea is that states have mandatory schooling laws stipulating
that students must stay in school until a given age (say age 16, for simplicity). However,
the key thing is that these laws dictate the age at which a student may leave school, not
how many years of schooling a student must get. Therefore, if a student starts school at age
M. Anderson, Lecture Notes 11, ARE 213 Fall 2012 17
6, she will be legally required to receive 10 years of schooling. However, if she starts school
at age 5, she will be legally required to receive 11 years of schooling. Thus, variations in
the age at which a student starts school will result in variations in the amount of schooling
that student is legally required to receive. While this will not make a dierence for most
people (because most people do not drop out of high school as soon as they are no longer
required to be there), it will make a dierence for some people, so there should be a nonzero
correlation (albeit a modest one) between the age one starts school and how many years of
schooling one receives.
How does this all pertain to quarter of birth? The quarter in which a student is born
can have a large eect on what age the student starts school because the academic calendar
begins in September regardless of quarter of birth. Many states require children to start
school in the calendar year in which they turn 6. So, for example, a child born in December
(fourth quarter) might start school at age 5.7, and thus be required by law to receive a
minimum of 10.3 years of schooling (16 minus 5.7). However, a child born in January (rst
quarter) might start school at age 6.7, and thus be required by law to receive a minimum of
only 9.3 years of schooling (16 minus 6.7). Quarter of birth is thereby correlated with legally
required years of schooling, and thus quarter of birth is also correlated with actual years
of schooling. Quarter of birth therefore satises the rst property of a good instrument,
Cov(d
i
,
i
) = 0.
Does quarter of birth satisfy the second property of a good instrument, i.e. Cov(z
i
,
i
) =
0? Potentially, yes (though it turns out not). It seems plausible that the quarter in which one
is born might not causally aect ones future wages, except through its eect on schooling
(though there could be some strange weather eect on young babies). There is also no
obvious reason to think that quarter of birth should be spuriously correlated with anything
that aects future wages, particularly if we think that the time of conception is determined
in a random manner. It is therefore plausible that quarter of birth and future wages are
uncorrelated (except through changes in schooling).
How would we implement the quarter of birth instrument in practice? We would probably
M. Anderson, Lecture Notes 11, ARE 213 Fall 2012 18
use three instruments: a dummy for the rst quarter (z
1
), a dummy for the second quarter
(z
2
), and a dummy for the third quarter (z
3
) (we exclude the fourth quarter to avoid the
dummy variable trap, i.e. to avoid perfect colinearity with the constant term). So the rst
stage would be to regress schooling on quarter of birth (assuming there are no additional
covariates that we are including):
d
i
=
0
+
1
z
1i
+
2
z
2i
+
3
z
3i
+ u
i
Then take the predicted

d
i
from the rst stage and use them in the second stage to run
the regression:
y
i
=
0
+
1

d
i
+ u
i
The value of

1
from this regression is our estimate of the eect of schooling on wages.
If the two IV assumptions are true (Cov(d
i
,
i
) = 0 and Cov(z
i
,
i
) = 0), then this will be a
consistent estimate of the eect of schooling on wages.
There are a couple of things to note in this application. First, Angrist and Krueger
implement IV in a couple of dierent ways. They begin with a Wald estimator which
compares only two groups, people born in the rst quarter and people born in the second
through fourth quarters. The Wald estimator divides the dierence in mean earnings for
the two groups by the dierence in mean schooling. Given our previous discussion of IV, it
should be clear that this procedure is equivalent to doing IV with a binary instrument and
no covariates. With this estimator, Angrist and Krueger estimate the return to schooling to
be around 0.10 (i.e., one additional year of schooling raises wages by 10 percent) in the 1980
Census. This is higher than the OLS estimate from the same sample, which they nd to be
around 0.07. However, Angrist and Krueger also implement a 2SLS procedure in which they
use dozens of instruments. They produce these instruments by interacting quarter of birth
with year of birth (since the eect of quarter of birth on schooling might vary across years).
In their 2SLS regressions, they frequently nd coecients closer to the OLS estimate of 0.07
M. Anderson, Lecture Notes 11, ARE 213 Fall 2012 19
than to the Wald estimate of 0.10. Unbeknown to them, the culprit behind this pattern is
the weak instruments problem, which we will discuss in a subsequent section.
Second, while the quarter of birth instrument is much better than most instruments
you will come across (at least in terms of satisfying the exclusion restriction), it is still not
impervious to criticism. For example, many babies are conceived shortly after people get
married. Some couples are likely to wait until the summer to get married, while other couples
are more likely to get married quickly or when it is most convenient. Therefore, couples of
the rst type would be more likely to have children in the rst or second quarter, whereas
couples of the latter type would be equally likely to have children in any quarter. If couples
of the rst type are dierent in some important way (e.g., perhaps they have higher income
on average) than couples of the second type, then that could introduce a correlation between
quarter of birth and future wages. Any nonzero correlation between z
i
and
i
would be
particularly problematic in this case because the rst stage is relatively weak (again, we will
discuss this issue in a subsequent section).
3 Heterogenous Treatment Eects: AIR (1996) and
LATE
All of the discussion above concentrates on IV in the context of homogeneous treatment
eects. This was the focus of IV estimation for the rst 50 years, but it doesnt t in with
our discussion of heterogeneous treatment eects at the beginning of the course. Recall the
distinction between ATE the average treatment eect for a randomly drawn individual in
our sample and TOT the average treatment eect for a randomly drawn treated individual
in our sample. With IV, these distinctions become more interesting. We have sidestepped
this discussion so far by assuming homogeneous treatment eects, so ATE is equal to TOT,
and both are equal to the average treatment eect for any other sub-population one might
think of. If we allow for heterogeneous treatment eects, however, what is it that IV actually
estimates? ATE? TOT? The answer, presented in Angrist, Imbens, and Rubins seminal 1996
M. Anderson, Lecture Notes 11, ARE 213 Fall 2012 20
paper (henceforth AIR 1996), is neither.
3.1 Intuition
I will proceed somewhat unconventionally by rst explaining intuitively what IV estimates
in the context of heterogeneous treatment eects and then presenting the mathematical
proof. My hope is that understanding the terms and concepts intuitively will make the math
easier to interpret. What IV generally estimates is the local average treatment eect, or
LATE. LATE is the average treatment eect of d
i
on y
i
for the units for whom changing
the instrument (changing z
i
) changes their treatment status (changes d
i
). This is somewhat
abstract, but it should become clearer in the context of our two examples, the medical trial
and the quarter of birth instrument.
What does it mean to say that IV estimates the average treatment eect of d
i
on y
i
for
the units for whom changing z
i
changes d
i
? In practice, this is best illustrated in the medical
trial example. In this example, there are four potential types of people. Note that not all of
these types need exist in practice; in fact, we will explicitly rule out one type by assumption
when we do the proof. The rst type are people who always take the pill, regardless of
whether they are assigned to the treatment group or the control group.
6
In the language of
AIR 1996, we call these people always-takers. The second type are people who never take
the pill, regardless of whether they are assigned to the treatment group or the control group.
We call these people never-takers. The third type are people that take the pill if and only if
they are assigned to the treatment group. We call these people LATE-compliers. Finally,
the fourth type are people who take the pill if and only if they are in the control group. We
call this perverse group the LATE-deers, and we rule them out by assumption.
The people for whom changing z
i
changes their value of d
i
are the people who take
the pill if and only if they are in the treatment group, i.e. the LATE-compliers (recall
that assignment to treatment versus control group is the instrument in this example). The
6
You might wonder how the control group could get the pill. Think about terms like black market or
prescription abuse.
M. Anderson, Lecture Notes 11, ARE 213 Fall 2012 21
always-takers are unaected by the instrument, because they take the treatment regardless
of whether they are in the treatment or control group. Likewise, the never-takers are also
unaected by the instrument, because they eschew the treatment regardless of whether they
are in the treatment or control group. The deers are ruled out by assumption. Therefore,
the IV estimator estimates the eect of the pill on blood pressure for the people who take
the pill if they are in the treatment group but do not take it if they are in the control group.
If the eect is homogeneous, then this distinction is irrelevant, but if the eect varies across
individuals, then this distinction can become important.
Suppose that there are two types of people: people who respond to the pill and people
who do not respond to the pill. This is not a far-fetched assumption most medical trials
nd that the treatment is successful in treating some cases, but unsuccessful in treating
other cases. So
1
is negative for people who respond to the pill (remember that we think
the pill should lower blood pressure), and
1
is zero for people who do not respond to the
pill. Further suppose that people who respond to the pill know that they will respond to it
(dont ask me how), so they always take it, regardless of whether they are in the treatment
or the control group. However, the people for whom the treatment has no eect take the
pill only if they are in treatment group (when they are given the pill for free), and not if
they are in the control group. We know that IV estimates the eect of the treatment on the
LATE-compliers, i.e. the people that take it if and only if they are in the treatment group.
Therefore, in this case, IV estimates the eect of the treatment on the people for whom the
treatment has no eect, because they are the only ones for whom the instrument changes
whether or not they take the pill. So IV will estimate
1
= 0 in this example, despite the
fact that the average treatment eect is negative.
7
Does this mean that IV is inconsistent? Not really it is simply providing a consistent
estimate of the local average treatment eect (the average eect for the people for whom
changing the instrument changed d
i
), not the average treatment eect for the entire popula-
7
Of course, we could alternatively construct a scenario in which the individuals with no treatment eect
are the never-takers and the individuals with a negative treatment eect are the LATE-compliers. In that
scenario, IV would produce a negative estimate of
1
, but the magnitude would be larger than ATE.
M. Anderson, Lecture Notes 11, ARE 213 Fall 2012 22
tion or sample. As long as you interpret IV correctly, then it is not inconsistent. Of course,
it may not estimate what you want to estimate (which might be ATE or TOT), but thats
the way the cookie crumbles. So the lesson here is that IV is consistent, but that you have
to be careful in thinking about exactly what it is estimating. Importantly, IV estimates the
average treatment eect for individuals that comply with the instrument. Since dierent
instruments will have dierent sets of compliers, it follows that dierent instruments can
plim to dierent values, even if all the instruments under consideration meet the two crite-
ria for valid instruments. This result basically invalidates overidentication tests as a valid
scientic testing procedure and has implications for instrumenting for multiple endogenous
variables simultaneously.
Why does IV estimate LATE in our example? As I have reiterated many times, the IV
estimator is the reduced form divided by the rst stage. So if the IV estimate is 0 in the
example I discussed above, that means that the reduced form must be 0. In the medical
trial example, the reduced form is the mean blood pressure for the treatment group minus
the mean blood pressure for the control group. Since the always-takers take the pill when
they are in the treatment group and when they are in the control group, their mean blood
pressure will not be any dierent when they are in the treatment group than it is when
they are in the control group. So those people will never contribute anything to moving the
reduced form away from zero. The people who can potentially move the reduced form away
from zero are the people who take the treatment when theyre in the treatment group but do
not take it when they are in the control group. But we assumed that those were the people
for whom the pill had no eect, so of course their mean blood pressure in the treatment
group is not any dierent than their mean blood pressure in the control group. Thus we get
a reduced form of 0 in our example.
If the pill did have an eect for these people, then the reduced form would be capturing
that eect, and we would get a nonzero coecient estimate. That coecient would represent
the total eect of the pill averaged over all of the individuals in the treatment group. In
fact, however, only the LATE-compliers were aected. The IV thus rescales the reduced
M. Anderson, Lecture Notes 11, ARE 213 Fall 2012 23
form by the rst stage because the rst stage estimates, in our example, the fraction of the
sample that are LATE-compliers (i.e., the fraction that changed their value of d
i
in response
to being assigned to the treatment group).
To reiterate, the always-takers and the never-takers do not, in expectation, contribute
anything to moving the reduced form away from zero, because for them the treatment in-
dicator is always the same in the treatment group and the control group (and the random
assignment procedure balances them, on average, across treatment and control). Thus their
mean blood pressure is no dierent in the treatment group than it is in the control group.
Therefore, the only group of people who can move the reduced form away from zero is the
group of LATE-compliers, because for them the treatment level actually varies depending
on whether they are in the treatment group or in the control group. So if the treatment has
an eect for them, then their mean blood pressure will be dierent in the treatment group
than it is in the control group. But by denition, the LATE-compliers are the people for
whom changing z
i
changes d
i
. Thus IV estimates the average treatment eect for the people
for whom changing z
i
changes d
i
, because those are the people who drive the reduced form,
and IV is just the reduced form rescaled by the rst stage.
Before showing the formal derivation of LATE, I will explain how it applies to the quarter
of birth example. Recall that in the quarter of birth example, the instrument works because
some people stay in school only as long as they are legally required to, and then they drop out
as soon as they reach age 16. These are the people for whom the instrument z
i
(quarter of
birth) has an eect on d
i
(years of school). If it helps, you could literally imagine a 15.5 year
old potential dropout who was born in the third quarter thinking to himself, If only I had
been born in the rst quarter, then I would be able to drop out of school right now, because
Id already be 16. But instead I have to stay in school until the third quarter and receive 11
years of schooling instead of 10.5 years of schooling! These people are the equivalent of the
LATE-compliers (they dont have to actually think in this manner though!). In contrast,
however, for the vast majority of people the instrument (quarter of birth) has no eect on
how long they stay in school, because they plan to stay in school long past the age at which
M. Anderson, Lecture Notes 11, ARE 213 Fall 2012 24
they can legally dropout. They are the equivalent of the always-takers.
8
Since IV estimates the causal eect of d
i
(schooling) on y
i
(wages) for the people for
whom the instrument z
i
(quarter of birth) changes their value of d
i
, the IV estimate gives us
the average eect of schooling on wages for people who drop out as soon as they are no longer
legally required to stay in school. So the quarter of birth instrument is really estimating the
average eect of an additional year of schooling on wages for high school dropouts. Is there
any reason to believe that this is the same eect of schooling that the average person would
have? Probably not. On the one hand, it may overestimate the average eect of schooling
if we believe that wages are a concave function of schooling, so that the return to schooling
falls as you get more schooling.
9
On the other hand, it may underestimate the average
eect of schooling if we believe that high school dropouts dont apply themselves in school
anyway, so they dont get much out of being in school. Either way, the point is that the
IV regression is estimating the average eect of schooling on wages for high school dropouts
rather than for the entire population. It consistently estimates this eect, but this eect is
probably dierent than the population average eect of schooling on wages. Thus we need
to be careful about how we interpret the result. Finally, note that the reason IV estimates
the average eect of schooling on wages for high school dropouts is not because our sample
only consists of high school dropouts. The sample is taken from the entire population, but
IV only estimates the average eect of d
i
on y
i
for the LATE-compliers (i.e., the high school
dropouts), not the average eect for the entire population. However, if our policy interest
pertains to students at risk of dropping out, the average eect for LATE-compliers may be
very informative.
8
The never-takers would be the ones that disregard the law entirely and drop out of school long before
they are legally allowed to.
9
This phenomenon has been referred to as discount rate bias because a simple human capital model
implies that an individual should stay in school until her return to schooling equals her discount rate.
Students that drop out early do so because they have higher discount rates, and their marginal return to
schooling is higher. However, the term discount rate bias is somewhat deceptive in the sense that its not
really an issue of bias but rather an issue of heterogeneous treatment eects and external validity.
M. Anderson, Lecture Notes 11, ARE 213 Fall 2012 25
3.2 Proof
Let D
i
be a binary treatment, Z
i
a binary instrument, and Y
i
an outcome. Let Z be an
N-dimensional vector that contains the value of the instrument, Z
i
, for each unit in the data
set, and D be a similar vector for the treatment variable. We dene the potential outcome
Y
i
(Z, D) as the potential outcome for unit i under a given vector of values for the instrument
and a given vector of values for the treatment. Since D
i
is assumed to be aected by Z
i
, we
also dene the potential outcome D
i
(Z) as the potential outcome for the treatment under a
given vector of values for the instrument.
As discussed above, there are four types of individuals: always-takers, never-takers,
LATE-compliers, and LATE-deers. Table 1 presents each type using the potential out-
comes notation. D
i
(Z
i
) is constant for the never-takers and always-takers the instrument
doesnt aect their choice to get treated or not get treated. D
i
(Z
i
) changes positively with
Z
i
for the LATE-compliers they comply with their intention to treat assignment. D
i
(Z
i
)
changes negatively with Z
i
for the LATE-deers they defy their intention to treat as-
signment.
Table 1: Types of Individuals by D
i
(0) and D
i
(1)
D
i
(0)
0 1
D
i
(1) 0 Never-taker LATE-deer
1 LATE-complier Always-taker
We need to make several assumptions before proceeding. First, take as given the Stable
Unit Treatment Value Assumption (SUTVA):
1. If Z
i
= Z

i
, then D
i
(Z) = D
i
(Z

).
2. If Z
i
= Z

i
and D
i
= D

i
, then Y
i
(Z, D) = Y
i
(Z

, D

).
As we discussed earlier, SUTVA basically amounts to assuming that the treatment is
M. Anderson, Lecture Notes 11, ARE 213 Fall 2012 26
well-dened and that there is no interference between units. The causal eect of Z
i
on D
i
is
D
i
(1) D
i
(0). The causal eect of Z
i
on Y
i
is Y
i
(1, D
i
(1)) Y
i
(0, D
i
(0)).
We assume that Z
i
is randomly assigned. We also assume that the exclusion restriction
holds, i.e.,
Y (Z, D) = Y (Z

, D) Z, Z

, D
In other words, for a given value of D
i
, it doesnt matter what the value of Z
i
is the
instrument only matters insofar as it aects the treatment. Given the exclusion restriction,
we can then dene the causal eect of D
i
on Y
i
as Y
i
(1) Y
i
(0) (we no longer need to include
Z
i
as an argument in Y
i
because it is irrelevant conditional on D
i
). This is the causal eect
of interest.
We assume that the instrument has a nonzero eect on the treatment:
E[D
i
(1) D
i
(0)] = 0
This is equivalent to our normal IV assumption that the instrument is correlated with
the treatment. Finally, we impose a monotonicity assumption stating that the instrument
does not change treatment status in opposite directions for dierent units:
D
i
(1) D
i
(0) i = 1, ..., N
The monotonicity assumption rules out the possibility of LATE-deers, i.e., individuals
that change from D
i
= 1 (treated) to D
i
= 0 (untreated) when their instrument changes
from Z
i
= 0 (intended to not treat) to Z
i
= 1 (intended to treat).
Under these assumptions (which are basically the typical IV assumptions, except for the
monotonicity assumption), what does IV estimate? To answer this question, we leverage the
M. Anderson, Lecture Notes 11, ARE 213 Fall 2012 27
fact that the IV estimator can be written as a ratio of the reduced form over the rst stage.
First, consider the causal eect of Z on Y for unit i:
Y
i
(1, D
i
(1)) Y
i
(0, D
i
(0)) =
[Y
i
(1)D
i
(1) + Y
i
(0)(1 D
i
(1))] [Y
i
(1)D
i
(0) + Y
i
(0)(1 D
i
(0))] =
(Y
i
(1) Y
i
(0))(D
i
(1) D
i
(0))
If the equality of the rst and second lines is not clear, recall in our rst lectures that
we dened Y
i
= Y
i
(1)D
i
+ Y
i
(0)(1 D
i
). We are doing exactly the same thing here when
moving from the rst line to the second line. The only dierence is that we now have an
additional layer of complexity from introducing the instrument Z
i
, as D
i
is now a function
of Z
i
.
The result above implies that the causal eect of Z on Y for unit i equals (Y
i
(1)
Y
i
(0))(D
i
(1) D
i
(0)). This formulation is convenient in part because it implies that Z has
no causal eect on Y for the always-takers and the never-takers. For these two groups,
D
i
(1) = D
i
(0), so (Y
i
(1) Y
i
(0))(D
i
(1) D
i
(0)) = 0. This conrms my earlier claim that
[the always-takers and never-takers] never contribute anything to moving the reduced form
away from zero.
Table 2: Causal Eect of Z on Y by D
i
(0) and D
i
(1)
D
i
(0)
0 1
D
i
(1) 0 Y
i
(1, 0) Y
i
(0, 0) = 0 Y
i
(1, 0) Y
i
(0, 1) = (Y
i
(1) Y
i
(0))
Never-taker LATE-deer
1 Y
i
(1, 1) Y
i
(0, 0) = Y
i
(1) Y
i
(0) Y
i
(1, 1) Y
i
(0, 1) = 0
LATE-complier Always-taker
Source: Angrist, Imbens, and Rubin (1996).
Table 2 summarizes the causal eect of Z on Y for each type. For always-takers and
never-takers, there is no eect of Z on Y , as argued above. For LATE-compliers, the eect
M. Anderson, Lecture Notes 11, ARE 213 Fall 2012 28
of Z on Y is Y
i
(1) Y
i
(0), i.e., the dierence in their potential outcomes under D
i
= 1
and D
i
= 0. For LATE-deers, the eect of Z on Y is Y
i
(0) Y
i
(1), i.e., the dierence
in their potential outcomes under D
i
= 0 and D
i
= 1. It is the opposite of the eect for
the LATE-compliers because the instrument (Z) aects the treatment (D) in the opposite
manner for deers vis a vis compliers.
What is the average causal eect of Z on Y ?
E[Y
i
(1, D
i
(1)) Y
i
(0, D
i
(0))] =
E[(Y
i
(1) Y
i
(0))(D
i
(1) D
i
(0))] =
E[E[(Y
i
(1) Y
i
(0))(D
i
(1) D
i
(0))|D
i
(1) D
i
(0)]] =
E[(D
i
(1) D
i
(0))E[Y
i
(1) Y
i
(0)|D
i
(1) D
i
(0)]] =
1 E[(Y
i
(1) Y
i
(0))|D
i
(1) D
i
(0) = 1] P(D
i
(1) D
i
(0) = 1)
1 E[(Y
i
(1) Y
i
(0))|D
i
(1) D
i
(0) = 1] P(D
i
(1) D
i
(0) = 1) =
= E[(Y
i
(1) Y
i
(0))|D
i
(1) D
i
(0) = 1] P(D
i
(1) D
i
(0) = 1)
The equality between the fourth and fth/sixth lines holds because we do not have to
consider individuals for which D
i
(1) D
i
(0) = 0 (i.e., the never-takers and always-takers).
The last equality holds because we rule out LATE-deers by assumption, i.e., we assume
that P(D
i
(1) D
i
(0) = 1) = 0. Thus we conclude that the average causal eect of Z on
Y is E[(Y
i
(1) Y
i
(0))|D
i
(1) D
i
(0) = 1] P(D
i
(1) D
i
(0) = 1).
We know the IV estimator is equal to the reduced form divided by the rst stage, so its
limit must equal the ratio of the limits of those two estimators. The limit of the reduced form
is average causal eect of Z on Y , or E[(Y
i
(1) Y
i
(0))|D
i
(1) D
i
(0) = 1] P(D
i
(1) D
i
(0) =
M. Anderson, Lecture Notes 11, ARE 213 Fall 2012 29
1). The limit of the rst stage is the average causal eect of Z on D, or E[D
i
(1) D
i
(0)] =
P(D
i
(1) D
i
(0) = 1). Thus the IV estimand is:
E[(Y
i
(1) Y
i
(0))|D
i
(1) D
i
(0) = 1] P(D
i
(1) D
i
(0) = 1)
P(D
i
(1) D
i
(0) = 1)
=
E[(Y
i
(1) Y
i
(0))|D
i
(1) D
i
(0) = 1]
But D
i
(1)D
i
(0) = 1 if and only if an individual is a LATE-complier. Thus IV estimates
the average eect of D on Y for LATE-compliers.
3.3 The Monotonicity Assumption
All of the assumptions we made above are standard textbook IV assumptions with the
exception of the monotonicity assumption, i.e., the assumption that Z only changes D in
one direction (or not at all). What happens if the monotonicity assumption is not met? In
that case, we cannot drop out the last term in our derivation of the causal eect of Z on
Y ; the reduced form becomes
c
P(complier)
d
P(deer), where
c
and
d
are the average
treatment eects of D on Y for compliers and deers respectively. The rst stage becomes
E[D
i
(1) D
i
(0)] = 1 P(complier) 1 P(deer). Thus the IV estimand is:

c
P(complier)
d
P(deer)
P(complier) P(deer)
This looks like a simple weighted average; the danger, however, is that the weight for
deers can take on negative values. There is thus no guarantee that the IV estimand need
lie between
c
and
d
.
10
So in general its probably best if you can make a case that the
monotonicity assumption holds.
10
Consider, for example, a case in which
c
= 3,
d
= 1, P(complier) = 2/3, and P(deer) = 1/3.
M. Anderson, Lecture Notes 11, ARE 213 Fall 2012 30
3.4 Multi-valued Treatments and Instruments
Angrist and Imbens (1995) discuss cases in which the treatment or instrument is not binary.
We deal rst with the case in which the treatment is not binary. Suppose the treatment,
D
i
, takes on J + 1 values (J > 1). In that case, we can write the potential outcome Y
i
(D
i
)
as a quantity that has a dierent value for each value of D
i
: Y
i
(0), Y
i
(1),...,Y
i
(J). Note that
the notation is identical to the notation we use when D
i
is binary, except that now Y
i
(D
i
)
can take on J + 1 dierent values instead of just 2 dierent values. Our instrument is still
binary, however, so D
i
(Z
i
) still has only two possible values: D
i
(0) or D
i
(1).
There are now J dierent causal eects of D on Y . There is the causal eect of changing
D
i
from 0 to 1, the causal eect of changing D
i
from 1 to 2, the causal eect of changing D
i
from 2 to 3, and so on, through the causal eect of changing D
i
from J 1 to J. In fact,
we could imagine even more causal eects by changing D
i
by more than one unit (e.g., the
eect of changing D
i
from 0 to 5), but these additional eects will just be sums of the J
causal eects that we already dened.
With a binary instrument and a multi-valued treatment, the IV estimator converges to:
E[Y
i
|Z
i
= 1] E[Y
i
|Z
i
= 0]
E[D
i
|Z
i
= 1] E[D
i
|Z
i
= 0]
=
J

j=1
w
j
E[Y
i
(j) Y
i
(j 1) | D
i
(1) j > D
i
(0)]
where the weights in the sum are equal to w
j
=
P(D
i
(1)j>D
i
(0))

J
l=1
P(D
i
(1)l>D
i
(0))
.
To interpret the expression above, its easiest to assume that the instrument induces no
more than a single unit change in D
i
for any individual i.
11
In that case, for any individual,
D
i
(1)D
i
(0) is equal to zero or one. If D
i
(1)D
i
(0) = 0, then an individual is a noncomplier,
and he contributes nothing to the IV estimand. Indeed, you can see that the conditional
expectation in the sum is conditioned on D
i
(1) being greater than D
i
(0). If D
i
(1)D
i
(0) = 1,
then the individual is a complier, but there are now J dierent types of compliers. There
11
Its not a problem if the instrument induces a multi-unit change in D
i
it just makes the expression
more complicated to interpret because a single individual can now appear in the sum for multiple values of
j.
M. Anderson, Lecture Notes 11, ARE 213 Fall 2012 31
are compliers that move from D
i
(0) = 0 to D
i
(1) = 1, compliers that move from D
i
(1) = 1
to D
i
(2) = 2, and so on. Let us call a complier that moves from D
i
(0) = j to D
i
(1) = j + 1
to be a complier of type j. The sum above then takes a weighted average of the average
treatment eects for each type of complier. When j = 1, the conditional expectation in
the sum is the average treatment eect for individuals for whom D
i
= 1 when assigned to
the treatment group and D
i
= 0 when assigned to the control group. When j = 2, the
conditional expectation in the sum is the average treatment eect for individuals for whom
D
i
= 2 when assigned to the treatment group and D
i
= 1 when assigned to the control
group. And so on. The weights, w
j
, are equal to the share of compliers that are of type
j. IV therefore estimates a weighted average of J local average treatment eects (one local
average treatment eect for each complier type), with weights equal to the share of compliers
that are of type j. Angrist and Imbens describe IV as estimating a weighted average of
per-unit average causal eects along the length of an appropriately dened causal response
function. They refer to this quantity as the average causal response (ACR).
In other words, changing the treatment by one unit has dierent average eects at dif-
ferent values of the treatment. IV estimates a weighted average of these dierent eects,
and each eect is weighted by its share of the compliers. In the quarter-of-birth schooling
example, IV estimates an average eect of moving from 10th to 11th grade or 11th to 12th
grade for high school dropouts that comply with compulsory schooling laws. Although the
causal eect of schooling on earnings varies from rst grade to graduate study, all of the
compliers in the quarter-of-birth example are individuals getting between 10 to 12 years of
schooling (not counting kindergarten). Hence the weights in the sum above are zero for all
j except j = 11 and j = 12.
Now consider the case in which the instrument, Z
i
, can take on K + 1 distinct values.
There are now K distinct average causal responses (ACRs), one for each point at which the
instrument changes. The ACR at point Z
i
= k is:

k,k1
=
E[Y
i
|Z
i
= k] E[Y
i
|Z
i
= k 1]
E[D
i
|Z
i
= k] E[D
i
|Z
i
= k 1]
M. Anderson, Lecture Notes 11, ARE 213 Fall 2012 32
In other words, we can think about changing the instrument Z
i
by one unit at each
point k, from k 1 to k. Changing the instrument by one unit at point k allows us to
identify an average causal response at point k, and we label that ACR as
k,k1
. You could
alternatively think of each point k as representing a separate binary instrument (in fact,
that is how Angrist and Imbens motivate this formula). The IV estimate is then a weighted
average of K ACRs. Specically, the IV estimate converges to:
K

k=1

k

k,k1
with weights
k
dened as:

k
= (E[D
i
|Z
i
= k] E[D
i
|Z
i
= k 1])
(E[D
i
|Z
i
k] E[D
i
|Z
i
< k]) P(Z
i
k) (1 P(Z
i
k))
The rst part of the expression for
k
, E[D
i
|Z
i
= k] E[D
i
|Z
i
= k 1], implies that
points along the instrument that induce larger changes in D
i
(i.e., points that have a stronger
rst stage) will receive more weight. This should be intuitive since a stronger rst stage
should give us more leverage in identifying the eect of D
i
on Y
i
. The second part of the
expression for
k
implies that points of Z
i
near the median of Z
i
receive more weight since
P(Z
i
k) (1 P(Z
i
k)) is maximized at the median of Z
i
(points that split Z
i
such that
the average value of D
i
is much larger in the upper part of Z
i
than in the lower part of Z
i
also receive more weight).
In summation, when the instrument Z
i
can realize K + 1 multiple values, IV estimates
a weighted average of the K average causal responses that correspond to increasing the
instrument by one unit at points 1, 2, 3, ..., K. The weight used for the kth ACR is
proportional to the strength of the rst stage at Z
i
= k.
M. Anderson, Lecture Notes 11, ARE 213 Fall 2012 33
3.5 Summary
We have seen in this section that IV estimates the local average treatment eect, or LATE.
This is the average treatment eect for units that are induced by the instrument to change
their treatment status. The clear application of this nding is that it allows us to think more
precisely about which group of individuals our treatment eect estimate applies to. There
are, however, other important implications.
Most importantly, the LATE result implies that, in the presence of treatment eect
heterogeneity, dierent instruments should produce dierent estimates, even in arbitrarily
large samples. The choice of instrument denes the group of LATE-compliers; dierent
instruments therefore estimate the average treatment eect for dierent groups of LATE-
compliers. There is no reason why these averages need be equal for dierent groups.
The fact that dierent instruments can produce dierent treatment eect estimates (even
absent sampling error) calls into question the general utility of overidentication tests. These
tests compare coecient estimates produced by dierent instruments the idea is that if the
instruments are all valid, all the estimates should be equal (up to sampling error). If some
instruments are invalid, however, the estimates produced by dierent instruments may dier.
In the context of heterogeneous treatment eects, however, we know that dierent instru-
ments can produce dierent coecient estimates even if all of the instruments are internally
valid. Thus it is impossible to ever reject the validity of the instruments, making the
overidentication tests scientically questionable. The same critique holds for the Hausman
test, which compares the IV estimate to the OLS estimate. With heterogeneous treatment
eects, there is no reason that OLS (which, under ideal conditions, will estimate ATE or
TOT) need equal IV (which estimates LATE).
Heterogeneous treatment eects also complicate matters when you have multiple en-
dogenous variables that you want to instrument for. Consider, for example, a simple case in
which you wish to simultaneously estimate the eect of education (d
1
) and experience (d
2
)
M. Anderson, Lecture Notes 11, ARE 213 Fall 2012 34
on earnings (y). The model might look like:
y
i
=
0
+
1
d
1i
+
2
d
2i
+
i
Both treatments are subject to selection issues and are endogenously determined. In-
strumenting for education and controlling for experience as a covariate will not give consistent
estimates of the eect of education on earnings it is inappropriate to control for a variable
that is aected by the treatment (in general, getting more education will mean getting less
job experience). The correct way to estimate the causal eect of education on earnings is to
instrument for education and include as covariates only predetermined variables.
If, however, you want to estimate a structural model that contains both education and
experience, i.e., you want to know the eect of education when holding experience constant
(even though we may not be able to imagine such a scenario in real life), you might nd two
instruments, one for education (call it z
1
) and one for experience (call it z
2
).
12
You can then
identify
1
and
2
by running 2SLS, using both z
1
and z
2
as instruments. Intuitively, 2SLS
is using z
2
to estimate
2
, and then using this estimate of
2
to adjust for the fact that z
1
aects both d
1
and d
2
when estimating
1
(i.e., the eect of education on earnings holding
experience constant).
With homogenous treatment eects, this strategy is valid. With heterogenous treatment
eects, however, we know that dierent instruments generally estimate dierent local average
treatment eects. Assuming that z
1
estimates the same treatment eect for d
2
that z
2
estimates is therefore unjustied.
13
In principle, the eect of manipulating education while
holding experience constant could be positive for all individuals, yet the 2SLS procedure
could generate a negative estimate of
1
(even ignoring sampling error).
12
Of course, the education instrument will invariably aect experience. In principle, however, the experi-
ence instrument need not aect education.
13
This is equivalent to assuming that z
1
and z
2
should both produce identical estimates of
2
.
M. Anderson, Lecture Notes 11, ARE 213 Fall 2012 35
4 Weak Instruments
Weak instruments that is to say, instruments that are only weakly correlated with the
treatment of interest pose a special set of problems. First, and most importantly, a weak
rst stage implies that any bias in the reduced form will be amplied in the IV estimate.
This is true regardless of the number of instruments one uses. When using many weak
instruments, however, a nite sample issue arises and 2SLS becomes biased towards the
OLS estimate (conventional standard errors are also inaccurate). Though these issues have
been known to some degree for several decades, they were brought to the attention of applied
researchers by Bound, Jaeger, and Baker (1995) (henceforth BJB 1995).
4.1 Omitted Variables Bias
Consider a case with a single endogenous variable, d
i
, one or more instruments, z
i
, and no
covariates.
14
We are interested in the causal relationship between d
i
and y
i
, summarized as
y
i
= + d
i
+
i
We have an instrument z
i
that we use to predict d
i
d
i
= z
i
+ u
i
Consider the consistency of

OLS
and

2SLS
. For OLS,
plim

OLS
=
Cov(d
i
, y
i
)
Var(d
i
)
=
Cov(d
i
, d
i
+
i
)
Var(d
i
)
= +

d

dd
The plim for 2SLS relies on the fact that

d
i
plims to z
i
,
plim

2SLS
=
Cov(

d
i
, y
i
)
Var(

d
i
)
=
Cov(

d
i
, d
i
+
i
)
Var(

d
i
)
=
Cov(z
i
, (z
i
+ u
i
) +
i
)
Var(

d
i
)
= +

d
14
As per BJB 1995, the core results remained unchanged by the addition of covariates.
M. Anderson, Lecture Notes 11, ARE 213 Fall 2012 36
If there is zero covariance between d and then OLS will consistently estimate . If
there is zero covariance between z and then 2SLS will consistently estimate (note that
2SLS is never unbiased because it is a ratio of two random variables). What happens when
these covariances are nonzero, however? Under what conditions will one estimator be more
or less inconsistent than the other?
The ratio of the inconsistency in the IV estimator to the inconsistency in the OLS esti-
mator is:


dd

d
=

1
R
2
FS
R
2
FS
is the R
2
of the rst stage; the equality holds because R
2
FS
= SSR/SST =

d
/
dd
.
If we had covariates in the model, the R
2
FS
term would be the partial R
2
from the rst stage,
i.e., the R
2
from running d
i
on z
i
after the covariates have been partialled out from both.
15
From the result above, we see that the relative inconsistency of IV vis a vis OLS depends
on two quantities. First, it depends on the covariance of

d
i
and
i
relative to the covariance
of d
i
and
i
. If the covariance of the error term and

d
i
increases (relative to the covariance
of the error term and d
i
), then the inconsistency of IV increases this is quite intuitive.
More interestingly, the relative inconsistency of IV also depends on the inverse of the R
2
(or
partial R
2
, if you have covariates) of the rst stage. Thus, if the rst stage is weak (i.e.,
low R
2
), any violation of the exclusion restriction will be amplied, and IV can become very
inconsistent. A rst stage (partial) R
2
of 0.1, for example, will inate the ratio

d
by a
factor of 10. Except that things arent quite that simple.
The complication is that

d
is itself aected by the strength of the rst stage. If the rst
stage is weak, then by denition the variance of

d will be relatively low, and so the covariance

d
will tend to be low as well. For tractability, and because it covers the preponderance of
15
With covariates in the model, the

d
and
d
terms are also calculated after the covariates have been
partialled out from d
i
and z
i
.
M. Anderson, Lecture Notes 11, ARE 213 Fall 2012 37
meaningful cases, suppose that z
i
contains only one instrument. In that case:

1
R
2
FS
=
Cov(z, )
Cov(d, )

Var(d)
Var(z)
=

z
/
z

d
/
d

z
=

z

1
R
FS
The last expression is more useful in the sense that it is expressed in terms that do
not depend on the units of measurement for any of the variables in question. The second
term,
1
R
FS
, conrms that a weak rst stage does exacerbate the relative inconsistency of IV
vis a vis OLS, but the degree of bias is not as strong as originally implied. With a rst
stage (partial) R
2
of 0.1, for example, IV will be less inconsistent than OLS as long as the
correlation between the instrument, z
i
, and the error term,
i
, is approximately three times
less than the correlation between d
i
and
i
. With a rst stage (partial) R
2
of 0.01, however,
the correlation between z
i
and
i
needs to be ten times less than the correlation between d
i
and
i
in order for IV to be preferable to OLS.
So, if the rst stage is relatively weak, then you should think carefully about whether
your exclusion restriction (Cov(z, ) = 0) holds. Even a modest correlation between the
instrument and the structural error term can make IV highly inconsistent if the rst stage
(partial) R
2
is low. This is true regardless of whether you have one instrument or many
instruments.
BJB 1995 analyze the potential for omitted variables bias in Angrist and Krueger (1991)
using the just-identied case. Quarter of birth is parameterized as a single indicator variable
that equals zero if an individual is born in the rst quarter and unity if an individual is born
in the second through fourth quarters. With this parameterization, Angrist and Krueger
report a rst stage coecient of 0.1 people born in the rst quarter have 0.1 years less
education than those born in the second through fourth quarters. This is a fairly small
eect, but the coecient is highly signicant since the sample numbers in the hundreds of
thousands.
BJB note that the dierence in mean log per capita family income for young children
born in the second through fourth quarters versus those born in the rst quarter is 0.024
M. Anderson, Lecture Notes 11, ARE 213 Fall 2012 38
families of children born in the rst quarter have per capita income that is about 2.4% lower
than families of children born in the second through fourth quarters. Using an intergener-
ational correlation coecient of 0.4 (the standard in the literature at that time now it is
estimated to be even higher), BJB infer that omitted factors might lead to a dierence in
mean log income of 0.01 between individuals born in the second through fourth quarters and
individuals born in the rst quarter. Though this dierential is quite small, it is important
to remember that the rst stage is also very small, with a coecient of 0.1. Thus the bias
in the IV estimate will be 10 times the bias in the reduced form estimate a reduced form
bias of 0.01 translates to an IV bias of 0.10. Interestingly, this is very close to the return to
education that Angrist and Krueger estimate using the quarter of birth instrument. I am
not claiming that their estimate is necessarily wrong, but the relatively weak rst stage does
mean that the quarter of birth design is not quite as clean as it rst appears.
4.1.1 Testing for Covariate Balance
In many IV applications, it is informative to test for balance of covariates across the instru-
ment. This is similar to testing for covariate balance when stratifying on the propensity
score the idea is that if observable factors determining Y
i
are balanced across the instru-
ment, then unobservable factors are also likely to be balanced. In terms of testing statistical
signicance, it is ne to simply regress each covariate on the instrument and examine the
signicance of the coecient on Z
i
we are just estimating the reduced form relationship
between Z
i
and each covariate here. If anything, this will yield a conservative test (in that
it assumes the rst stage is relatively strong). However, when interpreting the magnitude
of this reduced form coecient, it is important to keep in mind the weak instrument results
above. Even a small bias in the reduced form can translate into a large bias in the IV.
4.2 Finite Sample Bias
The rst issue that a weak rst stage can amplify any correlation between the instrument
and the structural equation error term is important. Nevertheless, for reasons that are
M. Anderson, Lecture Notes 11, ARE 213 Fall 2012 39
unclear, much of the focus in the last decade regarding weak instruments has pertained to
the second issue that BJB 1995 raise nite sample bias.
16
This issue is, in my opinion,
somewhat overblown, but it is important in a subset of cases, so you should be aware of it.
Recall that I said that IV/2SLS is consistent but not unbiased. This occurs because the
IV estimator is ratio of two random variables (the reduced form and the rst stage), so we
cannot compute the expectation (in fact, there is no guarantee that it even exists). For an
arbitrarily large sample, the bias of IV disappears, but of course no real sample is arbitrarily
large. The problem is that the rst stage is estimated with error if we knew the true
value of the rst stage coecient(s), we could plug these values in and IV/2SLS would be
unbiased.
To x ideas, consider a case in which there is no population rst stage the instruments
have zero eect on d. We know that IV partitions the variation in d into two components:
the variation in d induced by z and the complement of that variation. In this case, however,
because z has no eect on d, the distinction between the two components is entirely arbitrary.
In the population, the rst stage will be zero, and the component of d that is correlated with
z will contain nothing. In any nite sample, however, the rst stage will not be zero.
If there is only one (weak) instrument, the IV estimate will be highly unstable. The IV
coecient is the ratio of the reduced form coecient over the rst stage coecient. Since
the rst stage coecient is centered at zero, the IV coecient can easily realize very large
positive or negative values its distribution may be approximated by a Cauchy distribution.
However, unless d is endogenous to a degree that is uncommon in empirical research, there
is little chance that nite sample bias will be an issue with just one (or a small number of)
instrument(s). This is because the IV standard errors will be very large, and the researcher
will correctly conclude that it is not possible to conduct precise statistical inference.
With many weak instruments, nite sample bias becomes problematic. With a large
number of instruments, the amount of variation in d that

d captures becomes nontrivial
16
I suspect the relative focus on the second issue rather than the rst is due in part to the fact that the
rst issue is fairly straightforward there is not much else to be said.
M. Anderson, Lecture Notes 11, ARE 213 Fall 2012 40
we overt the rst stage, so to speak. If the instruments have no eect on d, however, the
partitioning of d into the component determined by z and its complement is meaningless
the variation in d that we think is caused by z is no dierent than remaining variation in d
that we throw away. It is thus unsurprising that with a large number of weak instruments,

2SLS
becomes biased towards

OLS
.
17
To complicate matters, the 2SLS standard errors
become biased downwards as well

2SLS
is not as precisely estimated as it appears to be.
Angrist and Krueger (1995) demonstrate the many weak instruments problem by gen-
erating 180 instruments ((3 quarter of birth dummies 10 year of birth dummies = 30
dummies) plus (3 quarter of birth dummies 50 state of birth dummies = 150 dummies))
and then replacing the actual quarter of birth with random draws from a discrete uniform
distribution with four points of support. Recall that the IV estimate for Angrist and Krueger
(1991) is approximately 0.10 (this is using a small number of instruments) while the OLS
estimate is approximately 0.07. The IV estimate produced by the large number of randomly
generated instruments is 0.06 with a standard error of 0.014 very close to the OLS esti-
mate and statistically signicant. An inattentive researcher might mistakenly believe that
this estimate is informative.
What can be done to address the nite sample bias issue when working with many weak
instruments?
18
The simplest solution is to simply reduce the number of instruments being
used. Since it is dicult to nd one good instrument, let alone many good instruments,
the many weak instruments issue often occurs when the researcher interacts the primary
instrument with a number of other covariates. In these cases it is straightforward to eliminate
the interaction terms Angrist and Krueger (1991), for example, parameterize the QOB
instrument as a single variable (rst quarter versus second through fourth quarters) and as
180 dierent variables.
If it is not possible to reduce the number of instruments, there are two issues to be ad-
17
To take an extreme case, if you had as many instruments as observations, you could t d perfectly, and

2SLS
=

OLS
. Things become problematic long before that happens, however.
18
It is not exactly clear what constitutes a weak rst stage. Staiger and Stock (1997) recommend caution
when dealing with rst stage F-statistics of less than 10.
M. Anderson, Lecture Notes 11, ARE 213 Fall 2012 41
dressed. First, the bias towards the OLS estimate must be corrected. Second, the standard
errors must be corrected. An easy way to address the rst issue is to use the Limited In-
formation Maximum Likelihood Estimator (LIML). LIML is derived by assuming that the
residuals in the structural and rst stage equations are normally distributed and then esti-
mating and via maximum likelihood methods.
19
Angrist and Krueger (2001) note that
LIML is approximately unbiased in that the median of its sampling distribution is often
close to the parameter being estimated.
20
Although this does not completely eliminate nite
sample bias, LIML generally performs better than 2SLS in cases with many instruments.
Conventional LIML standard errors can still be too small, however. Imbens suggests imple-
menting a correction derived in Bekker (1994). Multiply the conventional LIML standard
errors by:
1 +
K/N
1 K/N
(
N

i=1
( z
i
)
2
/N)
1
(

1
1

)
1
where K is the number of instruments, N is the sample size, z
i
is a 1 K row vector
of demeaned instruments, is a K 1 column vector of rst stage coecients,
1
is the
coecient on d in the structural equation, and is the variance-covariance matrix of the
reduced form residual and the rst stage residual (the variance of the reduced form residual,
v
i
, is the upper left element, and the variance of the rst stage residual, u
i
, is the bottom
right element). In practice, we replace these population coecients and moments with their
sample counterparts.

1
comes from the LIML estimate of
1
, and , v
i
, and u
i
come from
OLS estimates of the rst stage and reduced form equations.
Intuitively, the standard error adjustment increases in K because additional instruments
make it more likely that we will overt the rst stage. It decreases in the second term,
19
In just-identied cases, LIML is numerically identical to 2SLS/IV.
20
An alternative is to use a split-sample IV estimator that estimates the rst stage on a dierent sample
than the second stage. In practice, you would estimate the vector of rst stage coecients, , using the rst
sample. Then you would apply this coecient vector to the instruments in the second sample to construct
the tted value of the treatment variable. This solves the rst stage over-tting problem by essentially
forcing the rst stage to do an out-of-sample forecast. In practice, however, its often easier to estimate
LIML than it is to estimate some of the SSIV estimators, and there doesnt appear to be a strong consensus
among econometricians in favor of the latter over the former.
M. Anderson, Lecture Notes 11, ARE 213 Fall 2012 42
which is the inverse of the regression sum of squares for the rst stage, because a larger
regression sum of squares implies a stronger set of instruments. Finally, it increases in
(the magnitude of) because determines the degree of endogeneity (recall that the rst
stage residual is the potentially endogenous part of d, while the reduced form residual is
the unexplained part of y, so a high covariance between the two components implies a high
degree of endogeneity).
5 Additional References
Angrist, J. and G. Imbens. Two-stage Least Squares Estimation of Average Causal Ef-
fects in Models with Variable Treatment Intensity. Journal of the American Statistical
Association, 1995, 90, 431-442.
Angrist, J. and A. Krueger. Split-Sample Instrumental Variables Estimates of the Re-
turn to Schooling. Journal of Business and Economic Statistics, 1995, 33, 225-235.
Bekker, P. Alternative Approximations to the Distribution of Instrumental Variables
Estimators. Econometrica, 1994, 62, 657-681.
Staiger, D. and J. Stock. Instrumental Variables Regression with Weak Instruments.
Econometrica, 1997, 65, 557-586.
M. Anderson, Lecture Notes 12, ARE 213 Fall 2012 1
ARE 213 Applied Econometrics
UC Berkeley Department of Agricultural and Resource Economics
Selection On Unobservables Designs:
Part 4, Regression Discontinuity Designs
1
Regression Discontinuity (RD) designs go back in the evaluation literature at least as far
as Thistlethwaite and Campbell (1960). Only in the past decade, however, has RD become
popular in economics. Most RD designs are basically special cases of IV. Nevertheless, they
are probably my favorite selection on unobservables design because the identication (by
which I mean the source of variation in the treatment that we are using to identify the
treatment eect) is so transparent.
1 Introduction
1.1 Background
Suppose that we want to estimate the eect of some binary treatment D
i
on an outcome Y
i
.
Using the potential outcomes framework, we write Y
i
(0) as the potential untreated outcome
and Y
i
(1) as the potential treated outcome; Y
i
= D
i
Y
i
(1) +(1 D
i
)Y
i
(0). Now suppose that
the value of D
i
i.e., whether or not an individual gets treated is completely or partially
determined by whether some predictor X
i
lies above or below a certain threshold, c. The
predictor X
i
need not be randomly assigned. In fact, we assume that it is related to the
potential outcomes Y
i
(0) and Y
i
(1), but that this relationship is smooth, i.e., Y
i
(0) and Y
i
(1)
do not jump discontinuously as X
i
changes. Any discontinuous change in Y
i
as X
i
crosses
c will thus be interpreted as a causal eect of the treatment D
i
. We call X
i
the running
variable.
RD designs often arise in administrative situations in which units are assigned a program,
1
These notes are heavily derived from Imbens and Lemieux (2008).
M. Anderson, Lecture Notes 12, ARE 213 Fall 2012 2
treatment, or award based upon a numerical index being above or below a certain threshold.
For example, a politician may be elected if and only if the dierential between the vote share
that she receives and the vote share that her opponent receives exceeds 0, a student may
be assigned to summer school if and only if his performance on a combination of tests falls
below a certain threshold, or a toxic waste site may receive cleanup funds if and only if its
hazard rating falls above a certain level. In these cases, individuals or units whose indices
X lie directly below the threshold c are considered to be comparable to individuals or units
whose indices X lie directly above the threshold c, and we can estimate the treatment eect
by taking a dierence in mean outcomes for units directly above the threshold and units
directly below the threshold.
1.2 The Sharp RD Design
There are two types of RD designs: the sharp design and the fuzzy design. In the sharp
RD design (SRD), the probability that D = 1 changes from zero to one as running variable
crosses c. In other words, no one with X < c gets treated, and everyone with X c gets
treated. In the fuzzy RD design, the probability of treatment jumps discontinuously as X
crosses c, but it does not jump by 100 percentage points. In other words, either some people
with X < c get treated, or some people with X c do not get treated, or (most likely) both.
We will focus rst on the sharp RD design.
In the sharp RD design, D
i
is a deterministic function of X
i
: D
i
= 1(X
i
c).
To estimate the causal eect of D
i
on some outcome Y
i
, we simply take the dierence in
mean outcomes on either side of c. Formally, we estimate:
lim
xc
E[Y
i
|X
i
= x] lim
xc
E[Y
i
|X
i
= x] = lim
xc
E[Y
i
(1)|X
i
= x] lim
xc
E[Y
i
(0)|X
i
= x]
This represents the average causal eect of D on Y for individuals with X
i
= c. We will
call this eect
SRD
.

SRD
= E[Y
i
(1) Y
i
(0)|X
i
= c]
M. Anderson, Lecture Notes 12, ARE 213 Fall 2012 3
To justify this interpretation, we need it to be true that Y
i
(0) and Y
i
(1) are smooth
functions of X
i
as X
i
crosses c.
2
We make this assumption in the form of a conditional
expectation.
Assumption 1
E[Y
i
(0)|X
i
= x] and E[Y
i
(1)|X
i
= x] are continuous in x
With this assumption we can write

SRD
= lim
xc
E[Y
i
|X
i
= x] lim
xc
E[Y
i
|X
i
= x]
and estimate
SRD
as the dierence between two regression functions estimated in the
neighborhood of c.
Since we never observe Y
i
(0) for units with X
i
= c, we rely upon extrapolating E[Y
i
(0)|X
i
=
c] using units with X
i
arbitrarily close to c. The continuity assumption above guarantees
that the bias from this extrapolation becomes negligible as we get arbitrarily close to c.
Lee (2008) is an example of the sharp RD design in practice. Lee explores the eects
of incumbency using the fact that a politician is elected if and only if he receives more
votes than his opponent.
3
Lee uses this fact to examine Congressional districts in which the
Democrats won by a few votes in period t to districts in which the Democrats lost by a few
votes in period t. He compares party success in these districts in elections held in period
t +1 to estimate the eect of incumbency on the probability of winning. He nds that party
success in period t strongly aects party success in period t + 1 the probability of success
in t + 1 rises by approximately 50 percentage points.
2
Technically, it should suce to simply have Y
i
(0) be a smooth function of X
i
as X
i
crosses c. In that
case we would estimate the average eect of the treatment on the treated at X
i
= c. But its hard to imagine
an RD scenario in which Y
i
(0) is smooth in the running variable at c while Y
i
(1) is not smooth in the running
variable at c.
3
This may not truly be a sharp RD design if it includes data from Florida, particularly around 2000.
M. Anderson, Lecture Notes 12, ARE 213 Fall 2012 4
1.3 The Fuzzy RD Design
The fuzzy RD design (FRD) is similar in concept to the sharp RD design except that D
i
is no longer a deterministic function of X
i
. Instead, the probability of treatment changes
by some nonzero amount as the running variable crosses the threshold c, but this change in
probability is less than 100 percentage points. Formally, we write
0 < lim
xc
P(D
i
= 1|X
i
= x) lim
xc
P(D
i
= 1|X
i
= x) < 1
This scenario is arguably more common than the sharp RD scenario in that most things
in real life are determined by multiple factors, and the inuence of the running variable as
it crosses the threshold c may be just one of those factors. In the fuzzy RD design there are
now two causal eects to be estimated: the eect of crossing the threshold on the probability
of treatment and the eect of crossing the threshold on the outcome (in the sharp RD design,
the former is known to be 1). Formally, the fuzzy RD estimand is

FRD
=
lim
xc
E[Y
i
|X
i
= x] lim
xc
E[Y
i
|X
i
= x]
lim
xc
E[D
i
|X
i
= x] lim
xc
E[D
i
|X
i
= x]
If this estimator looks somewhat familiar, thats because it should. Its the direct analog
of an IV estimator in which the instrument is an indicator for whether X
i
lies directly above
c. Formally, let D
i
(x

) be the potential treatment status of unit i for a threshold x

in
the neighborhood of c. Note that x

now represents a potential value for the threshold,


not the value of the running variable for unit i. We do this because it is often easier to
conceive of manipulating the threshold rather than the running variable.
4
D
i
(x

) is unity if
unit i would take the treatment if the threshold were x

, and zero otherwise. Thus we are


considering manipulating the threshold; for example, if individuals are eligible for for free
health insurance at age 65, one might imagine changing the threshold for eligibility from 65
to 65.1. When the threshold changes, some people who were previously eligible would now
be ineligible. In this context, we need the equivalent of the IV monotonicity assumption:
4
For example, the running variable may be age.
M. Anderson, Lecture Notes 12, ARE 213 Fall 2012 5
Assumption 2
D
i
(x

) is non-increasing in x

at x

= c.
In other words, moving a unit with X
i
= c from intended to treat to intended to not
treat by increasing the threshold x

never results in that unit switching from an untreated


status to a treated status (units only drop out of treatment or dont change their status as
you ratchet up the threshold they never drop in to treatment as you decrease the pool
of intended to treat by increasing the threshold). The monotonicity assumption rules out
the possibility of deers, leaving only always-takers, never-takers, and compliers. We dene
these groups in a manner similar to that of AIR (1996); a complier is a unit such that
lim
x

X
i
D
i
(x

) = 0 and lim
x

X
i
D
i
(x

) = 1.
In other words, a complier is a unit that does not take the treatment when the threshold
lies just above X
i
and takes the treatment when the threshold lies just below X
i
. Analogously,
a never-taker is a unit that never takes the treatment when the the threshold is in the
neighborhood of X
i
(regardless of whether the threshold lies just above or below X
i
), and
an always-taker is a unit that always takes the treatment when the the threshold is in
the neighborhood of X
i
(regardless of whether the threshold lies just above or below X
i
).
Intuitively, the fuzzy RD design measures the average treatment eect for RD compliers:

FRD
=
lim
xc
E[Y
i
|X
i
= x] lim
xc
E[Y
i
|X
i
= x]
lim
xc
E[D
i
|X
i
= x] lim
xc
E[D
i
|X
i
= x]
= E[Y
i
(1) Y
i
(0)|unit i is a complier and X
i
= c].
The logic is the same as in the IV case the outcomes and treatment statuses for the
always-takers and never-takers do not change as the running variable crosses the threshold,
so they contribute nothing to either the numerator or the denominator of the fuzzy RD
estimator. The only units that have non-zero contribution are the compliers.
M. Anderson, Lecture Notes 12, ARE 213 Fall 2012 6
DiNardo and Lee (2004) test whether unionization has a direct eect on wages using
a fuzzy RD design. They leverage the fact that, in the United States, employees vote on
whether they want to unionize (this election is not automatically held typically organizers
must rst collect cards from a majority of employees requesting an election from the National
Labor Relations Board). If a majority of employees vote in favor of unionizing, then the
employer is legally required to recognize the union and bargain in good faith.
5
One might
think that crossing the threshold of 50% in the election would generate a sharp RD design,
but in fact it does not. First, in a few cases a union is not recognized in spite of winning
a majority of the vote because the NLRB invalidates the result. More importantly, in
many close losses, union organizers try again and ultimately win a subsequent election.
Thus, crossing the threshold of 50% in an initial election changes the probability of union
recognition, but it does not change it from zero to one. Instead, the data suggest that the
probability of eventual and lasting union recognition rises by approximately 80 percentage
points. Interestingly, however, there is no discernible eect on employment, rm survival,
or wages as the vote share crosses the 50% threshold. Whether this is due to threat eects
(which should be particularly pronounced for compliers), union ineectiveness, or some other
mechanism is not entirely clear.
1.4 The FRD, Matching, and Unconfoundedness
The RD context makes an analysis based on unconfoundedness, i.e. Y
i
(0), Y
i
(1) D
i
|X
i
,
seem attractive. In particular, it is intuitively appealing to match units with X
i
close to
c and then compare dierence in mean outcomes for the treated units in this group and
the untreated units in this group the link to propensity score matching is clear. In the
SRD design, this analysis makes sense (in fact, it should reproduce the SRD estimator),
but in the FRD design, it does not. That is because, in the FRD design, treated units in
the neighborhood of c consist of a mixture of compliers and always-takers, while untreated
units in the neighborhood of c consist of a mixture of compliers and never-takers. Thus
5
In practice, many employers engage in illegal tactics to attempt to forestall elections, including ring
union activists.
M. Anderson, Lecture Notes 12, ARE 213 Fall 2012 7
the treated and untreated units are not, on average, directly comparable. This is the same
reason that we do not directly compare treated and untreated units in the context of IV
(e.g., in a medical trial) instead we compare those that we intended to treat to those that
we do not intend to treat, and rescale the coecient by our estimate of the proportion of
compliers in the sample.
1.5 External Validity
RD estimates are inherently localized. In the SRD design, the eects are estimated for a
subpopulation with X
i
in the neighborhood of c. In the FRD design, the subpopulation
is further restricted it consists only of compliers in the neighborhood of c. Although the
external validity is limited, it is (hopefully) counterbalanced by a relatively high degree of
internal validity. Nevertheless, it is important to note and think about the external validity
of any RD estimates.
2 Graphical Analysis
2.1 Introduction
A graphical analysis should be the focus of any RD paper. The strength of the RD design
revolves around the fact that the treatment assignment rule is known (or at least partially
known) and that we should be able to see discontinuous changes in the treatment and the
outcome (if there is an eect) as the running variable crosses c. Any RD design that fails to
exhibit a visually perceptible break in treatment probability at the discontinuity threshold
is basically not credible, regardless of the regression results. Conversely, any break that is
visually perceptible will almost surely be statistically signicant. So with RD papers, the
statistical results really take a back seat to the graphical analysis.
There are three types of graphs in RD analyses, though not all analyses will necessarily
include all three types. The rst type plots outcomes by the running variable, where out-
M. Anderson, Lecture Notes 12, ARE 213 Fall 2012 8
comes can include both Y
i
and D
i
. The second type plots covariates by the running variable,
and the third type plots the density of the running variable.
2.2 Outcomes by the Running Variable
The rst type of graph is basically a histogram-type plot that presents the average value of
an outcome at evenly spaced values of the running variable this is equivalent to running
a kernel regression at each of those points using a uniform kernel. Formally, there are
two parameters to choose: the binwidth, h, and the number of bins to the left and right
of the threshold value, K
0
and K
1
.
6
Given these parameters, construct bins (b
k
, b
k+1
] for
k = 1, ..., K = K
0
+ K
1
, where
b
k
= c (K
0
k + 1) h
This simply creates K
0
evenly spaced bins of width h below the threshold value (c), and
K
1
evenly spaced bins of width h above the threshold value. For each bin, calculate the
number of observations lying in that bin:
N
k
=
N

i=1
1(b
k
< X
i
b
k+1
)
Then, calculate the average treatment level in the bin (if you have a fuzzy design):
D
k
=
1
N
k
N

i=1
D
i
1(b
k
< X
i
b
k+1
)
Finally, calculate the average outcome in the bin:
Y
k
=
1
N
k
N

i=1
Y
i
1(b
k
< X
i
b
k+1
)
6
Alternatively, one could choose the binwidth h and the support of the histogram the two methods are
equivalent.
M. Anderson, Lecture Notes 12, ARE 213 Fall 2012 9
The rst plot of interest, particularly in the fuzzy RD design, that of D
k
against the mid-
point of each of the bins k = 1, ..., K. The question is whether there is a visual discontinuity
in the plot of D
k
at the threshold c. A visual break implies that crossing the threshold has a
signicant eect on the probability of treatment this graph is equivalent to the rst stage
in an IV analysis. Again, if there is no visual break, then it is unlikely that the statistical
analysis will nd anything either (and even if it does, it wont be very credible).
The second plot of interest is that of Y
k
against the midpoint of each of the bins
k = 1, ..., K. The focus in this plot is on whether there is a visual discontinuity in the
outcome as the running variable crosses the threshold c. A visual break implies that cross-
ing the threshold has a signicant eect on the outcome, which in turn implies (under our
assumptions) that the treatment has a signicant eect on the outcome. This graph is equiv-
alent to the reduced form in an IV analysis. In addition to inspecting the threshold for a
discontinuity, you should also inspect whether there are any other discontinuities of similar
(or greater) magnitude in Y
k
at other values of the running variable. If there are, and if
there is not a clear a priori reason to expect these discontinuities, then the research design is
called into question eectively, we have detected a violation of Assumption 1 (smoothness
in expected potential outcomes). Finally, note that it is important not to smooth over the
threshold value c i.e., no bin should cross c. Smoothing over c will tend to minimize any
discontinuity at the threshold.
2.3 Covariates by the Running Variable
The second type of graph plots the average values of covariates against the running vari-
able using the same methodology as above. Suppose that we have a covariate Z
i
that is
related to the outcome but should not be aected by the treatment. Plotting Z
i
against
the running variable will allow us to determine whether it is balanced across the threshold
this is the equivalent of showing covariate balance after matching on the propensity score
M. Anderson, Lecture Notes 12, ARE 213 Fall 2012 10
or demonstrating that covariates are balanced across an instrument. Formally, we calculate
Z
k
=
1
N
k
N

i=1
Z
i
1(b
k
< X
i
b
k+1
)
and then plot Z
k
against the midpoint of each of the bins k = 1, ..., K. If the research
design is valid, then there should not be any discontinuity in Z
k
as the running variable
crosses the threshold c.
Lee (2008) generates plots of outcomes and covariates by the running variable in the
context of Congressional elections. As discussed earlier, he is interested in whether incum-
bency provides a reelection advantage to the party in power. The rst graph, Figure 1, plots
Y
k
against the running variable, with the discontinuity at c = 0 (you win if and only if
you get more votes than your opponent). This plot demonstrates that incumbency provides
an enormous advantage to the party in power there is a large, discontinuous break in
the probability of a Democrat winning the next election if a Democrat just barely won the
last election.
7
The second graph, Figure 2, plots Z
k
against the running variable, with the
threshold at c = 0. In this case, the covariate Z represents the number of previous election
victories of a candidate. If the RD design is valid, then there should be no discontinuity at
the threshold (i.e., candidates that just barely won an election should not look any dierent,
in terms of past success, than candidates that just barely lost an election). Lees gure shows
that there is no discontinuity in the covariate, increasing the credibility of the design.
2.4 Density of the Running Variable
The third type of graph arises from a specication test suggested in McCrary (2008). A
primary concern in RD designs is that individuals may be able to game the assignment rule.
That is to say, if individuals understand the assignment mechanism and can manipulate their
value of the running variable, then they may be able to place themselves just above (or below)
7
Because Lees RD design is a sharp RD design, there is little value in plotting D
k
against the running
variable. We already know what it will look like it will just change from 0 to 1 as we cross c. If this were
a fuzzy RD design, however, the plot of D
k
would be just as important as the plot of Y
k
.
M. Anderson, Lecture Notes 12, ARE 213 Fall 2012 11
Figure 1: Candidates Probability of Winning Election t+1, by Margin of Victory in Election
t: local averages and parametric t. Source: Lee (2008)
0.00
0.10
0.20
0.30
0.40
0.50
0.60
0.70
0.80
0.90
1.00
-0.25 -0.20 -0.15 -0.10 -0.05 0.00 0.05 0.10 0.15 0.20 0.25
Local Average
Logit Iit
Democratic Vote Share Margin oI Victory, Election t
P
r
o
b
a
b
i
l
i
t
y

o
I

W
i
n
n
i
n
g
,

E
l
e
c
t
i
o
n

t

1
Figure IIa: Candidate's Probability of Winning Election t+1, by
Margin of Victory in Election t: local averages and parametric fit
0.00
0.50
1.00
1.50
2.00
2.50
3.00
3.50
4.00
4.50
5.00
-0.25 -0.20 -0.15 -0.10 -0.05 0.00 0.05 0.10 0.15 0.20 0.25
Local Average
Polynomial Iit
N
o
.

o
I

P
a
s
t

V
i
c
t
o
r
i
e
s

a
s

o
I

E
l
e
c
t
i
o
n

t
Figure IIb: Candidate's Accumulated Number of Past Election
Victories, by Margin of Victory in Election t: local averages and
parametric fit
Democratic Vote Share Margin oI Victory, Election t
M. Anderson, Lecture Notes 12, ARE 213 Fall 2012 12
Figure 2: Candidates Accumulated Number of Past Election Victories, by Margin of Victory
in Election t: local averages and parametric t. Source: Lee (2008)
0.00
0.10
0.20
0.30
0.40
0.50
0.60
0.70
0.80
0.90
1.00
-0.25 -0.20 -0.15 -0.10 -0.05 0.00 0.05 0.10 0.15 0.20 0.25
Local Average
Logit Iit
Democratic Vote Share Margin oI Victory, Election t
P
r
o
b
a
b
i
l
i
t
y

o
I

W
i
n
n
i
n
g
,

E
l
e
c
t
i
o
n

t

1
Figure IIa: Candidate's Probability of Winning Election t+1, by
Margin of Victory in Election t: local averages and parametric fit
0.00
0.50
1.00
1.50
2.00
2.50
3.00
3.50
4.00
4.50
5.00
-0.25 -0.20 -0.15 -0.10 -0.05 0.00 0.05 0.10 0.15 0.20 0.25
Local Average
Polynomial Iit
N
o
.

o
I

P
a
s
t

V
i
c
t
o
r
i
e
s

a
s

o
I

E
l
e
c
t
i
o
n

t
Figure IIb: Candidate's Accumulated Number of Past Election
Victories, by Margin of Victory in Election t: local averages and
parametric fit
Democratic Vote Share Margin oI Victory, Election t
M. Anderson, Lecture Notes 12, ARE 213 Fall 2012 13
the threshold c. In that case, the individuals just above the threshold will disproportionately
consist of those gaming the rule, and they will not be directly comparable to the individuals
lying just below the threshold. For example, consider a welfare program that activates only
when income falls below a threshold c. Shrewd families with income in the neighborhood
of c will stop working right before income crosses c, so the observations right below c will
disproportionately consist of families of this type. This type of family will not be balanced
across the discontinuity threshold. Another example would arise if individuals are assigned on
the basis of test scores but can re-take the test as many times as necessary. If the researcher
uses an individuals maximum test score as the running variable, motivated individuals who
re-take the test many times will be more likely to fall right above the discontinuity threshold
than right below it.
8
To address this issue, McCrary suggests a specication test examining the density of
the running variable as it crosses c. In practice, this graph can be generated using the
same methodology as above, but instead of graphing Y
k
or Z
k
, just plot the number of
observations falling in each bin, i.e. N
k
=

i
1(b
k
< X
i
b
k+1
). If units are manipulating
their values of the running variable to fall just above or below c, then we should observe a
discontinuity in the distribution of the running variable as it crosses c. If the distribution of
the running variable is smooth as it crosses c, then its unlikely that individuals are gaming
the assignment mechanism.
3 Estimation
3.1 Local Linear Regression and the Sharp RD
We now focus on estimating (rather than graphing) the treatment eect in RD designs.
Recall from our earlier lectures that kernel regression can be conceptualized as a local
constant estimator for a given X

, it estimates the mean of Y in the neighborhood of


8
A solution in this case would be to use the individuals rst test score as the running variable, assuming
these data are available to the researcher.
M. Anderson, Lecture Notes 12, ARE 213 Fall 2012 14
X

(hence Y is assumed to be constant in expectation in the neighborhood of X). We


also discussed Lowess regression as an example of a local regression estimator. We use a
simpler form of local linear regression to estimate the change in Y
i
as the running variable
crosses c.
We rst choose the bandwidth, h, that will determine the regression sample on either
side of the threshold point c. We then t a linear regression on either side of the threshold
point for the samples with X
i
(c h, c) and X
i
[c, c + h). Formally, we compute:
min
l
,
l
N

i ch<X
i
<c
(Y
i

l

l
(X
i
c))
2
and
min
r
,
r
N

i cX
i
<c+h
(Y
i

r

r
(X
i
c))
2
lim
xc
E[Y
i
|X
i
= x] and lim
xc
E[Y
i
|X
i
= x] are then estimated as:

l
(c) =
l
+

l
(c c) =
l
and
r
(c) =
r
+

r
(c c) =
r
Finally, estimate
SRD
=
r

l
.
In practice, it is often easier to implement this estimator using a single regression. Specif-
ically, in the SRD design, run the following regression:
Y
i
= + D
i
+ (X
i
c) + (X
i
c) D
i
+ u
i
for the sample with c h < X
i
< c + h
The coecient will be numerically identical to
SRD
above. The advantage to the single
regression, besides being simpler, is that we can use the least squares (robust) standard errors
for statistical inference. As with kernel regression, the two factors that the researcher must
choose are the kernel function and the bandwidth, h. The kernel choice we implicitly ignored
by choosing the uniform kernel, and we know that kernel choice is not too important anyway
the important choice is bandwidth. Imbens suggests a bandwidth proportional to N

,
M. Anderson, Lecture Notes 12, ARE 213 Fall 2012 15
where 1/5 < < 2/5, but as always its best to check that the results are not sensitive to
doubling or halving the bandwidth.
Covariates could also be added to the regression above to improve precision. Unlike
with the running variable, it is not necessary to interact the covariates with the treatment
indicator (which is eectively an indicator for whether the unit lies above or below c, given
that this is the SRD design).
3.2 Estimation in the Fuzzy RD
In the FRD design, we have two eects to estimate: the eect of crossing the threshold on
the treatment (the rst stage) and the eect of crossing the threshold on the outcome (the
reduced form). We again use local linear regression. As you might expect, we apply the
same methodology as in Section 3.1 to estimate the eect of crossing the threshold on Y
i
and
the eect of crossing the threshold on D
i
. Specically, we run the regressions:
Y
i
=
0
+
1
Z
i
+
2
(X
i
c) +
3
(X
i
c) Z
i
+ u
i
for the sample with c h < X
i
< c + h
and
D
i
=
0
+
1
Z
i
+
2
(X
i
c) +
3
(X
i
c) Z
i
+ v
i
for the sample with c h < X
i
< c + h
where Z
i
= 1(X
i
c). In other words, we regress Y
i
and D
i
on an indicator for whether
an observation falls above or below the discontinuity threshold (and also controlling for the
running variable and an interaction of the running variable and the above/below indicator)
using the sample with c h < X
i
< c + h. The fuzzy RD estimator is:

FRD
=

1

1
In other words, the FRD estimator is simply the ratio of the reduced form and rst stage
estimates, i.e. the eect of crossing the discontinuity threshold on the outcome divided by
M. Anderson, Lecture Notes 12, ARE 213 Fall 2012 16
the eect of crossing the discontinuity threshold on the treatment. Again, it is trivial to add
covariates to the regressions above, and the formula
FRD
=
1
/
1
will still apply.
Given the discussion above, it should be obvious that we can estimate
FRD
using a TSLS
regression in which Y
i
is the outcome, D
i
is the treatment, Z
i
is the instrument, and (X
i
c)
and (X
i
c) Z
i
are covariates (this regression would of course be limited to the sample with
c h < X
i
< c + h). The advantage of this approach, besides ease of implementation, is
that we can use the 2SLS (robust) standard errors for statistical inference.
3.3 Optimal Bandwidth
As with kernel density estimation, choosing the bandwidth (h) is more an art than a science.
If you really love doing math, check out the section in Imbens and Lemieux (2008) on
bandwidth selection. They discuss a cross-validation method in the context of the SRD.
Recall that the optimal bandwidth is generally a function of the regression function that we
are trying to estimate (which itself depends on the bandwidth). The cross-validation method
minimizes the mean squared error between Y
i
and predicted Y
i
with respect to h. The setup
here is actually simpler than it was with kernel density estimators in that we only care about
minimizing the MSE near a single point (c) rather than over the entire support of X. We
do this by minimizing
CV

Y
(h) =
1
N

iq
X,,l
X
i
q
X,1,r
(Y
i
(X
i
))
2
where q
X,,l
and q
X,1,r
are the and 1 quantiles (respectively) of the empirical
distributions of the samples with X
i
< c and X
i
c (respectively). We limit the sample in
this manner because were really interested in the optimal bandwidth in some region around
c. Literally what the CV procedure does is nds the bandwidth h that minimizes the mean
squared error (i.e., the dierence between the actual Y
i
and the predicted Y
i
) in the trimmed
data set.
M. Anderson, Lecture Notes 12, ARE 213 Fall 2012 17
In the CV criterion formula above, (X
i
) is the intercept from a local linear regression
using observations falling in (X
i
h, X
i
) if X
i
< c or observations falling in [X
i
, X
i
+ h) if
X
i
c. These regressions are estimated as described in the beginning of Section 3.1, but
now X
i
replaces c. Why do we proceed in this manner? Recall that the goal of our left side
local linear regression is to estimate the conditional expectation of Y
i
(0) at X
i
= c. Because
E[Y
i
|X
i
] jumps discontinuously at X
i
= c (and because there are virtually no observations
with X
i
= c), we cannot test our regressions performance by comparing its prediction to
actual observations of Y
i
when X
i
= c. Instead, in our data we observe pairs (Y
i
, X
i
) at many
values of X
i
other than c. For each observation we thus pretend that X
i
is the threshold value
c, estimate a local linear regression at X
i
using other nearby observations, and measure how
well the local linear regression does at predicting the observed value of Y
i
for the observation
(Y
i
, X
i
).
9
Specically, for each point X
i
that is to the left of c (but right of q
X,,l
), we run the
regression:
Y
j
=
l
(X
i
) +
l
(X
i
) (X
j
X
i
) + u
j
for the sample with X
i
h < X
j
< X
i
We limit the sample to the left side of X
i
to mimic the local linear regression that we
estimate on the left side of c.
10
Note that the parameters
l
(X
i
) and
l
(X
i
) are functions
of X
i
because we are running a dierent local linear regression for each point X
i
in the left
subsample. We set (X
i
) =
l
(X
i
) for each point X
i
in the left subsample.
11
For each point X
i
that is to the right of c (but left of q
X,1,r
), we run the regression:
Y
j
=
r
(X
i
) +
r
(X
i
) (X
j
X
i
) + u
j
for the sample with X
i
< X
j
< X
i
+ h
9
Stepping back to see the forest rather than the trees, our ultimate goal is to choose the optimal bandwidth
h

that produces the set of local linear regressions which best predict the observed values of Y
i
in the trimmed
data set.
10
In principle we could estimate separate bandwidths for either side of c, but in practice we use the same
bandwidth on both sides of c.
11
(X
i
) is not a function of

l
(X
i
) because the regressor X
j
X
i
approaches zero as X
j
approaches the
simulated threshold X
i
. If we specied the regressor as X
j
instead of X
j
X
i
, then we would need to set
(X
i
) =
l
(X
i
) +
l
(X
i
)X
i
in order to estimate the conditional expectation of Y
i
at X
i
.
M. Anderson, Lecture Notes 12, ARE 213 Fall 2012 18
We limit the sample to the right side of X
i
to mimic the local linear regression that
we estimate on the right side of c. We set (X
i
) =
r
(X
i
) for each point X
i
in the right
subsample.
The procedure is computationally intensive for each potential value of h, you are running
(1 ) N local linear regressions. In practice, you might start with = 0.50 and assess the
sensitivity of h to using larger values of (e.g., 0.8, 0.9).
For the FRD design, one could in principle choose separate bandwidths for the reduced
form and the rst stage. In practice, however, we generally choose the same bandwidth for
both regressions. One option is to use the optimal bandwidth for the reduced form for both
regressions. Alternatively, we could choose the minimum of the two optimal bandwidths
(the reduced form optimal bandwidth and the rst stage optimal bandwidth).
Regardless of how you choose the bandwidth, it always a good idea to test the sensitivity
of your results to choice of bandwidth by doubling/halving the bandwidth.
3.4 Alternative Estimators
An alternative to the SRD and FRD estimators above is to use all of the data when estimating
the treatment eect but to control for the conditional expectation of the outcome as a
function of the running variable. For example, in the SRD design, we could express the
outcome as a function of the running variable and the treatment:
Y
i
= + m(X
i
) + D
i
+ v
i
If D
i
is a treatment indicator, then the coecient should estimate the average treatment
eect for units at the discontinuity. In practice, however, we do not know the function m(.).
We therefore approximate m(.) using a low order polynomial of the running variable (e.g.,
a cubic or quartic), fully interacted with the treatment indicator. We can then estimate the
treatment eect by simply regressing Y
i
on D
i
, a polynomial of X
i
, and that polynomial
M. Anderson, Lecture Notes 12, ARE 213 Fall 2012 19
interacted with D
i
. The coecient on D
i
gives the estimate of the treatment eect. This is
the type of estimator applied in Card and Lee (2008) (see Section 3.6).
In the FRD design, we can use the same methodology as above, but apply it to the 2SLS
estimator instead of the OLS estimator. Specically, replace D
i
with Z
i
= 1(X
i
c), and
run a 2SLS regression in which the endogenous regressor is D
i
, the instrument is Z
i
, and the
covariates are a polynomial in X
i
and the interaction between that polynomial and Z
i
.
3.5 Specication Tests
As in the graphical analysis, we can run several sets of specication tests. One type of
specication check tests for discontinuities in covariates at the threshold, c. To perform this
test, simply replace Y
i
in the regressions above with the covariate of interest, Z
i
. Another
specication check, presented in McCrary (2008), tests whether there is a discontinuous
jump in the running variable density at the threshold. As argued above, such a jump would
be evidence that individuals are manipulating their values of the running variable in order
to select into/out of treatment. Implementing this test requires us to revisit kernel density
estimation.
The rst step in the test involves estimating a histogram using a similar methodology to
that in Section 2.2. Note that the histogram bins are dened such that no bin crosses the
threshold value, c. The goal here is to plot the frequency of observations, rather than the
average outcome. For each bin we therefore plot

F
k
= N
k
/(N b), where b is the binwidth
and
N
k
=
N

i=1
1(b
k
< X
i
b
k+1
).
We are therefore plotting the number of observations that fall in each histogram bin,
normalized by N b (we normalize by b so that the binwidth does not impact the height
of the histogram bars). We then estimate local linear regressions using the histogram bin
midpoints as data this is eectively smoothing the histogram. As in previous sections, we
M. Anderson, Lecture Notes 12, ARE 213 Fall 2012 20
estimate this regression separately for the left and right sides of the threshold value we do
not want to smooth across the threshold point. Let {X

1
, ..., X

K
0
} be the set of histogram
bin midpoints that fall below c. At every point x < c, run the following weighted regression:

F
k
=
l
(x) +
l
(x)(X

k
x) using weights

K((X

k
x)/h)
where K(t) = max{0, 1 |t|} is the triangle kernel and h is the bandwidth (keep in mind
that the bandwidth h is no longer the same thing as the binwidth b!). Note that within each
regression, x is a constant the parameters are written as functions of x to reect the fact
that we are estimating a separate regression at each point x < c. The density estimate at
point x is then

f(x) =
l
(x). Repeat the same analysis using only the set of histogram bin
midpoints that lie above c, {X

1
, ..., X

K
1
}. From this analysis, recover the density estimate
at all points x c,

f(x) =
r
(x).
The test statistic will be

= ln

f(x)
+
ln

f(x)

, where

f(x)
+
and

f(x)

are estimated
just to the right and left of c respectively. We normalize this statistic by its standard error,

4.8 (
1

f
+
+
1

)/(Nh). We can then evaluate



/

using a standard t-distribution.


In order to implement this estimator, you need to choose the binwidth, b, and the band-
width, h. McCrarys simulations indicate that the choice of b is not too important, but the
choice of h can be important. His recommendation is to choose b = 2 /

N, where is the
standard deviation of the running variable. For the bandwidth, his recommendation is to
choose it via visual inspection, experimenting with dierent bandwidths. If you prefer to
use a plug-in bandwidth formula, however, or if you need an initial value for the bandwidth,
McCrary oers the following plug-in bandwidth estimator:
1. Using the rst-step histogram, estimate a global 4th order polynomial (in the running
variable) on each side of the threshold, c.
2. On each side of c, compute 3.348 [
2
(b a)/

(X
k
)
2
]
0.2
, where
2
is the mean
squared error of the regression, ba is X
K
c on the right-side regression and cX
1
on
M. Anderson, Lecture Notes 12, ARE 213 Fall 2012 21
Figure 3: Democratic vote share relative to cuto, popular elections to the House of Repre-
sentatives, 1900-1990. Source: McCrary (2008)
r X
1
; X
2
; . . . ; X
J
. The binsize and bandwidth were again chosen subjectively after using the automatic
procedure. Much more so than the vote share density, the roll call density exhibits very specic features near
the cutoff point that are hard for any automatic procedure to identify.
27
The gure strongly suggests that the underlying density function is discontinuous at 50%. Outcomes within
a handful of votes of the cutoff are much more likely to be won than lost; the rst-step histogram indicates
that the passage of a roll call vote by 12 votes is 2.6 times more likely than the failure of a roll call vote by 12
ARTICLE IN PRESS
0
30
60
90
120
150
-1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 1
Democratic Margin
F
r
e
q
u
e
n
c
y

C
o
u
n
t
0.00
0.20
0.40
0.60
0.80
1.00
1.20
1.40
1.60
D
e
n
s
i
t
y

E
s
t
i
m
a
t
e
Fig. 4. Democratic vote share relative to cutoff: popular elections to the House of Representatives, 19001990.
Table 2
Log discontinuity estimates
Popular elections Roll call votes
0.060 0.521
(0.108) (0.079)
N 16,917 35,052
Note: Standard errors in parentheses. See text for details.
0
50
100
150
200
250
300
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Percent Voting in Favor of Proposed Bill
F
r
e
q
u
e
n
c
y

C
o
u
n
t
0.00
0.50
1.00
1.50
2.00
2.50
D
e
n
s
i
t
y

E
s
t
i
m
a
t
e
Fig. 5. Percent voting yeay: roll call votes, U.S. House of Representatives, 18572004.
27
I use a binsize of b 0:003 and a bandwidth of h 0:03. The automatic procedure would select b 0:0025 and h 0:114.
J. McCrary / Journal of Econometrics 142 (2008) 698714 710
the left-side regression, and

f

(X
k
) is the estimated second derivative from the global
polynomial model. Set the estimated bandwidth,

h, to be the average of two quantities
that you have computed.
As always, no matter how you choose your bandwidth, you will want to check the sensi-
tivity of your results to choice of bandwidth.
McCrary (2008) applies the test to Lees Congressional election data and to House roll
call votes. In the former example, we expect no discontinuity in the running variable density
at c, and we nd none, as evidenced in Figure 3 and Table 2. In the roll call votes, however,
there are relatively few voters (only the members of the House of Representatives), and the
votes are public knowledge. Close votes thus may involve intense lobbying of swing members,
and votes may be more likely to barely pass than they are to barely fail. Indeed, that is
what appears in Table 2 and Figure 4.
M. Anderson, Lecture Notes 12, ARE 213 Fall 2012 22
r X
1
; X
2
; . . . ; X
J
. The binsize and bandwidth were again chosen subjectively after using the automatic
procedure. Much more so than the vote share density, the roll call density exhibits very specic features near
the cutoff point that are hard for any automatic procedure to identify.
27
The gure strongly suggests that the underlying density function is discontinuous at 50%. Outcomes within
a handful of votes of the cutoff are much more likely to be won than lost; the rst-step histogram indicates
that the passage of a roll call vote by 12 votes is 2.6 times more likely than the failure of a roll call vote by 12
ARTICLE IN PRESS
0
30
60
90
120
150
-1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 1
Democratic Margin
F
r
e
q
u
e
n
c
y

C
o
u
n
t
0.00
0.20
0.40
0.60
0.80
1.00
1.20
1.40
1.60
D
e
n
s
i
t
y

E
s
t
i
m
a
t
e
Fig. 4. Democratic vote share relative to cutoff: popular elections to the House of Representatives, 19001990.
Table 2
Log discontinuity estimates
Popular elections Roll call votes
0.060 0.521
(0.108) (0.079)
N 16,917 35,052
Note: Standard errors in parentheses. See text for details.
0
50
100
150
200
250
300
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Percent Voting in Favor of Proposed Bill
F
r
e
q
u
e
n
c
y

C
o
u
n
t
0.00
0.50
1.00
1.50
2.00
2.50
D
e
n
s
i
t
y

E
s
t
i
m
a
t
e
Fig. 5. Percent voting yeay: roll call votes, U.S. House of Representatives, 18572004.
27
I use a binsize of b 0:003 and a bandwidth of h 0:03. The automatic procedure would select b 0:0025 and h 0:114.
J. McCrary / Journal of Econometrics 142 (2008) 698714 710
Figure 4: Percent voting yes, roll call votes, US House of Representatives, 1857-2004. Source:
McCrary (2008)
r X
1
; X
2
; . . . ; X
J
. The binsize and bandwidth were again chosen subjectively after using the automatic
procedure. Much more so than the vote share density, the roll call density exhibits very specic features near
the cutoff point that are hard for any automatic procedure to identify.
27
The gure strongly suggests that the underlying density function is discontinuous at 50%. Outcomes within
a handful of votes of the cutoff are much more likely to be won than lost; the rst-step histogram indicates
that the passage of a roll call vote by 12 votes is 2.6 times more likely than the failure of a roll call vote by 12
ARTICLE IN PRESS
0
30
60
90
120
150
-1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 1
Democratic Margin
F
r
e
q
u
e
n
c
y

C
o
u
n
t
0.00
0.20
0.40
0.60
0.80
1.00
1.20
1.40
1.60
D
e
n
s
i
t
y

E
s
t
i
m
a
t
e
Fig. 4. Democratic vote share relative to cutoff: popular elections to the House of Representatives, 19001990.
Table 2
Log discontinuity estimates
Popular elections Roll call votes
0.060 0.521
(0.108) (0.079)
N 16,917 35,052
Note: Standard errors in parentheses. See text for details.
0
50
100
150
200
250
300
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Percent Voting in Favor of Proposed Bill
F
r
e
q
u
e
n
c
y

C
o
u
n
t
0.00
0.50
1.00
1.50
2.00
2.50
D
e
n
s
i
t
y

E
s
t
i
m
a
t
e
Fig. 5. Percent voting yeay: roll call votes, U.S. House of Representatives, 18572004.
27
I use a binsize of b 0:003 and a bandwidth of h 0:03. The automatic procedure would select b 0:0025 and h 0:114.
J. McCrary / Journal of Econometrics 142 (2008) 698714 710
M. Anderson, Lecture Notes 12, ARE 213 Fall 2012 23
3.6 Discrete Running Variables
In some cases, the running variable is discrete (technically, this is true in all cases, but
in some cases the discreteness is non-trivial, e.g., if the running variable is age and you
only have the data available in months or quarters). In these cases, it is impossible to
estimate the conditional expectation function arbitrarily close to the discontinuity threshold,
generating a form of model misspecication (Card and Lee 2008). This specication error
can lead to group structure in the variance-covariance matrix non-zero covariances between
observations with the same value of the discretized running variable, X
i
, may arise because
there is a deviation between the true conditional expectation and the predicted conditional
expectation (using the coarse running variable). Card and Lee suggest two ways to modify
the standard errors in response to this problem. First, one can cluster the standard errors
on the discrete values of the running variable (i.e., each value is a separate cluster). This
adjustment will be valid if the specication errors in both potential outcomes (E[Y
1
|X = x
i
]
and E[Y
0
|X = x
i
]) are the same. If you have reason to believe that the specication error
diers for the two potential outcomes, then you can collapse the data to the cell level (each
cell corresponds to a single value of the discrete running variable) and run a cell size-weighted
regression using the cell level data (see Card and Lee for further details).
4 Additional References
Card, D. and D. Lee. Regression Discontinuity Inference with Specication Error Journal
of Econometrics, 2008, 142, 655-674.
Lee, D. Randomized Experiments from Non-random Selection in U.S. House Elections.
Journal of Econometrics, 2008, 142, 675-697.
M. Anderson, Lecture Notes 13, ARE 213 1
ARE 213 Applied Econometrics
Fall 2012 UC Berkeley Department of Agricultural and Resource Economics
Statistical Inference:
Part 1, Panel Data and Clustering
We now transition from talking about estimators to talking about how to perform sta-
tistical inference with these estimators. That is to say, we set aside the issue of consistency
we either assume our estimators are consistent or we assume that we are content with
accepting the estimand for whatever it is and instead focus on how to conduct statistical
tests or construct condence intervals for these estimators. We begin by discussing the issue
of serial correlation (i.e., dependence between dierent observations in the same data set),
particularly in the context of clustered data.
As recently as 2000, many applied micro papers were using conventional (or Eicker-
White robust) standard errors in data sets with a high degree of dependence between
observations. Failing to appropriately account for this dependence can easily understate
the standard errors by a factor of two or three, and it is now viewed as unacceptable to
treat dependent observations as if they were independent when calculating standard errors.
The paper that really brought this issue to the attention of applied researchers is Bertrand,
Duo, and Mullainathan (2004). We will review this paper and then discuss the appropriate
techniques for adjusting standard errors depending on the number of independent groups
(i.e., clusters) and the number of units inside each of these groups.
1 Bertrand, Duo, and Mullainathan (2004)
1.1 Literature Review
Bertrand, Duo, and Mullainathan (2004) (henceforth BDM) examine the performance of
conventional standard errors in the context of dis-in-dis (DD) estimators that are popular
M. Anderson, Lecture Notes 13, ARE 213 2
in many applied micro elds (labor, public, development, health, etc.). They begin by
summarizing the state of the DD literature from 1990 to 2000. Using a survey of 92 DD
papers drawn from six journals (AER, ILRR, JOLE, JPE, JPubE, and QJE), they nd that
65 use more than two periods of data. Of these papers, the average number of periods used is
16.5, creating the potential for a large number of dependent observations within each cross-
sectional unit. Only ve of these papers make any correction for serial correlation across
time within cross-sectional units. Four of the papers use parametric AR corrections (which
turn out to be ineective), and only one allows for arbitrary serial correlation within each
cross-sectional unit (the recommended solution). To summarize, the state of the applied
literature during this time period with respect to computing standard errors was nothing
short of appalling.
1.2 Theory
Although it did not make it into the published version, the original BDM working paper
contained a useful section on the bias of the OLS standard errors in the presence of AR(1)
auto-correlation. Consider a simple bivariate regression of the form:
y
t
= + x
t
+
t
Though we have only one cross-sectional unit in this case, we will in general assume
that errors are independent across cross-sectional units (i.e., clusters) but dependent within
cross-sectional units (i.e., over time within a given unit). Assume that x
t
follows an AR(1)
process with auto-correlation parameter and that
t
follows an AR(1) process with auto-
correlation parameter . This process implies that the correlation between two observations
of x
t
that are t

periods apart is
t

. Likewise the correlation between two observations of

t
that are t

periods apart is
t

. It can be shown that


Var(

) =

2

T
2
x
(1 + 2

T1
t=1
x
t
x
t+1
T
2
x
+ 2
2

T2
t=1
x
t
x
t+2
T
2
x
+ ... + 2
T1
x
1
x
T
T
2
x
).
M. Anderson, Lecture Notes 13, ARE 213 3
The OLS standard errors in contrast are estimated as

Var(

) =

2

T
2
x
Furthermore, as T , the ratio of the estimated variance (i.e.,

Var(

)) to the true
variance (i.e., Var(

)) equals (1 )/(1 + ).
These formulas make clear several important points regarding serial correlation.
1
First, if
either the residual,
t
, or the regressor, x
t
, is independent across observations, then there is no
serial correlation bias in the standard errors. This is because if either = 0 or E[x
t
x
s
= 0] (for
t = s), then Var(

) =
2

/T
2
x
(asymptotically, (1)/(1+) = 1). This fact explains why
we are rarely concerned about serial correlation in randomized trials. For example, consider
implementing a randomized intervention in the Bay Area in which some unemployed workers
receive job training and others do not we wish to compare outcomes of the treated workers
to those of the untreated workers. These workers outcomes are surely correlated with each
other, since all of them are aected by similar macroeconomic shocks. Nevertheless, there is
no serial correlation problem because the treatment is randomly assigned at the individual
level, so by denition there can be no serial correlation in treatment assignments across
dierent individuals.
Second, the presence of positive serial correlation (the most common type) in both
t
and x
t
will lead the estimated variance of

to be too small relative to the true variance
of

.
2
This should be obvious as T ((1 )/(1 + ) < 1), but it also holds for
nite T the terms and the expectations of the x
t
x
s
terms in Var(

) will be positive. As
and increase (i.e., the serial correlation gets worse), the bias in the estimated variance
gets worse. This fact should be somewhat intuitive if there is positive dependence between
observations on both the treatment and outcome sides, then you eectively have less data
than it seems. In an extreme case, if you had perfect serial correlation in both the outcome
1
Technically, these points are specic to the simple AR(1) model, but they will often hold in more general
forms of serial correlation too.
2
The same thing is true if both
t
and x
t
are negatively serially correlated, but negative serial correlation
in both variables is rare in practice.
M. Anderson, Lecture Notes 13, ARE 213 4
and the treatment, then you would just be repeating the same observation again and again
as you added additional time periods, and no new information would actually enter the data
set.
Third, the degree of bias is aected by T. Note that the ratio of Var(

)/

Var(

) is:
1 + 2

T1
t=1
x
t
x
t+1
T
2
x
+ 2
2

T2
t=1
x
t
x
t+2
T
2
x
+ ... + 2
T1
x
t
x
T
T
2
x
Although each denominator contains T, the numerator in each term should increase at
rate T/(T 1) or faster, while the denominators increase at only (T + 1)/T. Furthermore,
increasing T always adds more terms to the expression. Thus, the larger that T gets, the
worse the bias gets, all other things being equal. This should be fairly intuitive as T
increases, the ratio of dependent observations to truly independent observations increases,
and the downward bias in the standard errors becomes worse.
Finally, it is theoretically possible for the true variance of

to be less than the estimated
variance of

(even ignoring sampling error). Specically, if there were negative serial corre-
lation in x
t
and positive serial correlation in
t
(or vice versa), then the estimated standard
errors would overstate the true variance of

. This might occur if the outcome has been rst
dierenced (which tends to generate negative serial correlation due to mean reversion) or in
a randomized experiment that implements a paired research design.
1.3 Simulations
BDM run a set of simulations using CPS data from 1979 to 1999 in order to ascertain
how severe the bias in conventional standard errors is in practice. Specically, they measure
female wages in 50 states over 21 years (1,050 state-by-year cells) and then randomly generate
laws that aect some states and not others. They randomly draw a year from the uniform
distribution between 1985 and 1995 to determine when the simulated law takes eect. They
then randomly draw 25 states that will be treated, leaving the other 25 as controls. The
treatment dummy is dened as unity for women living in a treated state during a treated
M. Anderson, Lecture Notes 13, ARE 213 5
year, and zero otherwise. Note that this simulation procedure is very dierent from randomly
assigning a treatment dummy in each state-by-year cell. Under pure random assignment,
there would be no serial correlation in the treatment dummy, and the conventional standard
errors should be correct (BDM conrm this fact in their simulations). But in the real world,
laws dont randomly turn on and o from year to year in the same state they turn on
in a given state and then persist (i.e., there is serial correlation). Hence BDM design their
simulations to replicate how laws are actually distributed in the real world.
By design these simulated laws, though serially correlated over time, are uncorrelated with
any real outcome. Thus we know that on average the regression coecient on the simulated
treatment variable,

, will equal zero.
3
The question of interest, however, is whether the
standard errors are of the correct size, i.e. do we reject the null hypothesis of zero only 5%
of the time at the = 0.05 level? The answer is no. Using the results from several hundred
simulations, BDM nd that they reject the null hypothesis of no eect an incredible 67%
of the time when using micro level data. Aggregating the data to the state-by-year level
improves matters a bit, but they still reject the null hypothesis 40-50% of the time.
BDM also experiment with changing the sample size along the two relevant dimensions
(G and T). They nd that reducing G while keeping T xed at T = 21 has a minimal eect
on the rejection rate (it still remains around 40% or higher). Reducing T to 5, however,
lowers the rejection rate to 0.08 (G remains 50). Reducing T to 3 brings the rejection rate
down to 0.05. So, as predicted by theory, the bias from serial correlation is less severe for
relatively low values of T (i.e., small clusters).
1.4 Solutions
1.4.1 Parametric AR(1) Corrections
One possible correction entails assuming that the serial correlation takes the form of an AR(1)
process (i.e.,
t+1
= u
t+1
+
t
) and using a parametric correction based on this model (e.g.,
3
The regression also include state and year xed eects.
M. Anderson, Lecture Notes 13, ARE 213 6
transform the data using the formula x
t
= x
t
x
t1
). BDM nd that parametric corrections
are ineective, presumably because the assumptions they impose on the exact form of the
serial correlation are too restrictive. After implementing the parametric correction, BDM
still nd rejection rates from 18-24%. In summary, dont rely on parametric corrections that
assume an AR(1) error process (or some similarly restrictive error process).
1.4.2 Collapse the Data
Another correction entails collapsing the data until the dependence issue disappears. Specif-
ically, we solve the clustering problem by collapsing the clusters down until they only contain
one or two observations each. Then the resulting data set has no dependence problem be-
cause the observations are independent of each other by virtue of the fact that the clusters are
independent of each other. This method should almost always solve the dependence issue,
albeit at the expense of lower precision (and possibly introducing some form of aggregation
bias).
In the panel data/DD context, collapsing the data generally entails collapsing each cross-
sectional unit into two time periods: pre-treatment and post-treatment. You can then
estimate a regression of the outcome on a treatment indicator using the collapsed data; con-
ventional standard errors should generate tests of the correct size (or at least close enough).
This method is somewhat problematic, however, if the treatment activates at dierent times
for dierent states in that case, its unclear what the counterfactual post-treatment period
for the untreated states is. In this context, BDM suggest a variant of the following procedure:
1. Regress Y
st
on state xed eects, year dummies, and relevant covariates (if you have
individual level data, Y
st
corresponds to the state-by-year cell mean). Note that we are
not including the treatment indicator in this regression. Collect the residuals from this
regression call them

Y
st
. Regress D
st
, the treatment indicator, on state xed eects,
year dummies, and the same covariates that you use for Y
st
. Collect the residuals from
this regression call them

D
st
.
M. Anderson, Lecture Notes 13, ARE 213 7
2. For the treatment states only, divide the observations into two groups: observations
from before the law and observations from after the law. Collapse the observations
(which now consist of

Y
st
and

D
st
) down to the state-by-treatment-status level (i.e., you
will have two observations for each treated state: pre-treatment and post-treatment).
3. Using this collapsed data set with only treated states, regress

Y
st
on

D
st
. The standard
errors in this regression should be the correct size (or close enough).
4
This is the Mike Anderson Approved
TM
method of collapsing the data when the policy
change occurs at dierent periods for dierent cross-sectional units. It is slightly dierent
than the methodology suggested in BDM they suggest the same procedure except that they
do not make any mention of residualizing the treatment indicator. Failing to residualize the
treatment indicator results in an estimator that does not reproduce the standard collapsed
DD estimator when the policy change occurs simultaneously for all treated states.
5
When collapsing the data, BDM nd that the rejection rate falls to 5-6%, using either
the simple aggregation method or the residual aggregation method. When the number of
clusters (states) falls to 10, they nd that the simple aggregation method rejects 5% of the
time while the residual aggregation method rejects 9% of the time. When the number of
clusters falls to 6, the simple aggregation method rejects 7% of the time while the residual
aggregation method rejects 10% of the time.
1.4.3 Arbitrary Variance-Covariance Matrix (Clustered Standard Errors)
A nal correction (the generally recommended one) uses an empirical estimator that can
accommodate an arbitrary variance-covariance matrix. This is the same clustered standard
errors estimator that we covered in the panel data lectures. Assume that we have a panel
4
You might think that the control states are adding nothing here since they have been discarded. However,
they implicitly provided the counterfactual trajectories for the treated states when we estimated the year
dummies in the rst step.
5
Anyone that can disprove me on this point gets an extra 20 percentage points added to their course
grade.
M. Anderson, Lecture Notes 13, ARE 213 8
data model of the form y
gt
= x
gt
+
gt
. Formally, the estimator is:
Var(

) =

g=1
X

g
X
g

g=1
X

g

g

g
X
g

g=1
X

g
X
g

1
,
where X
g
is a T K matrix with the tth row equal to x
gt
, and
g
is the T 1 column vector
with tth element equal to
gt
. We calculate
gt
using the estimated regression coecients
(remember, serial correlation does not make the coecient estimates inconsistent, it only
aects the standard errors).
6
This formula makes it clear why more clusters are better (from the perspective of com-
puting standard errors). The middle term will provide a precise estimate of E[X

g

g

g
X
g
]
only if G is of a reasonable size, i.e. we have a sucient number of clusters. Otherwise, our
estimate of E[X

g

g

g
X
g
] will be relatively unstable.
We explore the derivation of this estimator in the next section. For the moment, note
that in practice the arbitrary variance-covariance estimator is implemented using Statas ,
cluster() option. This option can be applied to a number of estimators, not just linear
regression. The key underlying assumption is that although residuals may be correlated
within a given cluster, they are independent across dierent clusters.
BDM explore the performance of the arbitrary variance-covariance matrix in their sim-
ulations. The rejection rate is 6% for 50 or 20 clusters, 8% for 10 clusters, and 12% for 6
clusters.
2 The Clustered Variance Estimator
2.1 Derivation
Clustered standard errors have replaced conventional standard errors in virtually all panel
data applications, and in many other contexts as well (e.g., sampling groups of students
6
In practice we also apply a correction for the number of clusters:
G
G1
N1
NK

G
G1
, where N is the total
sample size. Stata automatically applies this correction.
M. Anderson, Lecture Notes 13, ARE 213 9
within classrooms, sampling groups of individuals within villages, etc.). Given their widespread
adoption, you should have some understanding of the underlying algebra.
How do we derive the clustered variance estimator (i.e., the estimator that accommodates
an arbitrary variance-covariance matrix)? First note that the conditional variance of

is:
E[(

)(

|X] = E[(X

X)
1
X

X(X

X)
1
|X]
Under homoskedasticity and independence between observations, E[

|X] =
2
I, and
the formula above collapses to
2
(X

X)
1
. Suppose that we relax these assumptions, how-
ever. Then we have:
E[(X

X)
1
X

X(X

X)
1
|X] = (X

X)
1
X

E[

|X]X(X

X)
1
Focus on the interior K K matrix, X

E[

|X]X. The two outer K K matrices are


basically irrelevant since they are unaected by assumptions about the residuals.
X

X = X

2
1

1

2
. . .
1

1

2
2
. . .
2

N
.
.
.
.
.
.

1
. . .
2
N

x
1
x
2
.
.
.
x
N

1
x

2
. . . x

N
j=1
x
j

N
j=1
x
j

2
.
.
.

N
j=1
x
j

=
N

i=1
N

j=1
x

i
x
j

i
.
Reinserting the conditional expectation gives us:
N

i=1
N

j=1
x

i
x
j
E[
j

i
|X].
The independence assumption allows us to ignore all terms with i = j in the sum

N
i=1

N
j=1
x

i
x
j
E[
j

i
|X], yielding

N
i=1
x

i
x
i
E[
2
i
|X]. This in turn gives us the Huber-Eicker-
White heteroskedasticity robust standard errors: (

N
i=1
x

i
x
i
)
1

N
i=1
x

i
x
i

i
2
(

N
i=1
x

i
x
i
)
1
.
M. Anderson, Lecture Notes 13, ARE 213 10
These standard errors go to zero as N because the two inverted matrices grow at a
combined rate of N
2
while the interior matrix grows at a rate of only N.
Suppose, however, that we drop the independence assumption. The conditional variance
of

is:
Var(

) = (
N

i=1
x

i
x
i
)
1
N

i=1
N

j=1
x

i
x
j
E[
j

i
|X](
N

i=1
x

i
x
i
)
1
This expression presents two problems. First, this quantity need not converge to zero as
N because the interior matrix can now grow at up to N
2
, potentially matching the
growth rate of the two inverted matrices.
7
Thus

can converge very slowly (or not at all, in
cases of extreme dependence). Second, the empirical analog of this quantity is fatally awed.
Consider the empirical estimator for the interior matrix:
N

i=1
N

j=1
x

i
x
j

j

i
= X

X
The OLS residuals, , are constructed to be orthogonal to the regressors, X. Thus

X = 0
by construction, and an estimator of Var(

) based on

N
i=1

N
j=1
x

i
x
j

j

i
is guaranteed to
equal zero.
Now consider a case in which we have G clusters. Within each cluster, we want to allow
for dependence between observations of an arbitrary form, but we assume that observations
in dierent clusters are independent. This assumption allows us to ignore all terms with
i = j in the sum

N
i=1

N
j=1
x

i
x
j
E[
j

i
|X], as long as those terms are in dierent clusters.
This gives us the following estimator for the expectation of

N
i=1

N
j=1
x

i
x
j
E[
j

i
|X]. In
this estimator, we rst sum up all of the cross terms within a given cluster and then sum up
over all of the clusters:
G

g=1
(
T

s=1
T

t=1
x

s
x
t

s

t
) =
G

g=1
X

g

g

g
X
g
7
In other words, the estimated variance of

N

could go to innity we do not achieve root-N conver-
gence rates.
M. Anderson, Lecture Notes 13, ARE 213 11
We dene X
g
and
g
as in Section 1.4.3. Because the regression coecients are estimated
for the entire sample, rather than for each cluster individually, we do not run into the problem
that

g
X
g
= 0 by construction. The equality in the expression above should be clear when
you consider that we already showed X

X =

N
i=1

N
j=1
x

i
x
j

i
(replace N with T and
i, j with s, t). Note also that we could easily accommodate clusters of varying sizes by
indexing T as T
g
.
The clustered variance estimator is thus:
Var(

) =

g=1
X

g
X
g

g=1
X

g

g

g
X
g

g=1
X

g
X
g

1
Note that this estimator (and the variance of

) goes to zero as G . We should be
careful, however, if we have very few clusters, because we may not get a very precise estimate
of E[X

g

g

g
X
g
]. We discuss this issue in the next section.
2.2 Rules for Clustering
Wooldridge sets out several rules of thumb for clustering as a function of the number of
clusters (G) and the number of observations per cluster (T).
1. Large G and Small T
g
. This is basically the case set out in BDM (G is greater
than 20, and T
g
is less than G). As demonstrated in BDMs simulations, the clustered
standard errors perform well in this scenario.
2. Large G and Large T
g
. You might think that problems would arise here because
there are so many intra-cluster cross terms being estimated. However, for moderately
large G (20? 50?), clustered standard errors appear to perform well even with large
T
g
.
3. Small G and Large T
g
. If G is small (e.g., certainly if G < 10) and T
g
is relatively
large, then clustered standard errors are unlikely to perform well. Donald and Lang
M. Anderson, Lecture Notes 13, ARE 213 12
(2007) suggest a method that is similar to the collapsing method discussed in Section
1.4.2 simply collapse all the data down to the cluster level and run a regression on the
G observations at the cluster level. In a DD scenario, it may be necessary to collapse
the data to the cluster-by-treatment level. Note that t-statistics for the collapsed data
will have only GK degrees of freedom, highlighting the inference problem with small
numbers of clusters (e.g., consider the Card and Krueger (1994) paper the data there
would be collapsed to only 4 cells, essentially making inference impossible).
4. Small G and Small T
g
. This is a less challenging version of the Small G/Large T
g
scenario, so anything that works there should work here.
2.3 Multi-way Clustering
We have so far assumed that there is only one unit of clustering e.g., individuals (or time
periods) are correlated within states but independent across states. But what if there are
multiple levels of clustering? One possibility involves multiple levels of clustering that are
nested within each other. For example, a panel data set might contain individuals living in
cities nested within states with treatments that vary at the state level. In this scenario, the
solution is to just cluster at the highest level (the state, in the example just given). Since the
clustered variance estimator accommodates an arbitrary variance-covariance matrix within
each cluster, it is robust to the presence of sub-clusters within each cluster. Or, to put it
another way, we only need independence across clusters, and we get that by clustering at
the highest level.
If there are two levels of clustering that are not nested, however, then you may need to
adjust for multi-way clustering. For example, consider a state-by-year panel data set. Our
typical concern is that observations within a given state are correlated over time. However,
it is also possible that there could be dependence between states within a given year; e.g.,
a hurricane might aect multiple southeastern states in one year. If the treatment also has
a geographic correlation, then we will want to cluster by year as well as clustering by state.
M. Anderson, Lecture Notes 13, ARE 213 13
Constructing a cluster that contains all years and states, however, will result in having only
one cluster in the data set inference will be impossible.
Cameron, Gelbach, and Miller (2006) propose a simple way to accommodate multi-way
clustering with non-nested clusters. Suppose that there are two dimensions of non-nested
clustering, state (s) and year (t). The Cameron, et al. procedure requires the computation
of three variance-covariance matrices for the estimator,

. The rst, Var
s
(

), is clustered
at the state level. The second, Var
t
(

), is clustered at the year level. The third, Var


st
(

),
is clustered at the state-by-year level, i.e. the intersection of the two one-way levels. The
multi-way clustered standard errors are then calculated as the sum of the rst two variance-
covariance matrices minus the third variance-covariance matrix:
Var(

) = Var
s
(

) + Var
t
(

) Var
st
(

)
The key assumption in multi-way clustering is that observations that dier in both s and t
are independent of each other this is the analog of the assumption in one-way clustering that
observations with dierent g (i.e., dierent s in the state panel data case) are independent
of each other.
8
Observations that share either s or t, however, may be arbitrarily correlated
with each other.
One theoretical issue that Cameron, et al. claim rarely occurs in practice is that the
formula above could give negative estimates for one or more diagonal entries. If this occurs,
they recommend simply choosing the maximum standard errors obtained from one-way clus-
tering along each of the cluster dimensions. The procedure also generalizes to three-way
clustering using an analogous formula three one-way clustered matrices enter positively,
three two-way clustered matrices enter negatively, and one three-way clustered matrix enters
positively.
9
Software to implement the multi-way clustered estimator in Stata is available on Doug
8
Whether this assumption will hold in practice depends on the context. It would appear to rule out, for
example, common shocks to multiple states that persist over several years.
9
You can easily gure out this formula by drawing a Venn diagram.
M. Anderson, Lecture Notes 13, ARE 213 14
Millers website.
2.4 Bootstrap Based Improvements
Cameron, Gelbach, and Miller (2007) propose bootstrap based improvements to the clustered
standard errors that should improve performance when using a small number of clusters (i.e.,
small G). We will discuss this technique when we cover bootstrapping.
3 Additional References
Donald, S. and K. Lang. Inference with Dierence-in-Dierences and Other Panel Data.
Review of Economics and Statistics, 2007, 89, 221-233.
M. Anderson, Lecture Notes 14, ARE 213 1
ARE 213 Applied Econometrics
Fall 2012 UC Berkeley Department of Agricultural and Resource Economics
Statistical Inference:
Part 2, Randomization Inference
All of the statistical tests that we have discussed so far have relied upon precise dis-
tributional assumptions (for small sample inference) or asymptotic theory and large sample
approximations. Suppose, however, that we are unwilling to make distributional assumptions
and that our sample is small, so that we are unsure whether it is safe to apply asymptotic
approximations.
1
In that case we may want to apply a nonparametric test that relies neither
on distributional assumption nor asymptotic theory. In these notes, we consider one such
class of nonparametric tests: randomization inference, or permutation tests.
1 The Lady Tasting Tea and Fishers Exact Test
Fisher (1935) presents a statistical test of a ladys claim that she can discriminate whether
milk was added prior to the tea or after the tea simply by tasting a cup of tea.
2
The
experiment consists of mixing eight cups of tea, four of which have had the milk added
before the tea, and four of which have had the milk added after the tea. The lady is told
of the experimental design and knows that there are exactly four cups of each type. She is
instructed to taste each cup of tea, which are presented to her in random order, and to place
them into two groups of four (milk before and milk after).
To conduct the statistical test, note that there are 70 ways to choose 4 objects out
1
In practice, the central limit theorem takes hold shockingly quickly for most distributions e.g., by the
time you have a dozen observations, the distribution of the mean strongly resembles a normal distribution. In
principle, however, you might have a really bizarre distribution. More importantly, randomization inference
can be useful in cases with clustering.
2
This episode is apparently based on an historical event involving R.A. Fisher. The lady in question
reportedly passed the test with ying colors the explanation is that pouring hot tea into cold milk causes
the milk to curdle, but pouring cold milk into hot tea does not.
M. Anderson, Lecture Notes 14, ARE 213 2
of 8, assuming that order does not matter (which, in this case, it does not). Formally,

8
4

=
8!
4!4!
=
8765
432
= 70. Thus, there are 70 ways in which the lady could potentially
divide the 8 cups of tea. If the lady has no ability to discriminate between early-milk and
late-milk cups, which constitutes our null hypothesis, then the probability of dividing them
such that all four early-milk cups end up in the early-milk group is 1 in 70, or 0.014.
3
This
result would be statistically signicant at conventional levels. Now consider the probability
that three or more early-milk cups end up in the early-milk group. There are

4
3

ways to
place 3 early-milk cups in the early-milk group, and

4
1

ways to place one early-milk cup


in the late-milk group. Hence there are 16 ways for three early-milk cups to end up in the
early-milk group, plus the one way that four early-milk cups and can end up in the early-milk
group. The probability that three or more early-milk cups end up in the early-milk group
is thus 17/70 = 0.24.
4
This result is not statistically signicant at conventional levels, so
the lady must place all four early-milk cups in the early-milk group in order to prove to
us beyond a reasonable doubt that she can discriminate between early-milk and late-milk
cups.
5
Note that this test makes absolutely no assumptions about the parametric distribution
of residuals (we didnt even dene a random variable
i
) or the independence of various cups
of tea. In fact, it may well be the case that cups are correlated with each other in that some
came from one pot of tea while the rest came from another pot. But this is irrelevant from
our perspective all that matters is that the cups were sorted randomly before the decision
to add the milk before or after the tea was made. Nor does it matter what strange form the
distribution of tea taste may take. The key insight is that we take the outcomes for what
they are under the null hypothesis and then map out the distribution of the test statistic, the
number of cups correctly classied, that arises due to the randomization procedure. That is
3
Formally, there are 4 choose 4 (1) ways to place 4 early-milk cups in the early-milk group and 4 choose
0 (1) ways to place 0 early-milk cups in the late-milk group.
4
We could likewise calculate the probability of getting two cups right and two cups wrong as 4 choose 2
ways to place two early-milk cups in the early-milk group and 4 choose 2 ways to place two early-milk cups
in the late-milk group.
5
Alternatively, if she fails but shows some promise, e.g. places 3 out of 4 correctly, then we may rerun
the experiment with a larger number of cups, increasing the power of the experiment.
M. Anderson, Lecture Notes 14, ARE 213 3
to say, the variability in the test statistic under the null hypothesis comes from the random
assignment mechanism itself rather than sampling variability that arises because we do not
observe the entire population of cups of tea.
This test generalizes to a permutation test known as Fishers Exact Test. This non-
parametric test can be applied in any scenario in which there is a binary outcome and a
binary treatment.
6
Let N be the number of observations. In general the test looks like:
Treated Control Row Total
High N
TH
N
CH
N
H
Low N
TL
N
CL
N
L
Column Total N
T
N
C
The probability of observing any realization of this table is p =

N
H
N
TH

N
L
N
TL

N
N
T

.
Thus, if we want to compute whether the realization we observe is too improbable to be due
to chance, we simply calculate p for the realization we observed and all realizations that are
more extreme (generally, less probabilistic) than the one we observed.
7
We then sum these
probabilities to get a p-value.
For a variety of reasons, Fishers Exact Test is not often applied in economics.
8
Never-
theless, it is very useful in demonstrating the advantages of randomization tests. Because
randomization forms the basis for inference no distributional or independence assumptions
are necessary, nor do we need to use asymptotic approximations. The only thing we need to
do is model the assignment procedure correctly.
6
In the tea case the treatment might be dened as early-milk while the outcome is dened as classied
as early-milk.
7
If we expect that the treatment increases the probability of a high outcome, then higher values of N
T
correspond to more extreme realizations.
8
Among the reasons: The outcomes must be binary many times are not. Furthermore, if the outcomes
are binary, then the CLT will take hold very quickly, so an exact test is unnecessary unless the samples are
much smaller than is typical in economics. Finally, the exact test does not accommodate covariates.
M. Anderson, Lecture Notes 14, ARE 213 4
2 Randomization Tests
Anderson (2008) applies a randomization test in the context of randomized trials of preschool
programs. The samples in this study are as small as a dozen individuals (the control group
for the smallest study); thus there is some concern that asymptotic theory may not apply.
Because we know the assignment procedure (children were randomly assigned to treatment
and control), it is easy to simulate the distribution of the test statistic (the dierence in
means divided by its standard error) under the null hypothesis.
The intuition behind the test is as follows. The preschool children, who were recruited
from at-risk families in Ypsilanti, MI during the 1960s, cannot reasonably be thought of as a
random sample from any larger population. In fact, we observe the entire population of at-
risk children recruited for the Perry Preschool Program in Ypsilanti from 1962 to 1967. But
that does not mean that any observed dierence between the treatment and control groups
must represent a treatment eect. Even with random assignment, the treatment and control
groups will not be perfectly balanced. The question is whether the observed dierences are
large enough to be reasonably due to chance (randomness in the assignment procedure) or
whether they represent a true treatment eect.
What we would like to do is run the experiment thousands of times under the null
hypothesis of no treatment eect and record the distribution of our estimator through all of
these runs. To achieve this, we impose the null hypothesis, i.e., we assume that Y
i
(1) = Y
i
(0)
for all individuals. Under this hypothesis, we can simulate the experiment as many times
as we would like by randomly assigning placebo treatment indicators and recording the
dierence in means between treated and control groups. Using these results, we can
map out the null distribution of our test statistic. We can then compare the observed test
statistic from the real data to this null distribution. Note that we make no distributional
assumptions at all the variation in the null distribution of the test statistic arises from the
randomization procedure which we, the experimenters, designed (or, in this case, we know
how it was implemented).
M. Anderson, Lecture Notes 14, ARE 213 5
For a given sample size N, the procedure is implemented as follows:
1. Draw binary treatment assignments Z

i
from the empirical distribution of the original
treatment assignments without replacement.
2. Calculate the t-statistic for the dierence in means between treated and untreated
groups.
3. Repeat the procedure 100,000 times and compute the frequency with which the sim-
ulated t-statistics which have expectation zero by design exceed the observed
t-statistic.
If only a small fraction of the simulated t-statistics exceed the observed t-statistic, reject
the null hypothesis of no treatment eect. This procedure tests the sharp null hypothesis of
no treatment eect, so rejection implies that the treatment has some distributional eect.
Formally, only two assumptions are required:
1. Random Assignment: Let Y
i
(0) be the outcome for individual i when untreated and
Y
i
(1) be the outcome for individual i when treated (we only observe either Y
i
(0) or
Y
i
(1)). Random assignment implies {Y
i
(0), Y
i
(1) Z
i
}.
2. No Treatment Eect: Y
i
(0) = Y
i
(1) i
Note that no assumptions regarding the distributions or independence of potential out-
comes are needed. This is because the randomized design itself is the basis for inference
(Fisher 1935), and pre-existing clusters cannot be positively correlated with the treatment
assignments in any systematic way. Even if the potential outcomes are xed, the test statis-
tic will still have a null distribution induced by the random assignment. Since the researcher
knows the design of the assignment, it is always possible to reconstruct this distribution
under the null hypothesis of no treatment eect, at least by simulation if not analytically.
Thus, this test always controls Type I error at the desired level (Rosenbaum 2007).
M. Anderson, Lecture Notes 14, ARE 213 6
For binary Y
i
, this test generally converges to Fishers Exact Test. However, it diers
slightly from Fishers Exact Test in that Fishers test rejects for small p-values while this test
rejects for large t-statistics. This test is also similar to bootstrapping under the assumption
of no treatment eect (Simon 1997); the only dierence is that the resampling is done without
replacement rather than with replacement. This highlights the fact that the variance in the
test statistics null distribution arises from the randomization procedure itself rather than
from unknown variability in the potential outcomes.
In the paper, the randomization test produced p-values that were relatively close to those
from standard t-statistics the CLT takes hold quite quickly. Nevertheless, randomization
tests can be useful in addressing doubts from people who nd it hard to believe that you can
denitely disprove a null hypothesis with only 30 or 40 observations. They can also be useful
in challenging situations involving clustering. It is important to keep in mind, however, that
they only apply to testing the null hypothesis. You should not use them to test alternative
hypotheses or create condence intervals because the null distribution is just that, the null
distribution (its not the alternative distribution).
3 Randomization Tests with Clustering
The nice thing about randomization tests is that they are immune to clustering issues as
long as the randomization correctly forms the basis for inference. In other words, as long
as you model the randomization process correctly, you are guaranteed to have a valid test
of the null hypothesis. It doesnt matter if the potential outcomes are dependent or even
if they are xed numbers, because the null distribution of the test statistic is assumed to
arise from the randomization in the assignment procedure. Thus, as long as you can model
the randomization in the assignment procedure correctly, you can conduct a test of the null
hypothesis that will have the correct size.
Consider rst our case with the preschool projects. Obviously the potential outcomes are
correlated across dierent students within the study. For example, some students will attend
M. Anderson, Lecture Notes 14, ARE 213 7
Elementary School A, while others attend Elementary School B. The students attending
School A will experience common shocks that the students in School B do not, and vice versa.
None of this, however, aects the validity of our randomization test. One way to see this is
to note that the treatment is randomly assigned, i.e., there is no serial correlation in D
i
. We
know from our clustering lectures that serial correlation in outcomes is not a problem unless
it is accompanied by serial correlation in the treatment variable. Alternatively, however,
you could simply note that we know exactly how the treatment was assigned (i.e., randomly
by individual), and once we know that we dont have to worry about the distributions
of the potential outcomes. Under the null hypothesis, there is no missing data problem
(Y
i
(0) = Y
i
(1) for all units), and we can construct all the counterfactual estimates of

under dierent distributions of the treatment assignment that could have arisen in alternative
universes.
Now consider a more challenging case, such as a dis-in-dis study in which one state
is aected by a treatment and a group of comparison states is not. The test that Abadie,
et al. (2007) propose in this setting is a randomization test. We know how the assignment
mechanism works in this case it is turned on indenitely for one state at some point in
time and never turns on for the other states. Under the null hypothesis we observe all of
the untreated outcomes for every state. We can test whether the eect we observe for
the treated state (e.g., California in the Abadie, et al. example) is large or small compared
to other realizations of the test statistic under other potential treatment assignments. The
important thing is that we model the treatment assignment correctly, which in this case
means that we turn it on at a given point for the treated state and then leave it on. If we
randomly switched the treatment on and o for a given state, we would not be reproducing
the original assignment procedure. In this case, we would tend to over-reject because we
are assuming that the treatment was randomly assigned within a state (i.e., the treatment
was not serially correlated) when in fact it was randomly assigned across states but serially
correlated within states.
Finally, consider a case in which it is impossible to cluster the standard errors using the
M. Anderson, Lecture Notes 14, ARE 213 8
standard panel techniques. Aker (2008) tests the impact of the introduction of cell phones
on grain markets in Niger. She uses monthly price data on 31 markets over several years;
from these data she constructs 433 market pairs. She then examines the eect of a cell phone
dummy, equal to unity if both market pairs have cell phone coverage and zero otherwise, on
price dispersion between markets using a dis-in-dis type strategy. In a normal panel setting
we might cluster at the market-pair level, but in fact it is impossible to construct two or more
independent clusters using the market-pair data.
9
One solution is to try to implement the
multi-way clustering technique clustering at both the year level and the market-pair level;
however, the assumption that any common shock to market-pair 1-2 and market-pair 1-3
does not persist over time seems unlikely.
10
Alternatively, we can implement a randomization test of the null hypothesis in which we
assign placebo cellular towers to markets in random order and simulate the treatment eect
hundreds of times. We build cellular towers in the same way that they are built in the data
i.e., there are no cellular towers initially, then one is built in an initial market and persists
indenitely, then a second is built in another market and persists indenitely, etc. and
estimate the test statistic of interest (e.g., a t-statistic for a regression coecient) for each
simulation. This should give us an accurate test of the null hypothesis without requiring
unknown and/or unrealistic assumptions on the structure of the variance-covariance matrix.
Inference is based on the structure of the treatment rollout, combined with the null hypothesis
of no treatment eect.
4 Additional References
Aker, J. Does Digital Divide or Provide? The Impact of Cell Phones on Grain Markets in
Niger. Mimeo, UC Berkeley ARE, 2008.
9
For example, consider clustering at the market level. The rst cluster contains all market pairs with
market 1. But by denition it also contains pairs with markets from every other market cluster, so we cant
construct non-overlapping clusters.
10
Remember, multi-way clustering assumes that if two observations are dierent along both of the cluster
dimensions, then there is zero correlation. In this case, that implies that if you are looking at two dierent
market pairs in two dierent time periods, they should have zero correlation.
M. Anderson, Lecture Notes 14, ARE 213 9
Fisher, R.A. The Design of Experiments, 1935, Oliver and Boyd: Edinburgh and London.
M. Anderson, Lecture Notes 15, ARE 213 1
ARE 213 Applied Econometrics
Fall 2012 UC Berkeley Department of Agricultural and Resource Economics
Statistical Inference:
Part 3, The Bootstrap
Randomization tests are useful for testing a null hypothesis without making parametric
assumptions. We may take the nonparametric route because we want to be robust to dis-
tributional assumptions, or we may take the nonparametric route because we cannot (or do
not want to) calculate the nite sample (or even asymptotic) properties of our estimator.
1
But what if we want a resampling based procedure that can produce condence intervals
rather than just testing the null hypothesis? Then we turn to bootstrapping.
1 Bootstrapping the Mean
We begin by considering a simple example in which we want to estimate the variance of
the sample mean,

= y. We view each observation, y
i
, as a random draw from a larger
population. One way to estimate the variance of y is to assume that y
i
N(,
2
), in
which case we can show that y N(,
2
/N). Alternatively, we can relax the normality
assumption and apply the Central Limit Theorem and the Law of Large Numbers to show
that

N( y ) N(0,
2
) asymptotically. Suppose, however, that we did not know these
things (which can be the case with more exotic estimators). How might we estimate the
variance of

= y?
One way would be to use the sample analog of the variance. No, not the sample analog

Var(

) =

Var( y) = (1/N)

(y
i
y)
2
/N, which still relies on the formula Var(A + B) =
Var(A) +Var(B) +2 Cov(A, B). Rather, we could randomly draw S samples of size N from
1
Computing time is cheap and getting cheaper; human time is expensive and not getting cheaper.
M. Anderson, Lecture Notes 15, ARE 213 2
the population and compute S estimates of

s
= y
s
for s = 1, ..., S. Then estimate

Var(

) =
1
S 1
S

s=1
(

)
2
where

=
1
S

S
s=1

s
. Note that this is the conventional estimator we use for the standard
deviation of a random variable; the dierence is that the unit of observation is now the sample
rather than the individual observation.
In practice, of course, we only have one sample at our disposal, not S samples. Never-
theless, we can estimate the population distribution of y
i
using the empirical distribution of
y
i
in our sample. We do this by randomly drawing N observations from our data set with
replacement.
2
Each of these draws of N observations constitutes a single bootstrap sample,
b. For a given bootstrap sample, we calculate the mean of the bootstrapped observations, y
b
.
Repeating this procedure B times produces B estimates of the statistic

b
= y
b
, b = 1, ..., B.
We then estimate the variance of

= y as

Var(

) =
1
B 1
B

b=1
(

)
2
where

=
1
B

B
b=1

b
. This is an example of a nonparametric bootstrap; it is nonpara-
metric in that it makes no distributional assumptions, though it still relies on independence
between observations.
We have motivated the bootstrap as a method of estimating the variance of an estimator
when no analytic estimate is available. In practice, that is what it is generally used for.
Nevertheless, for asymptotically pivotal test statistics, the bootstrap can also be used to
improve upon the rst-order asymptotic approximation of Var(

).
3
Applied researchers rarely
2
After reading through the procedure, it should be obvious why we sample with replacement. If we
sampled without replacement,

b
would be identical for every bootstrap sample.
3
An asymptotically pivotal test statistic is one whose asymptotic distribution does not depend on any
unknown parameters. For example,

= y is not asymptotically pivotal because, even assuming the null
hypothesis E[y
i
] =
0
, it still depends on the unknown parameter . The t-statistic, t = (


0
)/s

,
however, is asymptotically pivotal.
M. Anderson, Lecture Notes 15, ARE 213 3
use it for that purpose, however, and, with the exception of the clustered case, we will not
devote much discussion to it here.
2 General Bootstrapping Procedure
For a general sample containing observations w
1
, w
2
, ..., w
N
, suppose that we are interested
in estimating the distribution of some statistic

(w
1
, w
2
, ..., w
N
). A general bootstrapping
procedure is implemented as follows:
1. Using the original sample w
1
, w
2
, ..., w
N
, draw a bootstrap sample with replacement
using one of the methods discussed in Sections 2.1, 2.2, or 3. Call this sample
w

1
, w

2
, ..., w

N
.
2. Compute the statistic of interest,

, using the bootstrap sample call the resulting
estimate

. Note that

could be a coecient, a standard error, or a test statistic.
3. Repeat the rst two steps B times, collecting B iterations of the statistic,

1
, ...,

B
.
Once you have generated the B bootstrapped values of the statistic, there are a variety
of things you can do with them. The most likely candidates are to compute the sample
variance of the statistic and/or to construct condence intervals for the statistics estimand.
To compute the bootstrapped sample variance of the statistic, use the formula that we saw
in Section 1:

Var(

) =
1
B 1
B

b=1
(

)
2
To compute a bootstrapped 95% condence interval for , nd the 2.5 and 97.5 percentiles
of

1
, ...,

B
and use those two values as the lower and upper bounds of the condence
intervals. To estimate the distribution of the t-statistic for some estimator

, you could let

= (

)/s

.
4
You could then dene a rejection region for

= t

from the original data


4
s

is calculated the standard way, but using the bootstrap sample data instead of the original data.
M. Anderson, Lecture Notes 15, ARE 213 4
using the 95th percentile of |

1
|, ..., |

B
| (i.e., reject if |

| is greater than the 95th percentile


of |

1
|, ..., |

B
|). Note that

is centered at its expectation in the empirical distribution so


we are eectively bootstrapping the t-statistics distribution under the null hypothesis. The
advantage of bootstrapping the coecients t-statistic rather than the coecient itself (

)
is that we may gain the asymptotic renements mentioned in Section 1.
One issue is how large to set B. In most cases, several hundred iterations is sucient,
but with cheap computing time you might as well do several thousand unless your estimator
is very computationally intensive.
2.1 The Paired Bootstrap
We dened a general bootstrapping procedure in Section 2, but we did not specify how to
actually generate the bootstrap samples. The rst, and most common, method that we
consider is the nonparametric bootstrap, also known as the paired bootstrap.
Suppose that the data w
1
, w
2
, ..., w
N
consist of N pairs of observations, w
i
= (y
i
, x
i
),
where y
i
is the dependent variable and x
i
contains the explanatory variables. The paired
bootstrap draws pairs, (y

i
, x

i
) from the empirical distribution of w
i
with replacement. In
other words, for any given draw, both y

i
and x

i
come from the same observation. Thus the
relationship between x

and y

is determined by the data rather than by any parametric


assumptions (hence the name nonparametric bootstrap). Note that this method can easily
be applied to nonlinear estimators as well as linear estimators.
5
Although the paired bootstrap is nonparametric, it still requires the assumption of in-
dependence between observations. It randomly samples from the empirical distribution if
this random sampling assumption is unjustied, then the bootstrap condence intervals may
be too narrow.
5
Bootstrapping wouldnt be that useful if we couldnt apply it to nonlinear estimators we already know
how to calculate the standard errors for linear estimators.
M. Anderson, Lecture Notes 15, ARE 213 5
2.2 The Residual and Parametric Bootstraps
If we are willing to make additional assumptions about the data generating process, we can
improve the approximation that the bootstrap provides. For example, consider a regression
model of the form y
i
= g(x
i
, ) +
i
. After estimating

, we can use this estimate to form
the residuals,
i
= y
i
g(x
i
,

). Note that g(.) could be linear (e.g., ordinary least squares)
or nonlinear (e.g., nonlinear least squares). We can then perform a residual bootstrap by
constructing a sample consisting of (y

1
, x
1
), ..., (y

N
, x
N
), where y

i
= g(x
i
,

) +

i
, and

i
is
resampled with replacement from
1
, ..,
N
. We call this the residual bootstrap because it
resamples residuals, which are randomly assigned to some x
i
and used to construct y

1
.
6
The benet of residual bootstrapping is a potentially improved approximation. The draw-
back, however, is a loss of robustness. For example, suppose that there is heteroskedasticity.
Because the residual bootstrap randomly assigns

i
to x
i
, there is no heteroskedasticity in
the bootstrap samples (x
i
cannot predict the variance of

i
). Hence residual bootstrapping
is not robust to heteroskedasticity.
The parametric bootstrap goes even further than the residual bootstrap in incorporating
a priori information. Suppose we assume that the conditional distribution of y, y
i
f(x
i
, ),
is known up to the parameter . We estimate

via maximum likelihood estimation (or some
other method). We perform a parametric bootstrap by constructing a sample consisting of
(y

1
, x
1
), ..., (y

N
, x
N
), where y

i
is randomly generated by the distribution f(x
i
,

).
7
A simple example of the parametric bootstrap may make thing clearer. Consider the
ordinary linear regression model combined with the assumption that
i
N(0,
2
). In this
case, y
i
N(x
i
,
2
). Using the OLS estimates of and
2
, we randomly draw a y

i
from
the N(x

,
2
) distribution for each x
i
. Our bootstrap sample is then (y

1
, x
1
), ..., (y

N
, x
N
).
We repeat this B times to construct B bootstrap samples.
The advantages and disadvantages of the parametric bootstrap are similar to those of
6
Sometimes x
i
is also resampled before
i
is resampled it shouldnt make a big dierence either way.
7
Alternatively, we could also resample x
i
before generating the y

i
.
M. Anderson, Lecture Notes 15, ARE 213 6
the residual bootstrap (better approximation vs. less robustness), only amplied. Since we
are generally more concerned about robustness than about (generally marginal) eciency
improvements, the residual and parametric bootstraps are not often used by applied re-
searchers.
3 The Bootstrap and Clustering
All of the bootstrap procedures discussed above assume independence between observations.
Often we would like to apply the bootstrap in the context of clustered data, however, because
for non-standard estimators it may be inconvenient or infeasible to compute asymptotic
standard errors that are cluster robust. One possibility is to use the cluster bootstrap, also
known as the block bootstrap. The essential idea here is that we resample at the cluster
level rather than the observation level.
Suppose that we have G clusters, each containing T observations, and that G is relatively
large. We represent the data as w
1
, w
2
, ..., w
G
, where w
g
= [(y
g1
, x
g1
), .., (y
gT
, x
gT
)]. The
cluster bootstrap draws G clusters from the empirical distribution of w
g
with replacement,
where each cluster consists of the elements w

g
= [(y
g1
, x
g1
), .., (y
gT
, x
gT
)]. Thus the pro-
cedure is similar to the paired bootstrap except that we are resampling clusters instead of
resampling individual observations. We repeat this procedure B times to obtain B bootstrap
samples.
We saw in previous lectures that cluster robust standard errors can be inaccurate when
G is small. Can the cluster bootstrap provide a better approximation of the variance than
cluster robust standard errors when G is small? Cameron, Gelbach, and Miller (2007) argue
yes. They suggest a cluster bootstrap using the t-statistic,

= (


)/s

rather than the


coecient itself the t-statistic is attractive because it is asymptotically pivotal, enabling the
possibility of asymptotic renements. Note that s

now corresponds to the cluster robust


standard error for rather than the conventional OLS standard error for . The basic
procedure is to construct B bootstrap samples using the cluster bootstrap, collect

1
, ...,

B
,
M. Anderson, Lecture Notes 15, ARE 213 7
and reject H
0
if the absolute value of the t-statistic from the original data is greater than
the 95th percentile of |

1
|, ..., |

B
|.
CGM (2007) suggest a further renement to the procedure above, however, based on
the wild bootstrap. The wild bootstrap is similar to the residual bootstrap in that it be-
gins with the calculation of
i
= y
i
g(x
i
,

). We then construct a sample consisting of
(y

1
, x
1
), ..., (y

N
, x
N
). Instead of resampling from
1
, ..,
N
to get

i
, however, we set

i
equal
to
i
with 50% probability or
i
with 50% probability.
8
That is to say, each x
i
is assigned
the residual from observation i with 50% probability or the negative of the residual from
observation i with 50% probability. We then construct y

i
as y

i
= g(x
i
,

) +

i
. You might
think that the wild bootstrap would perform poorly because each bootstrapped observa-
tion is drawn from a distribution with only two points of support, but in fact it performs
quite well. It is also robust to heteroskedasticity (unlike the residual bootstrap) because the
original relationship between x
i
and the variance of the residual is maintained.
In the context of clustering, we modify the wild bootstrap to apply to clusters rather
than individual observations. The wild cluster bootstrap for the t-statistic,

= (

)/s

is implemented as follows.
1. Estimate the OLS estimator

and use it to construct the residuals
g
(g = 1, ..., G).
Note that
g
is a T 1 vector containing all the residuals for cluster g.
2. For each cluster g, generate

g
=
g
with 50% probability or

g
=
g
with 50%
probability. Set y

g
= x
g

g
. Doing this for each cluster produces a bootstrap sample
consisting of (y

1
, x
1
), ..., (y

G
, x
G
). Using this sample, compute

= (


)/s

. Note
again that s

now corresponds to the cluster robust standard error for rather than
the conventional OLS standard error for .
3. Repeat the second step B times to construct B bootstrap samples. Reject H
0
if the
absolute value of the t-statistic from the original data is greater than the 95th percentile
8
In Cameron and Trivedi they suggest assigning 1.618
i
with 27.64% probability or 0.618
i
with
72.36% probability, but CGM (2007) suggest the simpler 50/50 probabilities.
M. Anderson, Lecture Notes 15, ARE 213 8
of |

1
|, ..., |

B
|.
Using both simulated and real data, CGM nd that the wild cluster bootstrap outper-
forms other cluster bootstraps, particularly when G is 10 or less.
4 Bootstrapping Vs. Randomization Tests
Since both methods are based on resampling, it is tempting to conclude that bootstrapping
and randomization tests are the same thing. There are, however, fundamental dierences
between the two procedures. Philosophically, randomization tests are based o of the insight
that the distribution of the test statistic arises from the random assign procedure rather than
from resampling within a set population. As a result, randomization tests can only be used
to test the null hypothesis, while bootstrapping can be used to construct condence intervals.
Randomization tests are also, in general, less parametric than bootstraps. The parametric
bootstrap depends upon distributional assumption, while the residual bootstrap depends
upon homoskedasticity. Even the paired bootstrap depends on a zero serial correlation
assumption.
9
Randomization tests depend upon modeling the distribution of the treatment
correctly (e.g., is it or is it not serially correlated), but this is often much clearer to the
researcher than assumptions about unobserved residuals. And while the bootstrap can be
modied to accommodate clustering, as in the wild cluster bootstrap, it cannot easily be
modied to accommodate a challenging scenario such as the one discussed in Aker (2008)
(see lecture notes on randomization tests).
9
The bootstrap also generally depends on the assumption that the estimator is smooth.

You might also like