Lect - 10 - Difference-in-Differences Estimation PDF
Lect - 10 - Difference-in-Differences Estimation PDF
Difference-in-Differences Estimation
These notes provide an overview of standard difference-in-differences methods that have
been used to study numerous policy questions. We consider some recent advances in Hansen
(2007a,b) on issues of inference, focusing on what can be learned with various group/time
period dimensions and serial independence in group-level shocks. Both the repeated cross
sections and panel data cases are considered. We discuss recent work by Athey and Imbens
(2006) on nonparametric approaches to difference-in-differences, and Abadie, Diamond, and
Hainmueller (2007) on constructing synthetic control groups.
(1.1)
where y is the outcome of interest, d2 is a dummy variable for the second time period. The
dummy variable dB captures possible differences between the treatment and control groups
prior to the policy change. The time period dummy, d2, captures aggregate factors that would
cause changes in y even in the absense of a policy change. The coefficient of interest, 1 ,
multiplies the interaction term, d2 dB, which is the same as a dummy variable equal to one
for those observations in the treatment group in the second period. The
difference-in-differences estimate is
1 y B,2 y B,1 y A,2 y A,1 .
(1.2)
(1.3)
The coefficient of interest is now 3 , the coefficient on the triple interaction term, d2 dB dE.
The OLS estimate 3 can be expressed as follows:
3 y B,E,2 y B,E,1 y A,E,2 y A,E,1 y B,N,2 y B,N,1
where the A subscript means the state not implementing the policy and the N subscript
represents the non-elderly. For obvious reasons, the estimator in (1.4) is called the
difference-in-difference-in-differences (DDD) estimate. [The population analog of (1.4) is
easily established from (1.3) by finding the expected values of the six groups appearing in
(1.4).] If we drop either the middle term or the last term, we obtain one of the DD estimates
described in the previous paragraph. The DDD estimate starts with the time change in averages
for the elderly in the treatment state and then nets out the change in means for elderly in the
control state and the change in means for the non-elderly in the treatment state. The hope is
that this controls for two kinds of potentially confounding trends: changes in health status of
(1.4)
(3.1)
where i indexes individual, g indexes group, and t indexes time. This model has a full set of
time effects, t , a full set of group effects, g , group/time period covariates, x gt (these are the
policy variables), individual-specific covariates, z igt , unobserved group/time effects, v gt , and
individual-specific errors, u igt . We are interested in estimating . Equation (3.1) is an example
of a multilevel model.
One way to write (3.1) that is useful is
y igt gt z igt gt u igt , i 1, . . . , M gt ,
(3.2 )
(3.3)
Equation (3.3) is very useful, as we can think of it as a regression model at the group/time
period level.
As discussed by BDM, a common way to estimate and perform inference in (3.1) is to
ignore v gt , in which case the observations at the individual level are treated as independent.
When v gt is present, the resulting inference can be very misleading. BDM and Hansen (2007b)
allow serial correlation in v gt : t 1, 2, . . . , T and assume independence across groups, g.
A simple way to proceed is to view (3.3) as ultimately of interest. We observe x gt , t is
handled with year dummies,and g just represents group dummies. The problem, then, is that
we do not observe gt . But we can use the individual-level data to estimate the gt , provided
the group/time period sizes, M gt , are reasonably large. With random sampling within each
g, t, the natural estimate of gt is obtained from OLS on (3.2) for each g, t pair, assuming
that Ez igt u igt 0. (In most DD applications, this assumption almost holds by definition, as
the individual-specific controls are included to improve estimation of gt .) If a particular model
of heteroskedasticity suggests itself, and Eu it |z igt 0 is assumed, then a weighted least
squares procedure can be used. Sometimes one wishes to impose some homogeneity in the
slopes say, gt g or even gt in which case pooling can be used to impose such
restrictions. In any case, we proceed as if the M gt are large enough to ignore the estimation
error in the gt ; instead, the uncertainty comes through v gt in (3.3). Hansen (2007b) considers
adjustments to inference that accounts for sampling error in the gt , but the methods are more
complicated. The minimum distance approach we discussed in the cluster sampling notes,
applied in the current context, effectively drops v gt from (3.3) and views gt t g x gt
as a set of deterministic restrictions to be imposed on gt . Inference using the efficient
minimum distance estimator uses only sampling variation in the gt , which will be independent
across all g, t if they are separately estimated, or which will be correlated if pooled methods
are used.
Because we are ignoring the estimation error in gt , we proceed simply by analyzing the
panel data equation
gt t g x gt v gt , t 1, . . . , T, g 1, . . . , G,
(3.4)
(3.5)
(3.6)
(4.1)
where d2 t 1 if t 2 and zero otherwise, c i is an observed effect, and u it are the idiosyncratic
errors. The coefficient is the treatment effect. A simple estimation procedure is to first
difference to remove c i :
y i2 y i1 w i2 w i1 u i2 u i1
(4.2)
y i w i u i .
(4.3)
or
If Ew i u i 0, that is, the change in treatment status is uncorrelated with changes in the
idiosyncratic errors, then OLS applied to (4.3) is consistent. The leading case is when w i1 0
for all i, so that no units we exposed to the program in the initial time period. Then the OLS
estimator is
y treat y control ,
(4.4)
which is a difference-in-differences estimate except that we different the means of the same
units over time.This same estimate can be derived without introducing heterogeneity by simply
writing the equation for y it with a full set of group-time effects. Also, (4.4) is not the same
estimate obtained from the regression y i2 on 1, y i1 , w i2 that is, using y i1 as a control in a cross
section regression. The estimates can be similar, but their consistency is based on different
assumptions.
More generally, with many time periods and arbitrary treatment patterns, we can use
y it t w it x it c i u it , t 1, . . . , T,
which accounts for aggregate time effects and allows for controls, x it . Estimation by FE or FD
to remove c i is standard, provided the policy indicator, w it , is strictly exogenous: correlation
beween w it and u ir for any t and r causes inconsistency in both estimators, although the FE
estimator typically has smaller bias when we can assume conteporaneous exogeneity,
Covw it , u it 0. Strict exogeneity can be violated if policy assignment changes in reaction to
past outcomes on y it . In cases where w it 1 whenever w ir 1 for r t, strict exogeneity is
usually a reasonable assumption.
Equation (4.5) allows policy designation to depend on a level effect, c i , but w it might be
(4.5)
(4.6)
where g i is the trend for unit i. A general analysis allows arbitrary corrrelation between c i , g i
and w it , which requires at least T 3. If we first difference, we get
y it g i t w it x it u it , t 2, . . . , T,
(4.7)
where t t t1 is a new set of time effects. We can estimate (4.7) by differencing again,
or by using FE. The choice depends on the serial correlation properties in u it (assume strict
exogeneity of treatment and covariates). If u it is roughly uncorrelated, FE is preferred. If the
original errors u it are essentially uncorrelated, applying FE to (4.6), in the general sense of
sweeping out the linear trends from the response, treatment, and covariates, is preferred. Fully
robust inference using cluster-robust variance estimators is straightforward. Of course, one
might want to allow the effect of the policy to change over time, which is easy by interacting
time dummies with the policy indicator.
We can derive standard panel data approaches using the counterfactural framework from
the treatment effects literature.For each i, t, let y it 1 and y it 0 denote the counterfactual
outcomes, and assume there are no covariates. One way to state the assumption of
unconfoundedness of treatment is that, for time-constant heterogeneity c i0 and c i1 ,
Ey it0 |w i , c i0 , c i1 Ey it0 |c i0
(4.8)
Ey it1 |w i , c i0 , c i1 Ey it1 |c i1 ,
(4.9)
where w i w i1 , . . . , w iT is the time sequence of all treatments. We saw this kind of strict
exogeneity assumption conditional on latent variables several times before. It allows treatment
to be correlated with time-constant heterogeneity, but does not allow treatment in any time
period to be correlated with idiosyncratic changes in the counterfactuals. Next, assume that the
expected gain from treatment depends at most on time:
Ey it1 |c i1 Ey it0 |c i0 t , t 1, . . . , T.
(4.10)
(4.11)
(4.12)
Ey it |w i , c i0 , c i1 t0 c i0 t w it ,
(4.13)
then we arrive at
(4.14)
and similarly for (4.9), then the estimating equation simply adds x it 0 to (4.13). More
interesting models are obtained by allowing the gain from treatment to depend on
heterogeneity. Suppose we assume, in addition to the ignorability assumption in (4.14) (and the
equivalent condition for y it1
Ey it1 y it0 |x it , c i0 , c i1 t a i x it t
(4.15)
(4.16)
w it x it t c i0 a i w it .
This is a correlated random coefficient model because the coefficient on w it is t a i , which
has expected value t . Generally, we want to allow w it to be correlated with a i and c i0 . With
small T and large N, we do not try to estimate the a i (nor the c i0 ). But an extension of the
within transformation effectively eliminates a i w it . Suppose we simplify a bit and assume
t and drop all other covariates. Then, a regression that appears to suffer from an
incidental parameters problem turns out to consistently estimate : Regress y it on year
dummies, dummies for each cross-sectional observation, and latter dummies interacted with
w it . In other words, we estimate
it t0 i0 i w it .
While i is usually a poor estimate of i a i , their average is a good estimator of :
10
(4.17)
N 1 i .
(4.18)
i1
A standard error can be calculated using Wooldridge (2002, Section 11.2) or bootstrapping.
We can apply the results from the linear panel data notes to determine when the usual FE
estimator that is, the one that ignores a i w it is consistent for . In addition to the
unconfoundedness assumption, sufficient is
it E i , t 1, . . . , T,
E i |w
(4.19)
it w it w
where w
i . Essentially, the individual-specific treatment effect can be correlated
with the average propensity to recieve treatment, w
i , but not the deviations for any particular
time period.
Assumption (4.19) is not completely general, and we might want a simple way to tell
whether the treatment effect is heterogeneous across individuals. Here, we an exploit
correlation between the i and treatment. Recalling that i a i , a useful assumption (that
need not hold for obtaining a test) is
Ea i |w i1 , . . . w iT Ea i |w
i w
i w i ,
(4.20)
where other covariates have been suppressed. Then we can estimate the equation (with
covariates)
y it t0 w it x it 0 w it x it x t
(4.21)
w it w
i w
c i0 e it
by standard fixed effects. Then, we use a simple t test on , robust to heteroskedasticity and
serial correlation. If we reject, it does not mean the mean usual FE estimator is inconsistent,
but it could be.
(5.1)
(5.2)
(5.3)
This assumption implies that, within group, the population distribution is stable over time.
The standard DD model can be expressed in this way, with
h 0 u, t u t
(5.4)
U i G i V i , V i G i , T i
(5.5)
and
although, because of the linearity, we can get by with the mean independence assumption
EV i |G i , T i 0. If the treatment effect is constant across individuals, Y i 1 Y i 0, then
we can write
Y i T i G i G i T i V i ,
(5.6)
(5.7)
(5.8)
F 11 y F 10 F 1
00 F 01 y,
(5.9)
1
where F 1
00 is the inverse function of F 00 , which exists under the strict monotonicity
assumption. Notice that all of the cdfs appearing on the right hand size of (5.9) are estimable
from the data; they are simply the cdfs for the observed outcomes conditional on different g, t
pairs. Because F 1
11 y F 11 y, we can estimate the entire distributions of both
counterfactuals conditional on intervention, G i T i 1.
The average treatment effect in the CIC framework as
CIC EY1|G 1, T 1 EY0|G 1, T 1.
(5.10)
EY 11 1 EY 11 0,
where we drop the i subscript, Y gt 1 is a random variable having distribution DY1|G g, t,
and Y gt 0 is a random variable having distribution DY0|G g, t. Under the same
assumptions listed above,
CIC EY 11 EF 1
01 F 00 Y 10
(5.11)
where Y gt is a random variable with distribution DY|G g, t. Given random samples from
each subgroup, a generally consistent estimator of CIC is
N 11
CIC
N 1
11
Y 11,i
N 10
N 1
10
i1
F 101 F 00 Y 10 , i ,
i1
for consistent estimators F 00 and F 01 of the cdfs for the control groups in the initial and later
time periods, respectively. Now, Y 11,i denotes a random draw on the observed outcome for the
g 1, t 1 group and similarly for Y 10,i . Athey and Imbens establish weak conditions under
which CIC is N -asymptotically normal (where, naturally, observations must accumulate
within each of the four groups). In the case where the distributions of Y 10 and Y 00 are the same,
a simple difference in means for the treatment group over time.
The previous approach can be applied either with repeated cross sections or panel data.
Athey and Imbens discuss how the assumptions can be relaxed with panel data, and how
alternative estimation strategies are available. In particular, if U i0 and U i1 represent
unobservables for unit i in the initial and later time periods, respectively, then (5.3) can be
modified to
13
(5.12)
(5.13)
which allows for unobservd components structures U it C i V it where V it has the same
distribution in each time period.
As discussed by AI, with panel data there are other estimation approaches. As discussed
earlier, Altonji and Matzkin (2005) use exchangeability assumptions to identify average partial
effects. To illustrate how their approach might applie, suppose the counterfactuals satisfy the
ignorability assumption
EY it g|W i1 , . . . , W iT , U i h tg U i , t 1, . . . , T, g 0, 1.
(5.14)
The treatment effect for unit i in period t is h t1 U i h t0 U i , and the average treatment effect
is
t Eh t1 U i h t0 U i , t 1, . . . , T.
(5.15)
(5.16)
which means that only the intensity of treatment is correlated with heterogeneity. Under (5.14)
and (5.16), it can be shown that
i .
EY it |W i EEY it |W i , U i |W i EY it |W it , W
(5.17)
t N
Yt 1, W i Yt 0, W i .
i1
(5.18)
(5.19)
(5.20)
Remember, in the current setup, no units are treated in the initial time period, so W 1 means
treatment in the second time period.
As in Heckman, Ichimura, Smith, and Todd (1997), Abadie uses unconfoundedness
assumptions on changes over time to identify ATT , and straightforward extensions serve to
identify ATE . Given covariates X (that, if observed in the second time period, should not be
influenced by the treatment), Abadie assumes
EY 1 0 Y 0 0|X, W EY 1 0 Y 0 0|X,
(5.21)
so that, conditional on X, treatment status is not related to the gain over time in the absense of
treatment. In addition, the overlap assumption,
0 PW 1|X 1
(5.22)
is critical. (Actually, for estimating ATT , we only need PW 1|X 1.) Under (5.21) and
(5.22), it can be shown that
W pXY 1 Y 0
1 pX
ATT PW 1 1 E
W i p X i Y i
1 p X i
ATT 1 N 1
i1
(5.23)
is consistent and N -asymptotically normal. HIR discuss variance estimation. Imbens and
Wooldridge (2007) provide a simple adjustment available in the case that p is treated as a
parametric model.
If we also add
EY 1 1 Y 0 1|X, W EY 1 0 Y 0 0|X,
(5.24)
so that treatment is mean independent of the gain in the treated state, then
ATE E
W pXY 1 Y 0
pX1 pX
(5.25)
which dates back to Horvitz and Thompson (1952); see HIR. Now, to estimate the ATE over
the specified population, the full overlap assumption in (5.22) is needed, and
N
ATE N
i1
W i p X i Y i
p X i 1 p X i
Hirano, Imbens, and Ridder (2003) study this estimator in detail where p x is a series logit
estimator. If we treat this estimator parametrically, a simple adjustment makes valid inference
i be the summand in (5.26) less ATE , and let D
i hX i W i hX i
on ATE simple. Let K
be the gradient (a row vector) from the logit estimation. Compute the residuals, R i from the
i on D
i , i 1, . . . , N. Then, a consistent estimator of Avar N ATE ATE is
OLS regression K
just the sample variance of the R i . This is never greater than if we ignore the estimation of px
i themselves.
and just use the sample variance of the K
Under the unconfoundedness assumption, other strategies are available for estimating the
ATE and ATT. One possibility is to run the regression
Y i on 1, W i , p X i , i 1, . . . , N,
which was studied by Rosenbaum and Rubin (1983) in the cross section case. The coefficient
on W i is the estimated ATE, although it requires some functional form restrictions for
consistency. This is much preferred to pooling across t and running the regression Y it on 1, d1 t ,
d1 t W i , p X i . This latter regression requires unconfoundedness in the levels, and as
dominated by the basic DD estimate from Y i on 1, W i : putting in any time-constant function
16
(5.26)
(5.27)
EY 0 |X, W 1 EY 0 |X, W 0
where, remember, Y t denotes the observed outcome for t 0, 1. Each of the four conditional
expectations on the right hand side is estimable using a random sample on the appropriate
subgroup. Call each of these wt x for w 0, 1 and t 0, 1. Then a consistent estimator of
ATT is
N
N 1
1
W i 11 X i 01 X i
10 X i 00 X i .
i1
Computationally, this requires more effort than the weighted estimator proposed by Abdadie.
Nevertheless, with flexible parametric functional forms that reflect that nature of Y,
implementing (5.28) is not difficult. If Y is binary, then the wt should be obtained from binary
response models; if Y is nonnegative, perhaps a count variable, then wt x expx wt is
attractive, with estimates obtained via Poisson regression (quasi-MLE).
17
(5.28)
y 12 w j y j2 ,
j2
where w j are nonnegative weights that add up to one. The question is: how can we choose the
weights that is, the synthetic control to obtain the best estimate of the intervention effect?
ADH propose choosing the weights so as to minimize the distance between, in this simple
J1
18
References
(To be added.)
19