Regression Discontinuity Designs Using Covariates
Regression Discontinuity Designs Using Covariates
Abstract
∗
We thank Stephane Bonhomme, David Drukker, Kosuke Imai, Michael Jansson, Lutz Kilian, Pat Kline, Xinwei
Ma, Andres Santos, and Gonzalo Vazquez-Bare for thoughtful comments and suggestions. Cattaneo gratefully ac-
knowledges financial support from the National Science Foundation through grants SES-1357561 and SES-1459931,
and Titiunik gratefully acknowledges financial support from the National Science Foundation through grant SES-
1357561.
†
Department of Economics, University of Miami.
‡
Department of Economics and Department of Statistics, University of Michigan.
§
Booth School of Business, University of Chicago.
¶
Department of Political Science, University of Michigan.
1 Introduction
The Regression Discontinuity (RD) design is widely used in Economics and other social, behavioral,
biomedical, and statistical sciences. Within the causal inference framework, this design is considered
among the most credible non-experimental strategies because it relies on relatively weak and easy-
to-interpret nonparametric identifying assumptions, which permit flexible and robust estimation
and inference for (local to the cutoff) treatment effects. The key feature of the design is the
existence of a score, index, or running variable for each unit in the sample, which determines
treatment assignment via hard-thresholding: all units whose score is above a known cutoff are
offered treatment, while all units whose score is below this cutoff are not. Identification, estimation,
and inference proceed by comparing the responses of units near the cutoff, taking those below
reviews see Imbens and Lemieux (2008), Lee and Lemieux (2010), Skovron and Titiunik (2016),
assumptions. Under this approach, estimation of average treatment effects at the cutoff typically
relies on nonparametric local polynomial methods, where the unknown (but assumed-smooth) re-
gression function of the outcome variable given the score is flexibly approximated above and below
the cutoff, and then these estimates are used to assess whether there is a discontinuity in lev-
els, derivatives, or ratios thereof, at the cutoff. This discontinuity, if present, is understood as
the average response to the treatment, intention-to-treat, treatment effect on the treated, or local
average treatment effect, at the cutoff, depending on the specific setting and assumptions under
consideration.
A natural estimation strategy is to fit separate local polynomial regressions of the outcome on
the score above and below the cutoff, but in practice researchers often augment their models with
“pre-intervention” covariates in addition to the running variable. The main motivation behind this
practice is to increase the precision of the RD treatment effect estimator. In addition, covariates
are sometimes added with the goal of improving the plausibility of the RD design, though this
second motivation is much harder to justify because it rests on additional strong assumptions
(see Remark 1 below). The practice of including covariates in RD estimation has its roots in the
common analogy between RD designs and randomized experiments. Since the RD design is often
1
(formally or informally) thought of as a randomized experiment near the cutoff (Lee, 2008; Cattaneo
et al., 2015), and the inclusion of covariates is often used in the analysis of experiments to increase
adjusted RD analysis being widespread in empirical practice, there is no existing justification for
using additional covariates for identification, estimation, or inference purposes, employing only
continuity/smoothness conditions at the cutoff. This has led to the proliferation of ad-hoc covariate-
adjustment practices that, at best, reduce the transparency of the estimation strategy and, at worst,
We provide a set of results that formalize and justify covariate adjustment in RD designs,
and offer valid estimation and inference procedures. Following empirical practice, we augment the
standard RD framework in order to codify the inclusion of covariates. In particular, we study local
polynomial methods allowing for the inclusion of additional covariates in an additive separable,
linear-in-parameters way, which permits continuous, discrete, and mixed regressors and does not
require additional smoothing methods (e.g., no need for choosing other bandwidths or kernels).
This procedure for covariate-adjusted RD estimation covers linear model adjustments, which are
popular in applied work, and allows us to characterize not only the conditions under which the
inclusion of covariates is appropriate, but also the ways in which adjusting by covariates may
lead to inconsistent RD estimators. Thus, our formal results offer concrete guidelines for applied
that imposes the same adjustment above and below the cutoff is consistent for the standard RD
treatment effect if a simple “zero RD treatment effect on covariates” condition holds. For example,
in the sharp RD design, the only requirement is that the covariates have equal conditional expecta-
tion from above and below at the cutoff, which is often conceived and presented as a falsification or
“placebo” test in RD empirical studies (see, e.g., Lee, 2008; Canay and Kamat, 2015, and references
therein). This requirement of “balanced” covariates at the cutoff, in the appropriate sense depend-
ing on the RD design considered, is the most natural and practically relevant sufficient condition
but, more generally, we are able to obtain necessary and sufficient conditions for consistency of the
2
covariates, and show that this estimator is generally inconsistent for the standard RD parameter of
interest. We also characterize the (necessary and) sufficient conditions required for this alternative
estimator to be consistent, which are strong and unlikely to hold in empirical settings.
We offer a complete asymptotic analysis for the covariate-adjusted RD estimator, including novel
mean squared error (MSE) expansions, several MSE-optimal bandwidths and consistent data-driven
implementations thereof, MSE-optimal point estimators, and valid asymptotic inference, covering
all empirically relevant RD designs (sharp RD, kink RD, fuzzy RD, and fuzzy kink RD), with both
heteroskedastic and clustered data. These results have immediate practical use in any RD analysis
and aid in interpreting prior results. In particular, we characterize precisely the source of efficiency
gains obtained when using the covariate-adjusted RD estimator (see Remarks 2 and 3). Last but
not least, we provide new general purpose Stata and R packages that implement all our results—see
Calonico, Cattaneo, Farrell and Titiunik (2016b) and references therein for more details.
We illustrate our methods with two empirical applications. First, we employ the data of Ludwig
and Miller (2007) to re-analyze the effect of Head Start on child mortality, where we find that
including nine pre-intervention 1960 census covariates leads to an average reduction of confidence
interval length of about 5% to 10% relative to the case without covariate-adjustment. Second, we
use the data of Chay, McEwan and Urquiola (2005) on the effect of school improvements on test
scores, where we see a 3% to 5% reduction in confidence interval length when six region-indicator
covariates are included. Finally, we also discuss the findings from an extensive simulation study
Our paper contributes to the large and still rapidly expanding methodological literature on RD
designs. Instead of giving a (likely incomplete) summary here, we defer to the review articles cited
in our opening paragraph for references and the references given throughout the manuscript. In
addition, our main results are also connected to the causal inference literature on covariate-adjusted
treatment effect estimation in randomized experiments. For a recent review see Imbens and Ru-
bin (2015) and references therein. In the specific context of RD designs, one recent strand of the
literature re-interprets the data as being “as good as randomized” within a small window around
the cutoff. This so-called “local randomization” RD approach requires conditions stronger than
tions from randomized experiments to be brought to bear (Cattaneo et al., 2015, 2016). Although
3
we maintain the more standard continuity framework throughout this article, our main results
show that the large sample implications of a “local randomization” approach to RD designs carry
over for the particular case of covariate adjustment—the only requirement is a weak and intuitive
The plan of presentation is as follows. The bulk of the paper is devoted to an in-depth treatment
of the sharp RD design, with a brief section discussing extensions to other popular RD settings.
Section 3 details important identification and interpretation issues. Section 4 gives a complete
analysis of nonparametric inference in sharp RD designs using covariates, including new MSE ex-
pansions, MSE-optimal estimators, valid inference based on robust bias-correction techniques, and
consistent standard errors. Section 5 discusses extensions to other RD designs. Section 6 presents
the results from the empirical illustrations and the simulation study, and Section 7 concludes. The
Appendix contains the main formulas concerning the sharp RD design, omitted from the main
text to ease the exposition. A supplemental appendix includes: (i) a thorough theoretical treat-
ment of all RD cases and extensions, including proofs of the results herein, (ii) a discussion on
implementation and other methodological details, and (iii) the complete set of Monte Carlo results.
The observed data is assumed to be a random sample (Yi , Ti , Xi , Z0i )0 , i = 1, 2, . . . , n, from a large
population. The key feature of any RD design is the presence of an observed continuous score or
running random variable Xi , which determines treatment assignment for each unit in the sample:
all units with Xi greater than a known threshold x̄ are assigned to the treatment group, while all
units with Xi < x̄ are assigned to the control group. In sharp RD designs, treatment compliance
is perfect and hence Ti = 1(Xi ≥ x̄) denotes treatment status. Using the potential outcomes
framework, the observed Yi is given by
Yi (0)
if Ti = 0
Yi = Yi (0) · (1 − Ti ) + Yi (1) · Ti =
Yi (1)
if Ti = 1,
4
where Yi (1) and Yi (0) denote the potential outcomes with and without treatment, respectively, for
each unit i in the sample. The parameter of interest is the average treatment effect at the cutoff:
Evaluation points of functions are dropped whenever possible throughout the paper. Hahn, Todd
and van der Klaauw (2001) gave precise, easy-to-interpret conditions for nonparametric identifi-
cation of the standard RD treatment effect τ , without additional covariates. The key substantive
identifying assumption is that E[Yi (t)|Xi = x], t ∈ {0, 1}, be continuous at the cutoff x = x̄.
The new feature studied in this paper is the presence of additional “pre-intervention”, “pre-
where Zi (1) and Zi (0) denote the (potential) covariates on either side of the threshold. In practice,
it is natural to assume that some features of the marginal distributions of Zi (1) and Zi (0) are
equal at the cutoff x̄ or, more extremely, that Zi (1) = Zi (0), which would match the definition of
A large portion of the literature on estimation and inference in RD designs focuses on nonpara-
metric local polynomial estimators. In practice, researchers first choose a neighborhood around the
cutoff, usually via a bandwidth choice, and then conduct local polynomial inference—that is, they
rely on linear regression fits using only units whose scores lay within that pre-selected neighborhood.
where β̂ Y −,p (h) and β̂ Y +,p (h) correspond to the weighted least squares coefficients
n
β̂ Y −,p (h) X
= arg min (Yi − r−,p (Xi − x̄)0 β − − r+,p (Xi − x̄)0 β + )2 Kh (Xi − x̄), (1)
β̂ Y +,p (h) β ,β
− + i=1
5
with β − , β + ∈ Rp+1 , r−,p (x) = 1(u < 0)(1, x, · · · , xp )0 , r+,p (x) = 1(u ≥ 0)(1, x, · · · , xp )0 , e0 the
(p + 1)-vector with a one in the first position and zeros in the rest, and Kh (u) = K(u/h)/h for
a kernel function K(·) and a positive bandwidth sequence h. The kernel and bandwidth serve to
localize the regression fit near the cutoff. We assume the following standard regularity conditions
Assumption 1 (Kernel). k(·) : [0, 1] 7→ R is bounded and nonnegative, zero outside its support,
and positive and continuous on (0, 1). Set K(u) = 1(u < 0)k(−u) + 1(u ≥ 0)k(u).
The most popular choices of kernel are (i) the uniform kernel, giving equal weighting to obser-
vations Xi ∈ [x̄ − h, x̄ + h], and (ii) the triangular kernel that assigns linear down-weighting to the
same observations. The preferred choice of polynomial order is p = 1, which gives the standard
local-linear RD point estimator. The estimators β̂ Y −,p (h) and β̂ Y +,p (h) are, of course, numerically
equivalent to the coefficients that would be obtained from two separate weighted regressions, using
only observations on one side of the cutoff (with the same kernel and bandwidth). We set the
problem as a single joint least-squares linear regression fit to ease the upcoming comparisons with
While the standard estimator τ̂ (h) is popular in empirical work, and readily justified by local
smoothness assumptions, it is extremely common for empirical researchers to augment their speci-
fication with the additional covariates Zi . One way of introducing covariates leads to the following
β̃
Y −,p (h) Xn
(Yi − r−,p (Xi − x̄)0 β − − r+,p (Xi − x̄)0 β + − Z0i γ)2 Kh (Xi − x̄),
Y +,p (h) = arg min (2)
β̃
β− ,β+ ,γ i=1
γ̃ Y,p (h)
where β − , β + ∈ Rp+1 and γ ∈ Rd . Throughout the paper and supplemental appendix we employ
the following notational convention whenever possible: a quantity with a tilde (θ̃, say) is estimated
with additional covariates, while a quantity with a hat (θ̂, say) is not; cf. (1) vs. (2).
6
The estimator τ̃ (h) broadly captures the common empirical practice of first choosing a neigh-
borhood around the cutoff, and then conducting local “flexible” linear least-squares estimation
and inference with covariates. But our approach formalizes two restrictions in the way that the
additional covariates Zi enter the least-squares fit locally to the cutoff: (i) additive separabil-
ity between the basis expansion of the running variable and the additional covariates, and (ii) a
linear-in-parameters specification for these covariates. We avoid full nonparametric estimation over
(Xi , Z0i )0 ∈ R1+d , which would introduce d additional bandwidths and kernels, quickly leading to a
curse of dimensionality and hence rendering empirical application infeasible. Further, in practice,
Zi could include power expansions, interactions, and other “flexible” transformations of the origi-
nal set of covariates. This approach to RD covariate adjustment allows for any type of additional
The typical motivation for using the covariate-adjusted RD estimator τ̃ (h) is to improve preci-
sion in estimating the RD treatment effect, τ , which arguably stems from least squares analysis of
randomized controlled trials. We build on this intuition and make precise the conditions required
for consistency of the covariate-adjusted RD estimator τ̃ (h) for τ . We also show that much more
stringent conditions are required if treatment-covariate interactions are included in the estimation
model.
additional (pre-intervention) covariates can enhance the plausibility of the design. Specifically, it is
sometimes claimed that even if E[Yi (t)|Xi = x], t ∈ {0, 1}, is not continuous at the cutoff, adding
covariates could restore nonparametric identification of the RD average treatment effect at the
cutoff. However, within the continuity-based RD framework, if E[Yi (t)|Xi = x, Zi (t)], t ∈ {0, 1},
is indeed continuous, then E[Yi (t)|Xi = x] = E[E[Yi (t)|Xi , Zi (t)]|Xi = x] will be continuous under
most reasonable assumptions. For example, suppose that Zi (0) and Zi (1) are binary (e.g., gender),
then E[Yi (t)|Xi = x] = E[Yi (t)|Xi = x, Zi (t) = 0]P[Zi (t) = 0|Xi = x] + E[Yi (t)|Xi = x, Zi (t) =
1]P[Zi (t) = 1|Xi = x] is a linear combination of assumed-continuous functions and hence must
be continuous as well. Thus, covariate adjustment does not solve identification problems when
E[Yi (t)|Xi = x] is discontinuous. On the other hand, adding covariates for identification purposes
can be rationalized and be useful within a local randomization framework for RD designs (e.g.,
7
Angrist and Rokkanen, 2015; Keele et al., 2015). Below we discuss in more detail the distinctions
To make precise the difference in population parameters recovered with and without additional
useful to establish notation and state all the regularity conditions employed. This is done simul-
taneously in the assumption below, which is the only assumption imposed on the data generating
process.
Assumption 2 (Sharp RD Designs). For % ≥ p + 2 and all x ∈ [xl , xu ], where xl , xu ∈ R such that
xl < x̄ < xu :
(a) The Lebesgue density of Xi , denoted f (x), is continuous and bounded away from zero.
(b) µY − (x) := E[Yi (0)|Xi = x], µY + (x) := E[Yi (1)|Xi = x], µZ− (x) := E[Zi (0)|Xi = x],
µZ+ (x) := E[Zi (1)|Xi = x], E[Zi (0)Yi (0)|Xi = x], and E[Zi (1)Yi (1)|Xi = x] are % times con-
tinuously differentiable.
(c) V[Si (t)|Xi = x], with Si (t) := (Yi (t), Zi (t)0 )0 , t ∈ {0, 1}, are continuously differentiable and
invertible.
(d) E[|Si (t)|4 |Xi = x], t ∈ {0, 1}, are continuous, where | · | denotes the Euclidean norm.
metric analyses of RD designs, properly enlarged to allow for the inclusion of additional covariates.
Indeed, if one simply ignores all statements involving these covariates, the conditions are exactly
The assumptions are placed only on features such as the mean and variance of the conditional
distributions given the running variable Xi alone. Importantly, Assumption 2 does not restrict in
any way the “long” conditional expectation E[Yi (t)|Xi , Zi (t)], t ∈ {0, 1}, which implies that our
methods allow for discrete, continuous, and mixed additional covariates, and do not require any
semiparametric or parametric modeling of this regression function. That is, as we discuss in more
detail below, we allow for complete misspecification of E[Yi (t)|Xi , Zi (t)], t ∈ {0, 1}, for fixed n, and
hence give a “best linear approximation” interpretation to the regression coefficients obtained in
(2).
8
Assumption 2 is not intended to be minimal, but rather parsimonious and easily applicable to
nonparametric estimation and inference. Finally, all limits are taken as n → ∞, unless otherwise
noted.
variates
We now present the first main result of the paper, which connects and gives an interpretation to
then
0
τ̃ (h) →P τ − µZ+ − µZ− γ Y ,
where
γ Y = (σ 2Z− + σ 2Z+ )−1 E (Zi (0) − µZ− (Xi )) Yi (0) + (Zi (1) − µZ+ (Xi )) Yi (1) Xi = x̄ ,
It is well known in the RD literature that, under the conditions of Lemma 1, τ̂ (h) →P τ . The
conclusion of this lemma gives a precise description of the probability limit of the covariate-adjusted
sharp RD estimator, when implemented according to (2). A similar result is discussed for all other
Lemma 1 shows that this covariate-adjusted RD estimation approach is consistent for the stan-
dard RD treatment effect at the cutoff, τ = µY + − µY − , plus an additional term that depends
on the RD treatment effect at the cutoff for the additional covariates, τ Z := µZ+ − µZ− . It
follows that, given the smoothness conditions imposed in Assumption 2, a sufficient condition for
τ̃ (h) →P τ is that µZ+ = µZ− . This is weaker than assuming that the marginal distributions of
Zi (0) and Zi (1) are equal at the cutoff. In other words, µZ+ = µZ− is implied by, but does not
require that, P[Zki (0) ≤ z|Xi = x̄] = P[Zki (1) ≤ z|Xi = x̄] for all z and k = 1, 2, · · · , d, where
Zi (t) = [Z1i (t), Z2i (t), · · · , Zdi (t)]0 with t ∈ {0, 1}.
The typical motivation for including covariates in RD analyses is to gain precision in estimating
9
the RD treatment effect of interest, τ , which has its roots in the analysis of randomized experiments.
Even if implicitly, researchers employing covariates in RD designs assume some form of “local
randomization”, where units are thought to be assigned to treatment or control at random near the
intuitively by Lee (2008) and Lee and Lemieux (2010); more recently, Cattaneo et al. (2015, 2016)
discuss the stronger conditions, beyond continuity, required for the interpretation and valid analysis
of RD designs as local randomized experiments. (See also de la Cuesta and Imai, 2016, for a recent
discussion of the distinction between continuity and local randomization.) From this perspective,
“pre-intervention” covariates would satisfy Zi (0) = Zi (1) conditional on Xi ∈ [x̄ − h, x̄ + h], that
is, their distributions would be equal among control and treatment units near the cutoff.
In contrast, in this paper we do not assume a local randomization condition of any form, but
tion. In this setting, Lemma 1 shows that only continuity of the conditional expectations of the
additional covariates at the cutoff is the key condition required for consistency of the covariate-
adjusted RD estimator. In other words, whenever additional covariates satisfying µZ+ = µZ− are
included as in (2), the estimator τ̃ (h) will remain consistent for the standard RD estimand τ .
In addition to efficiency gains, another common motivation for examining covariates in randomized
is a frequently used method for doing so. A potentially interesting extension of our work, and
implemented as in (2) with interactions between rp (Xi − x̄) and Zi . This alternative estimation
method might be useful to assess treatment effect heterogeneity at the cutoff, as well as to provide
a more “flexible” approximation of the unknown conditional expectations in finite samples. While
such general approach is beyond the scope of this paper, we do discuss a special case of this idea
to illustrate the potential pitfalls of allowing for interactions in the local polynomial fits.
10
where now θ̌ p (h) = [β̌ Y −,p (h)0 , β̌ Y +,p (h)0 , γ̌ Y −,p (h)0 , γ̌ Y +,p (h)0 ]0 is computed by
n
X
min (Yi − r−,p (Xi − x̄)0 β − − r+,p (Xi − x̄)0 β + − Z0−,i γ − − Z0+,i γ + )2 Kh (Xi − x̄),
β − ,β + ,γ − ,γ +
i=1
where Z−,i = 1(Xi < x̄)Zi and Z+,i = 1(Xi ≥ x̄)Zi , and β− , β+ ∈ Rp+1 and γ − , γ + ∈ Rd .
In words, this alternative estimator fits a weighted least squares regression with full interactions
between treatment assignment and both the polynomial basis expansion of Xi and the additional
covariates Zi . Thus, θ̌ p (h) is numerically equivalent to fitting two separate weighted linear regres-
sions on each side of the cutoff, leading to [β̌ Y −,p (h)0 , γ̌ Y −,p (h)0 ] and [β̌ Y +,p (h)0 , γ̌ Y +,p (h)0 ]. We
present the estimation approach in a fully interacted version only for notation simplicity, so that
As shown in the next lemma, including the treatment-covariate interaction has important impli-
cations for interpretation. The difference follows from the fact that including this interaction allows
Lemma 2 (Sharp RD with Covariates and Treatment Interaction). Let Assumptions 1 and 2 hold.
If nh → ∞ and h → 0, then
where
The conclusion of this lemma defines a new RD parameter, η, recovered when additional covari-
ates interacted with treatment assignment are included linearly in the local polynomial estimation.
A similar result is established for all other RD designs in the supplement. This result gives a precise
and general interpretation to the probability limit of the interacted covariate-adjusted RD estima-
tor: η̌(h) is consistent for the standard RD average treatment effect at the cutoff, τ = µY + − µY − ,
plus an additional term which can be interpreted as the difference of the best linear approximations
at the cutoff of the unknown conditional expectations E[Yi (t)|Xi , Zi (0)], t ∈ {0, 1}, based on the
additional covariates included in the RD estimation. Alternatively, the “bias” due to the inclu-
11
sion of additional covariates interacted with treatment assignment, µ0Z+ γ Y + − µ0Z− γ Y − , can be
interpreted as the difference of the best linear predictions of Yi (t) on Zi (t), t ∈ {0, 1}, at the cutoff.
and sufficient condition for the resulting covariate-adjusted RD estimator to be consistent for the
standard RD treatment effect is that µ0Z+ γ Y + = µ0Z− γ Y − . This condition, however, is harder to
justify in practice than the condition required for the model without the interaction. In particular,
the previously sufficient condition µZ+ = µZ− (“covariate balance”) is now no longer sufficient
because one needs also to assume that γ Y + = γ Y − . The latter additional assumption can be
estimator. To make this clear, consider the following example. Suppose that E[Yi (0)|Xi , Zi (0)] =
ξ− (Xi ) + Zi (0)0 δ Y − and E[Yi (1)|Xi , Zi (1)] = ξ+ (Xi ) + Zi (1)0 δ Y + near the cutoff. Then γ Y − = δ Y −
and γ Y + = δ Y + and η̌(h) →P η = ξ+ (x̄) − ξ− (x̄) 6= τ , and hence the interacted covariate-
adjusted RD estimator η̌(h) is consistent for a partial effect at the cutoff. In this example, the
E[Yi (t)|Xi , Zi (t)] = E[Yi (t)|Xi ] near the cutoff, though the latter is not required.
Lemmas 1 and 2 not only give general, precise, and intuitive characterizations of the probability
limits of two popular covariate-adjusted RD estimators, but also have interesting implications for
the analysis and interpretation of RD designs using covariates. Most notably, the lemmas above
show the conditions under which a covariate-adjusted RD estimator is consistent for the standard
(causal) RD treatment effect of interest, τ , and, by implication, establish when estimators with and
Since in most applications τ is the parameter of interest, comparing τ̂ (h) vs. τ̃ (h) requires the
assumption µZ+ = µZ− (Lemma 1), while comparing τ̂ (h) vs. η̌(h) requires µ0Z+ γ Y + = µ0Z− γ Y −
(Lemma 2). Adjusting for covariates in RD settings seems most useful when the estimand of interest
remains unchanged, in which case comparing precision becomes meaningful (as in Remarks 2 and
3). In applications, there is no a priori reason to blindly compare different estimators (τ̂ (h), τ̃ (h),
η̌(h)) without imposing (and, possibly, testing for) the underlying sufficient assumptions required
12
to retain the same target RD treatment effect of interest.
Estimation and inference in RD designs using local polynomial methods without covariates (i.e.
using only Yi and Xi ) has been studied in great detail in recent years—see, among others, Hahn
et al. (2001), Porter (2003), Imbens and Kalyanaraman (2012), Calonico et al. (2014), Gelman
and Imbens (2014), Armstrong and Kolesar (2015), Kamat (2015), Calonico et al. (2016a), and
references therein. These papers give asymptotic MSE expansions, MSE-optimal point estimators,
data-driven bandwidth selection methods, asymptotically valid inference procedures based on bias-
correction and non-standard distributional approximations, and even valid Edgeworth expansions
this literature. We assume that µZ+ = µZ− in order to maintain the same standard RD treat-
ment effect of interest (Lemma 1). We present new MSE expansions, several data-driven optimal
consistent standard errors for τ̃ (h). Analogous results for other RD designs are briefly discussed
in Section 5. A full treatment of all cases, including several other extensions, is given in the sup-
plemental appendix. All our methods are implemented in companion general purpose R and Stata
where τ̂ (h) and γ̃ Y,p (h) were defined above (see (1) and (2)), and τ̂ Z (h) is a d-dimensional vector
containing the standard RD treatment effect estimator for each covariate. In other words, each
element of τ̂ Z (h) is constructed using the corresponding covariate as outcome variable in (1). In the
appendix and supplemental appendix we give exact details. Using this partial-out representation,
13
it follows that
τ̂ (h) − τ τ̂ (h) − τ
τ̃ (h) − τ = s(h)0 0
=s {1 + oP (1)}
τ̂ Z (h) τ̂ Z (h)
where s(h) = (1, γ̃ Y,p (h)0 )0 and s = (1, γ 0Y )0 , and because s(h) →P s using the results underlying
Therefore, the asymptotic analysis proceeds by studying the (joint) large-sample properties of
the vector τ̂ S (h) := (τ̂ (h), τ̂ Z (h)0 )0 and then taking the linear combination s(h) or s, as appropriate.
Note that τ̂ S (h) →P τ S := (τ, τ 0Z )0 under the conditions in Lemma 1. In fact, most of the results
presented in this paper do not require the assumption τ Z = 0, though without this assumption
the parameter of interest changes, undermining the practical usefulness of the results. Finally, we
re-emphasize that our results do not impose any restrictions on the distribution of Yi (t)|Xi , Zi (t),
t ∈ {0, 1}, and impose instead minimal restrictions on the distributions of Yi (t), Zi (t)|Xi , t ∈ {0, 1}
(see Assumption 2). For example, we do not place any parametric or semiparametric assumption
We first establish a valid asymptotic MSE expansion for the covariate-adjusted RD estimator. This
expansion will aid in developing optimal bandwidth choices and MSE-optimal point estimators.
Furthermore, the bias expressions will be instrumental to develop valid inference procedures based
where Bias[τ̃ (h)] := E[s0 τ̂ S (h) − s0 τ S |X] and Var[τ̃ (h)] := V[s0 τ̂ S (h)|X].
1
MSE[τ̃ (h)] = h2(1+p) Bτ̃ (h)2 {1 + oP (1)} + Vτ̃ (h),
nh
where the precise expressions for all bias and variance terms are given in the appendix.
The bias and variance expressions in Theorem 1 are different from those available in the litera-
ture (Imbens and Kalyanaraman, 2012; Calonico et al., 2014) due to the presence of the additional
14
covariates Zi . As a consequence, MSE-optimal bandwidth selection and MSE-optimal point es-
timators in RD designs using covariates are different from their counterparts without covariates.
Bias-correction techniques and standard errors constructions will also be different, as discussed
below.
The leading bias and variance formulas in Theorem 1 are derived in pre-asymptotic form. For
the bias, the random term Bτ̃ (h) gives a pre-asymptotic stochastic approximation to the conditional
bias of the linearized estimator (hence the presence of the oP (1) term), whereas the variance term
Vτ̃ (h) is simply obtained by a conditional on X calculation for the linearized estimator. Calonico
et al. (2016a) prove, using valid Edgeworth expansions, that employing pre-asymptotic approxima-
tions when conducting asymptotic inference in nonparametrics can lead to superior performance.
Furthermore, fewer unknown features of the data generating process must be characterized and
estimated.
The main constants in Theorem 1 have a familiar form: the bias and variance are, respectively,
Bτ̃ (h) = Bτ̃ + (h) − Bτ̃ − (h) and Vτ̃ (h) = Vτ̃ − (h) + Vτ̃ + (h), where each component stems from
estimating the unknown regression function on one side of the cutoff. Here, the bias is entirely
due to estimating the unknown functions µY − (·) and µZ− (·) for the control group and µY + (·) and
µZ+ (·) for the treatment group. When the additional covariates are not included, these constants
reduce exactly to those already available in the literature. In the appendix, we also give the limiting
version of the bias and variance constants; that is, we characterize the fixed, real scalars Bτ̃ and Vτ̃
Assuming that Bτ̃ 6= 0, the MSE-optimal bandwidth choice for the covariate-adjusted RD
This choice can be used to construct a consistent and MSE-optimal covariate-adjusted sharp RD
point estimator: τ̃ (hτ̃ ) →P τ , provided that τ Z = 0. Note that d = dim(Zi ) does not impact the
rate of decay because we do not employ any nonparametric smoothing methods on the additional
covariates.
We address the issue of data-driven implementations of the new optimal bandwidth choices
15
are well known from the literature on linear least-squares, we can give a precise characterization of
the asymptotic efficiency gains from introducing additional covariates in the RD estimation. Using
the explicit formulas given in the appendix, it is easy to show that the asymptotic variance of the
covariate-adjusted estimator τ̃ (denoted by Vτ̃ ) is equal to the asymptotic variance of the standard
Cov[Yi (t), Zi (t)|Xi = x̄] and V[Zi (t)|Xi = x̄]. Therefore, τ̃ can be asymptotically more efficient
than τ̂ when the term 2Cov[Yi (t), Zi (t)|Xi = x̄]0 γ Y is negative and larger in absolute value than
Remark 3 (MSE-optimal Point Estimation). The results above also show that τ̃ (hτ̃ ) can be a
better point estimator in a MSE sense that its MSE-optimal counterpart without covariates, τ̂ (hτ̂ ),
where hτ̂ denotes the MSE-optimal bandwidth choice for the standard RD estimator τ̂ (Imbens and
Kalyanaraman, 2012; Calonico et al., 2014). Using the explicit formulas given in the appendix, it
is easy to give conditions such that MSE[τ̃ (hτ̃ )] < MSE[τ̂ (hτ̂ )] (both have the same rate of decay),
although this is not the main goal of our paper. We still recommend that τ̂ (hτ̂ ) be the benchmark
RD point estimator, and thus that researchers incorporate covariates to increase precision relative
Remark 4 (Other Optimal Bandwidth Choices). In the supplemental appendix we discuss other
MSE-optimal bandwidth selectors based on the results underlying Theorem 1, which are specifically
tailored to one-sided and two-sided estimation problems in RD designs. Specifically, we present: (i)
separate MSE optimizations on either side of the cutoff, (ii) the MSE for the sum rather than the
difference of the one-sided estimators, and (iii) several regularized versions of the plug-in bandwidth
selectors. In all cases, the decay rate of these bandwidths matches the MSE-optimal choice, but
the exact leading constants differ, implying that any of these could be used to construct sharp
RD point estimators with an MSE-optimal convergence rate. These choices may be more stable in
finite samples or more robust to situations where the smoothing bias may be small.
robust nonparametric bias-correction. It is by now well understood that inference based on large-
sample distribution theory using MSE-optimal bandwidths will suffer from a first order bias, leading
16
to invalid hypothesis testing and confidence intervals because of misspecification errors near the
cutoff. This local smoothing bias involves the bias term in Theorem 1, Bτ̃ (h), which can be esti-
mated and removed. Following recent ideas and results in Calonico et al. (2014, 2016a), we propose
The bias terms of Theorem 1 are known up to a higher-order derivative of the unknown regres-
sion functions, µY − (·), µZ− (·), µY + (·), and µZ+ (·), all capturing the misspecification error intro-
duced by the local polynomial approximation. These objects can be estimated nonparametrically—
the complete details are available in the appendix (we replace s by s(h) for implementation). At
present, let us simply take as given the bias estimator B̃τ̃ (b) based on local polynomial techniques,
A particularly empirically useful choice is b = h, which is both allowed for by our asymptotic theory
and has some optimal properties (Calonico et al., 2016a). This bias correction approach is standard
in the literature (e.g., Fan and Gijbels, 1996, Section 4.4), and captures nicely “flexible” regression
adjustments to account for misspecification in finite samples (Calonico et al., 2014, Remark 7).
The key idea behind the robust bias-corrected distributional approximation is to employ an
estimator of the variability of τ̃ bc (h, b) for Studentization purposes, rather than an estimator of the
variability of τ̃ (h) only. Thus, the final missing ingredient before we can state our asymptotic Gaus-
RD estimator. Its fixed-n variability is easily characterized due to its approximate (conditional)
and the n(1 + d) × n(1 + d) matrices of variances and covariances ΣS− and ΣS+ are unknown.
Specifically, ΣS− = V[S(0)|X] and ΣS+ = V[S(1)|X], where S(0) = (Y(0)0 , vec(Z(0))0 )0 and S(1) =
17
(Y(1)0 , vec(Z(1))0 )0 , with Y(t) = [Y1 (t), Y2 (t), · · · , Yn (t)]0 and Z(t) = [Z1 (t), Z2 (t), · · · , Zn (t)]0 for
t ∈ {0, 1}. The appendix collects tedious details and specific formulas, including the exact form of
Pbc bc
−,p (h, b) and P+,p (h, b).
The (infeasible) variance formula Vτ̃bc (h, b) differs from that presented in Theorem 1, Vτ̃ (h),
because it also accounts for the leading additional variability injected by the bias estimation,
h1+p B̃τ̃ (b). By virtue of the variance formula being computed both conditionally and pre-asymptotic,
up to the linear combination term s, it involves only one unknown feature, ΣS− and ΣS+ , which
thereof. The estimators need to account for the specific data structure at hand, such as het-
eroskedasticity and/or clustering. In particular, we discuss two type of plug-in variance estimators,
one based on a nearest neighbor (NN) approach and the other based on a plug-in residuals (PR)
approach, covering both conditional heteroskedasticity and clustered data. We defer the notation-
ally cumbersome details to the supplemental appendix, and instead we provide here only a brief
summary of the main ideas and results. The unknown matrices ΣS− and ΣS+ contain, under
σY Zk −,i = Cov[Yi (0), Zki (0)|Xi ], σY Zk +,i = Cov[Yi (1), Zki (1)|Xi ]
for k = 1, 2, · · · , d. The feasible variance estimators are then constructed by replacing these un-
• NN Variance Estimation. Employing ideas in Muller and Stadtmuller (1987) and Abadie
and Imbens (2006, 2008), we replace σY Zk −,i and σY Zk +,i by, respectively,
J J
J 1 1
σ̂Y Zk −,i (J) = 1(Xi < x̄)
X X
Yi − Y`−,j (i) Zki − Zk`−,j (i) ,
J +1 J J
j=1 j=1
J J
J 1X 1
σ̂Y Zk +,i (J) = 1(Xi ≥ x̄)
X
Yi − Y`+,j (i) Zki − Zk`+,j (i) ,
J +1 J J
j=1 j=1
for k = 1, 2, · · · , d, and where `−,j (i) is the index of the j-th closest unit to unit i among
{Xi : Xi < x̄} and `+,j (i) is the index of the j-th closest unit to unit i among {Xi : Xi ≥ x̄},
18
and J denotes a (fixed) the number of neighbors chosen. Replacing the non-zero entries of
ΣS− and ΣS+ in this fashion (which depend on the sampling structured assumed), and s by
s(h), we obtain the NN variance estimator of Vτ̃bc (h, b), denoted by V̌τ̃bc (h, b).
• PR Variance Estimation. This method applies ideas from least-squares methods; see Long
and Ervin (2000), MacKinnon (2012), and Cameron and Miller (2015) for review on variance
σ̂Y Zk −,q,i (h) = 1(Xi < x̄)ω−,i 0 0
Yi − rq (Xi − x̄) β̂ Y −,q (h) Zki − rq (Xi − x̄) β̂ Zk −,q (h) ,
σ̂Y Zk +,q,i (h) = 1(Xi ≥ x̄)ω+,i Yi − rq (Xi − x̄)0 β̂ Y +,q (h) Zki − rq (Xi − x̄)0 β̂ Zk +,q (h) ,
for k = 1, 2, · · · , d, and where β̂ V −,q (h) and β̂ V +,q (h) denote the q-th order local polynomial
(1), and ω−,i and ω+,i denote finite-sample adjustments used to construct the HCk variance
estimators. See the supplement for details. Replacing the non-zero entries of ΣS− and ΣS+
in this fashion (which depend on the sampling structured assumed), and s by s(h), we obtain
the PR variance estimator of Vτ̃bc (h, b), denoted by V̂τ̃bc (h, b).
Putting together all the pieces, we obtain the following distributional approximation result.
τ̃ bc (h, b) − τ
T̃τ̃ = q →d N (0, 1).
1 bc
nh V τ̃ (h, b)
Furthermore, V̌τ̃bc (h, b)/Vτ̃bc (h, b) →P 1 and V̂τ̃bc (h, b)/Vτ̃bc (h, b) →P 1.
Theorem 2 provides valid inference in sharp RD designs using covariates. To our knowledge,
this is the first such result available in the literature for the covariate-adjusted RD estimation.
Extensions of this result to all other popular RD designs are discussed in Section 5 and the sup-
plemental appendix. Once bandwidths are chosen, asymptotically valid inference procedures are
confidence interval for the RD treatment effect τ , using h = b and NN variance estimation, is given
19
by
1.96 1.96
q q
τ̃ bc (h, h) − √ · V̌τ̃bc (h, h) , τ̃ bc (h, h) + √ · V̌τ̃bc (h, h) .
nh nh
Remark 5 (Clustered Data). Theorem 2 can also be established under clustered sampling. All
derivations and results remain valid, but the variance formulas will depend on the particular form of
clustering. In this case, asymptotics are conducted under the standard assumptions: (i) each unit i
belongs to exactly one of G clusters, and (ii) G → ∞ and Gh → ∞. See Cameron and Miller (2015)
for a review of cluster-robust inference, and Bartalotti and Brummet (2016) for a discussion in the
context of MSE-optimal bandwidth selection for sharp RD designs. This extension is conceptually
straightforward but notationally cumbersome, and is deferred to the supplement. Our companion
software in R and Stata also includes optional cluster-robust (i) bandwidth selection, (ii) MSE-
We now discuss bandwidth selection briefly, leaving full details to the supplement (where we also
discuss several alternative bandwidth selectors as in Remark 4). Here we focus exclusively on two
main, distinct approaches: (i) the MSE-optimal choice derived previously (hη ), which can be used
to construct MSE-optimal RD point estimators, and (ii) a novel bandwidth selection approach
constructed to obtain the fastest decay of the coverage error rate (CER) of robust bias-corrected
confidence intervals (denoted hCER,η ), motivated by the valid Edgeworth expansions for RD inference.
One of the strengths of Theorem 2 is that the distributional approximation is valid under a
large set of tuning parameter choices (strictly more than would be possible without bias correction),
which in particular includes the MSE-optimal choice (which is not valid for standard procedures).
Assuming the bias is not zero, Theorem 1 can be used in the familiar way to construct feasible
" # 1
3+2p
1 Ṽτ̃ (v)/n
h̃τ̃ =
2(1 + p) B̃τ̃ (b)2
where the exact form of the bias estimator, B̃τ̃ (b), and variance estimator, Ṽτ̃ (v), are given in the
supplemental appendix. Heuristically, these estimators are formed as plug-in versions of the pre-
20
asymptotic formulas obtained in Theorem 1, following the previous discussion leading to Theorem 2.
In the supplemental appendix, we also show that these feasible versions of the optimal bandwidths
erage error optimal bandwidth choices. Following the results in Calonico et al. (2016a), we also
p
− (3+p)(3+2p)
h̃CER,τ̃ = n · h̃τ̃ .
This bandwidth choice minimizes the coverage error rate for confidence intervals based on Theorem
5 Other RD designs
We extend our main results to cover other popular RD designs, including fuzzy, kink, and fuzzy
kink RD. Here we give a short overview of the main ideas, deferring all details to the supple-
mental appendix. There are two wrinkles to the standard sharp RD design discussed so far that
must be accounted for: ratios of estimands/estimators for fuzzy designs and derivatives in esti-
The distinctive feature of fuzzy RD designs is that treatment compliance is imperfect. This implies
that Ti = Ti (0) · 1(Xi < x̄) + Ti (1) · 1(Xi ≥ x̄), that is, the treatment status Ti of each unit
still changes discontinuously at the RD threshold level x̄. Here, Ti (0) and Ti (1) denote the two
potential treatment status for each unit i when, respectively, Xi < x̄ (not offered treatment) and
Xi ≥ x̄ (offered treatment).
To analyze the case of fuzzy RD designs, we first recycle notation for potential outcomes and
covariates as follows:
21
Zi (t) := Zi (0) · (1 − Ti (t)) + Zi (1) · Ti (t)
for t = 0, 1. That is, in this setting, potential outcomes and covariates are interpreted as their
adjusted instrumental variable type estimators is delicate; see e.g. Abadie (2003) for more discus-
sion. Nonetheless, the above re-definitions enable us to use the same notation, assumptions, and
results, already given for the sharp RD design, taking the population target estimands as simply
The following assumption complements Assumption 2, now concerning the (potential) treatment
variables.
Assumption 3 (Fuzzy RD Designs). For % ≥ p + 2 and all x ∈ [xl , xu ], where xl , xu ∈ R such that
xl < x̄ < xu :
(a) µT − (x) := E[Ti (0)|Xi = x], µT + (x) := E[Ti (1)|Xi = x], E[Zi (0)Ti (0)|Xi = x], and E[Zi (1)Ti (1)|Xi =
(b) V[Fi (t)|Xi = x], with Fi (t) := (Yi (t), Ti (t), Zi (t)0 )0 , t ∈ {0, 1}, are continuously differentiable
and invertible.
τY
ς= , τY = µY + − µY − , τT = µT + − µT − ,
τT
where recall that we continue to omit the evaluation point x = x̄, and we have redefined the potential
now τ has a subindex highlighting the outcome variable being considered (Y or T ), and hence
τ = τY by definition. See Hahn, Todd and van der Klaauw (2001) and Imbens and Lemieux (2008)
τ̂Y (h)
ςˆ(h) = , τ̂V (h) = e00 β̂ V +,p (h) − e00 β̂ V −,p (h),
τ̂T (h)
22
with V ∈ {Y, T }, according to (1). Similarly, the covariate-adjusted fuzzy RD estimator is
τ̃Y (h)
ς˜(h) = , τ̃V (h) = e00 β̃ V +,p (h) − e00 β̂ V −,p (h),
τ̃T (h)
with V ∈ {Y, T }, according to (2). Our notation makes clear that the fuzzy RD estimators, with
or without additional covariates, are simply the ratio of two sharp RD estimators, with or without
covariates.
The properties of the standard fuzzy RD estimator ςˆ(h) were studied in great detail before, while
the covariate-adjusted fuzzy RD estimator ς˜(h) has not been studied in the literature before. With
these preliminaries, we can give the analogue of Lemma 1 for fuzzy RD designs using covariates.
then
τY − [µZ+ − µZ− ]0 γ Y
ς˜(h) →P ,
τT − [µZ+ − µZ− ]0 γ T
where γ V = (σ 2Z− + σ 2Z+ )−1 E[(Zi (0) − µZ− (Xi ))Vi (0) + (Zi (1) − µZ+ (Xi ))Vi (1)|Xi = x̄] with
V ∈ {Y, T }.
Under the same conditions, when no additional covariates are included, it is well known that
ςˆ(h) →P ς. Thus, this lemma clearly shows that both probability limits will coincide under the same
sufficient condition as in the sharp RD design: µZ− = µZ+ . Therefore, at least asymptotically, a
(causal) interpretation for the probability limit of the covariate-adjusted fuzzy RD estimator can
be deduced from the corresponding (causal) interpretation for the probability limit of the standard
Since the fuzzy RD estimators are constructed as a ratio of two sharp RD estimators, their
asymptotic properties can be characterized by studying the asymptotic properties of the corre-
sponding sharp RD estimators, which have already been analyzed in previous sections. Specifically,
23
with
1
τT τ̃Y (h) τY
fς˜ = , τ̃ (h) = , τ = ,
− ττY2 τ̃T (h) τT
T
and where the term ς˜ is a quadratic (high-order) error. Therefore, it is sufficient to study the
asymptotic properties of the bivariate vector τ̃ (h) of covariate-adjusted sharp RD estimators, pro-
vided that ς˜ is asymptotically negligible relative to the linear approximation, which is proven in
the supplement. As before, while not necessary for most of our results, we continue to assume
that µZ− = µZ+ so the standard RD estimand is recovered by the covariate-adjusted fuzzy RD
estimator.
Employing the linear approximation and parallel results as those discussed above for the sharp
to conduct inference in fuzzy RD designs with covariates. All the same results outlined in the previ-
ous section are established for this case: in the supplemental appendix we present MSE expansions,
bias-corrected distribution theory and consistent standard errors under either heteroskedasticity
or clustering, for the covariate-robust fuzzy RD estimator ς˜(h). We do not attempt to present
these results here because they are notationally cumbersome, with little new conceptual insight.
Nevertheless, all these results are implemented in the general purpose software packages for R and
Our final extension concerns the so-called kink RD designs. See Card, Lee, Pei and Weber (2015)
for a discussion on identification and Calonico et al. (2014) for a discussion on estimation and
inference, both covering sharp and fuzzy settings without additional covariates. Dong and Lewbel
(2015) also study derivative estimation in RD designs, without additional covariates. We briefly
outline identification and consistency results when additional covariates are included in kink RD
estimation (i.e., derivative estimation at the cutoff), but relegate all other inference results to the
supplemental appendix.
To describe the estimands of interest in this context, let g (s) (x) = ∂ s g(x)/∂xs for any sufficiently
24
smooth function g(·). The standard sharp kink RD parameter is (proportional to)
(1) (1)
τY,1 = µY + − µY − ,
treatment effects are estimated by using the local polynomial plug-in estimators:
τ̂Y,1 (h)
τ̂Y,1 (h) = e01 β̂ Y +,p (h) − e01 β̂ Y −,p (h) and ςˆ1 (h) = ,
τ̂T,1 (h)
where e1 denote the conformable 2nd unit vector (i.e., e1 = (0, 1, 0, 0, · · · , 0)0 ). Therefore, the
and
τ̃Y,1 (h)
ς˜1 (h) = , τ̃V,1 (h) = e01 β̃ V +,p (h) − e01 β̃ V −,p (h), V ∈ {Y, T },
τ̃T,1 (h)
respectively. The following lemma gives our main identification and consistency results.
then
(1) (1)
τ̃Y,1 (h) →P τY,1 − [µZ+ − µZ− ]0 γ Y
and
(1) (1)
τY,1 − [µZ+ − µZ− ]0 γ Y
ς˜1 (h) →P (1) (1)
,
τT,1 − [µZ+ − µZ− ]0 γ T
(1) (1) (1) (1)
where γ Y and γ T are defined in Lemma 3, and recall that µZ− = µZ− (x̄) and µZ+ = µZ+ (x̄) with
(1) (1)
µZ− (x) = ∂µZ− (x)/∂x and µZ+ (x) = ∂µZ+ (x)/∂x.
As before, in this setting it is well known that τ̂Y,1 (h) →P τY,1 (sharp kink RD) and ςˆ1 (h) →P ς1
(fuzzy kink RD), formalizing once again that the estimand when covariates are included is in general
different from the standard kink RD estimand without covariates. In this case, a sufficient condition
25
(1) (1)
for the estimands with and without covariates to agree is µZ+ = µZ− for both sharp and fuzzy
kink RD designs.
While the above results are in qualitative agreement with the sharp and fuzzy RD cases, and
therefore most conclusions transfer directly to kink RD designs, there is one interesting difference
concerning the sufficient conditions guaranteeing that both estimands coincide: a sufficient con-
(1) (1)
dition now requires µZ+ = µZ− . This requirement is not related to the typical falsification test
conducted in empirical work, that is, µZ+ = µZ− , but rather a different feature of the conditional
distributions of the additional covariates given the score—the first derivative of the regression func-
tion at the cutoff. Therefore, this finding suggests a new falsification test for empirical work in kink
RD designs: testing for a zero sharp kink RD treatment effect on “pre-intervention” covariates.
For example, this can be done using standard sharp kink RD treatment effect results, using each
As before, inference results follow the same logic already discussed. Complete details are given
in the supplement and fully implemented in the R and Stata software described by Calonico et al.
(2016b).
6 Numerical Results
We now illustrate our methods empirically and present an extensive simulation study conducted to
assess the finite sample properties of the covariate-adjusted RD estimator and the associated large
sample inference procedures developed in this paper. To conserve space, we only discuss the main
We first illustrate our methods in two empirical applications. First, we re-analyze the effect of
Head Start assistance on child mortality in the U.S., which was first studied by Ludwig and Miller
(2007). In this application, the unit of observation is the U.S. county, the treatment is receiving
technical assistance to apply for Head Start funds, and the running variable is the county-level
poverty index constructed in 1965. The RD design arises because the treatment was given only to
counties whose poverty index was x̄ = 59.1984 or above, a cutoff that was chosen to ensure that the
300th poorest counties received the treatment. The outcome of interest is the child mortality rate
26
(for children of age five to nine) due to causes affected by Head Start’s health services component.
Next, we revisit the effect of school improvements on student language test scores in Chile,
first studied by Chay, McEwan and Urquiola (2005). The unit of observation is the school, the
treatment is receiving the school improvement program P-900, which was assigned in 1990 based
on an index (the running variable) constructed from previous school-level test scores. The outcome
heteroskedasticity-robust nearest neighbor variance estimation for both applications. In the Head
Start application, the additional regressors Zi are nine county-level covariates from the pre-intervention
1960 U.S. Census: total population, percentage of black and urban population, and levels and per-
centages of population in three age groups (children age 3 to 5, children age 14 to 17, and adults
older than 25). In the education application, the additional covariates are seven binary variables in-
dicating the school’s region group (Chile’s 13 administrative regions were divided into seven groups,
Table 1 presents the main results. In each application’s panel in that table, the first row reports
(depending on the column). The next three rows report 95% robust bias-corrected confidence
intervals, the percentage length change of the covariate-adjusted confidence interval relative to the
unadjusted one, and the p-value associated with a hypothesis of zero RD treatment effect. These
three rows appear twice, first when h for the RD point estimator and b for the bias estimator are
chosen separately (in this case, ρ = h/b is unrestricted), and then when b = h (in this case, ρ = 1).
Finally, the last two rows in each application’s panel report, respectively, the two bandwidths and
the number of observations to the left and to the right of the cutoff with Xi ∈ [x̄ − h, x̄ + h].
The columns in Table 1 correspond to different RD approaches. The first two columns employ
the MSE-optimal bandwidth without covariates—the first column reports inference results without
covariates while the second column presents covariate-adjusted inference. Thus, the second column
is intended to mimic a common practice among practitioners, who sometimes estimate the MSE-
optimal bandwidth without covariates and then include covariates in the estimation and inference
using the same observations (i.e., keeping the bandwidth choice fixed). It follows that, in the second
column, the h and b bandwidths used are the MSE-optimal bandwidths without covariates, and
27
therefore the point estimator in this second column is no longer MSE-optimal, since the optimal
bandwidths are not used; however, the confidence intervals and p-values are still valid because
the optimal bandwidths with and without covariates have the same rate of decay. The third
the previous sections; in this column, both bandwidths are chosen according to the MSE-optimal
formulas with covariates. In all cases, we use triangular kernel weights and nearest neighbor residual
estimates. Employing other kernels or variance estimators give very similar empirical results.
Our empirical findings are quite consistent across applications: employing covariate-adjusted
RD inference leads to precision improvements while the point estimators remain stable. In the
Head Start application, the point estimator ranges from −2.41 to −2.51, an effect that is statis-
tically different from zero at 5% significance level in all cases. As should be expected when the
additional covariates are truly pre-determined, including covariates does not substantially alter the
point estimates. (We also implemented “placebo tests” on the additional covariates and found,
this application can lead to sizable efficiency gains: for example, when both h and b are estimated
(ρ = h/b unrestricted), adding covariates within the MSE-optimal h without covariates (h = 6.81)
results in a 8.25% reduction in the length of the 95% confidence interval (column 2), as this confi-
dence interval shrinks from (−5.49, −0.10) to (−5.37, −0.45). The length of the confidence interval
is even shorter when both bandwidths are chosen optimally using covariates (column 3), one of the
In the case of the education data, we also find that including additional covariates does not
affect the point estimators, while providing some efficiency improvements. In this case, the point
estimates range from 3.45 to 3.49, and the confidence interval length shrinks approximately 3% to
Our empirical results suggest that including pre-intervention covariates can be empirically useful
in real RD applications, thereby illustrating the usefulness of the new methods developed in this
paper.
28
6.2 Simulation Evidence
We now illustrate our methods using simulated data. We consider four data generating processes
constructed using the data of Lee (2008): all parameters are obtained from the real data unless
explicitly noted otherwise. This model has been used extensively before, see Imbens and Kalya-
naraman (2012) and Calonico et al. (2014, 2016a), among many others. The additional covariate
included is previous democratic vote share, and the four models are distinguished by the importance
of this covariate: (i) in Model 1, the covariate is irrelevant; (ii) in Model 2 it enters the conditional
expectation of the potential outcomes E[Yi (t)|Xi = x, Zi (t)], t ∈ {0, 1}, according to the real data;
(iii) Model 3 takes Model 2 but sets the residual correlation between the outcome and covariate
to zero; (iv) Model 4 takes Model 2 but doubles the residual correlation between the outcome and
covariate equations. Note that Models 3 and 4 do not imply Cov[Yi (t), Zi (t)|Xi = x] = 0, t ∈ {0, 1}.
The constructions allowed E[Yi (t)|Xi = x, Zi (t)] to have different coefficients on each side of the
cutoff, while the conditional expectation of the potential covariates E[Zi (t)|Xi = x], t ∈ {0, 1},
were constructed assuming they are continuous at the cutoff (but still with different coefficients on
either side otherwise). Therefore, our covariate-adjusted RD estimator will be “misspecified” when
viewed as a local weighted least-square fit. The exact details of our Monte Carlo experiment are
We use a sample of size n = 1, 000 and consider 5, 000 replications. We compare the standard
RD estimator (τ̂ ) and the covariate-adjusted RD estimator (τ̃ ), with both infeasible and data-driven
MSE-optimal and CER-optimal bandwidth choices. To analyze the performance of our inference
procedures, we report average bias of the point estimators and average coverage rate and interval
length of nominal 95% confidence intervals. In addition, we also explore the performance of our
data-driven bandwidth selectors by reporting some of their main statistical features, such as mean,
median, standard deviation, across the 5, 000 replications. We report only one table that presents
estimates using triangular kernel and nearest neighbor (NN) heteroskedasticity-robust variance
The numerical results are given in Tables 2 and 3. All findings are highly consistent with our
large sample theory. Table 2 shows that including covariates can improve both MSE and interval
length, sometimes dramatically, and moreover, the gains are in line with our theory: the gains are
largest in Model 4 with the amplified residual correlation and least in Model 3 when that channel
29
is shut down. The results for Model 1 show that including an irrelevant covariate hardly changes
empirical results and conclusions. Finally, Table 3 shows that the data-driven bandwidth selectors
7 Conclusion
We provided a formal framework for identification, estimation, and inference in RD designs when
covariates are included in the estimation. We augmented the standard local polynomial estimator
consistent for the standard RD treatment effect if the covariate adjustment is restricted to be
equivalent above and below the cutoff. Furthermore, this estimator can achieve substantial efficiency
gains relative to the unadjusted RD estimator. We also showed that relaxing the latter restriction
with the inclusion of treatment-covariate interactions leads to a point estimator that is not generally
We also provided new MSE expansions, several optimal bandwidth choices and optimal point es-
consistent and cluster-robust standard errors. All these results were obtained for sharp, fuzzy, and
kink RD designs. Finally, we illustrated the practical implications of our results using two empiri-
cal applications and simulated data, and showed that including pre-intervention covariates in RD
designs can lead to useful improvements in precision. All the results presented in this paper are
We give a very succinct account of the main expressions for sharp RD designs, which were omitted
in the main paper to avoid overwhelming notation. A detailed treatment of this and all other RD
Let Rp (h) = [(rp ((X1 − x̄)/h), · · · , rp ((Xn − x̄)/h))0 ] be the n × (1 + p) design matrix, and
K− (h) = diag(1(Xi < x̄)Kh (Xi − x̄) : i = 1, 2, · · · , n) and K+ (h) = diag(1(Xi ≥ x̄)Kh (Xi − x̄) :
30
(a) (a) (a) (a) (a) (a)
also define µS− := (µY − , µZ− 0 )0 , µS+ := (µY + , µZ+ 0 )0 , a ∈ Z+ , and σ 2S− := V[Si (0)|Xi = x̄] and
σ 2S+ := V[Si (1)|Xi = x̄], recall that Si (t) = (Yi (t), Zi (t))0 , t ∈ {0, 1}. Let eν denote a conformable
(1 + ν)-th unit vector. Finally, recall that s(h) = (1, −γ̃ Y (h))0 and s = (1, −γ Y )0 .
The pre-asymptotic bias Bτ̃ (h) = Bτ̃ + (h) − Bτ̃ − (h) and its asymptotic counterpart Bτ̃ := Bτ̃ + −
(p+1) (p+1)
s0 µS− s0 µS−
Bτ̃ − (h) := e00 Γ−1
−,p (h)ϑ−,p (h) →P Bτ̃ − := e00 ∆p,−
(p + 1)! (p + 1)!
(p+1) (p+1)
0 −1
s0 µS+ s0 µS+
Bτ̃ + (h) := e0 Γ+,p (h)ϑ+,p (h) →P Bτ̃ + := e00 ∆p,+
(p + 1)! (p + 1)!
where, with the (slightly abusive) notation vk = (v1k , v2k , · · · , vnk )0 , ιn = (1, · · · , 1)0 ∈ Rn , Γ−,p (h) =
Rp (h)0 K− (h)Rp (h)/n and ϑ−,p (h) = Rp (h)0 K− (h)(X − x̄ιn /h)p+1 /n, Γ+,p (h) and ϑ+,p (h) defined
Z 0 −1 Z 0
0 1+p
∆p,− := rp (u)rp (u) K(u)du rp (u)u K(u)du ,
−∞ −∞
Z ∞ −1 Z ∞
0 1+p
∆p,+ := rp (u)rp (u) K(u)du rp (u)u K(u)du .
0 0
The pre-asymptotic variance Vτ̃ (h) = Vτ̃ − (h) + Vτ̃ + (h) and its asymptotic counterpart Vτ̃ :=
s0 σ 2S− s 0
Vτ̃ − (h) := [s0 ⊗ e00 P−,p (h)]ΣS− [s ⊗ P−,p (h)e0 ] →P Vτ̃ − := e0 Λp,− e0
f
s0 σ 2S+ s 0
Vτ̃ + (h) := [s0 ⊗ e00 P+,p (h)]ΣS+ [s ⊗ P+,p (h)e0 ] →P Vτ̃ + := e0 Λp,+ e0
f
√ √ √ −1 √
where P−,p (h) = hΓ−1 0
−,p (h)Rp (h) K− (h)/ n and P+,p (h) = hΓ+,p (h)Rp (h)0 K+ (h)/ n, and
Z 0 −1 Z 0 Z 0 −1
Λp,− := rp (u)rp (u)0 K(u)du rp (u)rp (u)0 K(u)2 du rp (u)rp (u)0 K(u)du ,
−∞ −∞ −∞
Z ∞ −1 Z ∞ Z ∞ −1
0 0 2 0
Λp,+ := rp (u)rp (u) K(u)du rp (u)rp (u) K(u) du rp (u)rp (u) K(u)du .
0 0 0
(p+1)
To construct pre-asymptotic estimates of the bias terms, we replace the only unknowns, µS−
(p+1)
and µS+ , by q-th order (p < q) local polynomial estimates thereof, using the preliminary band-
31
width b. This leads to the pre-asymptotic feasible bias estimate B̃τ̃ (b) := B̃τ̃ + (b) − B̃τ̃ − (b) with
(p+1) (p+1)
s(h)0 µ̃S−,q (b) s(h)0 µ̃S+,q (b)
B̃τ̃ − (b) := e00 Γ−1
−,p (h)ϑ−,p (h) and B̃τ̃ + (b) := e00 Γ−1
+,p (h)ϑ+,p (h)
(p + 1)! (p + 1)!
(p+1) (p+1)
where µ̃S−,q (b) and µ̃S+,q (b) collect the q-th order local polynomial estimates of the (p + 1)-th
derivatives using as outcomes each of the variables in Si = (Yi , Z0i )0 for control and treatment units,
1
τ̃ bc (h) = √ [s(h)0 ⊗ e00 (Pbc bc
+,p (h, b) − P−,p (h, b))]S,
nh
√ √
Pbc hΓ−1 0 1+p
ϑ−,p (h)e0p+1 Γ−1 0
−,p (h, b) = −,p (h) Rp (h) K− (h) − ρ −,q (b)Rq (b) K− (b) / n,
√ √
Pbc hΓ−1 0 1+p
ϑ+,p (h)e0p+1 Γ−1 0
+,p (h, b) = +,p (h) Rp (h) K+ (h) − ρ +,q (b)Rq (b) K+ (b) / n,
where P̃bc bc
−,p (h, b) and P̃−,p (h, b) are directly computable from observed data, given the choices of
bandwidth h and b, with ρ = h/b, and the choices of polynomial order p and q, with p < q.
estimator follows directly from the formulas above. All other details such preliminary bandwidth
selection, plug-in data-driven MSE-optimal bandwidth estimation, and other extensions and results,
References
Abadie, A., and Imbens, G. W. (2006), “Large Sample Properties of Matching Estimators for
Abadie, A., and Imbens, G. W. (2008), “Estimation of the Conditional Variance in Paired Experi-
32
Angrist, J., and Rokkanen, M. (2015), “Wanna Get Away? Regression Discontinuity Estimation
of Exam School Effects Away from the Cutoff,” Journal of the American Statistical Association,
110, 1331–1344.
Armstrong, T. B., and Kolesar, M. (2015), “A Simple Adjustment for Bandwidth Snooping,”
arXiv:1412.0267.
Bartalotti, O., and Brummet, Q. (2016), “Regression Discontinuity Designs with Clustered Data:
Mean Square Error and Bandwidth Choice,” in Regression Discontinuity Designs: Theory and
Calonico, S., Cattaneo, M. D., and Farrell, M. H. (2016a), “On the Effect of Bias Estimation on
Calonico, S., Cattaneo, M. D., Farrell, M. H., and Titiunik, R. (2016b), “rdrobust: Software for
Calonico, S., Cattaneo, M. D., and Titiunik, R. (2014), “Robust Nonparametric Confidence Inter-
Canay, I. A., and Kamat, V. (2015), “Approximate Permutation Tests and Induced Order Statistics
Card, D., Lee, D. S., Pei, Z., and Weber, A. (2015), “Inference on Causal Effects in a Generalized
Cattaneo, M. D., Frandsen, B., and Titiunik, R. (2015), “Randomization Inference in the Regression
Discontinuity Design: An Application to Party Advantages in the U.S. Senate,” Journal of Causal
Inference, 3, 1–24.
Cattaneo, M. D., Titiunik, R., and Vazquez-Bare, G. (2016), “Comparing Inference Approaches for
RD Designs: A Reexamination of the Effect of Head Start on Child Mortality,” working paper,
University of Michigan.
33
Chay, K. Y., McEwan, P. J., and Urquiola, M. (2005), “The Central Role of Noise in Evaluating
Interventions That Use Test Scores to Rank Schools,” American Economic Review, 95, 1237–
1258.
de la Cuesta, B., and Imai, K. (2016), “Misunderstandings about the Regression Discontinuity
Design in the Study of Close Elections,” Annual Review of Political Science, forthcoming, 19.
Dong, Y., and Lewbel, A. (2015), “Identifying the Effect of Changing the Policy Threshold in
Fan, J., and Gijbels, I. (1996), Local Polynomial Modelling and Its Applications, New York: Chap-
Gelman, A., and Imbens, G. W. (2014), “Why High-Order Polynomials Should Not be Used in
Hahn, J., Todd, P., and van der Klaauw, W. (2001), “Identification and Estimation of Treatment
Imbens, G., and Lemieux, T. (2008), “Regression Discontinuity Designs: A Guide to Practice,”
Imbens, G. W., and Kalyanaraman, K. (2012), “Optimal Bandwidth Choice for the Regression
Imbens, G. W., and Rubin, D. B. (2015), Causal Inference in Statistics, Social, and Biomedical
arXiv:1505.06483.
Keele, L. J., Titiunik, R., and Zubizarreta, J. (2015), “Enhancing a Geographic Regression Discon-
tinuity Design Through Matching to Estimate the Effect of Ballot Initiatives on Voter Turnout,”
Lee, D. S. (2008), “Randomized Experiments from Non-random Selection in U.S. House Elections,”
34
Lee, D. S., and Lemieux, T. (2010), “Regression Discontinuity Designs in Economics,” Journal of
Long, J. S., and Ervin, L. H. (2000), “Using Heteroscedasticity Consistent Standard Errors in the
Ludwig, J., and Miller, D. L. (2007), “Does Head Start Improve Children’s Life Chances? Evidence
and Future Directions in Causality, Prediction, and Specification Analysis, eds. X. Chen and N. R.
Swanson, Springer.
Porter, J. (2003), “Estimation in the Regression Discontinuity Model,” working paper, University
of Wisconsin.
Skovron, C., and Titiunik, R. (2016), “A Practical Guide to Regression Discontinuity Designs in
35
Table 1: Empirical Illustrations
Notes:
(i) All estimates are computed using a triangular kernel and nearest neighbor heteroskedasticity-robust variance
estimators.
(ii) Columns under “Standard” and “Cov-adjusted” correspond to, respectively, standard and covariate-adjusted RD
estimation and inference methods, given a choice of bandwidths.
(iii) Bandwidths used (h and b) are data-driven MSE-optimal for either standard RD estimator or covariated-adjusted
RD estimator (depending on the group of columns). Specifically, in the first two columns the bandwidths are selected
to be MSE-optimal for τ̂ (standard RD estimation), while in the third column the bandwidths are selected to be
MSE-optimal for τ̃ (covariate-adjusted RD estimation).
36
Table 2: Simulation Results (MSE, Bias, Empirical Coverage and Interval Length)
τ̂ τ̃ Change (%)
√ √ √
M SE Bias EC IL M SE Bias EC IL M SE Bias EC IL
Model 1
MSE-POP 0.045 0.012 0.938 0.199 0.045 0.012 0.934 0.198 0.2 0.1 −0.4 −0.6
MSE-EST 0.045 0.018 0.924 0.171 0.045 0.018 0.927 0.171 0.0 −1.0 0.3 −0.2
CER-POP 0.052 0.006 0.934 0.242 0.052 0.006 0.929 0.240 0.4 1.2 −0.5 −0.9
CER-EST 0.049 0.010 0.940 0.207 0.049 0.010 0.933 0.206 0.5 −1.5 −0.7 −0.5
Model 2
MSE-POP 0.047 0.013 0.935 0.213 0.041 0.008 0.941 0.185 −13.5 −33.6 0.6 −13.4
MSE-EST 0.048 0.017 0.929 0.188 0.041 0.011 0.932 0.163 −15.1 −34.8 0.3 −13.4
CER-POP 0.054 0.006 0.933 0.258 0.048 0.004 0.931 0.223 −11.7 −34.1 −0.2 −13.6
CER-EST 0.053 0.009 0.941 0.227 0.046 0.006 0.940 0.196 −13.1 −34.1 −0.2 −13.5
Model 3
MSE-POP 0.044 0.013 0.935 0.200 0.043 0.010 0.938 0.193 −3.3 −19.6 0.3 −3.5
MSE-EST 0.046 0.017 0.926 0.177 0.043 0.014 0.929 0.170 −5.5 −17.2 0.3 −4.0
CER-POP 0.051 0.006 0.933 0.243 0.050 0.005 0.930 0.234 −1.8 −20.9 −0.3 −3.8
CER-EST 0.050 0.009 0.939 0.213 0.048 0.008 0.939 0.205 −4.0 −16.8 0.0 −4.2
Model 4
MSE-POP 0.050 0.013 0.938 0.225 0.035 0.007 0.938 0.160 −29.3 −46.6 0.1 −28.8
MSE-EST 0.051 0.017 0.931 0.199 0.035 0.008 0.938 0.142 −30.5 −52.1 0.8 −28.4
CER-POP 0.058 0.006 0.934 0.273 0.042 0.003 0.926 0.194 −27.6 −43.6 −0.9 −29.0
CER-EST 0.056 0.009 0.942 0.240 0.040 0.005 0.934 0.171 −27.9 −51.2 −0.8 −28.5
Notes:
(i) All estimators are computed using the triangular kernel, NN variance estimation, and two bandwidths (h and b).
(ii) Columns
√ τ̂ and τ̃ correspond to, respectively, standard RD estimation and covariate-adjusted RD estimation;
columns “ M SE” report the square root of the mean square error of point estimator; columns “Bias” report average
bias relative to target population parameter; and columns “EC” and “IL” report, respectively, empirical coverage
and interval length of robust bias-corrected 95% confidence intervals.
(iii) Rows correspond to bandwidth method used to construct the estimator and inference procedures. Rows “MSE-
POP” and “MSE-EST” correspond to, respectively, procedures using infeasible population and feasible data-driven
MSE-optimal bandwidths (without or with covariate adjustment depending on the column). Rows “CER-POP” and
“CER-EST” correspond to, respectively, procedures using infeasible population and feasible data-driven CER-optimal
bandwidths (without or with covariate adjustment depending on the column).
37
Table 3: Simulation Results (Data-Driven Bandwidth Selectors)
Pop. Min. 1st Qu. Median Mean 3rd Qu. Max. Std. Dev.
Model 1
ĥτ̂ 0.144 0.097 0.168 0.191 0.195 0.217 0.338 0.041
h̃τ̃ 0.144 0.094 0.166 0.189 0.194 0.217 0.338 0.040
Model 2
ĥτ̂ 0.156 0.092 0.170 0.193 0.198 0.222 0.335 0.041
h̃τ̃ 0.158 0.095 0.171 0.197 0.201 0.227 0.336 0.042
Model 3
ĥτ̂ 0.156 0.091 0.169 0.193 0.197 0.221 0.335 0.040
h̃τ̃ 0.154 0.095 0.170 0.194 0.198 0.223 0.334 0.041
Model 4
ĥτ̂ 0.156 0.093 0.170 0.194 0.198 0.223 0.334 0.041
h̃τ̃ 0.161 0.088 0.172 0.199 0.203 0.231 0.336 0.043
Notes:
(i) All estimators are computed using the triangular kernel, and NN variance estimation.
(ii) Column “Pop.” reports target population bandwidth, while the other columns report summary statistics of the
distribution of feasible data-driven estimated bandwidths.
(iii) Rows ĥτ̂ and h̃τ̃ corresponds to feasible data-driven MSE-optimal bandwidth selectors without and with covariate
adjustment, respectively.
38