Why experimenters might not always want to randomize_Kasy 2016
Why experimenters might not always want to randomize_Kasy 2016
Instead
Author(s): Maximilian Kasy
Source: Political Analysis , Summer 2016, Vol. 24, No. 3 (Summer 2016), pp. 324-338
Published by: Cambridge University Press on behalf of the Society for Political
Methodology
JSTOR is a not-for-profit service that helps scholars, researchers, and students discover, use, and build upon a wide
range of content in a trusted digital archive. We use information technology and tools to increase productivity and
facilitate new forms of scholarship. For more information about JSTOR, please contact [email protected].
Your use of the JSTOR archive indicates your acceptance of the Terms & Conditions of Use, available at
https://about.jstor.org/terms
and Cambridge University Press are collaborating with JSTOR to digitize, preserve and extend
access to Political Analysis
Maximilian Kasy
Department of Economies, Harvard University, 1805 Cambridge Street,
Cambridge, MA 02138, USA
e-mail: [email protected] (corresponding author)
Suppose that an experimenter has collected a sample as well as baseline information about the units i
sample. How should she allocate treatments to the units in this sample? We argue that the answer
not involve randomization if we think of experimental design as a statistical decision problem. If,
instance, the experimenter is interested in estimating the average treatment effect and evaluat
estimate in terms of the squared error, then she should minimize the expected mean squared
(MSE) through choice of a treatment assignment. We provide explicit expressions for the expected
that lead to easily implementable procedures for experimental design.
1 Introduction
Author's note·. I thank Alberto Abadie, Ivan Canay, Gary Chamberlain, Raj Chetty, Nathaniel Hendren,
Larry Katz, Gary King, Michael Kremer, and Don Rubin, as well as seminar participants at the Harvard
retreat; the Harvard Labor Economics Workshop; the Harvard Quantitative Issues in Cancer Researc
Harvard Applied Statistics Seminar; UT Austin, Princeton, Columbia, and Northwestern Econometric
RAND; and at the 2013 CEME Conference at Stanford for helpful discussions. Replication data are a
Harvard Dataverse at http://dx.doi.org/10.7910/DVN/I5KCWI. See Kasy (2016). Supplementary materials f
are available on the Political Analysis Web site.
) The Author 2016. ! 'ublished by Oxford University Press on behalf of the Society for Political Methodology.
All rignts reserved, ror permissions, please eman: journais.permissions^oup.com
324
2.1 Setup
Before we present our general results and our proposed procedure, let us discuss a simple
motivating example. The example is stylized to allow calculations "by hand," but the intuitions
'if experimenters have a preference for randomization for reasons outside the decision problem considered in the present
article, a reasonable variant of the procedure suggested here would be to randomize among a set of assignments that are
"near-minimizers" of risk. If we are worried about manipulation of covariates, in particular, a final coin flip that
possibly switches treatment and control groups might be helpful. I thank Michael Kremer for this suggestion.
oo - -
oo -
1 1 1 1 1 0 0 0 0 - -
00 - -
00 -
Notes: Each row of this table corresponds to one possible treatment assignment (d\,..dî). The columns for "model 1" correspond to the
model Yj = + d + ef, and the columns for "model 2" to the model if = -x] + d + ef. Each row shows the bias, variance, and MSE of
β, for the given assignment and model. The designs 1-5 correspond to uniform random draws from the assignments with rows marked by
an entry of 1. Design 1 randomizes over all rows, design 2 over rows one through fourteen, etc. The last column of the table shows the
Bayesian expected MSE for each assignment for the squared exponential prior discussed below. For details, see Section 2.
from this example generalize. Suppose an experimenter has a sample of four experimental units
i = 1,..4, and she observes a continuous covariate Xt for each of them, where it so happens that
(X\,..X4) = (x\,..x4) = (0, 1, 2, 3). She assigns every unit to one of two binary treatments,
d, e {0, l}.2
Our experimenter wants to estimate the (conditional) average treatment effect of treatment D
across these four units,
0)
The experimenter plans to estimate this treatment effect by calculating the difference in means
across treatment and comparison groups in her experiment, that is,
Since there are four experimental units, there are 24 = 16 possible treatment
sixteen rows of Table 1 correspond to these assignments.3 In the first
{d\,..., d$) = (0, 1, 1, 0), in the second row (d\,..., i/4) = (1,0, 0, 1), etc.
2Consider for instance the setting of Nyhan and Reifler (2014), where legislators i received lette
(£>,= 1) or did not (£>, = 0). The outcome of interest Y in this case is the future fact-checking ratin
an important covariate X, might be their past rating.
'The code producing Table 1 is available online at Kasy (2016). At this address, we also prov
implementing our proposed approach in practice.
Vf = Xj + d + e^, (3)
where the ef are independent given X, and have mean 0 and
Table 1. The average treatment effect in this model is eq
(di,..., dn), we can calculate the corresponding bias, varia
where
Var(/S) = — + —- (4)
n\ «ο
with ri] being the number of units i receiving treatment dt= 1, and «ο being the s
is given by
Suppose now for a moment that the information about covariates got lost—somebody delete
column of X, in your spreadsheet. Then every i looks the same, before treatment is assigne
variance of potential outcomes T1 now includes both the part due to ef and the part due to X„ an
is equal to Var(îf) = Var(T,) + Var(ef) = Var(T/) + 1. Since units i are indistinguishable in
case, treatment assignments are effectively only distinguished by the number of units treated.
we observed no covariates and have random sampling, there is no bias (even ex post), and th
of any assignment with η χ treated units is equal to
Let us now, and for the rest of our discussion, assume again that we observed the covariates X,. The
calculations we have done so far for this case are for a fixed (deterministic) assignment of treatment
What if we use randomization? Any randomization procedure in our setting can be described by the
probabilities p{d\,di, d^, d4) it assigns to the different rows of Table 1. The MSE of such a pro
cedure is given by the corresponding weighted average of MSEs for each row:
"This is the ex post bias, given covariates and treatment assignment. This is the relevant notion of bias for us. It is
different from the ex ante bias, which is the expected bias when we do not know the treatment assignment yet.
y*=-£ + d + é. (9)
Since we do not know what the true state of the world is, we cannot evaluate a procedure based
on its actual loss. We can, however, consider various notions of average loss. If we average
loss over the randomness of a sampling scheme, and possibly over the randomness of a treat
ment allocation scheme, holding the state of the world θ fixed, we obtain the frequentist risk
function:
In terms of Table 1, Bayes risk averages the MSE for a given assignment 8" and data-generating
process Θ, R(8", θ), both across the rows, using the randomization device U, and across different
models for the data-generating process, using the prior distribution π. If, alternatively, we evaluate
each action based on the worst-case scenario for that action, we obtain minimax risk
u).
Rmn\8) = Σ U
(supR(SU,
\ θ /
θ)\ ■ P(U :
In terms of Table 1, minimax risk evaluates each row (realized assignment 8") in terms of the worst
case model for this assignment, supf?(S", θ), and evaluates randomized designs in terms of the
θ
weighted average across the worst case risk for each assignment.
Letting R*(8) denote either Bayes or minimax risk, we have
where the argmin is taken over non-randomized decision functions a = δ(Χ). It follows that
R*(S*) <minuR*(Su) < R*(S). (15)
The second inequality is strict, unless δ only puts positive probability on th
6Squared error loss is the canonical loss function in the literature on estimation. It has a lot of convenient properties
allowing for tractable analytic results, and in keeping with the literature we will focus on squared error loss. Other
objective functions are conceivable, such as for instance expected welfare of treatment assignments based on experi
mental evidence; see for instance Kasy (2014).
/* = ££*#-tfw· (21)
The conditional average treatment effect is the object of interest if we want to learn about trea
effects for units, in the population from which our sample was drawn, which look similar in ter
of covariates. We might be interested in this effect both for scientific reasons and for policy re
(deciding about future treatment allocations). One more question has to be settled before w
give an expression for the Bayes risk of any given treatment assignment: How is β going
estimated? We will consider two possibilities. The first possibility is that we use an estimator th
optimal in the Bayesian sense, namely the posterior best linear predictor of β. The second
bility is that we estimate β using a simple difference in means, as in the example of Section
μ, = μ(Χι, dj)
μβ=-Σ[μ(Χι,\)-μ(Χ„0)\, (22)
c<=l
j
Σ ic({X-d^(x"- C((x" di)> (Xj> °))]·
Let μ and C be the corresponding vector and matrix with entries μ., and C,j. Note that both μ and
C depend on the treatment assignment (d\,...,d„). Using this notation, the prior mean and
variance of Y are equal to μ and C + σ2/, and the prior mean of β equals μ β.
Let us now consider the posterior best linear predictor, which is the best estimator (in the
Bayesian risk sense) among all estimators linear in Y.1 The posterior best linear predictor
is equal to the posterior expectation if both the prior for / and the distribution of the residuals
Y-f are multivariate normal.
Proposition 2 (Posterior best linear predictor and expected loss). The posterior best linear predictor for
the conditional average treatment effect is given by
The proof of this proposition follows from standard characterizations of best linear predictors
and can be found in Appendix A. The expression for the MSE provided by equation (24) is easily
evaluated for any choice of (i/i d„). Since our goal is to minimize the MSE, we can in fact ignore
the Var(/3|A) term, which does not depend on {d\,..., d,,).
Let us next consider the alternative case where the experimenter uses the simple difference in means
estimator. We will need the following notation:
μΊ = μ(Χί, d)
(25)
Ciydl = C((Xi, dl), (Xj, d2)),
for d,dl,d2 e {0,1}. We collect these terms in the vectors μ''and matrices Cd'-d2, which are in turn
collected as
μ = (μ1, μ2)
' C°° C01 \ (26)
C =
c10 c11
7This class of estimators includes all standard estimators of β under unconfoundedness, such as those based on matching,
inverse probability weighting, regression with controls, kernel regression, series regression, splines, etc. Linearity of the
estimator is unrelated to any assumption of linearity in X; we are considering the posterior BLP of β in Y rather than
the BLP of Yi in X,·.
Ί Γ
MSE(rfi = σ2 J + (w' ■ μ)2 + W · C ■ w, (28)
n\ n0
w = (wu, w ),
W
ι di= 1,
n\ η (29)
ο l-dil
W: = f - .
«ο «
The expression for the MSE in this proposition has three terms. The first term is the variance of
the estimator. The second and third terms together are the expected squared bias. This splits in turn
into the square of the prior expected bias, and the prior variance of the bias.
It is interesting to note that we recover the standard notion of balance if, in addition to the
assumptions of Proposition 3, we impose a linear, separable model for /, that is,
CΧ1-Χ2)'·β, (31)
where Xd is the sample mean of covariates in e
squared bias is equal to
Risk is thus minimized by choosing treatment and control arms ot equal size,
balance as measured by the difference in covariate means (Χ1 — X2).
We now have almost all the ingredients for a procedure that can be used by p
important element is missing: How do we find the optimal assignment if? Or h
find a set of assignments that are close to optimal in terms of expected risk? T
since solving the problem by brute force is generally not feasible. We could do
example in Section 2, since in this example there were only 24 = 16 possible treatmen
In general, there are 2" assignments for a sample of size n, a number that very
prohibitive to search by brute force.
More sophisticated alternative optimization methods are discussed in the Online Appen
There are a number of arguments that can be made for randomization and against the f
considered in this article. We shall discuss some of these, and to what extent thev aooear to
That is correct. How does this relate to our argument? The arguments of Section 3 s
optimal procedures in a decision theoretic sense do not rely on randomization. Rando
inference cannot be rationalized as an optimal procedure under the conceptual frame
decision theory, however. As a consequence, the arguments of Section 3 do not apply.
One could take the fact that randomization inference is not justified by decision theo
argument against randomization inference. But one could also consider a compromise approac
is based on randomization among assignments that have a low expected MSE. Such partially r
assignments are still close to optimal and allow the construction of randomization-based test
Yes, it is. Selection on observables holds, as the name suggests, for any treatment assignmen
a function of observables. Put differently, conditional independence is guaranteed by any co
trial, whether randomized or not, as stated in the following proposition.
6 Conclusion
In this article, we discuss the question of how information from baseline covariates
when assigning treatment in an experiment. In order to give a well-grounded an
question, we adopt a decision theoretic and nonparametric framework. The nonpa
spective and the consideration of continuous covariates distinguish this article from
previous experimental design literature.
A number of conclusions emerge from our analysis. First, randomization is in general
Rather, we should pick a risk-minimizing treatment assignment, which is generically un
presence of continuous covariates. Second, we can consider nonparametric priors tha
tractable estimators and expressions for Bayesian risk (MSE). The general form of the ex
for such priors is Var(/t|A) — C' ■ (C+ Σ)-1 · C, where C and C are the appropria
vector and matrix from the prior distribution, cf. Section 4. We suggest picking th
signment that minimizes this prior risk. Finally, conditional independence betwe
outcomes and treatments given covariates does not require random assignment. It
conducting controlled trials, and does not rely on randomized controlled trials. Mat
plement our proposed approach is available online at http://dx.doi.org/10.7910/DVN
Appendix A: Proofs
Proof of Proposition 2. By the
β = Ε[β\X, D] + Cov03, Y
Ε[β\Χ, D] = μβ
which yields equation (23). We furthermore have, by the general properties of best linear predictors,
that
1 1
Var(A[/) = σ2 1
L«i "oj
£[ΔI/] = W ·/.
E[A] = £[£[Δ|/]]
= e[w' ·/]
= w' · E[f].
References
Berger, J. 1985. Statistical decision theory and Bayesian inference. New York: Springer.
Blattman, C., A. C. Hartman, and R. A. Blair. 2014. How to promote order and property rights under weak rule of law?
An experiment in changing dispute resolution behavior through community education. American Political Science
Review 108:100-120.
Findley, M. G., D. L. Nielson, and J. Sharman. 2015. Causes of noncompliance with international law: A field
ment on anonymous incorporation. American Journal of Political Science 59(1): 146-61.
Kalla, J. L., and D. E. Broockman. 2015. Campaign contributions facilitate access to congressional offici
randomized field experiment. American Journal of Political Science. "http://sfx.hul.harvard.edu/hvd? cha
t = utf8&id = doi: 10.1111/ajps. 12180&sid = libx%3Ahul.harvard&genre = article" doi: 10.1111 /ajps. 12180
Kasy, M. 2014. Using data to inform policy. Working Paper.
. 2016. Matlab implementation for: Why experimenters might not always want to randomize, and what they cou
instead, Harvard Dataverse. http://dx.doi.org/10.7910/DVN/I5KCWI.
Keele, L. 2015. The statistics of causal inference: A view from political methodology. Political Analysis, doi: 10.1093
mpv007.
Moore, R. T. 2012. Multivariate continuous blocking to improve political science experiments. Political Analysis
20(4):460-79.
Morgan, K. L., and D. B. Rubin. 2012. Rerandomization to improve covariate balance in experiments. Annals of
Statistics 40(2): 1263-82.
Nyhan, B., and J. Reifler. 2014. The effect of fact-checking on elites: A field experiment on U.S. state legislators.
American Journal of Political Science.