0% found this document useful (0 votes)
2 views

Why experimenters might not always want to randomize_Kasy 2016

The document discusses the limitations of randomization in experimental design, arguing that deterministic treatment assignments can minimize expected mean squared error (MSE) more effectively than random assignments. It emphasizes the importance of using baseline covariates to inform treatment allocation, suggesting that optimal designs can be derived through statistical decision-making frameworks. The author provides methods for implementing these designs and highlights that conditional independence can still be achieved without randomization, challenging traditional views on experimental methodology.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

Why experimenters might not always want to randomize_Kasy 2016

The document discusses the limitations of randomization in experimental design, arguing that deterministic treatment assignments can minimize expected mean squared error (MSE) more effectively than random assignments. It emphasizes the importance of using baseline covariates to inform treatment allocation, suggesting that optimal designs can be derived through statistical decision-making frameworks. The author provides methods for implementing these designs and highlights that conditional independence can still be achieved without randomization, challenging traditional views on experimental methodology.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 16

Why Experimenters Might Not Always Want to Randomize, and What They Could Do

Instead
Author(s): Maximilian Kasy
Source: Political Analysis , Summer 2016, Vol. 24, No. 3 (Summer 2016), pp. 324-338
Published by: Cambridge University Press on behalf of the Society for Political
Methodology

Stable URL: https://www.jstor.org/stable/26349740

JSTOR is a not-for-profit service that helps scholars, researchers, and students discover, use, and build upon a wide
range of content in a trusted digital archive. We use information technology and tools to increase productivity and
facilitate new forms of scholarship. For more information about JSTOR, please contact [email protected].
Your use of the JSTOR archive indicates your acceptance of the Terms & Conditions of Use, available at
https://about.jstor.org/terms

and Cambridge University Press are collaborating with JSTOR to digitize, preserve and extend
access to Political Analysis

This content downloaded from


80.64.189.225 on Wed, 06 Apr 2022 09:38:24 UTC
All use subject to https://about.jstor.org/terms
Advance Access publication June 6, 2016 Political Analysis (2016) 24:324-338
doi:10.1093/pan/mpw012

Why Experimenters Might Not Always Want to Random


and What They Could Do Instead

Maximilian Kasy
Department of Economies, Harvard University, 1805 Cambridge Street,
Cambridge, MA 02138, USA
e-mail: [email protected] (corresponding author)

Edited by R. Michael Alvarez

Suppose that an experimenter has collected a sample as well as baseline information about the units i
sample. How should she allocate treatments to the units in this sample? We argue that the answer
not involve randomization if we think of experimental design as a statistical decision problem. If,
instance, the experimenter is interested in estimating the average treatment effect and evaluat
estimate in terms of the squared error, then she should minimize the expected mean squared
(MSE) through choice of a treatment assignment. We provide explicit expressions for the expected
that lead to easily implementable procedures for experimental design.

1 Introduction

Experiments, and in particular randomized experiments, are the conceptual referenc


gives empirical content to the notion of causality. In recent years, actual randomized
have become increasingly popular elements of the methodological toolbox in a wide ra
science disciplines. Examples from the recent political science literature abound
Hartman, and Blair (2014), for instance, provided training in "alternative dispute res
tices" to residents of a random set of towns in Liberia. Kalla and Broockman (2
whether the possibility of scheduling a meeting with a congressional office changes d
whether it is revealed that the person seeking a meeting is a political donor. Findley,
Sharman (2015) study the effect of varying messages to incorporation services in differen
on the possibility of establishing (illegal) anonymous shell corporations.
Researchers conducting such field experiments in political science (as well as in fie
economics, medicine, and other social and biomedical sciences) are often confronted w
of the following situation (cf. Morgan and Rubin 2012). They have selected a random
some population and have conducted a baseline survey for the individuals in this sam
discrete treatment is assigned to the individuals in this sample, usually based on som
tion scheme. Finally, outcomes are realized, and the data are used to perform inferen
average treatment effect.
A key question for experimenters is how to use covariates from the baseline sur
assignment of treatments. Intuition and the literature suggest to use stratified rando
ditional on covariates, also known as blocking. Moore (2012), for instance, makes a
argument that blocking on continuous as well as discrete covariates is better than fu
tion or blocking only on a small number of discrete covariates. We analyze this quest

Author's note·. I thank Alberto Abadie, Ivan Canay, Gary Chamberlain, Raj Chetty, Nathaniel Hendren,
Larry Katz, Gary King, Michael Kremer, and Don Rubin, as well as seminar participants at the Harvard
retreat; the Harvard Labor Economics Workshop; the Harvard Quantitative Issues in Cancer Researc
Harvard Applied Statistics Seminar; UT Austin, Princeton, Columbia, and Northwestern Econometric
RAND; and at the 2013 CEME Conference at Stanford for helpful discussions. Replication data are a
Harvard Dataverse at http://dx.doi.org/10.7910/DVN/I5KCWI. See Kasy (2016). Supplementary materials f
are available on the Political Analysis Web site.

) The Author 2016. ! 'ublished by Oxford University Press on behalf of the Society for Political Methodology.
All rignts reserved, ror permissions, please eman: journais.permissions^oup.com

324

This content downloaded from


80.64.189.225 on Wed, 06 Apr 2022 09:38:24 UTC
All use subject to https://about.jstor.org/terms
Randomized Experiments 325

use baseline covariates—as a decision problem. Th


signment and an estimator, given knowledge of t
jective is to minimize risk based on a loss function su
estimator. The decision criteria considered are Bay
risk.
We show, first, that experimenters might not want to randomize in general. While surprising at
first, the basic intuition for this result is simple and holds for any statistical decision problem. The
conditional expected loss of an estimator is a function of the covariates and the treatment assign
ment. The treatment assignment that minimizes conditional expected loss is in general unique if
there are continuous covariates, so that a deterministic assignment strictly dominates all
randomized assignments.1
We next discuss how to implement optimal and near-optimal designs in practice, where near
optimal designs might involve some randomization. The key problem is to derive tractable expres
sions for the expected MSE of estimators for the average treatment effect, given a treatment as
signment. Once we have such expressions, we can numerically search for the best assignment, or for
a set of assignments that are close to optimal. In order to calculate the expected MSE, we need to
specify a prior distribution for the conditional expectation of potential outcomes given covariates.
We provide simple formulas for the expected MSE for a general class of nonparametric priors.
Our recommendation not to randomize raises the question of identification, cf. the review in
Keele (2015). We show that conditional independence of treatment and potential outcomes given
covariates still holds for the deterministic assignments considered, under the usual assumption of
independent sampling. Conditional independence only requires a controlled trial (CT), not a
randomized controlled trial (RCT).
To gain some intuition for our non-randomization result, note that in the absence of covariates
the purpose of randomization is to pick treatment and control groups that are similar before they
are exposed to different treatments. Formally, we would like to pick groups that have the same
(sample) distribution of potential outcomes. Even with covariates observed prior to treatment
assignment, it is not possible to make these groups identical in terms of potential outcomes. We
can, however, make them as similar as possible in terms of covariates. Allowing for randomness in
the treatment assignment to generate imbalanced distributions of covariates can only hurt the
balance of the distribution of potential outcomes. Whatever the conditional distribution of unob
servables given observables is, having differences in observables implies greater differences in the
distribution of unobserables relative to an assignment with no differences in observables. The
analogy to estimation might also be useful to understand our non-randomization result. Adding
random (mean 0) noise to an estimator does not introduce any bias. But it is never going to reduce
the M SE of the estimator.
The purpose of discussing tractable nonparametric priors—and one of the main contributions of
this article—is to operationalize the notion of "balance." In general, it will not be possible to obtain
exactly identical distributions of covariates in the treatment and control groups. When picking an
assignment, we have to trade off balance across various dimensions of the joint distribution of
covariates. Picking a prior distribution for the conditional expectation of potential outcomes, as
well as a loss function, allows one to calculate an objective function (Bayesian risk) that performs
this trade-off in a coherent and principled way.

2 A Motivating Example and Some Intuitions

2.1 Setup

Before we present our general results and our proposed procedure, let us discuss a simple
motivating example. The example is stylized to allow calculations "by hand," but the intuitions

'if experimenters have a preference for randomization for reasons outside the decision problem considered in the present
article, a reasonable variant of the procedure suggested here would be to randomize among a set of assignments that are
"near-minimizers" of risk. If we are worried about manipulation of covariates, in particular, a final coin flip that
possibly switches treatment and control groups might be helpful. I thank Michael Kremer for this suggestion.

This content downloaded from


80.64.189.225 on Wed, 06 Apr 2022 09:38:24 UTC
All use subject to https://about.jstor.org/terms
326 Maximilian Kasy

Table 1 Comparing bias, variance, and M

Assignment Designs Model 1 Model 2


ISIVIlSC.

d, d2 d3 d4 1 2 3 4 5 bias var MSE bias var MSE

0 1 1 0 1 1 1 1 1 0.0 1.0 1.0 2.0 1.0 5.0 1.77


1 0 0 1 1 1 1 1 1 0.0 1.0 1.0 -2.0 1.0 5.0 1.77
1 0 1 0 1 1 1 1 0 -1.0 1.0 2.0 3.0 1.0 10.0 2.05
0 1 0 1 1 1 1 1 0 1.0 1.0 2.0 -3.0 1.0 10.0 2.05
1 1 0 0 1 1 1 0 0 -2.0 1.0 5.0 6.0 1.0 37.0 6.51
0 0 1 1 1 1 1 0 0 2.0 1.0 5.0 -6.0 1.0 37.0 6.51
0 1 0 0 1 1 0 0 0 -0.7 1.3 1.8 3.3 1.3 12.4 2.49
0 0 1 0 1 1 0 0 0 0.7 1.3 1.8 -0.7 1.3 1.8 2.49
1 1 0 1 1 1 0 0 0 -0.7 1.3 1.8 0.7 1.3 1.8 2.49
1 0 1 1 1 1 0 0 0 0.7 1.3 1.8 -3.3 1.3 12.4 2.49
1 0 0 0 1 1 0 0 0 -2.0 1.3 5.3 4.7 1.3 23.1 Ell
1 1 1 0 1 1 0 0 0 -2.0 1.3 5.3 7.3 1.3 55.1 Ell
0 0 0 1 1 1 0 0 0 2.0 1.3 5.3 -7.3 1.3 55.1 6.77
0 1 1 1 1 1 0 0 0 2.0 1.3 5.3 -4.7 1.3 23.1 Ell
0 0 0 0 1 0 0 0 0 - -

oo - -

oo -

1 1 1 1 1 0 0 0 0 - -

00 - -

00 -

MSE model 1: 00 3.2 2.7 1.5 1.0


MSE model 2: 00 20.6 17.3 7.5 5.0

Notes: Each row of this table corresponds to one possible treatment assignment (d\,..dî). The columns for "model 1" correspond to the
model Yj = + d + ef, and the columns for "model 2" to the model if = -x] + d + ef. Each row shows the bias, variance, and MSE of
β, for the given assignment and model. The designs 1-5 correspond to uniform random draws from the assignments with rows marked by
an entry of 1. Design 1 randomizes over all rows, design 2 over rows one through fourteen, etc. The last column of the table shows the
Bayesian expected MSE for each assignment for the squared exponential prior discussed below. For details, see Section 2.

from this example generalize. Suppose an experimenter has a sample of four experimental units
i = 1,..4, and she observes a continuous covariate Xt for each of them, where it so happens that
(X\,..X4) = (x\,..x4) = (0, 1, 2, 3). She assigns every unit to one of two binary treatments,
d, e {0, l}.2
Our experimenter wants to estimate the (conditional) average treatment effect of treatment D
across these four units,

0)

The experimenter plans to estimate this treatment effect by calculating the difference in means
across treatment and comparison groups in her experiment, that is,

β = - Τ D, Y,■- - T(l - Di) Y,, (2)


«1 V "0 V

Since there are four experimental units, there are 24 = 16 possible treatment
sixteen rows of Table 1 correspond to these assignments.3 In the first
{d\,..., d$) = (0, 1, 1, 0), in the second row (d\,..., i/4) = (1,0, 0, 1), etc.

2Consider for instance the setting of Nyhan and Reifler (2014), where legislators i received lette
(£>,= 1) or did not (£>, = 0). The outcome of interest Y in this case is the future fact-checking ratin
an important covariate X, might be their past rating.
'The code producing Table 1 is available online at Kasy (2016). At this address, we also prov
implementing our proposed approach in practice.

This content downloaded from


80.64.189.225 on Wed, 06 Apr 2022 09:38:24 UTC
All use subject to https://about.jstor.org/terms
Randomized Experiments 327

Assume now that potential outcomes are determined b

Vf = Xj + d + e^, (3)
where the ef are independent given X, and have mean 0 and
Table 1. The average treatment effect in this model is eq
(di,..., dn), we can calculate the corresponding bias, varia
where

Var(/S) = — + —- (4)
n\ «ο

with ri] being the number of units i receiving treatment dt= 1, and «ο being the s
is given by

Bias = Ε[β] - β = — Y diXi - — 1 - di)xt (5)


ηιV

for our model.


For the assignment in the first row of Table 1, we get Var(/J) = f + i = 1, for the assign
ment in row 7, Var(^) = { + 5= 1-33, and similarly for the other rows. The bias for the first
row is given by Bias = Ε[β] - β = \ ■ (x2 + *3 - xi - *4) = 0, for the third row by
Bias = 5 · (χι + X3 - X2 - xn) = -1, etc. The MSE for each row is given by

MSE (di ,...,άη) = Ε[(β- βf] = Bias2 + Var. (6)

2.1.1 Forgetting the covariâtes

Suppose now for a moment that the information about covariates got lost—somebody delete
column of X, in your spreadsheet. Then every i looks the same, before treatment is assigne
variance of potential outcomes T1 now includes both the part due to ef and the part due to X„ an
is equal to Var(îf) = Var(T,) + Var(ef) = Var(T/) + 1. Since units i are indistinguishable in
case, treatment assignments are effectively only distinguished by the number of units treated.
we observed no covariates and have random sampling, there is no bias (even ex post), and th
of any assignment with η χ treated units is equal to

MSE(û?i ,...,dn) == \n1


(-+-)
VarGS)
no/
=_ + _). (Var(X,·) + 1). (7)
Any procedure that randomly assigns n\ — n/2 — 2 units to treatment 1 is optimal in this case. Thi
is, of course, the standard recommendation. A similar argument applies when we observe a discrete
covariate, with several observations i for each value of the covariate. In this latter case, random
ization conditional on X, is optimal; this is what is known as a blocked design.

2.2 Randomized designs

Let us now, and for the rest of our discussion, assume again that we observed the covariates X,. The
calculations we have done so far for this case are for a fixed (deterministic) assignment of treatment
What if we use randomization? Any randomization procedure in our setting can be described by the
probabilities p{d\,di, d^, d4) it assigns to the different rows of Table 1. The MSE of such a pro
cedure is given by the corresponding weighted average of MSEs for each row:

MSE = £ p(dx,..., dn) ■ MSE (dx ,...,dn). (8)


d\,...,d„

"This is the ex post bias, given covariates and treatment assignment. This is the relevant notion of bias for us. It is
different from the ex ante bias, which is the expected bias when we do not know the treatment assignment yet.

This content downloaded from


80.64.189.225 on Wed, 06 Apr 2022 09:38:24 UTC
All use subject to https://about.jstor.org/terms
328 Maximilian Kasy

We can now compare various randomizati


expected M SE. Let us first consider the com
by an independent coin flip. Such a proc
treatment assignments. Since there is a c
assignments in Table 1, the variance of an
infinite. This case corresponds to the firs
Any reasonable experimenter would rer
eliminate the last two rows of Table 1. T
assignment with a large MSE, such as for
tion procedure, labeled "design 2," is equ
Now, again, most experimenters would n
reduce randomness further by only allow
control groups of equal size. This eliminat
considered. The advantage of doing so is t
group sizes. Correspondingly, the average
So far we have ignored the information
such information by blocking, that is
randomizing within these groups. A spec
size 2. In our case, we could group observations 1 and 2, as well as observations 3 and 4, and
randomly assign one unit within each group to treatment and one to control. This gives the fourth
design and leaves us with random assignment over rows 1 through 4, resulting in a further red
tion of average MSE to 1.5.
This procedure, which already has more than halved the expected MSE relative to random
tion over all possible assignments, can still be improved upon. The assignment 0,1,0,1 has sy
tematically higher values of X among the treated relative to the control, resulting in a bias of 1 in
our setting, and systematically lower values of X among the treated for the assignment 1,0, 1
resulting in a bias of -1. Eliminating these two leaves us with the last design that only allows for th
first two treatment assignments. Since these two have the same MSE, we might as well random
between them. This is what the present article ultimately recommends. This yields an average M
of 1 in the present setting.

2.3 Other data-generating processes


Let us now take a step back and consider what we just did. We first noted that the MSE of an
randomized procedure is given by the weighted average of the MSEs of the deterministic assi
ments it is averaging over. By eliminating assignments with larger MSE from the support of t
randomization procedure, we could systematically reduce the MSE. In the limit, we would on
randomize between the two assignments with the lowest MSE.
The careful reader will have noted that all our calculations of the MSE depend on the assum
model determining potential outcomes. So let us consider a different model ("model 2") where
covariate affects outcomes in a different way,

y*=-£ + d + é. (9)

Relative to model 1, we have flipped the si


nonlinear. Table 1 shows the correspond
the ranking over the alternative randomize
model 2"), even though the values of th
forting robustness of the optimal proced
The careful reader will now again note
models, but one might be able to construct
is partially reversed. That is indeed true
of the underlying data-generating p

This content downloaded from


80.64.189.225 on Wed, 06 Apr 2022 09:38:24 UTC
All use subject to https://about.jstor.org/terms
Randomized Experiments 329

performance across alternative data-generating pro


This can be done explicitly using a (nonparametric
calculate the expected MSE, where the expectat
processes. That is the approach we will propose in
MSE for each assignment {d\,..., dn). The last colu
each deterministic treatment assignment; randomiz
of weighted averages of the deterministic MSEs.5 We c
randomize over a set of assignments with low MSE.
step back and briefly review general statistical decision
the arguments against randomization, or for randomiz
the intuition of our little example, hold under ve
decision problem.

3 Decision Theory Tells Us We Should Not Random

3.1 Brief review of decision theory

Our argument requires that we think of experime


decision theory provides one of the conceptual fo
decision theory was pioneered by Abraham Wald i
found in Berger (1985). We shall provide a brief rev
The basic idea of decision theory is to think of s
some action a (picking an estimate, deciding whet
mental treatment allocation d,...). The choice is m
so that a = 8(X). The action will ultimately be e
also depends on the unknown state of the wor
evaluated, for instance, by how much it deviate
thought of as the negative of a utility function, a f
models.

Since we do not know what the true state of the world is, we cannot evaluate a procedure based
on its actual loss. We can, however, consider various notions of average loss. If we average
loss over the randomness of a sampling scheme, and possibly over the randomness of a treat
ment allocation scheme, holding the state of the world θ fixed, we obtain the frequentist risk
function:

R(8, θ) = E[L(8(X), θ)\θ). (10)


We can calculate this frequentist risk function. If loss is the squa
true parameter β, then the frequentist risk function is equ
example of Section 2 for each row of Table 1. The models 1 an
to two values of Θ.
The risk function by itself does not yet provide an answer to the question of which decision
procedure to pick, unfortunately. The reason is that one procedure 8 might be better in terms of
R(8, θ) for one state of the world, while another procedure 8' is better for another state Θ' of the
world. Returning again to Table 1, we have for instance that the assignment in row 10 is better than
row 3 for model 1, but worse for model 2.
We thus have to face a trade-off across states Θ. There are different ways to do this. One way is to
focus on the worst-case scenario that is on the state of the world with the worst value of the risk

5The prior used to calculate this EMSE assumes that

C((x\,d\),(x2,d2))= 10*exp(—(H*, - x2\\2 - (d\ -i/2)2)/10);


for details, see Section 4.

This content downloaded from


80.64.189.225 on Wed, 06 Apr 2022 09:38:24 UTC
All use subject to https://about.jstor.org/terms
330 Maximilian Kasy

function. This yields the minimax dec


probabilities) to different states of the
Bayesian procedures.

3.2 Randomization in general decision

Let us now return to the question of ran


randomize? Suppose we allow our decisi
that a = S(U, X). U does not tell us anythi
enter the loss function L. As a conseq
procedure is simply a weighted averag
our random procedure might pick. An
the last two rows of Table 1.
Formally, let S"(x) = S(u, χ) be the decision made when U=u and X=x, and let R(S", Θ) be the
risk function of the procedure that picks δ"(χ) whenever X=x. Then, we have

R(S, Θ) = Σ R(&U> 0)-P(U= u), (11)


U

where we have assumed for simplicity of notation that U is discrete.


To get to a global assessment of δ, we need to trade off risk across values of Θ. If we do so using
the prior distribution π, we get the Bayes risk

R"(8) = J R(8, θ)άπ(θ)

= Σ j R(S", θ)άπ(θ) ■ P{U = u) (12)


= Σ Rn(8") -P(U = u).
U

In terms of Table 1, Bayes risk averages the MSE for a given assignment 8" and data-generating
process Θ, R(8", θ), both across the rows, using the randomization device U, and across different
models for the data-generating process, using the prior distribution π. If, alternatively, we evaluate
each action based on the worst-case scenario for that action, we obtain minimax risk

u).
Rmn\8) = Σ U
(supR(SU,
\ θ /
θ)\ ■ P(U :

In terms of Table 1, minimax risk evaluates each row (realized assignment 8") in terms of the worst
case model for this assignment, supf?(S", θ), and evaluates randomized designs in terms of the
θ

weighted average across the worst case risk for each assignment.
Letting R*(8) denote either Bayes or minimax risk, we have

R*(8) = Σ R*(8") ■ p(u = ")· (13)


The risk of any randomized procedure is given by the weighted average of risk across t
ministic procedures it averages over. Now consider an alternative decision procedure suc
S* e argmin R*(S), (14)
5

where the argmin is taken over non-randomized decision functions a = δ(Χ). It follows that
R*(S*) <minuR*(Su) < R*(S). (15)
The second inequality is strict, unless δ only puts positive probability on th

This content downloaded from


80.64.189.225 on Wed, 06 Apr 2022 09:38:24 UTC
All use subject to https://about.jstor.org/terms
Randomized Experiments 331

R*(8U). If the optimal deterministic S* is unique, R*(8*) < R


that 5 does not involve randomization. We have thus sh

Proposition 1 (Optimality of non-randomized decisions).


cussed in Section 3.1. Let R* be either Bayes risk or minim

1. The optimal risk R*(8*) when considering only determ


the optimal risk when allowing for randomized procedu
2. If the optimal deterministic procedure 8* is unique, th
trivial randomized procedure.

3.3 Experimental design

We can think of an experiment as a decision problem th

1. Sample observations i from the population of inter


2. Using the baseline covariates X and possibly a rand
treatments /),.
3. After running the experiment, collect data on outcome
estimators β of the objects β that we are interested
treatment effect.

4. Finally, loss is realized, L = L(β, β). A common c


Since this article is about experimental design, we are in
thing else (sampling and the matrix of baseline covariates X
interest β, the loss function L) as given, we can ask what i
the information in the baseline covariates X, and possib
In order to answer this question, we first calculate the ri
treatment allocation <5 = (d\,..d„), which only depends
estimation error (β - β)2, this gives the M SE

R(&, θ) = MSE(rflf..., dn) = Ε[(β - β)2\Χ, θ\. (16)


This is the calculation we did in Section 2 to get the MSE columns for models 1 and 2. The two
models can be thought of as two values of θ at which we evaluate the risk function.
As in the general decision problem, we get that the risk function of a randomized procedure is
given by the weighted average of the risk functions for deterministic procedures,
R(S, θ) = R(S", Θ) ■ P(U = u). Consequently, the same holds for Bayes risk and minimax risk.
This implies that optimal deterministic procedures perform at least as well as any randomized
procedures.
But do they perform better? Do we lose anything by randomizing? As stated in Proposition 1, the
answer to this question hinges on whether the optimal deterministic assignment is unique. Any
assignment procedure, random or not, is optimal if and only if it assigns positive probability only to
assignments {d\,..., dn) that minimize risk among deterministic assignments.
If we were to observe no covariates, then permutations of treatment assignments would not
change risk, since everything "looks the same" after the permutation. In that case, randomization is
perfectly fine, as long as the size of treatment and control groups are fixed. Suppose next that we
observe a covariate with small support, so that we get several observations for each value of the
covariate. In this case, we can switch treatment around within each block defined by a covariate

6Squared error loss is the canonical loss function in the literature on estimation. It has a lot of convenient properties
allowing for tractable analytic results, and in keeping with the literature we will focus on squared error loss. Other
objective functions are conceivable, such as for instance expected welfare of treatment assignments based on experi
mental evidence; see for instance Kasy (2014).

This content downloaded from


80.64.189.225 on Wed, 06 Apr 2022 09:38:24 UTC
All use subject to https://about.jstor.org/terms
332 Maximilian Kasy

value without changing risk—again, ever


case, stratified (blocked) randomization i
Suppose now, however, that we observe
support. In that case, in general, no two
just switch treatment around without chan
will in general be unique (possibly up to a "
This implies that randomizing comes with a
assignment.

4 Our Proposed Procedure


There is one key question we need to answe
to aggregate the risk function R(8, Θ) in
Bayesian approach, using a nonparametri
The key object that we need prior is th
covariates, since it is this function that d
treatment effect. Denote this function by

Ax,d) = E[Ydi\Xi = x,e\. (17)


In model 1 of Section 2, we had /(x, d) = x + d, in model 2, we had f(x
that the prior for / is such that expectations and variances exist. We d

E\J{x, d)] — μ(χ, d), and

Cov(/(xi, di),/(x2, d2)) = C((xi, d\), (x2, d2)) (18)


for a covariance function C. Assume further that we have a prior over the conditi
Y, which satisfies

£IVar( W, A, 0)1 Χι, A] = σ2(Τ„ A) = σ2. (19)


Assuming a prior centered on homoskedasticity is not without loss of generally, but is a
baseline.
How should the prior moments μ and C be chosen? We provide an extensive discussion in the
Online Appendix; a popular choice in the machine-learning literature is so-called squared expo
nential priors, where μ = 0 and

C({x\,d\), (x2, d2)) = exp(-(||*i - x2l|2 - (d\ - d2)2)/l). (20)


The length scale parameter I determines the assumed smoothness of / for such priors. This is the
prior we used to calculate the expected MSE in Table 1, choosing a length scale of /= 10. Another
popular set of priors builds on linear models. For such priors, we show below that expected MSE
corresponds to balance as measured by differences in covariate means.
Recall that we are ultimately interested in estimating the conditional average treatment effect,

/* = ££*#-tfw· (21)
The conditional average treatment effect is the object of interest if we want to learn about trea
effects for units, in the population from which our sample was drawn, which look similar in ter
of covariates. We might be interested in this effect both for scientific reasons and for policy re
(deciding about future treatment allocations). One more question has to be settled before w
give an expression for the Bayes risk of any given treatment assignment: How is β going
estimated? We will consider two possibilities. The first possibility is that we use an estimator th
optimal in the Bayesian sense, namely the posterior best linear predictor of β. The second
bility is that we estimate β using a simple difference in means, as in the example of Section

This content downloaded from


80.64.189.225 on Wed, 06 Apr 2022 09:38:24 UTC
All use subject to https://about.jstor.org/terms
Randomized Experiments 333

4.1 Bayes optimal estimation

It is useful to introduce the following notation for th

μ, = μ(Χι, dj)

Ctj = C((Xi, d,), (Xj, dj))

μβ=-Σ[μ(Χι,\)-μ(Χ„0)\, (22)

c<=l
j
Σ ic({X-d^(x"- C((x" di)> (Xj> °))]·

Let μ and C be the corresponding vector and matrix with entries μ., and C,j. Note that both μ and
C depend on the treatment assignment (d\,...,d„). Using this notation, the prior mean and
variance of Y are equal to μ and C + σ2/, and the prior mean of β equals μ β.
Let us now consider the posterior best linear predictor, which is the best estimator (in the
Bayesian risk sense) among all estimators linear in Y.1 The posterior best linear predictor
is equal to the posterior expectation if both the prior for / and the distribution of the residuals
Y-f are multivariate normal.

Proposition 2 (Posterior best linear predictor and expected loss). The posterior best linear predictor for
the conditional average treatment effect is given by

β = μ-β + C' ■ (C + σ2Τ)_1 - (Υ — μ), (23)


and the corresponding M SE ( Bayes risk ) equals

MSE(di ,...,dn) = \Άτ(β\Χ) - C' · (C + a21)'1 ■ C, (24)


where \Άΐ(β\Χ) is the prior variance of β.

The proof of this proposition follows from standard characterizations of best linear predictors
and can be found in Appendix A. The expression for the MSE provided by equation (24) is easily
evaluated for any choice of (i/i d„). Since our goal is to minimize the MSE, we can in fact ignore
the Var(/3|A) term, which does not depend on {d\,..., d,,).

4.2 Difference in means estimation

Let us next consider the alternative case where the experimenter uses the simple difference in means
estimator. We will need the following notation:

μΊ = μ(Χί, d)
(25)
Ciydl = C((Xi, dl), (Xj, d2)),

for d,dl,d2 e {0,1}. We collect these terms in the vectors μ''and matrices Cd'-d2, which are in turn
collected as

μ = (μ1, μ2)
' C°° C01 \ (26)
C =
c10 c11

7This class of estimators includes all standard estimators of β under unconfoundedness, such as those based on matching,
inverse probability weighting, regression with controls, kernel regression, series regression, splines, etc. Linearity of the
estimator is unrelated to any assumption of linearity in X; we are considering the posterior BLP of β in Y rather than
the BLP of Yi in X,·.

This content downloaded from


80.64.189.225 on Wed, 06 Apr 2022 09:38:24 UTC
All use subject to https://about.jstor.org/terms
334 Maximilian Kasy

μ and C do not depend on (d\,..d„), in


vector / sub-matrix of the former, se
to μ, we get E[f\ = μ and Var(/) = C.

Proposition 3 (Expected MSE for designs,


estimated using the difference in means

β = -ΤΰιΥι--Τ(\ ~ A)Yi- (27)


"1 V "o V

Ί Γ
MSE(rfi = σ2 J + (w' ■ μ)2 + W · C ■ w, (28)
n\ n0

w = (wu, w ),

W
ι di= 1,
n\ η (29)

ο l-dil
W: = f - .
«ο «

The expression for the MSE in this proposition has three terms. The first term is the variance of
the estimator. The second and third terms together are the expected squared bias. This splits in turn
into the square of the prior expected bias, and the prior variance of the bias.
It is interesting to note that we recover the standard notion of balance if, in addition to the
assumptions of Proposition 3, we impose a linear, separable model for /, that is,

f^x, d) = x' ■ γ + d ■ β, (30)


where γ has prior mean 0 and variance Σ. For this model, we

CΧ1-Χ2)'·β, (31)
where Xd is the sample mean of covariates in e
squared bias is equal to

(X1 - X2)' · Σ · (X1 - X2), (32)


and the MSE equals
1 1'
MSE(i/i, ...,d„) = a
2
1— + {XX - χ2)' -Σ- (Xx - x2). (
nx no

Risk is thus minimized by choosing treatment and control arms ot equal size,
balance as measured by the difference in covariate means (Χ1 — X2).

4.3 Discrete optimization

We now have almost all the ingredients for a procedure that can be used by p
important element is missing: How do we find the optimal assignment if? Or h
find a set of assignments that are close to optimal in terms of expected risk? T
since solving the problem by brute force is generally not feasible. We could do
example in Section 2, since in this example there were only 24 = 16 possible treatmen
In general, there are 2" assignments for a sample of size n, a number that very
prohibitive to search by brute force.

This content downloaded from


80.64.189.225 on Wed, 06 Apr 2022 09:38:24 UTC
All use subject to https://about.jstor.org/terms
Randomized Experiments 335

So what to do? One option, which is close to a co


rerandomize a feasible number of times, and to pi
picked. A quick calculation shows that this procedu
domize k times, and pick d as the best of the k assig
better then 99% of all assignments, say, is equal t
larger than 99%, and significantly larger k are feasib
Implementation of our design procedure us
summarized as follows: First, pick a prior. Th
For our example in Section 2, we used μ = 0 and C((*i, d\), (x2, d2)) =
10 * exp(—(||xi -X2II2 - (d\ - z/2)2)/10). Calculate the corresponding matrix C, and C
iterate the following:

1. Draw a random treatment assignment.


2. Calculate the objective function of equation (28).
3. Compare it to the best MSE obtained thus far. If the new MSE is better than the
store the new treatment assignment and MSE.
4. Iterate for some prespecified number of times k.

More sophisticated alternative optimization methods are discussed in the Online Appen

5 Arguments for Randomization

There are a number of arguments that can be made for randomization and against the f
considered in this article. We shall discuss some of these, and to what extent thev aooear to

5.1 Randomization inference requires randomization

That is correct. How does this relate to our argument? The arguments of Section 3 s
optimal procedures in a decision theoretic sense do not rely on randomization. Rando
inference cannot be rationalized as an optimal procedure under the conceptual frame
decision theory, however. As a consequence, the arguments of Section 3 do not apply.
One could take the fact that randomization inference is not justified by decision theo
argument against randomization inference. But one could also consider a compromise approac
is based on randomization among assignments that have a low expected MSE. Such partially r
assignments are still close to optimal and allow the construction of randomization-based test

5.2 Is identification assuming conditional independence still valid without randomization?

Yes, it is. Selection on observables holds, as the name suggests, for any treatment assignmen
a function of observables. Put differently, conditional independence is guaranteed by any co
trial, whether randomized or not, as stated in the following proposition.

Proposition 4 (Conditional independence)


Suppose that (Χ,, Y?, Yj) are i.i.d. draws from the population of interest, which are independ
Then, any treatment assignment of the form £), = di(X\,..., X„, U) satisfies conditional inde

(T,0, Y])LDi\Xi. (34)


This is true, in particular, for deterministic treatment assig

5.3 The Bayesian approach to experimental design require

That is correct, as far as the ranking of assignments in terms


ment might be optimal for a particular prior and objective (su
(MSE) of an estimator for the average treatment effect (ATE

This content downloaded from


80.64.189.225 on Wed, 06 Apr 2022 09:38:24 UTC
All use subject to https://about.jstor.org/terms
336 Maximilian Kasy

This does not imply, however, that the


has to rely on the same prior as the on
From the perspective of a researcher a
was assigned based on observables as w
ment was assigned, as long as the cond
guaranteed by Proposition 4. The desig
such as the difference in means, which

5.4 Randomized assignments are perceiv

A possible objection to the practical fe


that randomized assignments are fair,
not fair, in a similar way that "tagging"
constraint on feasible experiments.
Note, however, that optimal designs
distribution across treatments, leading
graphic or other groups, relative to ran

5.5 Does the proposed approach rely o

No, it does not. Nowhere have we imposed


distribution of covariates, beyond the ass
Section 4, in particular, is nonparametric

6 Conclusion

In this article, we discuss the question of how information from baseline covariates
when assigning treatment in an experiment. In order to give a well-grounded an
question, we adopt a decision theoretic and nonparametric framework. The nonpa
spective and the consideration of continuous covariates distinguish this article from
previous experimental design literature.
A number of conclusions emerge from our analysis. First, randomization is in general
Rather, we should pick a risk-minimizing treatment assignment, which is generically un
presence of continuous covariates. Second, we can consider nonparametric priors tha
tractable estimators and expressions for Bayesian risk (MSE). The general form of the ex
for such priors is Var(/t|A) — C' ■ (C+ Σ)-1 · C, where C and C are the appropria
vector and matrix from the prior distribution, cf. Section 4. We suggest picking th
signment that minimizes this prior risk. Finally, conditional independence betwe
outcomes and treatments given covariates does not require random assignment. It
conducting controlled trials, and does not rely on randomized controlled trials. Mat
plement our proposed approach is available online at http://dx.doi.org/10.7910/DVN

Conflict of interest statement. None declared.

I thank Larry Katz for pointing this out.


9To be precise, the support of these priors is the closure of the Reproducing Kernel Hilbert Space cor
For further discussion, see the Online Appendix.

This content downloaded from


80.64.189.225 on Wed, 06 Apr 2022 09:38:24 UTC
All use subject to https://about.jstor.org/terms
Tûrl Cvnorimanlc 337

Appendix A: Proofs
Proof of Proposition 2. By the

β = Ε[β\X, D] + Cov03, Y

Using the notation introduced

Ε[β\Χ, D] = μβ

Cov(/6, Y\X, D) = C'


(A2)
Var(Y\X, D) = C +
Ε\Υ\Χ,Ό] = μ,

which yields equation (23). We furthermore have, by the general properties of best linear predictors,
that

MSEC4,..dn) = m - β)2\Χ, D] = \ατ(β - β\Χ, D),


and Οον(β, β - β\Χ, D) = 0, so that

Var($|X, D) = Var(3|U, D) + Var(/3 - D).


This immediately implies equation (24). □

Proof of Proposition 3. Let ef = if —fiJXi, fif,·)· We can write

Δ := - /! = Ç · (/(JT„ dt) + ef)-i (/(*, 1) 0))


and

MSE(û?i ,..<4) = £ΙΔ2] = Var(A) + £ΙΔ]2


= £[Var(A|/)] + Var(£[A|/]) + £ΐΔ]2.

The first term is equal to

1 1
Var(A[/) = σ2 1
L«i "oj

The second term is equal to the variance of

£[ΔI/] = W ·/.

The third term is equal to the square of

E[A] = £[£[Δ|/]]

= e[w' ·/]
= w' · E[f].

The claim follows once we recall £[/] = μ and Var(/) = C. □

Proof of Proposition 4. The assumption of independent sampling implies that

(Xu Τή, Υ])1(Χι,..Xi-uXi+u .. -, X„, U), (A10)


thus

This content downloaded from


80.64.189.225 on Wed, 06 Apr 2022 09:38:24 UTC
All use subject to https://about.jstor.org/terms
338 Maximilian Kasy

(if, Y])l(Xu ..Xi-uXi+u ..., Xn, U)\Xi, (All)


and therefore

(If, Y))Ldi(Xu...,Xn,U)\Xi. (A12)

References

Berger, J. 1985. Statistical decision theory and Bayesian inference. New York: Springer.
Blattman, C., A. C. Hartman, and R. A. Blair. 2014. How to promote order and property rights under weak rule of law?
An experiment in changing dispute resolution behavior through community education. American Political Science
Review 108:100-120.

Findley, M. G., D. L. Nielson, and J. Sharman. 2015. Causes of noncompliance with international law: A field
ment on anonymous incorporation. American Journal of Political Science 59(1): 146-61.
Kalla, J. L., and D. E. Broockman. 2015. Campaign contributions facilitate access to congressional offici
randomized field experiment. American Journal of Political Science. "http://sfx.hul.harvard.edu/hvd? cha
t = utf8&id = doi: 10.1111/ajps. 12180&sid = libx%3Ahul.harvard&genre = article" doi: 10.1111 /ajps. 12180
Kasy, M. 2014. Using data to inform policy. Working Paper.
. 2016. Matlab implementation for: Why experimenters might not always want to randomize, and what they cou
instead, Harvard Dataverse. http://dx.doi.org/10.7910/DVN/I5KCWI.
Keele, L. 2015. The statistics of causal inference: A view from political methodology. Political Analysis, doi: 10.1093
mpv007.
Moore, R. T. 2012. Multivariate continuous blocking to improve political science experiments. Political Analysis
20(4):460-79.
Morgan, K. L., and D. B. Rubin. 2012. Rerandomization to improve covariate balance in experiments. Annals of
Statistics 40(2): 1263-82.
Nyhan, B., and J. Reifler. 2014. The effect of fact-checking on elites: A field experiment on U.S. state legislators.
American Journal of Political Science.

This content downloaded from


80.64.189.225 on Wed, 06 Apr 2022 09:38:24 UTC
All use subject to https://about.jstor.org/terms

You might also like