0% found this document useful (0 votes)
67 views16 pages

Hierarchical Models For Causal Effects1

Hierarchical Models for Causal Effects1

Uploaded by

Eduardo Índigo
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
67 views16 pages

Hierarchical Models For Causal Effects1

Hierarchical Models for Causal Effects1

Uploaded by

Eduardo Índigo
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 16

Hierarchical Models

for Causal Effects1


AVI FELLER and ANDREW GELMAN

Abstract
Hierarchical models play three important roles in modeling causal effects: (i)
accounting for data collection, such as in stratified and split-plot experimental
designs; (ii) adjusting for unmeasured covariates, such as in panel studies; and
(iii) capturing treatment effect variation, such as in subgroup analyses. Across all
three areas, hierarchical models, especially Bayesian hierarchical modeling, offer
substantial benefits over classical, non-hierarchical approaches. After discussing
each of these topics, we explore some recent developments in the use of hierarchical
models for causal inference and conclude with some thoughts on new directions for
this research area.

BACKGROUND ON HIERARCHICAL MODELING


AND CAUSAL INFERENCE
Historically, social scientists have estimated causal effects via classical, linear
regression. When the data come from a simple randomized experiment, this
can be a very sensible approach. However, when the data deviate from this
ideal even slightly, continuing to use this approach can be problematic. For
example, linear regression will yield misleading results if the researcher col-
lects test scores at the student level but randomizes assigns the intervention
at the classroom level. The goal of this approach is to use statistical methods
that properly account for non-standard data structures and other features of
the data collection process. Moreover, these methods can be extended to data
not obtained from randomized experiments.
The statistical term hierarchical modeling has two, related, meanings.
First, it can refer to modeling of hierarchical data structures: for example,
students within schools or, for a non-nested example, panel or time-series

1 For Emerging Trends in the Social and Behavioral Sciences, ed. Robert Scott and Stephen Kosslyn. We

thank Jennifer Hill and Shira Mitchell for helpful comments and the National Science Foundation and the
Institute of Education Sciences for partial support of this work.

Emerging Trends in the Social and Behavioral Sciences. Edited by Robert Scott and Stephen Kosslyn.
© 2015 John Wiley & Sons, Inc. ISBN 978-1-118-90077-2.

1
2 EMERGING TRENDS IN THE SOCIAL AND BEHAVIORAL SCIENCES

cross-sectional data, in which observations can be indexed by item or by


time. Second, it can refer to modeling of the parameters within a model. We
will use the two formulations interchangeably; indeed, models of hierarchi-
cal data structures can generally be viewed as add-ons to non-multilevel
models, by starting with a regression model with group indicators and then
assigning a second-level model to the coefficients for the group indicators.
Causal inference can be formulated statistically as a missing-data or predic-
tion problem, with the effect of a specified treatment on a specified item being
the difference between the predicted outcome conditional on the treatment.
In the standard notation (Neyman, 1923; Rubin, 1974), for item i there is a
treatment Zi that can equal 0 or 1, a set of pre-treatment predictors Xi , and
potential outcomes Yi0 and Yi1 corresponding to what would be observed
under one treatment or the other. The causal effect of Zi = 1 compared to
Zi = 0 is then conventionally defined as Yi1 − Yi0 . Unless otherwise stated, we
assume the stable unit treatment value assumption and an ignorable assign-
ment mechanism throughout.
In this essay, we consider three ways in which hierarchical modeling can
aid in causal inference:

• Accounting for Data Collection. In any data analysis, it is appropriate to


account for any individual or group characteristics that are predictive of
treatment assignment and inclusion in the dataset. When there are many
such variables (or if these design variables include categorical predictors
with many levels), multilevel modeling is a stable way to model and
adjust for these, in the same way that it is appropriate to condition on all
variables that affect the probability of inclusion in a sample survey (e.g.,
Gelman, 2007).
• Adjusting for Unmeasured Covariates. In observational studies it is typ-
ically necessary to adjust for differences between treated and control
items. If the observations are structured (for example, with longitudi-
nal, panel, or time-series cross-sectional data), multilevel modeling can
yield more efficient estimates than classical no-pooling estimates.
• Modeling Variation in Treatment Effects. Often there is interest not just in
an average treatment effect but also in how the effect varies across the
population. In the above notation, we can model the expected treatment
effect as a function of pre-treatment covariates x, and we can also model
the unexplained variance in the treatment effect.

We devote most of our space to the third of these issues because we


feel it is otherwise under-emphasized both in formal statistics and in
applications. In standard presentations of causal inference there is a quick
jump from treatment effects defined very generally to them being assumed
Hierarchical Models for Causal Effects 3

constant or estimated only as averages, and we believe there is the potential


to learn much more from data.
There is a long history of the use of hierarchical models for estimating
causal effects, especially in education statistics (Bryk & Raudenbush, 2002).
One reason for a revival of this topic now is that statisticians are increasingly
able and willing to fit complex regression models using regularization to
handle large numbers of predictors and arbitrary nonparametric curves (e.g.,
Rasmussen & Williams, 2006; Tibshirani, 1996). We as a field are moving
away from inferences for single parameters in linear models and toward
an acceptance of complex structures that can only be estimated with large
uncertainties.

HIERARCHICAL MODELING TO ACCOUNT FOR DATA COLLECTION


There is a general principle in statistics that the information used in the
design of data collection—the variables that are predictive of an item being
included in the study, and the variables that are predictive of treatment
assignment—should be included in the analysis. This principle is supposed
to be followed in classical design-based analyses as well as in model-based
or Bayesian analyses (Gelman et al., 2013).
Design information can be included in various ways, including survey
weighting, regression modeling, and poststratification. Moreover, specific
tools can be used and interpreted using different statistical paradigms. For
example, propensity scores (a form of model-based estimate of the prob-
abilities of treatment assignment; Rosenbaum & Rubin, 1983) can be used
to construct weights to estimate average treatment effects in randomized
experiments in which the probability of assignment to treatment varies.
Alternatively, the analysis of data from a randomized block design can
incorporate block indicators. In either case, the information that goes into
the design is being used in some way in the analysis.
This section discusses the use of hierarchical models to incorporate
information from several types of experimental designs: stratified designs,
cluster-randomized designs, split-plot designs, and longitudinal designs.
For a more complete discussion of hierarchical models to account for data
collection, see Hill (2013).

COMPLETELY RANDOMIZED EXPERIMENT


In a completely randomized experiment, the simplest estimate of the average
treatment effect is via non-multilevel regression:

yi ∼ N(𝛼 + 𝛽xi + 𝜏zi , 𝜎y2 )


4 EMERGING TRENDS IN THE SOCIAL AND BEHAVIORAL SCIENCES

where 𝜏̂ is an estimate of the treatment effect under certain assumptions.2


The coefficients in this model can be allowed to vary by group, whether or
not the grouping is part of the design of the study.

STRATIFIED EXPERIMENTS
In a stratified or blocked experiment, random assignment depends on one
or more observed covariates; for example, treatment is randomly assigned
to half of the men and half of the women in a study population. Failing to
account for this stratification in subsequent analysis leads to bias, since, in
Rubin’s terminology, treatment assignment is not unconditionally ignorable.
That is, the outcomes and treatment assignment are not marginally indepen-
dent. We can account for this in a multilevel setting by allowing the intercept,
𝛼, to vary by group:
yi ∼ N(𝛼g[i] + 𝛽xi + 𝜏zi , 𝜎y2 )

𝛼g ∼ N(𝜇𝛼 , 𝜎𝛼2 )

where xi are individual-level covariates; zi indicates treatment assignment;


and 𝛼 g[i] is the group-level intercept corresponding to individual i’s group, g.
In practice, this approach differs most starkly from classical non-hierarchical
regression when the number of groups is large, though this approach is still
sensible for only a few groups.
As 𝜎 𝛼 → ∞, the model reduces to the classical “fixed effects” or no pooling
estimate, with separate intercept estimates for each group. As 𝜎 𝛼 → 0, the
model reduces to complete pooling and fixes all 𝛼 g at a common 𝛼. Our pre-
ferred partial pooling approach is a compromise between these two estimates,
partially pooling the group level parameters, 𝛼 g toward the group-level
mean, 𝜇 𝛼 , by an amount that depends on 𝜎 𝛼 and the sample size of each
group.
The variation of the intercepts 𝛼 g is dictated by the experimental design,
but in general the coefficients 𝛽 and 𝜏 can be allowed to vary as well. As
we discuss later in this essay, variation in the treatment effect 𝜏 can be of
substantive interest. The resulting hierarchical model includes a covariance
matrix for the distribution of (𝛼, 𝜏) or (𝛼, 𝛽, 𝜏) that must be estimated from
data. In addition, such a model should include group-level predictors where
appropriate to model predictable variation among the groups beyond what
is explained by the individual-level predictors.
A final technical note is that these models assume that the probability of
treatment assignment is constant across groups, that is zg = z for all g. If this

2. For a discussion of issues surrounding regression in the context of analyzing randomized experi-
ments, see Imbens and Rubin (2015).
Hierarchical Models for Causal Effects 5

does not hold, including zg as a group-level predictor is often suitable to


correct for the relevant modeling issues. See Bryk and Raudenbush (2002),
Gelman and Hill (2006), and Hill (2013) for discussions.

CLUSTER-RANDOMIZED EXPERIMENTS
In stratified experiments, randomization depends on the group but is still
applied at the individual level. In cluster-randomized experiments, every
unit in the cluster receives the same treatment; in other words, randomiza-
tion occurs at the group level.3 This design is common in the social sciences.
Examples include public health interventions rolled out by city (Hill & Scott,
2009; Imai, King, & Nall, 2009), political advertising applied at the media
market level (Green & Vavreck, 2007), and educational interventions at the
classroom or whole-school level (Raudenbush, Martinez, & Spybrook, 2007).
Extending hierarchical models to such experiments is simple—we use the
same model as for a stratified experiment but include treatment assignment
as a group-level rather than individual-level predictor4 :
yi ∼ N(𝛼g[i] + 𝛽xi , 𝜎y2 )

𝛼g ∼ N(𝜇𝛼 + 𝜏zg , 𝜎𝛼2 )


where zg is the group-level treatment indicator. Moreover, adding group-
level covariates beyond treatment assignment is straightforward.5

SPLIT-PLOT DESIGN
One increasingly common extension of cluster-randomized experiments is
a split-plot design, in which different treatments are applied at different
levels. For example, Sinclair, McConnell, and Green (2012) conducted a
randomized voter mobilization experiment in which everyone in the treated
zip code received a voter mobilization mailing, but only some households
received a follow-up phone call, a natural multilevel setting. Another
example is Hong and Raudenbush (2006), who assessed the effect of holding
back low-achieving kindergarten students on classrooms and schools—a
setting of students within classrooms within schools.
In these examples, the “group-level treatment” is simply the count ofhow
many individuals in that group receive the treatment (in this case, whether
3. Some texts have slightly different definitions of cluster-randomized experiments, instead treating
the term as an umbrella for all randomizations that depend on the group. See Hill (2013).
4. In this simple case, the model is algebraically equivalent to the hierarchical model in the previous
section, with the treatment effect at the individual level, and the group-level random effect. Nonetheless,
we find this formulation useful for motivating more complex settings.
5. See Imai et al. (2009) and Hill and Scott (2009) for discussion of design- versus model-based infer-
ence for cluster-randomized designs.
6 EMERGING TRENDS IN THE SOCIAL AND BEHAVIORAL SCIENCES

10% or 20% of the class is held back). The setup applies more generally, how-
ever: treatments at different levels can be completely different. Extending the
above models yields,

yi ∼ N(𝛼g[i] + 𝛽xi + 𝜏1 z1,i , 𝜎y2 )

𝛼g ∼ N(𝜇𝛼 + 𝜏2 z2, g , 𝜎𝛼2 )

where z1,i indicates treatment at the first level (that is, individuals), and z2,g
indicates treatment at the second level (groups).

LONGITUDINAL AND REPEATED MEASUREMENTS


Longitudinal designs often mimic cluster-randomized experiments: there are
multiple observations for the sample individual, who is either assigned to
treatment or control. In practice, longitudinal data analysis requires consid-
eration of complex correlation and missing-data issues (Diggle, Heagerty,
Liang, & Zeger, 2002; Van der Laan & Robins, 2003). For the sake of illus-
tration, however, we note the connection between the previous models and a
simple hierarchical model that includes information about collecting obser-
vations within waves:
( )
yi ∼ N 𝛼jindiv + 𝛼 time 2
, 𝜎
[ i] k[i] y

( )
𝛼jindiv ∼ N 𝜇indiv + 𝛽xj + 𝜏zj , 𝜎indiv
2

( )
𝛼ktime ∼ N 𝜇time , 𝛼time
2

Extensions of this setup allow for regimes that vary over time. Examples of
multilevel modeling in this setting include Zajonc (2012), who investigates
student tracking, and Hong and Raudenbush (2007), who analyze the effects
of time-varying instructional treatments.

HIERARCHICAL MODELING TO ADJUST


FOR UNMEASURED COVARIATES
In observational studies, researchers must take extra steps when com-
paring treated and control units. In general, matching and propensity
score-based approaches can help to restrict the data to a regime under
which regression-based approaches can be used to control for systematic
differences between treatment and control units (Imbens & Rubin, 2015).
Just as with more complex experimental designs, propensity scores models
can often be improved by a hierarchical structure (Arpino & Mealli, 2011).
Hierarchical Models for Causal Effects 7

The econometrics literature has focused on an alternative approach for


observational studies, especially when estimating causal effects for panel
data. In the simplest case, observations are assumed to come from a simple
linear model, in which each individual has a single, unobserved trait, such as
ability, that is assumed to be constant over time (see Wooldridge, 2010, for a
discussion of the relevant assumptions). Consider the example of estimating
the effect of union membership on (log) wages using individual-level panel
data (e.g., Angrist & Pischke, 2008):
indiv time
yi = 𝛼j[i] + 𝛼t[i] + 𝜏zi + 𝜀i

indiv time
where zi denotes union membership and 𝛼j[i] and 𝛼t[i] are intercept esti-
mates for individual j and time t respectively. As before, the key inferential
question is the choice of the appropriate model for 𝛼 i , including whether
the relevant assumptions are applicable. There is a long literature in econo-
metrics comparing the no-pooling and partial-pooling estimates in this case,
which correspond to the so-called panel fixed effects and panel random
effects estimators. As with randomized experiments, the partially pooled
estimates are often more efficient than the no pooling estimates (Hausman,
1978; Wooldridge, 2010).

HIERARCHICAL MODELING TO ACCOUNT


FOR TREATMENT EFFECT VARIATION
The above models all center on the estimation of constant additive treatment
effects; that is: Y1i − Y0i = 𝜏 i = 𝜏, for units i = 1, … , n. In practice, however, we
know that everything varies, especially in the social sciences, making this an
overly restrictive modeling assumption.
The key point is not just that treatment effects vary, but that we can both
predict this variation and use it to better understand the intervention of inter-
est (Ding, Feller, & Miratrix, 2014). For example, practitioners implementing
a program face budget constraints and must target resources to a subset of
the population (see, e.g., Dehejia, 2005; Imai & Strauss, 2011). Or policymak-
ers are interested in the effects of a given intervention on the distribution
of resources in society (Bitler, Gelbach, & Hoynes, 2003). In other settings,
variation in a causal effect is itself of interest (Gelman & Huang, 2008).
Even when not formally in the model, treatment effect variation is implic-
itly recognized. Once we start talking about average treatment effects (ATE),
local average treatment effects (LATE), and the like, we are implicitly talking
about interactions and varying treatment effects. Were the treatment effect
truly constant across all units, we could just speak of “the treatment effect”
without having to specify which cases we are averaging over. Any discussion
8 EMERGING TRENDS IN THE SOCIAL AND BEHAVIORAL SCIENCES

of particular average treatment effects is relevant because treatment effects


vary; that is, the treatment effect interacts with pretreatment variables.
Before turning to estimation, it is important to note that there are some
fundamental limitations to inference about varying treatment effects. First,
as with all causal inference problems, we can never observe the joint distri-
bution of Y0 and Y1 , only their marginal distributions. Therefore, we never
actually observe 𝜏 i directly and, in general, can only make inference about its
full distribution via additional modeling assumptions, though we can almost
always identify the expectation of 𝜏 i given a set of covariates (see Ding et al.
(2014) for a discussion). Second, inference about systematic variation in 𝜏 i (i.e.,
the variation that can be explicitly modeled) does not necessarily imply infer-
ence about underlying causal mechanisms, in the same way that correlation
does not necessarily imply causation. See Gerber and Green (2012) for illus-
trative political examples.
With those caveats, we now turn to modeling treatment effects that vary.
Ideally, such modeling should be grounded in theory. But, as we discuss
below, even when guided by strong substantive considerations, researchers
still face a broad range of non-trivial modeling choices. We therefore explore
a broad range of models for treatment effect variation, beginning with the
simplest—and by far most widely used—model of treatment effect variation:
variation in the mean treatment effect conditional on covariates. We then turn
to models of treatment effect variation in second and higher-order moments,
and to treatment effect variation on the overall outcome distribution. Finally,
we discuss nonparametric methods for estimating treatment effect variation.
Throughout we focus on the advantages of a Bayesian hierarchical approach
over non-multilevel regression modeling.

VARYING TREATMENT EFFECTS ON THE MEAN OUTCOME


We now return to the hierarchical model for stratified experiments. In that
setting, variation in the treatment effect, 𝜏, was reflected in the model. We
can relax that assumption in the following model:
yi ∼ N(𝛼g[i] + 𝛽xi + 𝜏g[i] zi , 𝜎y2 )

( ) (( ) ( 2 ))
𝛼g 𝜇𝛼 𝜎𝛼 𝜌𝜎𝛼 𝜎𝜏
∼N ,
𝜏g 𝜇𝜏 𝜌𝜎𝛼 𝜎𝜏 𝜎𝜏2
which partially pools all the group-specific treatment effect estimates toward
the overall mean, 𝜇 𝜏 , while allowing the group-specific intercept and treat-
ment effect to co-vary (Bryk & Raudenbush, 2002; Gelman and Hill, 2006).
As in the previous sections, the classical approach is mathematically equiv-
alent to the hierarchical model in which 𝜎 𝛼 → ∞ and 𝜎 𝜏 → ∞, which implies
Hierarchical Models for Causal Effects 9

no pooling both for the varying intercepts, 𝛼 g and the varying slopes, 𝜏 g .
This latter choice is especially problematic in the context of treatment effect
interactions, implying that an interaction effect of zero is as likely as an arbi-
trarily large interaction—an obviously absurd statement (Dixon & Simon,
1991; Simon, 2002).
Moreover, this no-pooling approach can prove especially problematic in
the context of trying to estimate multiple weak signals. First, consider the
issue of statistical power: imagine that a researcher is interested in treatment
effect variation across two equally-sized groups and that the true interaction
effect is half as large as the overall effect. Since there are half as many people
in each subgroup, however, the precision decreases by a factor of four rela-
tive to the precision for the main effect (Simon, 2007). If the study is powered
so that the main effect is barely statistically significant, as is typical in social
science applications, then detecting the interaction effect is substantially
less likely. A practical consequence is that applied researchers often look for
interactions across many covariates. In a non-hierarchical setting, this creates
a classic multiple testing problem, as well as a strong incentive for “spec-
ification searches” (Fink, McConnell, & Vollmer, 2011; Pocock, Assmann,
Enos, & Kasten, 2002). Pre-analysis plans that specify which subgroups will
be analyzed before running the experiment mitigate this issue, but do not
completely resolve the multiple comparisons problem in the no pooling
model.
A Bayesian hierarchical approach is not a panacea for weak signals and
multiple comparisons, but it does avoid some of the worst pitfalls that these
create. In short, multiple comparisons can be thought of as multiple analyses
in parallel, a natural opportunity for hierarchical modeling or meta-analysis.
The resulting shrinkage from hierarchical Bayesian inference automatically
reduces the “false positives” problem inherent in multiple classical infer-
ences. While this is true, pre-specifying interaction effects is still important in
a Bayesian setting: if a researcher believes that an interaction effect is impor-
tant enough to be pre-specified, the researcher’s prior distribution for that
effect is more diffuse (i.e., there is more mass far from zero) than for an inter-
action effect that is chosen “post-hoc” (Simon, 2007).
To date, we are aware of very few analyses in the social sciences that par-
tially pool treatment effect estimates across groups. One promising exception
is in the analysis of so-called multi-site trials, which are common in social pol-
icy and education evaluations (see, e.g., Bloom, Raudenbush, & Weiss, 2013).
Another is recent work by Imai and Ratkovic (2013), who use Lasso—rather
than a Bayesian hierarchical model—to regularize these treatment effect
estimates.
10 EMERGING TRENDS IN THE SOCIAL AND BEHAVIORAL SCIENCES

VARIANCE COMPONENTS MODELS OF TREATMENT EFFECTS


As in the previous section, varying treatment effects can be modeled as inter-
actions. But such models can also be rewritten in terms of variance compo-
nents (Bryk & Raudenbush, 2002; Gelman and Hill, 2006):

yi = 𝛼 + 𝛽xi + 𝜏zi , +ei

ei = vg[i] + 𝜉g[i] zi + 𝜀i

where vg[i] ∼ N(0, 𝜎v2 ); 𝜉g[i] ∼ N(0, 𝜎𝜉2 ); and 𝜀i ∼ N(0, 𝜎s2 ). If vg[i] and 𝜉 g[i] were
modeled as a bivariate Normal, this would be exactly equivalent to the pre-
vious model.
Written in this way, it is clear that a much richer class of models are possible
for the error term, ei . The first such models were proposed in a seminal paper
by Bryk and Raudenbush (1988), who model the error term, ei , as a function
of continuous (standardized) covariates6 :

ei = 𝜉i xi + 𝜀i

Therefore, positive values of xi are associated with larger variance; for


example, high-ability students will show more variability in test scores than
low-ability students.
Applying this model directly to the case of a binary treatment indicator
implies that the treatment group will have larger variance than the control
group. Gelman (2004) calls this an additive treatment error model, which
would naturally arise in a situation with a new, active intervention that has
a variable impact. However, the opposite situation—subtractive treatment
error—is also possible; for example, in before-after data, if an active interven-
tion is applied in the “before” period and the treatment consists of dropping
this intervention in the “after” period (i.e., for incumbency; see Gelman &
Huang, 2008).7 Of course, a richer set of variance components models are
possible, such as allowing the variance to depend on a complex function
of pre-treatment covariates. For example, in the context of a multilevel
experimental setting, Kim and Seltzer (2011) fit hierarchical regressions on
the variance components themselves.
Such models can prove difficult to fit in practice. As Bryk and Raudenbush
(1988) observe, fat tails and other departures from the assumed parametric
model can create significant complications. We believe that, in these cases, it
is especially important to check that the model is a good fit to the observed
data (Gelman et al., 2013, Chapter 6).
6. The authors extend this to multiple covariates. There is a large literature on comparing variances
in experimental settings. See, for example, Cox (1984).
7. Gelman (2004) also discusses a “replacement treatment error” model, in which the treatment
replaces a random error component from the “before” period.
Hierarchical Models for Causal Effects 11

VARIATION IN THE EFFECT OF TREATMENT ON QUANTILES


Thus far, our models have focused on the first and second moments of the
outcome distributions. An alternative modeling strategy, especially popu-
lar in economics, is instead to compare percentiles of the marginal distribu-
tions, Qy0 and Qy1 (e.g., Angrist & Pischke, 2008). For example, Bitler et al.
(2003) investigate the effects of welfare reform on the entire wage distribu-
tion, not just the average for a subset of the population. Similarly, Dominici,
Zeger, Parmigiani, Katz, and Christian (2006) estimate effect of nutrients on
low-birth-weight infants, rather than on average weights.
Modeling quantiles is often more challenging than modeling means. A
growing literature has focused on this estimation challenge in a Bayesian
context (Chamberlain & Imbens, 2003; Lancaster & Jun, 2009; Taddy &
Kottas, 2010). Moreover, many of the hierarchical modeling approaches
described above can be extended to quantile regression (Reich, Bondell, &
Wang, 2010), simply replacing the mean by the relevant quantile.

ADVANCES IN MODELS FOR TREATMENT EFFECT VARIATION


Flexible Parametric Methods for Treatment Effect Variation. As in other areas of
Bayesian modeling, new developments in models for treatment effect varia-
tion across groups focus on richer parameterizations of the simpler models
discussed above. For example, there is a growing literature on specifying the
prior variance for interaction effects (Hodges, Cui, Sargent, & Carlin, 2007;
Sargent & Hodges, 1997). One promising approach is the use of nonpara-
metric prior distributions (Sivaganesan, Laud, & Müller, 2010) or in highly
flexible hierarchical array priors (Volfovsky & Hoff, 2012) for these interac-
tion terms.
Another extension is to allow for treatment effects to vary across contin-
uous covariates. Researchers rarely—if ever—have a substantive reason to
make a particular assumption of the parametric form of the interaction. For
example, we might have a strong substantive reason why a given treatment
will become increasingly effective for higher income voters. However,
theory is unlikely to tell us that this relationship is linear, quadratic, or
exponential—there is no strong reason ahead of time to believe that treat-
ment effect is more likely to increase with income rather than log-income
(Beck & Jackman, 1998). Some researchers seek to avoid this problem by
discretizing their continuous variable, but this simply pushes the problem
back to a specification search of a different kind, in which researchers
find cutpoints that lead to the best results (Assmann, Pocock, Enos, &
Kasten, 2000). Flexible models for continuous covariates, such as splines
and Gaussian processes, offer a promising solution to this issue (Feller &
Holmes, 2009).
12 EMERGING TRENDS IN THE SOCIAL AND BEHAVIORAL SCIENCES

Finally, we would also like to see a class of models in which treatments with
larger main effects naturally have larger variation. As Cox (1984) observed,
“large component main effects are more likely to lead to appreciable interac-
tions than small components. Also, the interactions corresponding to larger
main effects may be in some sense of more practical importance.” See also
Gelman (2004). Bien, Taylor, and Tibshirani (2012) implement a model that
respects this hierarchy restriction in the context of the Lasso. To our knowl-
edge, however, there are no such models in a Bayesian setting.

Nonparametric Response Surface Modeling. An alternative approach is based


on directly modeling the distributions of Y0 and Y1 , also known as response
surface modeling. For example, Hill (2011) used Bayesian additive regression
trees (BART) to model treatment effect interactions (see also Green and Kern
(2012) and Imai and Strauss (2011)). More broadly, cutting-edge Bayesian
nonparametric methods, such as Gaussian processes, can be used to flexibly
model the response surface (e.g., Tokdar, 2013).

Variation across Latent Subgroups. Finally, there is an increasing focus on mod-


eling treatment effect variation across latent or partially-observed subgroups.
This is especially promising for principal strata, subgroups defined by the
joint distribution of intermediate outcomes, such as treatment take-up, under
treatment and control (Frangakis & Rubin, 2002). Since many researchers
already fit such models in a Bayesian framework (Hirano, Imbens, Rubin,
& Zhou, 2000; Imbens & Rubin, 2015), it is straightforward to extend these
models to a multi-level setting.

CONCLUSION
When doing causal inference for experiments and observational studies, the
ubiquitous statistical challenge is to control for systematic pretreatment dif-
ferences between treatment and control groups. Multilevel modeling arises
here for three reasons: in fitting the statistical model to multilevel data struc-
tures, as a tool for regularizing in matching or regression models with large
numbers of predictors, and for modeling variation in treatment effects.

REFERENCES

Angrist, J. D., & Pischke, J. S. (2008). Mostly harmless econometrics: An empiricist’s com-
panion. Princeton, NJ: Princeton University Press.
Arpino, B., & Mealli, F. (2011). The specification of the propensity score in multilevel
observational studies. Computational Statistics and Data Analysis, 55(4), 1770–1780.
Hierarchical Models for Causal Effects 13

Assmann, S. F., Pocock, S. J., Enos, L. E., & Kasten, L. E. (2000). Subgroup analysis
and other (mis) uses of baseline data in clinical trials. Lancet, 355(9209), 1064–1069.
Beck, N., & Jackman, S. (1998). Beyond linearity by default: Generalized additive
models. American Journal of Political Science, 42, 596–627.
Bien, J., Taylor, J., & Tibshirani, R. (2012). A lasso for hierarchical interactions. arXiv
preprint arXiv:1205.5050.
Bitler, M., Gelbach, J., & Hoynes, H. (2003). What mean impacts miss: Distributional
effects of welfare reform experiments. American Economic Review, 96(4), 988–1012.
Bloom, H. S., Raudenbush, S. W., & Weiss, M. (2013). Estimating variation in program
impacts: Theory, practice and applicationsMDRC Working Paper.
Bryk, A. S., & Raudenbush, S. W. (1988). Heterogeneity of variance in experimental
studies: A challenge to conventional interpretations. Psychological Bulletin, 104(3),
396–404.
Bryk, A. S., & Raudenbush, S. W. (2002). Hierarchical linear models: Applications and
data analysis methods (2nd ed.). Thousand Oaks, CA: Sage Publications.
Chamberlain, G., & Imbens, G. W. (2003). Nonparametric applications of Bayesian
inference. Journal of Business and Economic Statistics, 21(1), 12–18. doi:10.1198/
073500102288618711
Cox, D. R. (1984). Interaction. International Statistical Review, 52(1), 1–31. doi:10.2307/
1403235
Dehejia, R. H. (2005). Program evaluation as a decision problem. Journal of Economet-
rics, 125(1–2), 141–173. doi:10.1016/j.jeconom.2004.04.006
Diggle, P., Heagerty, P., Liang, K. Y., & Zeger, S. (2002). Analysis of longitudinal data.
Oxford, England: Oxford University Press.
Ding, P., Feller, A., & Miratrix, L. (2014). Randomization inference for treatment
effect variation. Working paper available at http://scholar.harvard.edu/files/
feller/files/ding_feller_miratrix_submission.pdf.
Dixon, D. O., & Simon, R. (1991). Bayesian subset analysis. Biometrics, 47, 871–881.
Dominici, F., Zeger, S. L., Parmigiani, G., Katz, J., & Christian, P. (2006). Estimat-
ing percentile-specific treatment effects in counterfactual models: a case-study
of micronutrient supplementation, birth weight and infant mortality. Journal of
the Royal Statistical Society. Series C. Applied Statistics, 55(2), 261–280. doi:10.1111/
j.1467-9876.2006.00533.x
Feller, A., & Holmes, C. (2009). Beyond Toplines: Heterogeneous Treatment Effects in
Randomized Experiments. Working paper available at http://www.stat.columbia.
edu/∼gelman/stuff_for_blog/feller.pdf.
Fink, G., McConnell, M., & Vollmer, S. (2011). Testing for heterogeneous treatment
effects in experimental data: False discovery risks and correction procedures. Jour-
nal of Development Effectiveness, 6(1), 44–57.
Frangakis, C. E., & Rubin, D. B. (2002). Principal Stratification in causal inference.
Biometrics, 58(1), 21–29.
Gelman, A. (2004). Treatment effects in before-after data. In Applied Bayesian model-
ing and causal inference from incomplete-data perspectives (pp. 195–202). Chichester,
England: John Wiley & Sons, Ltd. doi:10.1002/0470090456.ch18
Gelman, A. (2007). Struggles with survey weighting and regression modeling (with
discussion). Statistical Science, 22, 153–188.
14 EMERGING TRENDS IN THE SOCIAL AND BEHAVIORAL SCIENCES

Gelman, A., Carlin, J. B., Stern, H. S., Dunson, D. B., Vehtari, A., & Rubin, D. B. (2013).
Bayesian data analysis. Boca Raton, FL: CRC press.
Gelman, A., & Hill, J. (2006). Data analysis using regression and multilevel/hierarchical
models. Cambridge: Cambridge University Press.
Gelman, A., & Huang, Z. (2008). Estimating incumbency advantage and its variation,
as an example of a before–after study. Journal of the American Statistical Association,
103(482), 437–446. doi:10.1198/016214507000000626
Gerber, A. S., & Green, D. P. (2012). Field experiments: Design, analysis, and interpreta-
tion. New York, NY: W. W. Norton and Company.
Green, D. P., & Kern, H. L. (2012). Modeling heterogeneous treatment effects in sur-
vey experiments with Bayesian additive regression trees. Public Opinion Quarterly,
76(3), 491–511.
Green, D. P., & Vavreck, L. (2007). Analysis of cluster-randomized experiments: A
comparison of alternative estimation approaches. Political Analysis, 16(2), 138–152.
doi:10.1093/pan/mpm025
Hausman, J. A. (1978). Specification tests in econometrics. Econometrica, 46,
1251–1271.
Hill, J. L. (2011). Bayesian nonparametric modeling for causal inference. Journal of
Computational and Graphical Statistics, 20(1), 217–240. doi:10.1198/jcgs.2010.08162
Hill, J. L. (2013). Multilevel models and causal inference. In M. Scott, J. Simonoff &
B. Marx (Eds.), The SAGE handbook of multilevel modeling. Los Angeles, CA: Sage.
Hill, J., & Scott, M. (2009). Comment: The essential role of pair matching. Statistical
Science, 24(1), 54–58. doi:10.1214/09-STS274A
Hirano, K., Imbens, G. W., Rubin, D. B., & Zhou, X.-H. (2000). Assessing the effect of
an influenza vaccine in an encouragement design. Biostatistics, 1(1), 69–88.
Hodges, J. S., Cui, Y., Sargent, D. J., & Carlin, B. P. (2007). Smoothing balanced
single-error-term analysis of variance. Technometrics, 49(1), 12–25. doi:10.1198/
004017006000000408
Hong, G., & Raudenbush, S. W. (2006). Evaluating kindergarten retention pol-
icy. Journal of the American Statistical Association, 101(475), 901–910. doi:10.1198/
016214506000000447
Hong, G., & Raudenbush, S. W. (2007). Causal inference for time-varying instruc-
tional treatments. Journal of Educational and Behavioral Statistics, 33(3), 333–362.
doi:10.3102/1076998607307355
Imai, K., & Ratkovic, M. (2013). Estimating treatment effect heterogeneity in ran-
domized program evaluation. The Annals of Applied Statistics, 7(1), 443–470.
doi:10.1214/12-AOAS593
Imai, K., & Strauss, A. (2011). Estimation of heterogeneous treatment effects from
randomized experiments, with application to the optimal planning of the get-out-
the-vote campaign. Political Analysis, 19(1), 1–19. doi:10.1093/pan/mpq035
Imai, K., King, G., & Nall, C. (2009). The essential role of pair matching in cluster-
randomized experiments, with application to the Mexican Universal Health Insur-
ance Evaluation. Statistical Science, 24(1), 29–53. doi:10.1214/08-STS274
Imbens, G., & Rubin, D. (2015). Causal inference in statistics, social, and biomedical sci-
ences: An introduction. Cambridge: Cambridge University Press.
Hierarchical Models for Causal Effects 15

Kim, J., & Seltzer, M. (2011). Examining heterogeneity in residual variance to


detect differential response to treatments. Psychological Methods, 16(2), 192–208.
doi:10.1037/a0022656
Neyman, J. (1923). On the application of probability theory to agricultural experi-
mentsEssay on principles. Section 9. Translated and edited by D. M. Dabrowska
and T. P. Speed. Statistical Science, 5, 463–480 (1990).
Lancaster, T., & Jun, S. J. (2009). Bayesian quantile regression methods. Journal of
Applied Econometrics, 25(2), 287–307. doi:10.1002/jae.1069
Pocock, S. J., Assmann, S. E., Enos, L. E., & Kasten, L. E. (2002). Subgroup analysis,
covariate adjustment and baseline comparisons in clinical trial reporting: Cur-
rent practice and problems. Statistics in Medicine, 21(19), 2917–2930. doi:10.1002/
sim.1296
Rasmussen, C. E., & Williams, C. K. I. (2006). Gaussian processes for machine learning.
Cambridge, MA: MIT Press.
Raudenbush, S. W., Martinez, A., & Spybrook, J. (2007). Strategies for improving pre-
cision in group-randomized experiments. Educational Evaluation and Policy Analy-
sis, 29(1), 5–29. doi:10.3102/0162373707299460
Reich, B. J., Bondell, H. D., & Wang, H. J. (2010). Flexible Bayesian quantile regres-
sion for independent and clustered data. Biostatistics, 11(2), 337–352. doi:10.1093/
biostatistics/kxp049
Rosenbaum, P. R., & Rubin, D. B. (1983). The central role of the propensity score in
observational studies for causal effects. Biometrika, 70(1), 41–55.
Rubin, D. B. (1974). Estimating causal effects of treatments in randomized and non-
randomized studies. Journal of Educational Psychology, 66(5), 688–701.
Sargent, D. J. & Hodges, J. S. (1997). Smoothed ANOVA with application to sub-
group analysis. Research Report rr2005-018, Department of Biostatistics, University of
Minnesota.
Simon, R. (2002). Bayesian subset analysis: Application to studying treatment-
by-gender interactions. Statistics in Medicine, 21(19), 2909–2916. doi:10.1002/
sim.1295
Simon, R. M. (2007). Subgroup analysis. In Wiley Encyclopedia of clinical trials. New
York, NY: John Wiley & Sons, Inc.
Sinclair, B., McConnell, M., & Green, D. P. (2012). Detecting spillover effects: Design
and analysis of multilevel experiments. American Journal of Political Science, 56(4),
1055–1069. doi:10.1111/j.1540-5907.2012.00592.x
Sivaganesan, S., Laud, P. W., & Müller, P. (2010). A Bayesian subgroup analysis
with a zero-enriched Polya Urn scheme. Statistics in Medicine, 30(4), 312–323.
doi:10.1002/sim.4108
Taddy, M. A., & Kottas, A. (2010). A Bayesian nonparametric approach to inference
for quantile regression. Journal of Business and Economic Statistics, 28(3), 357–369.
doi:10.1198/jbes.2009.07331
Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. Journal of the
Royal Statistical Society B, 58, 267–288.
Tokdar, S. T. (2013). Causal analysis of observational data with gaussian process
potential outcome models. Presentation at the 2013 Joint Statistical Meetings.
16 EMERGING TRENDS IN THE SOCIAL AND BEHAVIORAL SCIENCES

Van der Laan, M. J., & Robins, J. M. (2003). Unified methods for censored longitudinal
data and causality. New York, NY: Springer.
Volfovsky, A., & Hoff, P. D. (2012). Hierarchical array priors for ANOVA decompo-
sitions. arXiv.org. Retrieved from http://arxiv.org/pdf/1208.1726v1.pdf
Wooldridge, J. M. (2010). Econometric analysis of cross section and panel data (2nd ed.).
Cambridge, MA: MIT press.
Zajonc, T. (2012). Bayesian inference for dynamic treatment regimes: Mobility, equity,
and efficiency in student tracking. Journal of the American Statistical Association,
107(497), 80–92. doi:10.1080/01621459.2011.643747

FURTHER READING

Cox, D. R. (1958). The interpretation of the effects of non-additivity in the Latin


square. Biometrika, 45, 69–73.

AVI FELLER SHORT BIOGRAPHY


Avi Feller is a PhD student in the Harvard Statistics Department, where he
applies statistical methods to public policy. Prior to Harvard, Avi served as
Special Assistant to Office of Management and Budget (OMB) Directors Peter
Orszag and Jack Lew and was a research associate at the Center on Budget
and Policy Priorities. Avi earned an MS in Applied Statistics as a Rhodes
Scholar at the University of Oxford and holds a BA in Political Science and
Applied Mathematics from Yale University.

ANDREW GELMAN SHORT BIOGRAPHY


Andrew Gelman is a professor of statistics and political science and director
of the Applied Statistics Center at Columbia University. His books include
Bayesian Data Analysis (with John Carlin, Hal Stern, David Dunson, Aki
Vehtari, and Don Rubin), Teaching Statistics: A Bag of Tricks (with Deb
Nolan), Data Analysis Using Regression and Multilevel Models (with
Jennifer Hill), Red State, Blue State, Rich State, Poor State: Why Americans
Vote the Way They Do (with David Park, Boris Shor, and Jeronimo Cortina),
and A Quantitative Tour of the Social Sciences (coedited with Jeronimo
Cortina)/Hierarchical.

RELATED ESSAYS
Statistical Power Analysis (Psychology), Christopher L. Aberson
Text Analysis (Methods), Carl W. Roberts
Person-Centered Analysis (Methods), Alexander von Eye and Wolfgang
Wiedermann

You might also like