Hierarchical Models For Causal Effects1
Hierarchical Models For Causal Effects1
Abstract
Hierarchical models play three important roles in modeling causal effects: (i)
accounting for data collection, such as in stratified and split-plot experimental
designs; (ii) adjusting for unmeasured covariates, such as in panel studies; and
(iii) capturing treatment effect variation, such as in subgroup analyses. Across all
three areas, hierarchical models, especially Bayesian hierarchical modeling, offer
substantial benefits over classical, non-hierarchical approaches. After discussing
each of these topics, we explore some recent developments in the use of hierarchical
models for causal inference and conclude with some thoughts on new directions for
this research area.
1 For Emerging Trends in the Social and Behavioral Sciences, ed. Robert Scott and Stephen Kosslyn. We
thank Jennifer Hill and Shira Mitchell for helpful comments and the National Science Foundation and the
Institute of Education Sciences for partial support of this work.
Emerging Trends in the Social and Behavioral Sciences. Edited by Robert Scott and Stephen Kosslyn.
© 2015 John Wiley & Sons, Inc. ISBN 978-1-118-90077-2.
1
2 EMERGING TRENDS IN THE SOCIAL AND BEHAVIORAL SCIENCES
STRATIFIED EXPERIMENTS
In a stratified or blocked experiment, random assignment depends on one
or more observed covariates; for example, treatment is randomly assigned
to half of the men and half of the women in a study population. Failing to
account for this stratification in subsequent analysis leads to bias, since, in
Rubin’s terminology, treatment assignment is not unconditionally ignorable.
That is, the outcomes and treatment assignment are not marginally indepen-
dent. We can account for this in a multilevel setting by allowing the intercept,
𝛼, to vary by group:
yi ∼ N(𝛼g[i] + 𝛽xi + 𝜏zi , 𝜎y2 )
𝛼g ∼ N(𝜇𝛼 , 𝜎𝛼2 )
2. For a discussion of issues surrounding regression in the context of analyzing randomized experi-
ments, see Imbens and Rubin (2015).
Hierarchical Models for Causal Effects 5
CLUSTER-RANDOMIZED EXPERIMENTS
In stratified experiments, randomization depends on the group but is still
applied at the individual level. In cluster-randomized experiments, every
unit in the cluster receives the same treatment; in other words, randomiza-
tion occurs at the group level.3 This design is common in the social sciences.
Examples include public health interventions rolled out by city (Hill & Scott,
2009; Imai, King, & Nall, 2009), political advertising applied at the media
market level (Green & Vavreck, 2007), and educational interventions at the
classroom or whole-school level (Raudenbush, Martinez, & Spybrook, 2007).
Extending hierarchical models to such experiments is simple—we use the
same model as for a stratified experiment but include treatment assignment
as a group-level rather than individual-level predictor4 :
yi ∼ N(𝛼g[i] + 𝛽xi , 𝜎y2 )
SPLIT-PLOT DESIGN
One increasingly common extension of cluster-randomized experiments is
a split-plot design, in which different treatments are applied at different
levels. For example, Sinclair, McConnell, and Green (2012) conducted a
randomized voter mobilization experiment in which everyone in the treated
zip code received a voter mobilization mailing, but only some households
received a follow-up phone call, a natural multilevel setting. Another
example is Hong and Raudenbush (2006), who assessed the effect of holding
back low-achieving kindergarten students on classrooms and schools—a
setting of students within classrooms within schools.
In these examples, the “group-level treatment” is simply the count ofhow
many individuals in that group receive the treatment (in this case, whether
3. Some texts have slightly different definitions of cluster-randomized experiments, instead treating
the term as an umbrella for all randomizations that depend on the group. See Hill (2013).
4. In this simple case, the model is algebraically equivalent to the hierarchical model in the previous
section, with the treatment effect at the individual level, and the group-level random effect. Nonetheless,
we find this formulation useful for motivating more complex settings.
5. See Imai et al. (2009) and Hill and Scott (2009) for discussion of design- versus model-based infer-
ence for cluster-randomized designs.
6 EMERGING TRENDS IN THE SOCIAL AND BEHAVIORAL SCIENCES
10% or 20% of the class is held back). The setup applies more generally, how-
ever: treatments at different levels can be completely different. Extending the
above models yields,
where z1,i indicates treatment at the first level (that is, individuals), and z2,g
indicates treatment at the second level (groups).
( )
𝛼jindiv ∼ N 𝜇indiv + 𝛽xj + 𝜏zj , 𝜎indiv
2
( )
𝛼ktime ∼ N 𝜇time , 𝛼time
2
Extensions of this setup allow for regimes that vary over time. Examples of
multilevel modeling in this setting include Zajonc (2012), who investigates
student tracking, and Hong and Raudenbush (2007), who analyze the effects
of time-varying instructional treatments.
indiv time
where zi denotes union membership and 𝛼j[i] and 𝛼t[i] are intercept esti-
mates for individual j and time t respectively. As before, the key inferential
question is the choice of the appropriate model for 𝛼 i , including whether
the relevant assumptions are applicable. There is a long literature in econo-
metrics comparing the no-pooling and partial-pooling estimates in this case,
which correspond to the so-called panel fixed effects and panel random
effects estimators. As with randomized experiments, the partially pooled
estimates are often more efficient than the no pooling estimates (Hausman,
1978; Wooldridge, 2010).
( ) (( ) ( 2 ))
𝛼g 𝜇𝛼 𝜎𝛼 𝜌𝜎𝛼 𝜎𝜏
∼N ,
𝜏g 𝜇𝜏 𝜌𝜎𝛼 𝜎𝜏 𝜎𝜏2
which partially pools all the group-specific treatment effect estimates toward
the overall mean, 𝜇 𝜏 , while allowing the group-specific intercept and treat-
ment effect to co-vary (Bryk & Raudenbush, 2002; Gelman and Hill, 2006).
As in the previous sections, the classical approach is mathematically equiv-
alent to the hierarchical model in which 𝜎 𝛼 → ∞ and 𝜎 𝜏 → ∞, which implies
Hierarchical Models for Causal Effects 9
no pooling both for the varying intercepts, 𝛼 g and the varying slopes, 𝜏 g .
This latter choice is especially problematic in the context of treatment effect
interactions, implying that an interaction effect of zero is as likely as an arbi-
trarily large interaction—an obviously absurd statement (Dixon & Simon,
1991; Simon, 2002).
Moreover, this no-pooling approach can prove especially problematic in
the context of trying to estimate multiple weak signals. First, consider the
issue of statistical power: imagine that a researcher is interested in treatment
effect variation across two equally-sized groups and that the true interaction
effect is half as large as the overall effect. Since there are half as many people
in each subgroup, however, the precision decreases by a factor of four rela-
tive to the precision for the main effect (Simon, 2007). If the study is powered
so that the main effect is barely statistically significant, as is typical in social
science applications, then detecting the interaction effect is substantially
less likely. A practical consequence is that applied researchers often look for
interactions across many covariates. In a non-hierarchical setting, this creates
a classic multiple testing problem, as well as a strong incentive for “spec-
ification searches” (Fink, McConnell, & Vollmer, 2011; Pocock, Assmann,
Enos, & Kasten, 2002). Pre-analysis plans that specify which subgroups will
be analyzed before running the experiment mitigate this issue, but do not
completely resolve the multiple comparisons problem in the no pooling
model.
A Bayesian hierarchical approach is not a panacea for weak signals and
multiple comparisons, but it does avoid some of the worst pitfalls that these
create. In short, multiple comparisons can be thought of as multiple analyses
in parallel, a natural opportunity for hierarchical modeling or meta-analysis.
The resulting shrinkage from hierarchical Bayesian inference automatically
reduces the “false positives” problem inherent in multiple classical infer-
ences. While this is true, pre-specifying interaction effects is still important in
a Bayesian setting: if a researcher believes that an interaction effect is impor-
tant enough to be pre-specified, the researcher’s prior distribution for that
effect is more diffuse (i.e., there is more mass far from zero) than for an inter-
action effect that is chosen “post-hoc” (Simon, 2007).
To date, we are aware of very few analyses in the social sciences that par-
tially pool treatment effect estimates across groups. One promising exception
is in the analysis of so-called multi-site trials, which are common in social pol-
icy and education evaluations (see, e.g., Bloom, Raudenbush, & Weiss, 2013).
Another is recent work by Imai and Ratkovic (2013), who use Lasso—rather
than a Bayesian hierarchical model—to regularize these treatment effect
estimates.
10 EMERGING TRENDS IN THE SOCIAL AND BEHAVIORAL SCIENCES
ei = vg[i] + 𝜉g[i] zi + 𝜀i
where vg[i] ∼ N(0, 𝜎v2 ); 𝜉g[i] ∼ N(0, 𝜎𝜉2 ); and 𝜀i ∼ N(0, 𝜎s2 ). If vg[i] and 𝜉 g[i] were
modeled as a bivariate Normal, this would be exactly equivalent to the pre-
vious model.
Written in this way, it is clear that a much richer class of models are possible
for the error term, ei . The first such models were proposed in a seminal paper
by Bryk and Raudenbush (1988), who model the error term, ei , as a function
of continuous (standardized) covariates6 :
ei = 𝜉i xi + 𝜀i
Finally, we would also like to see a class of models in which treatments with
larger main effects naturally have larger variation. As Cox (1984) observed,
“large component main effects are more likely to lead to appreciable interac-
tions than small components. Also, the interactions corresponding to larger
main effects may be in some sense of more practical importance.” See also
Gelman (2004). Bien, Taylor, and Tibshirani (2012) implement a model that
respects this hierarchy restriction in the context of the Lasso. To our knowl-
edge, however, there are no such models in a Bayesian setting.
CONCLUSION
When doing causal inference for experiments and observational studies, the
ubiquitous statistical challenge is to control for systematic pretreatment dif-
ferences between treatment and control groups. Multilevel modeling arises
here for three reasons: in fitting the statistical model to multilevel data struc-
tures, as a tool for regularizing in matching or regression models with large
numbers of predictors, and for modeling variation in treatment effects.
REFERENCES
Angrist, J. D., & Pischke, J. S. (2008). Mostly harmless econometrics: An empiricist’s com-
panion. Princeton, NJ: Princeton University Press.
Arpino, B., & Mealli, F. (2011). The specification of the propensity score in multilevel
observational studies. Computational Statistics and Data Analysis, 55(4), 1770–1780.
Hierarchical Models for Causal Effects 13
Assmann, S. F., Pocock, S. J., Enos, L. E., & Kasten, L. E. (2000). Subgroup analysis
and other (mis) uses of baseline data in clinical trials. Lancet, 355(9209), 1064–1069.
Beck, N., & Jackman, S. (1998). Beyond linearity by default: Generalized additive
models. American Journal of Political Science, 42, 596–627.
Bien, J., Taylor, J., & Tibshirani, R. (2012). A lasso for hierarchical interactions. arXiv
preprint arXiv:1205.5050.
Bitler, M., Gelbach, J., & Hoynes, H. (2003). What mean impacts miss: Distributional
effects of welfare reform experiments. American Economic Review, 96(4), 988–1012.
Bloom, H. S., Raudenbush, S. W., & Weiss, M. (2013). Estimating variation in program
impacts: Theory, practice and applicationsMDRC Working Paper.
Bryk, A. S., & Raudenbush, S. W. (1988). Heterogeneity of variance in experimental
studies: A challenge to conventional interpretations. Psychological Bulletin, 104(3),
396–404.
Bryk, A. S., & Raudenbush, S. W. (2002). Hierarchical linear models: Applications and
data analysis methods (2nd ed.). Thousand Oaks, CA: Sage Publications.
Chamberlain, G., & Imbens, G. W. (2003). Nonparametric applications of Bayesian
inference. Journal of Business and Economic Statistics, 21(1), 12–18. doi:10.1198/
073500102288618711
Cox, D. R. (1984). Interaction. International Statistical Review, 52(1), 1–31. doi:10.2307/
1403235
Dehejia, R. H. (2005). Program evaluation as a decision problem. Journal of Economet-
rics, 125(1–2), 141–173. doi:10.1016/j.jeconom.2004.04.006
Diggle, P., Heagerty, P., Liang, K. Y., & Zeger, S. (2002). Analysis of longitudinal data.
Oxford, England: Oxford University Press.
Ding, P., Feller, A., & Miratrix, L. (2014). Randomization inference for treatment
effect variation. Working paper available at http://scholar.harvard.edu/files/
feller/files/ding_feller_miratrix_submission.pdf.
Dixon, D. O., & Simon, R. (1991). Bayesian subset analysis. Biometrics, 47, 871–881.
Dominici, F., Zeger, S. L., Parmigiani, G., Katz, J., & Christian, P. (2006). Estimat-
ing percentile-specific treatment effects in counterfactual models: a case-study
of micronutrient supplementation, birth weight and infant mortality. Journal of
the Royal Statistical Society. Series C. Applied Statistics, 55(2), 261–280. doi:10.1111/
j.1467-9876.2006.00533.x
Feller, A., & Holmes, C. (2009). Beyond Toplines: Heterogeneous Treatment Effects in
Randomized Experiments. Working paper available at http://www.stat.columbia.
edu/∼gelman/stuff_for_blog/feller.pdf.
Fink, G., McConnell, M., & Vollmer, S. (2011). Testing for heterogeneous treatment
effects in experimental data: False discovery risks and correction procedures. Jour-
nal of Development Effectiveness, 6(1), 44–57.
Frangakis, C. E., & Rubin, D. B. (2002). Principal Stratification in causal inference.
Biometrics, 58(1), 21–29.
Gelman, A. (2004). Treatment effects in before-after data. In Applied Bayesian model-
ing and causal inference from incomplete-data perspectives (pp. 195–202). Chichester,
England: John Wiley & Sons, Ltd. doi:10.1002/0470090456.ch18
Gelman, A. (2007). Struggles with survey weighting and regression modeling (with
discussion). Statistical Science, 22, 153–188.
14 EMERGING TRENDS IN THE SOCIAL AND BEHAVIORAL SCIENCES
Gelman, A., Carlin, J. B., Stern, H. S., Dunson, D. B., Vehtari, A., & Rubin, D. B. (2013).
Bayesian data analysis. Boca Raton, FL: CRC press.
Gelman, A., & Hill, J. (2006). Data analysis using regression and multilevel/hierarchical
models. Cambridge: Cambridge University Press.
Gelman, A., & Huang, Z. (2008). Estimating incumbency advantage and its variation,
as an example of a before–after study. Journal of the American Statistical Association,
103(482), 437–446. doi:10.1198/016214507000000626
Gerber, A. S., & Green, D. P. (2012). Field experiments: Design, analysis, and interpreta-
tion. New York, NY: W. W. Norton and Company.
Green, D. P., & Kern, H. L. (2012). Modeling heterogeneous treatment effects in sur-
vey experiments with Bayesian additive regression trees. Public Opinion Quarterly,
76(3), 491–511.
Green, D. P., & Vavreck, L. (2007). Analysis of cluster-randomized experiments: A
comparison of alternative estimation approaches. Political Analysis, 16(2), 138–152.
doi:10.1093/pan/mpm025
Hausman, J. A. (1978). Specification tests in econometrics. Econometrica, 46,
1251–1271.
Hill, J. L. (2011). Bayesian nonparametric modeling for causal inference. Journal of
Computational and Graphical Statistics, 20(1), 217–240. doi:10.1198/jcgs.2010.08162
Hill, J. L. (2013). Multilevel models and causal inference. In M. Scott, J. Simonoff &
B. Marx (Eds.), The SAGE handbook of multilevel modeling. Los Angeles, CA: Sage.
Hill, J., & Scott, M. (2009). Comment: The essential role of pair matching. Statistical
Science, 24(1), 54–58. doi:10.1214/09-STS274A
Hirano, K., Imbens, G. W., Rubin, D. B., & Zhou, X.-H. (2000). Assessing the effect of
an influenza vaccine in an encouragement design. Biostatistics, 1(1), 69–88.
Hodges, J. S., Cui, Y., Sargent, D. J., & Carlin, B. P. (2007). Smoothing balanced
single-error-term analysis of variance. Technometrics, 49(1), 12–25. doi:10.1198/
004017006000000408
Hong, G., & Raudenbush, S. W. (2006). Evaluating kindergarten retention pol-
icy. Journal of the American Statistical Association, 101(475), 901–910. doi:10.1198/
016214506000000447
Hong, G., & Raudenbush, S. W. (2007). Causal inference for time-varying instruc-
tional treatments. Journal of Educational and Behavioral Statistics, 33(3), 333–362.
doi:10.3102/1076998607307355
Imai, K., & Ratkovic, M. (2013). Estimating treatment effect heterogeneity in ran-
domized program evaluation. The Annals of Applied Statistics, 7(1), 443–470.
doi:10.1214/12-AOAS593
Imai, K., & Strauss, A. (2011). Estimation of heterogeneous treatment effects from
randomized experiments, with application to the optimal planning of the get-out-
the-vote campaign. Political Analysis, 19(1), 1–19. doi:10.1093/pan/mpq035
Imai, K., King, G., & Nall, C. (2009). The essential role of pair matching in cluster-
randomized experiments, with application to the Mexican Universal Health Insur-
ance Evaluation. Statistical Science, 24(1), 29–53. doi:10.1214/08-STS274
Imbens, G., & Rubin, D. (2015). Causal inference in statistics, social, and biomedical sci-
ences: An introduction. Cambridge: Cambridge University Press.
Hierarchical Models for Causal Effects 15
Van der Laan, M. J., & Robins, J. M. (2003). Unified methods for censored longitudinal
data and causality. New York, NY: Springer.
Volfovsky, A., & Hoff, P. D. (2012). Hierarchical array priors for ANOVA decompo-
sitions. arXiv.org. Retrieved from http://arxiv.org/pdf/1208.1726v1.pdf
Wooldridge, J. M. (2010). Econometric analysis of cross section and panel data (2nd ed.).
Cambridge, MA: MIT press.
Zajonc, T. (2012). Bayesian inference for dynamic treatment regimes: Mobility, equity,
and efficiency in student tracking. Journal of the American Statistical Association,
107(497), 80–92. doi:10.1080/01621459.2011.643747
FURTHER READING
RELATED ESSAYS
Statistical Power Analysis (Psychology), Christopher L. Aberson
Text Analysis (Methods), Carl W. Roberts
Person-Centered Analysis (Methods), Alexander von Eye and Wolfgang
Wiedermann