0% found this document useful (0 votes)
17 views

An Automated Approach To Causal Inference in Discrete Settings

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views

An Automated Approach To Causal Inference in Discrete Settings

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 17

Journal of the American Statistical Association

ISSN: (Print) (Online) Journal homepage: www.tandfonline.com/journals/uasa20

An Automated Approach to Causal Inference in


Discrete Settings

Guilherme Duarte, Noam Finkelstein, Dean Knox, Jonathan Mummolo & Ilya
Shpitser

To cite this article: Guilherme Duarte, Noam Finkelstein, Dean Knox, Jonathan Mummolo &
Ilya Shpitser (21 Aug 2023): An Automated Approach to Causal Inference in Discrete Settings,
Journal of the American Statistical Association, DOI: 10.1080/01621459.2023.2216909

To link to this article: https://doi.org/10.1080/01621459.2023.2216909

© 2023 The Author(s). Published with


license by Taylor & Francis Group, LLC.

View supplementary material

Published online: 21 Aug 2023.

Submit your article to this journal

Article views: 3416

View related articles

View Crossmark data

Citing articles: 5 View citing articles

Full Terms & Conditions of access and use can be found at


https://www.tandfonline.com/action/journalInformation?journalCode=uasa20
JOURNAL OF THE AMERICAN STATISTICAL ASSOCIATION
2023, VOL. 00, NO. 0, 1–16: Theory and Methods
https://doi.org/10.1080/01621459.2023.2216909

An Automated Approach to Causal Inference in Discrete Settings


Guilherme Duartea , Noam Finkelsteinb , Dean Knox a
, Jonathan Mummolo c
, and Ilya Shpitserb
a
Operations, Information and Decisions Department, The Wharton School of the University of Pennsylvania, Philadelphia, PA; b Department of Computer
Science, Whiting School of Engineering at the Johns Hopkins University, Baltimore, MD; c Department of Politics and School of Public and International
Affairs, Princeton University, Princeton, NJ

ABSTRACT ARTICLE HISTORY


Applied research conditions often make it impossible to point-identify causal estimands without untenable Received March 2022
assumptions. Partial identification—bounds on the range of possible solutions—is a principled alternative, Accepted April 2023
but the difficulty of deriving bounds in idiosyncratic settings has restricted its application. We present a
general, automated numerical approach to causal inference in discrete settings. We show causal questions KEYWORDS
Causal inference;
with discrete data reduce to polynomial programming problems, then present an algorithm to automatically
Constrained optimization;
bound causal effects using efficient dual relaxation and spatial branch-and-bound techniques. The user Partial identification; Linear
declares an estimand, states assumptions, and provides data—however incomplete or mismeasured. The programming; Polynomial
algorithm then searches over admissible data-generating processes and outputs the most precise possible programming
range consistent with available information—that is, sharp bounds—including a point-identified solution
if one exists. Because this search can be computationally intensive, our procedure reports and continually
refines non-sharp ranges guaranteed to contain the truth at all times, even when the algorithm is not run
to completion. Moreover, it offers an ε-sharpness guarantee, characterizing the worst-case looseness of the
incomplete bounds. These techniques are implemented in our Python package, autobounds. Analytically
validated simulations show the method accommodates classic obstacles—including confounding, selec-
tion, measurement error, noncompliance, and nonresponse. Supplementary materials for this article are
available online.

1. Introduction anonymization, or even summary statistics from other studies.


The method allows for sensitivity analyses on any assumption by
When causal quantities cannot be point identified, researchers
relaxing or removing it entirely. Moreover, it alerts users when
often pursue partial identification to quantify the range of pos-
assumptions conflict with observed data, indicating faulty causal
sible answers. These solutions are tailored to specific settings
theory. We also develop techniques for drawing statistical infer-
(e.g., Lee 2009; Sjölander et al. 2014; Kennedy, Harris, and
ences about estimated bounds. We implement these methods in
Keele 2019; Knox, Lowe, and Mummolo 2020; Gabriel, Sachs,
a Python package, autobounds, and demonstrate them with
and Sjölander 2022), but the idiosyncrasies of applied research
a host of analytically validated simulations.
can render prior results unusable if even slightly differing sce-
Our work advances a rich literature on partial identification
narios are encountered. This piecemeal approach to deriving
in causal inference (Manski 1990; Zhang and Rubin 2003; Cai
causal bounds presents a major obstacle to scientific progress. To
et al. 2008; Swanson et al. 2018; Molinari 2020; Gabriel, Sachs,
increase the pace of discovery, researchers need a more general
and Sjölander 2022), outlined in Section 2, which has sometimes
solution.
cast partial identification as a constrained optimization prob-
In this article, we present an automated approach to causal
lem. In pioneering work, Balke and Pearl (1997) provided an
inference in discrete settings which applies to all causal graphs,
automatic sharp bounding method for causal queries that can be
as well as all standard observed quantities and domain assump-
expressed as linear programming problems. However, numerous
tions. Users declare an estimand, state assumptions, and provide
estimands and empirical obstacles do not fit this description, and
available data—however incomplete or mismeasured. The algo-
a complete and feasible computational solution has remained
rithm numerically computes sharp bounds—the most precise
elusive.
possible answer to the causal query given these inputs, including
When feasible, sharp bounding represents a principled and
a unique point estimate if one exists. Our approach accommo-
transparent method that makes maximum use of available data
dates any classic threat to inference—including missing data,
while acknowledging its limitations. Claims outside the bounds
selection, measurement error, and noncompliance. It can fuse
can be immediately rejected, and claims inside the bounds must
information from numerous sources—including observational
be explicitly justified by additional assumptions or new data. But
and experimental data, datasets that are unlinkable due to
several obstacles still preclude widespread use. For one, analytic
CONTACT Dean Knox [email protected] Operations, Information and Decisions Department, The Wharton School of the University of Pennsylvania,
Philadelphia, PA.
Supplementary materials for this article are available online. Please go to www.tandfonline.com/r/JASA.
© 2023 The Author(s). Published with license by Taylor & Francis Group, LLC.
This is an Open Access article distributed under the terms of the Creative Commons Attribution-NonCommercial-NoDerivatives License (http://creativecommons.org/licenses/by-nc-nd/4.0/), which
permits non-commercial re-use, distribution, and reproduction in any medium, provided the original work is properly cited, and is not altered, transformed, or built upon in any way. The terms on
which this article has been published allow the posting of the Accepted Manuscript in a repository by the author(s) or with their consent.
2 G. DUARTE ET AL.

bounds—which can be derived once and then applied repeat- bounds (Manski 1990) and recover a counterintuitive point-
edly, unlike our numerical bounds which must be recomputed identification result in the literature on nonrandom missingness
each time—remain intractable for many problems. Within the (Miao et al. 2016).
subclass of linear problems, Balke and Pearl’s (1997) simplex In short, our approach offers a complete and computation-
method offers an efficient analytic approach, but analytic non- ally feasible approach to causal inference in discrete settings.
linear solutions are still derived case by case (e.g., Kennedy, Given a well-defined causal query, valid assumptions, and data,
Harris, and Keele 2019; Knox, Lowe, and Mummolo 2020; researchers now have a general and automated process to draw
Gabriel, Sachs, and Sjölander 2022). Moreover, though general causal inferences that are guaranteed to be valid and, with suffi-
sharp bounds can in theory be obtained by standard nonlin- cient computation time, provably optimal.
ear optimization techniques (Geiger and Meek 1999; Zhang
and Bareinboim 2021), in practice, such approaches are often
computationally infeasible. This is because without exhaustively 2. Related Literature
exploring a vast model space to avoid local optima, they can Researchers have long sought to automate partial identification
inadvertently report invalid bounds that may fail to contain the by recasting causal bounding problems as constrained optimiza-
truth. tion problems that can be solved computationally. Our work is
To address these limitations, we first show in Sections 3–4 most closely related to Balke and Pearl (1997), which showed
that—using a generalization of principal strata (Frangakis and that certain bounding problems in discrete settings—generally,
Rubin 2002)—causal estimands, modeling assumptions, and when interventions and outcomes are fully observed—could
observed information can be rewritten as polynomial objective be formulated as the minimization and maximization of a lin-
functions and polynomial constraints with no loss of infor- ear objective function subject to linear equality and inequality
mation. We extend results from Geiger and Meek (1999) and constraints. Such programming problems admit both symbolic
Wolfe, Spekkens, and Fritz (2019) to show that essentially all solutions and highly efficient numerical solutions. Subsequent
discrete partial identification problems reduce to polynomial studies have proven that the bounds produced by this technique
programs—a well-studied class of optimization tasks that nest are sharp (Bonet 2001; Ramsahai 2012; Sachs et al. 2022). These
linear programming as a special case.1 However, it is well known results were extended by Geiger and Meek (1999), who showed
that solving polynomial programs to global optimality is in gen- that a much broader class of discrete problems can be formulated
eral NP-hard, highlighting the need for efficient bounding tech- in terms of polynomial relations when analysts have precise
niques that remain valid even under time constraints (Belotti information about the kinds of disturbances or confounders that
et al. 2009; Vigerske and Gleixner 2018). may exist.2 In addition to the well-known conditional inde-
To ameliorate these computational difficulties, Section 4.2 pendence constraints implied by d-separation, these problems
shows how causal graphs can be restated as equivalent canonical imply generalized equality constraints (Verma and Pearl 1990;
models, simplifying the polynomial program. Next, Section 5 Tian and Pearl 2002) and generalizations of the instrumental
develops an efficient optimization procedure, based on dual inequality constraints (Pearl 1995; Bonet 2001).
relaxation and spatial branch-and-bound relaxation techniques, Geiger and Meek (1999) notes that in theory quantifier
that provides bounds of arbitrary sharpness. We show this proce- elimination algorithms can provide symbolic bounds. However,
dure is guaranteed to achieve complete sharpness with sufficient the time required for quantifier elimination grows as a doubly
computation time; in the problems we examine here, this occurs exponential function of the number of parameters, rendering it
in a matter of seconds. However, in cases where the time needed infeasible for all but the simplest cases. At the core of this issue is
is prohibitive, our algorithm is anytime (Dean and Boddy 1988), that symbolic methods provide a general solution, meaning that
meaning it can be interrupted to obtain nonsharp bounds that they must explore the space of all possible inputs. In contrast,
are nonetheless guaranteed to be valid. Crucially, our technique numerical methods such as ours can accelerate computation by
offers an additional guarantee we term “ε-sharpness”—a worst- eliminating irrelevant portions of the model space.
case looseness factor that quantifies how much the current non- Even so, computation can be time-consuming.3 In practice,
sharp bounds could potentially be improved with additional many optimizers can rapidly find reasonably good values but
computation. In Section 6, we provide two approaches for char- cannot guarantee optimality without exhaustively searching the
acterizing uncertainty in the estimated bounds. We demonstrate model space. This approach poses a challenge for obtaining
our technique in a series of analytically validated simulations in causal bounds, which are global minimum and maximum val-
Section 7, showing the flexibility of our approach and the ease ues of the estimand across all models that are admissible, or
with which assumptions can be modularly imposed or relaxed. consistent with observed data and modeling assumptions. If a
Moreover, we demonstrate how it can improve over widely used local optimizer operates on the original problem (the primal),

2
A subtle point in nonlinear settings is that the region of possible values
for the estimand—that is, estimand values associated with models in the
1 model subspace that are consistent with available data and assumptions—
Specifically, our results apply to elementary arithmetic functionals or mono-
tonic transformations thereof—a broad set that essentially includes all may be disconnected. That is, while the sharp lower and upper bounds
causal assumptions, observed quantities, and estimands in standard use. correspond to minimum and maximum possible values of the estimand,
For example, the average treatment effect and the log odds ratio can be not all estimand values between these extremes are necessarily possible.
sharply bounded with our approach, but nonanalytic functionals (which are 3
Sharp bounds can always be obtained by exhaustively searching the model
rarely if ever encountered) cannot. Functionals that do not meet these con- space. But the computation time required to do so—that is, to solve the
ditions can be approximated to arbitrary precision, if they have convergent polynomial programming problem—can explode with the number of vari-
power series. ables (principal strata sizes).
JOURNAL OF THE AMERICAN STATISTICAL ASSOCIATION 3

Figure 1. Canonicalization of a mediation graph. Noncanonical and canonicalized forms are given in panels (a) and (b), respectively. Both are equivalent with respect to
their data law. Canonicalization proceeds as follows: (i) the dependent disturbance U3 is absorbed into its parent U23 ; (ii) the superfluous U2 is eliminated as it influences
a subset of U23 ’s children; and (iii) the irrelevant U13 is absorbed into the V1 → V3 arrow as it is neither observed nor of interest. A complete guide to canonicalization is
given in Appendix B.1.

proceeding from the interior and widening bounds as more information about main variables V may be incomplete, partial
extreme models are discovered, then failing to reach global identification techniques are needed.
optimality will result in invalid bounds—ranges narrower than
the true sharp bounds, failing to contain all possible solutions.
In the following sections, we detail our approach to address- 3.1. Canonical DAGs
ing each of these outstanding obstacles to automating the dis- We now discuss how canonicalizing DAGs—reformulating
covery of sharp bounds for discrete causal problems. them w.l.o.g. into a simpler form—simplifies the bounding task.
A DAG is said to be in canonical form if (i) no disturbance Uk
3. Preliminaries has a parent in G; and (ii) there exists no pair of disturbances, Uk
and Uk , such that Uk influences a subset of variables influenced
We now define notation and introduce key concepts. A technical by Uk . Evans (2016) showed that any noncanonical DAG G  has
glossary is given in Appendix A. We first review how any causal a canonical form G with an identical distribution governing all
model represented by a directed acyclic graph (DAG) can be variables in V; an algorithm for obtaining this canonical form
“canonicalized,” or reduced into simpler form, without loss of is given in Appendix B.1. In short, canonicalization distills the
generality (w.l.o.g.; Evans 2016). We describe how these graphs data-generating process (DGP) to its simplest form by elimi-
give rise to potential outcomes and a generalization of principal nating potentially complex networks of irrelevant disturbances.
strata (Frangakis and Rubin 2002), two key building blocks in Figure 1 shows a noncanonical DAG in panel (a); panel (b) gives
our analytic strategy. the canonicalized version. Note that disturbances affecting only
We follow the convention that bold letters denote collections a single variable, such as U1 , are often left implicit; here, we
of variables; uppercase and lowercase letters denote random depict them explicitly for clarity.
variables and their realizations, respectively. Consider a struc-
tured system in which random vectors V = {V1 , . . . , VJ } repre-
sent observable main variables and U = {U1 , . . . , UK } represent 3.2. Potential Outcomes
unobserved disturbances. We will assume each observed variable The notation of potential outcome functions allows us to com-
Vj is discrete and its space S(Vj ) has finite cardinality; the pactly express the effects of manipulating variable Vj ’s main
spaces of unobserved variables are unrestricted. Observed data parents, paV (Vj ), or other ancestors that are also main variables.
for each unit i ∈ {1, . . . , N} is an iid draw from V. Further Similarly, paU (Vj ) denotes parents of Vj that are disturbances.
suppose that causal relationships between all variables in V and Let A ⊂ V be intervention variables that will be fixed to
U are represented by a nonparametric structural equation model A = a. When A = ∅, so no intervention occurs, then define
with independent errors (NPSEM-IE; Pearl 2000).4 Here, we Vj (a) = Vj , the natural value. When A ⊆ paV (Vj ), so only
concentrate on deriving results for the NPSEM-IE model, but immediate parents are manipulated, then the potential outcome
our approach is also applicable to the model of Robins (1986) 
function is given by itsstructural equation, Vj (a) = fj A =
and Richardson and Robins (2013) without change.5 a, paV (Vj ) \ A, paU (Vj ) . For example, in Figure 1(b), the effect
Figure 1 presents a DAG G representing relationships of intervention V2 = v2 on outcome V3 is defined in terms of
between V ∪ U. Note that fully observing these variables would V3 (V2 = v2 ) = f3 (V2 = v2 , V1 , U23 ). Here, the intervention
be sufficient to identify every quantity we consider in this arti- set is A = V2 , and the remaining parents of V3 —the non-
cle. However, since disturbances U are unobserved, and since intervened main parent, paV (V3 ) \ A = V1 , and the disturbance
parent, paU (V3 ) = U23 —are allowed to follow their natural
distributions. We now define more general potential outcome
4
The NPSEM-IE model states that each Vj ∈ V and each Uk ∈ U is a functions by recursive substitution (Richardson and Robins 2013;
deterministic function of (i) variables in V∪U corresponding to its parents in Shpitser 2018).
 For arbitrary interventions on A ⊂ V, let 
G and (ii) an additional disturbance term, Vj or Uk . The crucial assumption
Vj (a) = Vj {a : A ∈ paV (Vj )}∪{Vj (a) : Vj ∈ paV (Vj )\A} ;
in the NPSEM-IE is that these  terms are mutually independent. Note that
throughout this article, we keep the presence of  variables implicit; we will here,  is a generic index that sweeps over main variables in the
prove that each Vj can equivalently viewed as a deterministic function of its graph. That is, if a parent of Vj is in A, it is set to the corre-
parents in V ∪ U, absorbing the variation induced by  terms into U. sponding value in a. Otherwise, the parent takes its potential
5
See Appendix F.3 for further discussion of the finest fully randomized
causally interpretable structured tree graph (FFRCISTG; described in value after intervention on causally prior variables, or its natural
Richardson and Robins 2013). value otherwise. To obtain the parent’s potential value, apply
4 G. DUARTE ET AL.

the same definition recursively.6 For example, in Figure 1(b), proportion of compliers minus that of defiers.7 As Proposition 2
potential outcomes for V3 include (i) V3 (∅) = V3 (V1 , V2 ), the will show, by writing down all information in terms of these
observed distribution; (ii) V3 (v1 ) = V3 [v1 , V2 (v1 )], relating to strata, essentially any causal inference problem can be converted
total effects; and (iii) V3 (v1 , v2 ), relating to controlled effects. into an equivalent optimization problem involving polynomials
of variables that represent strata sizes.
Finally, consider the more complex mediation DAG of Fig-
3.3. Generalized Principal Stratification ure 2(a). Response functions for V1 and V2 remain as above. In
In this section, we show how any DAG and any causal quantity contrast, V3 is caused by paV (V3 ) = {V1 , V2 } via the structural
can be represented w.l.o.g. using a generalization of principal equation V3 = f3 (V1 , V2 , U23 ). Substituting in disturbance
strata. Roughly speaking, principal strata on a variable Vj are U23 = u23 produces one of 16 response functions of the form
(u )
groups of units that would respond to counterfactual interven- f3 23 : S(V1 ) × S(V2 ) → S(V3 ), yielding 16 strata.8
tions in the same way (Greenland and Robins 1986; Frangakis In turn, the number of principal strata determines the min-
and Rubin 2002). Formally, let A = paV (Vj ) be an intervention imum complexity of a reduced but nonrestrictive alternative
set for which all main parents of Vj are jointly set to
 some a, and model in which the full data law, or joint distribution over
consider unit
 i’s collection of potential outcomes Vi,j (A = a) : every potential outcome, is preserved. This means the reduction
a ∈ S(A) . Each principal stratum of Vj then represents a subset is w.l.o.g. for every possible factual or counterfactual quantity
of units in which this collection is identical. involving V. Specifically, the number of principal strata in the
The NPSEM of a graph is closely related to its principal strati- graph determines the minimum cardinalities of each Uk ∈
fication. This is because each potential outcome in the collection U that are needed to represent the original model w.l.o.g., if
above is given by Vi,j (A = a) = fj [A = a, pai,U (Vj )], in we were to redefine Uk in terms of a categorical distribution
which the only source of random variation is unit i’s realization over principal strata. For example, to capture the joint response
of the relevant disturbances. After fixing these disturbances, patterns that a unit may have on V2 and V3 , a reduced version
all structural equations become deterministic, meaning that a of U23 can express any full data law if it has a cardinality of
realization of U i must fix every potential outcome for every |S(U23 )| = 4 × 16, because V2 has four possible response
variable under every intervention. For example, consider the functions and V3 has 16.
simple DAG U1 → V1 → V2 ← U2 , in which V1 and V2 are Below, Proposition 1 states that a generalization of this
binary. This relationship is governed by the structural equations approach can produce nonrestrictive models w.l.o.g. for any
V1 = f1 (U1 ) and V2 = f2 (V1 , U2 ), where the functions f1 : discrete-variable DAG and any full data law. Crucially, this also
S(U1 ) → S(V1 ) and f2 : S(V1 ) × S(U2 ) → S(V2 ) are holds for (i) graphs where a variable Vj is influenced by multiple
deterministic and shared across all units. Thus, the only source disturbances Uk and Uk , as in Figure 2(b); and (ii) the challeng-
of randomness is in U = {U1 , U2 }. ing case of nongeared graphs (Evans 2018) such as Figure 2(c)—
Analysts generally do not have direct information about these roughly speaking, when disturbances Uk , Uk , and Uk touch
disturbances. For example, U1 could potentially take on any overlapping combinations of main variables to create cycles of
value in (−∞, ∞). However, as Proposition 1 will state in greater confounding. Formalization is provided later.
generality, this variation is irrelevant because V1 has only two
possible values: 0 and 1. The space of U1 can therefore be divided Proposition 1. Suppose G is a canonical DAG over discrete
into two canonical partitions (Balke and Pearl 1997)—those that main variables V and disturbances U with infinite cardinality.
deterministically lead to V1 = 0 and those that lead to V1 = 1— The model over the full data law implied by G is unchanged
and thus treating U1 as if it were binary is w.l.o.g. by assuming that the disturbances have sufficiently large finite
Strata for V2 are similar but more involved. After U2 is cardinalities.
realized, it induces the partially applied response function V2 =
(u ) A proof can be found in Appendix F.1, along with details on
f2 (V1 , U2 = u2 ) = f2 2 (V1 ), which deterministically governs
how to obtain a lower bound on nonrestrictive cardinalities for
how V2 counterfactually responds to V1 . Regardless of how
the disturbances. Briefly, Proposition 1 extends a result from
many are in S(U2 ), this response function must fall into one
Finkelstein, Wolfe, and Shpitser (2021), which showed there are
of only four possible strata, each a mapping of the form f2(u2 ) : reductions of S(U) that do not restrict the model over the factual
S(V1 ) → S(V2 ) (Angrist, Imbens, and Rubin 1996). These V. We build on this result to show that there are reductions
groups are (i) V2 = 1 regardless of V1 , “always takers” or “always that do not restrict the full data law—that is, the model over all
recover”; (ii) V2 = 0 regardless of V1 , “never takers” or “never factual and counterfactual versions of V.
recover”; (iii) V2 = V1 , “compliers,” or those “helped” by V1 ; and
(iv) V2 = 1 − V1 , “defiers,” or those “hurt” by V1 . Thus, from the
perspective of V2 , any finer-grained variation in S(U2 ) beyond
 note that the ATE is given by E[V2 (V1 = 1) − V2 (V1 =
7
To see this,
the canonical partitions is irrelevant. These partitions are in one- 0)] = strata E[V2 (V1 = 1) − V2 (V1 = 0) | strata] · Pr(strata) =
to-one correspondence with principal strata, which in turn allow 0 · Pr(always taker) + 0 · Pr(never taker) + 1 · Pr(complier) − 1 · Pr(defier).
8
More generally, the number of unique response functions grows with (i)
causal quantities to be expressed in simple algebraic expressions. the cardinality of the variable, (ii) the number of causal parents it has, and
For example, the average treatment effect (ATE) is equal to the (iii) the parents’ cardinalities. Specifically, Vj has |S(Vj )||S(paV (Vj ))| possi-
ble mappings: given a particular input from Vj ’s parents, the number of
possible outputs for Vj is |S(Vj )|; the number of possible inputs from Vj ’s
parents is |S(paV (Vj ))| = V  ∈paV (Vj ) |S(Vj )|, the product of the parents’
j
6
When defining potential outcomes for Vj , intervention on Vj itself is ignored. cardinalities.
JOURNAL OF THE AMERICAN STATISTICAL ASSOCIATION 5

Figure 2. Any discrete-variable DAG can be represented in terms of generalized principal strata. Panels (a–b) depict geared graphs. In (a), each main variable is influenced
by only one disturbance. In (b), V2 is influenced by both U12 and U23 . In (c), a nongeared graph with cyclical confounding by U12 , U23 , and U13 is shown. For each case, the
functional parameterizations—representations of each graph in terms of generalized principal strata—are illustrated.

Though the theory of principal stratification is well under- any one of the associated disturbances can be allocated to
stood when each main variable Vj is influenced by only one take primary responsibility—that is, to be the input for which
disturbance Uk , complications arise when Vj is influenced by the response function is partially applied. For the purposes of
multiple disturbances Uk and Uk . For each such main variable, defining this response function, all remaining disturbances are
6 G. DUARTE ET AL.

treated as if they were main variables.9 For example, in Fig- 4. Formulating the Polynomial Program
ure 2(b), V2 is influenced by both U12 and U23 ; we will allocate
We now turn to the central problem of this article: sharply
V2 to U23 for illustration, but allocating it to U12 would produce
bounding causal quantities with incomplete information. Our
identical bounds. Next, we compute the cardinality of remaining
approach is to transform the task into a constrained optimiza-
disturbances as usual. Here, U12 is left only to determine V1 ,
tion problem that can be solved computationally by (i) rewrit-
meaning that it has a cardinality of two. Finally, we return to
ing the causal estimand into a polynomial expression, and (ii)
the primary disturbance and determine its cardinality based on
rewriting modeling assumptions and empirical information into
main variables and remaining disturbances. In this example,
polynomial constraints. Appendix C.1 provides a detailed walk-
after fixing U23 , the variable V2 is a function of V1 and U12 , both
through of this process with a concrete instrumental variable
binary, meaning that U23 has a cardinality of sixteen.
problem, along with example code that illustrates how the above
Finally, Proposition 1 extends Evans (2018) by allowing
steps are automated by our software in merely eight lines of code.
us to develop generalized principal strata for graphs that are
Our goal is to obtain sharp bounds on the estimand, or the
nongeared, meaning that disturbances do not satisfy the running
narrowest range that contains all admissible values consistent
intersection property.10 These cases differ only in that they con-
with available information: structural causal knowledge in the
tain cycles of confounding; after breaking the cycle at any point,
form of a canonical DAG, G; empirical evidence, E; and model-
they can be dealt with in the same manner as geared graphs. An
ing assumptions, A, formalized below. Importantly, our defini-
example of a nongeared graph is given in Figure 2(c). Finkelstein,
tion of “empirical evidence” flexibly accommodates essentially
Wolfe, and Shpitser (2021) presents an algorithm for construct-
any data about the joint, marginal, or conditional distributions
ing a generalized principal stratification for nongeared graphs.
of the main variables.
In brief, the algorithm breaks the confounding cycle by select-
We will suppose the main variables take on values in a known,
ing an arbitrary disturbance—for example, U13 —and fixing its
discrete set, S = S(V). In this section, we will demonstrate
cardinality at a value that is guaranteed to be nonrestrictive
(i) that {G, E, A, S} restricts the admissible values of the target
of the model over factual random variables, by Carathéodory’s
quantity, and (ii) this range of observationally indistinguish-
theorem. In this case, based on U13 ’s district,11 {V1 , V2 , V3 }, U13
able values can be recovered by polynomial programming. The
can be analyzed w.l.o.g. as if it had a cardinality of |S(V1 ) ×
causal graph and variable space, G and S, together imply an
S(V2 ) × S(V3 )| − 2. In all subsequent analysis of Figure 2(c),
infinite set of possible structural equation models, each capable
U13 would then be treated as a main variable, allowing the
of producing the same full data laws. By Proposition 1, w.l.o.g.,
graph to be analyzed as if it were geared. As in Figure 2(b), U12
we can consider a simple model in which (i) each counterfactual
then determines the response of V1 to U13 . Finally, U23 jointly
main variable is a deterministic function of exogeneous, discrete
determines (i) the responses of V2 to V1 and U12 as well as (ii)
disturbances; (ii) there are a relatively small number of such
V3 to V1 , V2 , and U13 . We note that the number of parameters
disturbances; and (iii) disturbances take on a finite number of
involved in nongeared graphs can quickly become intractable.
possible values, corresponding to principal strata of the main
In these cases, valid but possibly nonsharp bounds can always
variables. When repeatedly sampling units (along with each
be obtained by solving a relaxed problem in which a single
unit’s random disturbances, U), the kth disturbance thus follows
disturbance is connected to each main variable in a district,
the categorical distribution with parameters PUk = {Pr(Uk =
absorbing multiple disturbances that influence only a subset of
uk ) : uk ∈ S(Uk )}. By the properties of canonical DAGs, these
those variables (for example, adding a U123 that absorbs U12 ,
disturbances are independent. It follows that the parameters PU
U13 , and U23 ).
of the joint disturbance distribution Pr(U = U) = k Pr(Uk =
In sum, all classes of discrete-variable DAGs can be parame-
uk ) not only fully determine the distribution of each factual main
terized in terms of generalized principal strata. In what follows,
variable under no intervention, Vj (∅)—they also determine the
we show how this representation allows us to reformulate causal
counterfactual distribution of Vj (a) under any intervention a, as
bounding problems in terms of polynomial programs that can
well as its joint distribution with other counterfactual variables
be optimized over the sizes of these strata, subject to constraints
Vj (a ) under possibly different interventions a . This leads to
implied by assumptions and available data.
the following proposition, proven in Appendix F.2.
9
Note that if any main variable V has multiple parents in U, there may Proposition 2. Suppose G is a canonical DAG and C :  is a
be multiple valid parameterizations—that is, methods for constructing
generalized principal strata—depending on which disturbance is assigned set of counterfactual statements, indexed by , in which C =
primary responsibility for determining which main variable. If each main {V (a ) = v } states that variable
 V will take on value v
variable has only a single parent in U, there is only a single functional under manipulation a . Let 1 U ⇒ {C : } be an indicator
parameterization.
10
Here, the running intersection property requires that there exists a total function that evaluates to 1 if and only if disturbance realizations
ordering of disturbances such that any set of main variables that are chil- U = {u1 , . . . , uK } deterministically lead to C being satisfied for
dren of a disturbance Uk , as well as disturbances earlier in the ordering than every . Then under the structural equation model for G,
Uk , must all be children of a specific disturbance earlier in the ordering than
Uk . For example, in Figure 2(c), if the ordering is U13 < U12 < U23 , then V2
and V3 are both children of U23 and earlier disturbances; thus, both must   
be simultaneously influenced by at least one of the earlier disturbances. Pr C = 1 U ⇒ {C : } Pr(Uk = uk ),
This is not the case, and furthermore, there exists no other ordering that  U∈S(U) uk ∈U
satisfies the requirement, so Figure 2(c) is nongeared. For further discussion,
see Finkelstein, Wolfe, and Shpitser (2021)
11
Districts of a canonical graph are components that remain connected after which is a polynomial equation in PU , the probabilities
removing arrows among V. Pr(Uk = uk ).
JOURNAL OF THE AMERICAN STATISTICAL ASSOCIATION 7

For example, in the mediation setting of Figure 1(b), Proposi- 3 U g (P )


fractional g2 (PU ) can be rewritten as some h(P U)
in which
tion 2 implies that the joint distribution of the factual variables— g3 (PU ) has fewer fractions than g2 (PU ). Regardless of whether
V1 (∅), V2 (∅), and V3 (∅)—is given by h(PU ) is positive, negative, or of indeterminate sign, it can
  be shown that h(PU ) can be cleared to obtain an equivalent
Pr V1 (∅) = v1 , V2 (∅) = v2 , V3 (∅) = v3
relation. The exact procedure differs for each case and, when
= Pr(U1 = u1 )Pr(U23 = u23 ), h(PU ) is indeterminate, requires a set of auxiliary variables, s,
{u1 ,u23 }∈U to be created.13 If all fractions have been cleared from g3 (PU ),
where U = {U : U ⇒ V} is the set of disturbance realizations then the rewritten statement is also of the promised form and
that are consistent with a particular V(∅) = V. In other words, we are done; otherwise, recurse. We denote this transformation
of the original statement—that is, polynomial-fractionalizing
  its components
U = {u1 , u23 } : f1(u1 ) (∅) = v1 , f2(u23 ) (v1 ) = v2 , f3(u23 ) (v1 , v2 ) = v3 .  and then clearing all resulting fractions—as
polynomialize g1 (PV )  α .
Alternatively, analysts may be interested in the probabil- By the same token, any estimand g1 (PV ) that is a polynomial-
ity that a randomly drawn unit i has a positive controlled fractional g2 (PU ) in the parameters of PU can be re-expressed
direct effect when fixing the mediator to V2 = 0. This is as a polynomial in the expanded parameter space, h(PU , s),
given by Pr V3 (V1 = 0, V2 = 0) = 0, V3 (V1 = 1, V2 = 0) = 1 along with a set of additional polynomial relations. To see this,
and
 is similarly expressed in terms of the disturbances as first define a new estimand, s, which is a monomial (and hence
{u1 ,u23 }∈U  Pr(U1 = u1 )Pr(U23 = u23 ), summing over a dif- a polynomial). This new estimand can be made equivalent
ferent subset of the disturbance space, U  = {u1 , u23 } : to the original one by imposing a new polynomial-fractional
(u ) (u )  constraint, s − g2 (PU ) = 0. Any remaining fractions in the
f3 23 (V1 = 0, V2 = 0) = 0, f3 23 (V1 = 1, V2 = 0) = 1 .
We now expand this result to include a large class of function- new constraint are cleared as above. We will make extensive
als of marginal probabilities and logical statements about these use of these properties to convert causal queries to polynomial
functionals. programs.
Algorithm 2 in the supplementary materials provides a step-
Corollary 1. Suppose G is a canonical DAG. Let PV denote the by-step procedure for formulating a polynomial programming
full data law and g1 (PV ) denote a functional of PV involving ele- problem. Solving this program via Algorithm 3 in the sup-
mentary arithmetic operations on constants and marginal prob- plementary materials then produces sharp bounds. Both algo-
abilities of PV . Then g1 (PV ) can be re-expressed as a polynomial rithms, given in Appendix B, mirror the discussion here with
fraction in the parameters of PU , g2 (PU ), by replacing each more formality. We begin by transforming a factual or coun-
marginal probability with its Proposition 2 polynomialization. terfactual target of inference T into polynomial form, possibly
creating additional auxiliary variables to eliminate fractions. To
We denote this replacement
  process with the operation accomplish this task, the procedure uses the possibly noncanon-
polynomial-fractionalize g1 (PV ) . The corollary has a number ical DAG G and the variable space S(V) to re-express T in terms
of implications, which we discuss briefly. First, it shows that of functional parameters that correspond to principal strata
a wide array of single-world and cross-world functionals can proportions. The result is the objective function of the polyno-
be expressed as polynomial fractions. These include traditional mial program. The procedure then polynomializes the sets of
quantities such as the ATE, as well as more complex ones such constraints resulting from empirical evidence and by modeling
as the pure direct effect and the probability of causal suffi- assumptions, respectively denoted E and A. In Figure 2, if only
ciency. It also suggests any nonelementary functional of PV observational data is available, then E consists of eight pieces
can be approximated to arbitrary precision by a polynomial of evidence, each represented as a statement
 corresponding to
fraction, if the functional has a convergent power series. We a cell of the factual
 distribution Pr V 1 (∅) = v1 , V2 (∅) =
note that nonelementary functionals rarely arise in practice, v2 , V3 (∅) = v3 = Pr(V1 = v1 , V2 = v2 , V3 = v3 )
apart from logarithmic- or exponential-scale estimands.12 An for observable values in {0, 1}3 . Modeling assumptions include
example that our approach cannot handle is the nonanalytic all other information, such as monotonicity or dose-response
functional 1{ATE is rational}. assumptions; these can be expressed in terms of principal strata.
A nonobvious implication of Corollary 1 is that when (i) For example, the assumed unit-level monotonicity of the V1 →
g1 (PV ) is an elementary arithmetical functional; (ii)  ∈ V2 relationship (e.g., the “no defiers” assumption of Angrist,
{<, ≤, =, ≥, >} is a binary comparison operator; and (iii) α Imbens, and Rubin 1996) can be written as the statement that
is a finite constant, then any statement of the form g1 (PV )  α
can be transformed into a set of equivalent non-fractional
relations, {h (PU , s)  0 : }. Here, each h (·) denotes a
non-fractional polynomial in the parameters indicated;  is a
possibly different binary comparison from ; and s are newly
13
First, consider strictly positive h(PU ); here, g3 (PU ) − αh(PU )  0 is equiv-
alent to the original statement. Second, consider strictly negative h(PU ):
created auxiliary variables that are sometimes necessary. The clearing the fraction yields g3 (PU ) − αh(PU )  0, where  reverses an
transformation proceeds as follows. First, g1 (PV )  α can be inequality . Finally, in the case when h(PU ) can take on both positive
rewritten as g2 (PU )  α, by Proposition 1. Then, note that any and negative values, let an auxiliary variable s ∈ s be defined such that
s·h(PU )−1 = 0, which is a polynomial relation of the promised form. It can
now be seen that the original statement is equivalent to s · g3 (PU ) − α  0.
12
Bounds on a monotonic transform of x can be obtained by bounding x, then For a concrete example of how auxiliary variables can be used to clear
applying the transform. fractions, see Appendix C.1.3.
8 G. DUARTE ET AL.

 
Pr V2 (V1 = 0) = 1, V2 (V1 = 1) = 0 = 0.14 Finally, the functions or constraints; moreover, both the admissible region
statement that each disturbance Uk follows a categorical prob- of the model space and the region of possible objective values
ability distribution is re-expressed
 as the polynomial relations can be disconnected.16 Local solvers thus cannot guarantee valid
Pr(Uk = uk ) ≥ 0 ∀ uk and uk Pr(Uk = uk ) = 1. bounds without exhaustively searching the space; when time is
Algorithm 2 produces an optimization problem with a poly- finite, these can fail to discover global extrema for the causal
nomial objective subject to polynomial constraints. This polyno- estimand, resulting in invalid intervals that are not guaranteed
mial programming problem is equivalent to the original causal to contain the quantity of interest.
bounding problem. This leads directly to the following theorem.

Theorem 1. Minimization (maximization) of the polynomial 4.2. Simplifying the Polynomial Program
program produced by Algorithm 2 produces sharp lower The time needed to solve polynomial programs can grow expo-
(upper) bounds on T under the sample space S(V), structural nentially with the number of variables. To address this, in
equation model G, additional modeling assumptions A, and Appendix D, we employ various techniques that draw on graph
empirical evidence E. theory and probability theory to simplify polynomial programs
into forms with fewer variables that generally have identical
4.1. Example Program for Outcome-based Selection solutions but are usually faster to solve. At a high level, these
simplifications fall into four categories. First, Appendix D.1
For intuition, consider the simple example in Figure 3, moti- proposes a simplification that reduces the degree of polynomial
vated by a hypothetical study of discrimination in traffic law expressions. Using the graph’s structure, we show how to detect
enforcement using (a) police data on vehicle stops and (b) traffic- when a disturbance Uk is guaranteed to be  irrelevant, meaning
sensor data on overall vehicle volume. For illustrative purposes, its parameters only occur in contexts where uk ∈S(Uk ) Pr(Uk =
suppose all drivers behave identically. Here, X ∈ {0, 1} indicates uk ) can be factored out and replaced with unity. Second,
whether a motorist is a racial minority and Y ∈ {0, 1} whether Appendix D.2 introduces a simplification that reduces the degree
the motorist is stopped by police. X and Y are assumed to be of polynomial expressions by exploiting equality constraints like
unconfounded. However, there exists outcome-based selection: the simple Pr(X-control) + Pr(X-treated) = 1 example above.
we only learn driver race (X) from police records if a stop occurs We note some practical considerations when using symbolic
(Y = 1), precluding point identification. Panels (a–d) in Figure 3 algebra systems such as SageMath (Stein et al. 2019), specifically
depict the inputs to the algorithm: (a) the causal graph, G; (b) about the computational efficiency of factoring out complex
the observed evidence, E, consisting of the marginal Pr(Y = y) polynomial expressions and replacing them with constants, as
and the conditional Pr(X = x|Y = 1); (c) additional assump- opposed to solving for one variable in terms of others. Third,
tions, A, the monotonicity assumption that white drivers are Appendix D.3 discusses a broad class of simplifications that
not discriminatorily stopped; and (d) the sample space S(X, Y). reduce the number of constraints in the program, but with
The target T is the ATE, E[Y(X = 1) − Y(X = 0)], the important tradeoffs. We show that assumptions encoded in a
amount of anti-minority bias in stopping. Next, Figure 3(e) DAG, such as the empty binary graph UX → X Y ←
depicts functional parameterization in terms of six disturbance UY , allow the empirical evidence to be expressed using fewer
partitions, following Section 3.3. Applying simplifications from constraints—here, the reduction uses only two pieces of infor-
Section 4.2 results in elimination of Pr(Y-defier) by assumption, mation, Pr(X = 1) and Pr(Y = 1), exploiting the previously
then elimination of redundant strata that complete the sum to mentioned equality constraints and the assumed independence
unity, Pr(X-control) and Pr(Y-never). The problem can thus be of X and Y. This is a reduction from the three pieces of infor-
reduced to three dimensions. Next, the ATE is rewritten as the mation needed to convey Pr(X = x, Y = y),17 but comes at
probability of an anti-minority stop, minus that of an anti-white the cost that analysts can no longer falsify the independence
stop (which is zero by assumption). Finally, Figure 3(f–i) depict assumption. Finally, Appendix D.4 provides a simplification for
how each constraint narrows the space of potential solutions, detecting when constraints and parameters no longer bind the
leaving the admissible region shown in Figure 3(i)—the only objective function, meaning they can be safely eliminated from
part of the model space simultaneously satisfying all constraints. the program.
Once formulated in this way, optimization proceeds by locat- We caution that the practical application of these techniques
ing the highest and lowest values of T within this region, which remains an important area for future research: applying these
respectively represent the upper and lower bounds on the ATE.
A variety of computational solvers can in principle be used to
minimize and maximize it.15 However, in practice, the resulting 16
For example, the polynomial constraint x 3 − x 2 ≤ −0.1 would produce
polynomial programming problem can be much more complex a disconnected admissible region of x ∈ (−∞, −0.280] ∪ [0.413, 0.867].
Moreover, even connected admissible regions can produce disconnected
than the simple case shown in Figure 3. For example, even seem- sets of possible objective values; for example, with the objective 1x (which
ingly simple causal problems can result in nonconvex objective can be transformed to a polynomial objective, as discussed on page 7), the
constraint {−1 ≤ x ≤ 1} leads to possible objective values of (−∞, −1] ∪
[1, ∞). Note that discontinuity is merely a computational challenge rather
14
Assumed population-level monotonicity is typically written E[V  2 (V1 = 1)− than
V2 (V1 = 0)] ≥ 0, but can be rewritten in terms of strata as Pr V2 (V1 = 1) =  a conceptual issue, as the definition
 of the bounds in this case would
   be minx∈[−1,1] 1x , maxx∈[−1,1] 1x = (−∞, ∞).
1, V2 (V1 = 0) = 0 − Pr V2 (V1 = 0) = 1, V2 (V1 = 1) = 0 ≥ 0.
17
15
Throughout this article, we will neglect the distinction between minimum Any one of the four cells can be automatically
 eliminated, as it is redundant
(maximum) and infimum (supremum), as is standard practice in numerical given the implied constraint that x,y Pr(X = x, Y = y) = 1 by
optimization. construction of the principal strata.
JOURNAL OF THE AMERICAN STATISTICAL ASSOCIATION 9

Figure 3. Visualization of Algorithm 2. Constructing the polynomial program for a simple bounding problem with outcome-dependent selection, motivated by a study of
discrimination in traffic law enforcement. Panels (a)–(d) depict inputs to the algorithm. The graph, G, contains unconfounded treatment X and outcome Y. The evidence E
contains (i) the marginal distribution of Y and (ii) the conditional distribution of X if Y = 1. A consists of a monotonicity assumption. S states that X and Y are binary. The
target T is the ATE E[Y(x = 1) − Y(x = 0)]. Panel (e) depicts functional parameterization with six disturbance partitions, following Section 3.3. Applying simplifications
from Section 4.2 results in elimination of Pr(Y-defier) by assumption, then elimination of Pr(X-control) and Pr(Y-never) by the second axiom. Panels (f)–(i) show constraints
in the simplified model space.

techniques in different orders, or even with slightly different 5. Computing ε-sharp Bounds in Polynomial
software implementations, can produce optimization programs Programs
that are mathematically equivalent but can vary substantially in
runtime. We now turn to the practical optimization of the polynomial
program defined by Algorithm 2, which we refer to as the primal
10 G. DUARTE ET AL.

Figure 4. Visualization of Algorithm 3. Computing ε-sharp bounds for the outcome-based selection problem of Figure 3. Panel (a) shows how the target ATE varies over the
feasible region of the model space (reparameterized in terms of possible PU distributions) depicted in Figure 3(i). Panel (b) depicts the first step of our method, partitioning
of the model space into branches within which computationally tractable, piecewise linear dual relaxations are obtained. Panel (c) shows how suboptimal values of the
primal function, obtained with standard local optimizers, can be combined with the dual envelope to prune large regions of the model space that cannot possibly contain
the global extrema. In panel (d), the procedure is applied recursively. The pruned model space is rebranched and heuristic primal optimization is repeated, potentially
yielding narrower dual bounds and wider primal bounds, respectively. The looseness factor, ε, narrows until reaching zero (sharpness) or a specified threshold.

program. Per Theorem 1, minimization and maximization of To address this challenge and guarantee the validity of
the polynomialized target, T (p), is equivalent to the causal reported bounds, our approach incorporates dual techniques
bounding problem. (Optimization is implicitly over the admis- that do not directly optimize the original primal objective
sible region of the model space.) We denote the sharp lower function, T (p). Instead, these techniques construct alternative
and upper bounds as T ≡ minp T (p) and T ≡ maxp T (p). objective functions that are easier to optimize; solutions to the
As we note above, the challenge is that these problems are easier dual problems can then be related back to the original
often nonconvex and high dimensional, meaning globally opti- primal problems. In particular, we will construct piecewise lin-
mal solutions can be difficult to obtain. Conventional primal ear dual envelope functions D(p) and D(p) that satisfy D(p) ≤
optimizers, which iteratively improve suboptimal values, can T (p) ≤ D(p) for all p in the admissible region. An illustration
be trapped in local extrema, resulting in invalid bounds that is given in Figure 4(b). In statistics, related techniques have
fail to contain all possible values of the estimand (including found use in variational inference—an approach that constructs
global extrema). an analytically tractable dual function that can be maximized
JOURNAL OF THE AMERICAN STATISTICAL ASSOCIATION 11

in place of the likelihood function (Jordan et al. 1999; Blei, the key to Algorithm 3 is that the upper- and lower-bounding
Kucukelbir, and McAuliffe 2017).18 optimization problems must be executed in parallel, so that the
A key property of this envelope is that the easier-to-compute relative looseness ε can be tracked over time. In addition to the
dual bounds, D ≡ minp D(p) and D ≡ maxp D(p), will polynomial program produced by Algorithm 2, our procedure
always bracket the unknown true sharp bounds. This is because accepts two stopping parameters: εthresh , the desired level of
D(p) and D(p) are downward- and upward-shifted relaxations provable sharpness; and θ thresh , an acceptable width for the
of the original objective function, which can only lead to a bounds width θ ≡ D − D.19
lower minimum and higher maximum, respectively. The dual Figure 4 illustrates the procedure for the outcome-based
bounds are thus guaranteed to bevalid causal bounds. Viewed selection problem of Figure 3. The algorithm receives the primal
differently, the dual bounds D, D also represent outer bounds objective function, T (p), shown in Figure 4(a), as input. It
(where bounding addresses the computationally difficult task of then partitions the parameter space into a series of branches, or
computing  the global extrema) on the unknown sharp causal connected subsets of the parameter space. Separate partitions,
bounds T, T (where bounding addresses the fundamental B and B, are used for lower and upper bounding, respectively.
unknowability of the DGP). Here, a key consideration is that Within each branch b, a linear function Db (p) is constructed;
the choice of a dual envelope determines the looseness, or the easily computed properties such as derivatives and boundary
duality gaps T − D and D − T. Our task therefore reduces to values are used to ensure that this plane lies above or below
the question of how to evaluate the looseness of the dual bounds T (p) for all admissible points in the branch.20 We collect these
and, if needed, to refine the envelope so that it leads to tighter branch-specific
 bounds
 in the piecewise
 functions D(p)
 ≡
dual bounds. Db (p) if p ∈ B b : b and D(p) ≡ Db (p) if p ∈ B b : b , which
We now discuss our procedure for assessing the looseness of define the initial dual envelope shown with dashed blue lines
the dual bounds. To start, note that for any admissible point in Figure 4(b). Because each piece is linear, it is straightfor-
in the model space, p, the corresponding value of the target ward to compute the extreme points  of the dual
 envelope
quantity, T (p), must satisfy T ≤ T (p) ≤ T by definition, even within each branch, Db = min Db (p) : p ∈ B b and Db =
 
when the true sharp bounds are unknown. This immediately max Db (p) : p ∈ B b . The overall dual (outer) bounds are then
suggests that for any collection of points {p, p , p , . . .} within D = minb Db and D = maxb Db , depicted with hollow blue
the admissible region where we choose to evaluate T (·), the triangles.
lowest and highest values discovered—which we denote P and Next, the algorithm seeks to expand the primal (inner)
P—must  also be contained within the sharp bounds. In other bounds. Recall that these bounds, [P, P], are the minimum and
words, P, P represents an inner bound on the unknown sharp maximum values of the target function that have been encoun-
 
bounds T, T . Therefore, for any choice of dual envelope and tered in any set of admissible DGPs—regardless of how that
any collection of evaluated points, we have D ≤ T ≤ P ≤ P ≤ set was constructed. We can therefore use standard constrained
T ≤ D. We evaluate the looseness of the reported dual bounds optimization techniques to optimize the primal problem. Var-
by taking the ratio of the outer
 bounds’
  excess  width to the width ious heuristics—for example, initializing optimizers in regions
of the inner bounds, ε ≡ D − D / P − P − 1. It can be seen that appear promising based on the duals—can also be used. The
that when P = D and P = D, then the reported dual bounds fact that these techniques are only guaranteed to produce local
have provably attained sharpness and ε = 0. However, ε > 0 optima is not of concern, because primal bounds are used only
does not necessarily imply that the dual bounds are not sharp; for computational convenience. Examples of two admissible
for example, it may simply be that D = T, so the lower bound primal points are shown with red triangles in Figure 4(c). These
is sharp, but the collection of points evaluated is insufficiently primal bounds represent the narrowest possible causal bounds:
large, so that T < P and this sharpness cannot be proven. For the (unknown) sharp lower bound T must satisfy T ≤ P, and
this reason, we refer to ε as the worst-case looseness factor. similarly the sharp upper bound must satisfy P ≤ T. This
We are now ready to discuss how bounds are iteratively means entire swaths of the parameter space can now be ignored,
refined; a step-by-step procedure is given in Algorithm 3 in greatly accelerating the search. For example, in Figure 4(c), the
Appendix B. Note that at the outset of the procedure, the ini- upper dual function (upper dashed blue lines) indicates that the
tial dual envelope may lie far from the true objective func- rightmost three-quarters of the parameter space cannot possibly
tion, meaning ε will be large. We employ the spatial branch- produce a target value that is higher than P—the upper primal
and-bound approach to recursively subdivide the model space bound that has already been found (upper solid red triangle).
and efficiently search for regions in which the bounds may be Therefore, optimization of the upper dual bound can focus on
improved. A variety of mature optimization frameworks can be
used to implement the proposed methods, including Couenne
and SCIP (Belotti et al. 2009; Vigerske and Gleixner 2018); 19
We include θ thresh to address the possibility of point identification, in which
case P − P = 0, finite ε thresh cannot be achieved, and algorithms based on
18
Variational inference uses an analytic relaxation to obtain a single dual func- this stopping criteria alone will not terminate.
tion that lower-bounds the likelihood function everywhere in the model
20
For example, consider the objective function T (x) = x 2 . Any tangent line is
space. Our approach diverges in that (i) we conduct two simultaneous dual a valid lower dual function.
 Moreover,  within any interval [xa , xb ], the secant
relaxations to obtain an envelope—both lower and upper—for the original line from (xa , T (xa )) to xb , T (xb ) is a valid upper dual function. A piece-
primal function; (ii) we computationally generate piecewise dual functions, wise linear envelope can thus be constructed by proceeding one branch
rather than analytically deriving smooth duals, and (iii) instead of working at a time, computing derivatives (for example, at the branch midpoint) to
with a fixed dual function, we generate a sequence of dual envelopes that obtain a branch-specific lower dual function D b (x) and boundary values to
iteratively tighten the duals. obtain a branch-specific upper dual function D b (x).
12 G. DUARTE ET AL.


the bracketed “subspace to search in next iteration.” Optimiza- Let Ê be a vector collecting Pr(X  = 0, Y = 0), Pr(X  =
 21
tion of the lower dual bound only need consider regions that 1, Y = 0), Pr(X  = 0, Y = 1) . We then compute a
D(p) indicates can produce lower values than P.  
confidence region for the distribution N Ê, N1 diag(Ê)− N1 ÊÊ .
A new, refined dual envelope can now be constructed
This replaces all of the original equality constraints with a single
by subdividing the remaining space and recomputing tighter    1
dual functions, as shown in Figure 4(d). The procedure is quadratic inequality constraint of the form Ê−E N diag(Ê)−
1  −1   
then repeated recursively—the algorithm heuristically selects N ÊÊ Ê − E ≤ z, where E = Pr(X = 0, Y = 0),

branches in the model space that appear promising, then refines Pr(X = 1, Y = 0), Pr(X = 0, Y = 1) and z is some critical
primal and dual bounds in turn. If a more extreme admissible value of the χ 2 distribution.
target value is found, it is stored as a new primal bound. Finally, Specifics of the calculations are given in Appendix E. These
the algorithm prunes branches of B and B that cannot improve confidence regions for the empirical quantities aim to jointly
dual bounds or that wholly violate constraints. Optimization cover Pr(X = x, Y = y), for every x and y, in at least 1 −
terminates when either ε reaches εthresh or θ reaches θ thresh . For α of repeated samples (the Bernoulli-KL method guarantees
complex problems, the time to convergence may be prohibitive. conservative coverage in finite samples, whereas the Gaussian
But because the dual function is always guaranteed to contain method offers only asymptotic guarantees). When this holds,
the true objective function, the algorithm is anytime—the user confidence bounds obtained by optimizing subject to the relaxed
can halt the program at any point and obtain valid (but poten- empirical constraints are guaranteed to have at least 1 − α
tially loose) bounds. coverage of the population bounds. In Section 7.2, we show that
empirically, confidence bounds obtained from both methods are
conservative.
6. Statistical Inference
We now, briefly discuss how to modify Algorithm 3 to account 7. Simulated Examples
for sampling error in the empirical evidence used to con- 7.1. Instrumental Variables
struct bounds. A more rigorous formalization is provided in
Appendix E. Noncompliance with randomized treatment assignment is a
Consider a simulated binary X → Y graph with confounding common obstacle to causal inference. Balke and Pearl (1997)
X ← UXY → Y. Up until now, when discussing how empirical showed that bounds on the ATE under noncompliance can
evidence constrains the admissible DGPs, we have only con- be obtained via linear programming. However, that approach
sidered cannot be used to bound the local ATE (LATE) among “com-
 population distributions of observable quantities—here,
E = Pr(X = 0, Y = 0) = 0.121, Pr(X = 1, Y = 0) = 0.346, 
pliers” because this quantity is nonlinear. Angrist, Imbens, and
Pr(X = 0, Y = 1) = 0.349, Pr(X = 1, Y = 1) = 0.184 . Rubin (1996) shows the LATE can be point identified if certain
When these population constraints are input to the algorithm, conditions hold—including, notably, (i) the absence of a direct
we refer to the results as the population bounds. In practice, effect of treatment assignment Z on the outcome Y; and (ii)
however, analysts only have access to noisily estimated versions; monotonicity, or the absence of “defiers” in which actual treat-
with N = 1000, the sample analogues might respectively be ment X is the inverse of Z.22 Because these may not be satisfied
0.113, 0.352, 0.357, and 0.178. By the plug-in principle, estimated in practice, Figure 5 shows three possible sets of assumptions
bounds are obtained by supplying estimated constraints instead. that analysts may make: (a) neither; (b) the former but not the
In other words, we apply the algorithm as if Pr(X = x, Y = y) = latter; and (c) both. We simulate a true DGP corresponding to
 = x, Y = y).
Pr(X panel (b), in which no-direct-effect holds but monotonicity is
Next, we propose two easily polynomializable methods to violated. The true ATE is −0.25 and the true LATE is −0.36. We
account for uncertainty from sampling error. Our general will suppose analysts have access to the population distribution
approach is to relax empirical-evidence constraints: we say that Pr(Z = z, X = x, Y = y); inference is discussed in Section 7.2.
Pr(X = x, Y = y) must be near Pr(X  = x, Y = y), rather An overcautious analyst might be unwilling to rule out a
than equaling it. Our first method is based on the “Bernoulli-KL” direct Z → Y effect or defiers in Z → X, making only assump-
approach of Malloy, Tripathy, and Nowak (2020), which con- tions shown in panel (a). Applying our method yields bounds of
structs separate confidence regions for each observable Pr(X = [−0.63, 0.37] and [−1, 1] for the ATE and LATE, respectively—
x, Y = y). For example, rather than constraining Algorithm 3 sharp, but uninformative in sign. With an additional no-direct-
to only consider DGPs exactly satisfying Pr(X = 0, Y = 0) = effect assumption, per panel (b), they would instead obtain ATE
0.121, as in the population bounds, or Pr(X = 0, Y = 0) = bounds of [−0.55, −0.15], revealing a negative effect and cor-
0.113, as in the estimated bounds, we instead allow it to consider rectly containing the true ATE, −0.25. However, LATE bounds
any DGP in which 0.073 ≤ Pr(X = 0, Y = 0) ≤ 0.163. Thus, remain at [−1, 1]; as compliers cannot be identified experi-
each equality constraint in the original empirical evidence is mentally, this quantity is difficult to learn about without strong
replaced with two linear inequality constraints; this is equivalent assumptions. Finally, an overconfident analyst might mistakenly
to constraining Pr(X = x, Y = y) to lie within a hypercube. make an additional monotonicity assumption. Helpfully, when
Our second method is based on the multivariate Gaussian
limiting distribution of the multinomial proportion (Bienaymé 21
To avoid degeneracy issues, one empirical quantity is excluded, as it must
1838). This approach will essentially say that Pr(X = x, Y = y) sum to unity.
is constrained to lie within an ellipsoid, rather than a hypercube. 22
Other conditions include ignorability of Z and a nonnull effect of Z on X.
JOURNAL OF THE AMERICAN STATISTICAL ASSOCIATION 13

Figure 5. DGPs with noncompliance. Three possible scenarios involving encouragement Z, treatment X, and outcome Y. Panel (b) represents the true simulation DGP, where
Z → X monotonicity is violated (indicated by absence of a +). Panel (a) depicts assumptions used by an “overcautious” analyst unwilling to assume away a direct Z → Y
effect. Panel (c) corresponds to an “overconfident” analyst that incorrectly assumes monotonicity of Z → X.

Table 1. Bias of estimated bounds. Average estimated upper and lower bounds,
across 1000 simulated datasets, for varying sample sizes. Estimated bounds are
centered on population bounds in all scenarios.
Quantity N = 1000 N = 10,000 N = 100,000 Population
Lower bound −0.5497 −0.5498 −0.5500 −0.5502
Upper bound −0.1453 −0.1455 −0.1459 −0.1460

confidence bounds, depicted in Figure 6. For each combination


of sample size and uncertainty method, we draw 1000 simulated
datasets and run Algorithm 3 on each.
Table 1 reports average values of estimated confidence
bounds obtained by Algorithm 3 over 1000 simulated datasets,
for varying N. At all sample sizes, estimated bounds are cen-
tered on population bounds. Figure 13 in the supplementary
materials shows confidence bounds obtained across methods
Figure 6. Coverage of confidence bounds. Each of 1000 simulations is depicted
and sample sizes. The Bernoulli-KL method produces wider
with a horizontal line. For each simulation, a horizontal error bar represents esti- confidence intervals at all N; at N = 1000, it is generally
mated bounds (top panels) or 95% confidence bounds (middle and lower panels), unable to reject zero, whereas the asymptotic method does so
obtained per Section 6. All confidence bounds fully contain the population bounds,
indicating 100% coverage. The middle (lower) row of panels reflect confidence
occasionally. Differences in interval width persist but shrink
bounds obtained with the Bernoulli-KL (asymptotic) method. Columns of panels rapidly as sample size grows and both methods collapse on
report confidence bounds obtained using samples of various sizes. Vertical dotted population bounds. As discussed in Section 6, we find more
gray lines show true population lower and upper bounds, which contain the true
ATE of −0.25; vertical dashed black lines indicate zero.
conservative coverage for confidence bounds on the ATE (100%
coverage of population bounds), compared to coverage of the
underlying confidence regions on the observed quantities (95%
asked to produce bounds, Algorithm 3 reports the causal query joint coverage of observed population quantities for the asymp-
is infeasible—meaning that it cannot locate any DGP consistent totic method).
with data and assumptions. This clearly warns that the causal
theory is deficient. If the analyst naïvely applied the traditional
two-stage least-squares estimator, they would receive no such 7.3. More Complex Bounding Problems
warning. Instead, they would obtain an erroneous point estimate We now examine four hypothetical DGPs, shown in Figure 7,
of −0.74, roughly double the true LATE of −0.36. featuring various threats to inference. Throughout, we target the
ATE of X on Y. Panel (a) illustrates outcome-based selection:
we observe unit i only if Si = 1, where Si may be affected by
7.2. Coverage of Confidence Bounds
Y. Selection severity, Pr(S = 0), is known, but no information
We now evaluate the performance of confidence bounds that about Pr(X = x, Y = y|S = 0) is available. X and Y are
characterize uncertainty due to sampling error, constructed also confounded by unobserved U. Bounding in this setting is a
according to Section 6 and Appendix E, using the instrumental nonlinear program, with an analytic solution recently derived in
variable model of Figure 5(b). Specifically, we draw samples of Gabriel, Sachs, and Sjölander (2022). Panel (b) illustrates mea-
N = 1000, N = 10,000, or N = 100,000 observations from this surement error: an unobserved confounder U jointly causes Y
DGP. For each sample, we then use the empirical proportions and its proxy Y ∗ , but only treatment and the proxy outcome are
1 
N i 1{Zi = z, Xi = x, Yi = y} for all x, y, z ∈ {0, 1}. These observed. Bounding in this setting is a linear problem. A number
eight quantities form the basis of estimated bounds, by the plug- of results for linear measurement error were recently presented
in principle. To quantify uncertainty, we compute 95% confi- in Finkelstein et al. (2020); here, we examine the monotonic
dence regions on the same observed quantities, then convert errors case, where Y ∗ (Y = 1) ≥ Y ∗ (Y = 0). Panel (c) depicts
them to polynomial constraints for inclusion in Algorithm 3. missingness in outcomes, that is, nonresponse or attrition. Here,
Optimizing subject to these confidence constraints produces X affects both the partially observed Y and response indicator
14 G. DUARTE ET AL.

Figure 7. Various threats to inference. Panels depict (a) outcome-based selection, (b) measurement error, (c) nonresponse, and (d) joint missingness. In each graph, X and Y
are treatment and outcome, respectively. Dotted red regions represent observed information. In (a), the box around S indicates selection: other variables are only observed
conditional on S = 1. In (b), Y ∗ represents a mismeasured version of the unobserved true Y. In (c), RY indicates reporting, so that Y ∗ = Y if R = 1 and is missing otherwise.
In (d), both treatment and outcome can be missing, and missingness on X can affect missingness on Y.

Figure 8. Computation of ATE bounds. Progress of Algorithm 3 for simulation data from DGPs depicted in Figure 7(a)–(d). Black error bars are known analytic bounds, y-axes
are ATE values, and x-axes are runtimes of Algorithm 3. Red regions are dual bounds, which always contain sharp bounds and the unknown true causal effect; these can only
narrow over time, converging on optimality. Blue regions are primal bounds, which can only widen over time as more extreme models are found. Optimization stops when
primal and dual bounds meet, indicating bounds are sharp. Prior analytic bounds are sharp for problems (a)–(c). In setting (d), Algorithm 3 achieves point identification, but
Manski (1990) bounds do not.

R; if R = 1, then Y ∗ = Y, but if R = 0, then Y ∗ takes on the guaranteed-valid dual bounds along with their worst-case
missing value indicator NA. Nonresponse on Y is differentially suboptimality factor, ε—or await complete sharpness,
affected by both X and the value of Y itself (i.e., “missingness not ε = 0.
at random,” MNAR); Manski (1990) provides analytic bounds. In Figure 8(a)–(c), the algorithm converges on known ana-
Finally, panel (d) depicts joint missingness in both treatment and lytic results. Ultimately, in the selection simulation (a), Algo-
outcome—sometimes a challenge in longitudinal studies with rithm 3 achieves bounds of [−0.50, 0.64], correctly recovering
dropout—with MNAR on Y. Gabriel, Sachs, and Sjölander’s (2022) analytic bounds; in (b),
Figure 8 illustrates how Algorithm 3 recovers sharp bounds. measurement error bounds are [−0.62, 1.00], matching Finkel-
Each panel shows progress in time. Primal bounds (blue) can stein et al. (2020); and in (c), outcome missingness bounds
widen over time if more extreme, observationally equivalent are [−0.25, 0.75], equaling Manski (1990) bounds. Somewhat
models are found. Dual bounds (red) narrow as the outer counterintuitively, Figure 8(d) shows dual bounds collapsing to
envelope is tightened. Our method simultaneously searches for a point, eventually point-identifying the ATE at −0.25 despite
more extreme primal points and narrows the dual envelope. severe missingness. This surprising result turns out to be a
Analysts can terminate the process at any time, reporting variant of an approach using “shadow variables” developed by
JOURNAL OF THE AMERICAN STATISTICAL ASSOCIATION 15

Miao et al. (2016).23 This example illustrates the algorithm is Beyond providing a general tool for causal inference, our
general enough to recover results even when they are not widely approach aligns closely with recent calls to improve research
known in a particular model; note the commonly used approach transparency by explicitly declaring estimands, identifying
of Manski (1990) produces far looser bounds of [−0.72, 0.40], assumptions, and causal theory (Miguel et al. 2014; Lundberg,
failing to exploit causal structure given in Figure 7(d). This Johnson, and Stewart 2021). Only with a common understand-
result suggests our approach enables an empirical investigation ing of goals and premises can scholars have meaningful debates
of complex models where general identification results are not over the credibility of research. When aspects of a theory are
yet available. Situations where bounds converge suggest models contested, our approach allows for a fully modular exploration
where point identification via an explicit functional may be of how assumptions affect empirical conclusions. Scholars can
possible, potentially enabling new identification theory. learn whether assumptions are empirically consequential, and
if so, craft a targeted line of inquiry to probe their validity.
Our approach can also act as a safeguard for analysts, flagging
8. Potential Critiques of the Approach assumptions as infeasible when they conflict with observed
Below, we briefly discuss several potential critiques of our information.
method. Key avenues for future research are uncertainty quantifica-
tion and computation time for complex problems. While we
“The user must know the true causal model.” This is false; users develop conservative confidence bounds and anytime valid-
do not need to assert an incorrect “complete” model, but rather ity guarantees, future work should pursue confidence bounds
only what they know or believe. Our approach simply derives the with nominal coverage and computational improvements, per-
conclusions that follow from data and those transparently stated haps by incorporating point-identified subquantities or semi-
assumptions. parametric modeling. Causal inference scholars may also use
“The bounds will be too wide to be informative.” This is no this method to aid in the exploration of new identification
tradeoff: faulty point estimates based on faulty assumptions theory. These lines of inquiry now represent the major open
are also uninformative. When sharp bounds incorporating all questions in discrete causal inference.
defensible assumptions are wide, it merely means progress will
require more information.
Supplementary Materials
“What about continuous variables?” Discrete approximations
Appendices include (a) a technical glossary; (b) detailed algorithms; (c)
often suffice in applied work. If continuous treatments only affect worked examples; (d) details on program simplifications; (e) details on
discrete outcomes when exceeding a threshold, discretization is statistical inference; (f) proofs; and (g) details of simulations. Our repli-
lossless. Future work may study discrete approximations when cation archive consists of a Docker container including code, a snapshot
effects are smooth. of our autobounds package, and all dependencies needed to reproduce
reported results.
“The bounds will take too long to compute.” Achieving ε = 0
may sometimes take prohibitive time, but our approach remains
faster than manual derivation. Figure 8 shows that several Acknowledgments
recently published results were recovered in mere seconds. For helpful feedback, we thank Peter Aronow, Justin Grimmer, Kosuke Imai,
Moreover, our anytime guarantee ensures that premature termi- Luke Keele, Gary King, Christopher Lucas, Fredrik Sävje, Brandon Stewart,
nation will still produce valid bounds. Eric Tchetgen Tchetgen, and participants in the Harvard Applied Statistics
Workshop, the New York University Data Science Seminar, University
of Pennsylvania Causal Inference Seminar, PolMeth 2021, and the Yale
9. Conclusion Quantitative Research Methods Workshop.

Causal inference is a central goal of science, and many estab-


lished techniques can point-identify causal quantities under Disclosure Statement
ideal conditions. But in many applications, these conditions
The authors report there are no competing interests to declare.
are simply not satisfied, necessitating partial identification—
yet few tools for obtaining these bounds exist. For knowledge
accumulation to proceed in the messy world of applied statistics, Funding
a general solution is needed. We present a tool to automat-
ically produce sharp bounds on causal quantities in settings We gratefully acknowledge financial support from AI for Business and
the Analytics at Wharton Data Science and Business Analytics Fund, the
involving discrete data. Our approach involves a reduction of Carnegie Corporation of New York, the Office of Naval Research under
all such causal queries to polynomial programming problems, grant N00014-21-1-2820, the National Science Foundation under grants
enables efficient search over observationally indistinguishable 2040804 and CAREER 1942239, and the National Institutes of Health under
DGPs, and produces sharp bounds on arbitrary causal esti- grant R01 AI127271-01A1. The statements made and views expressed are
mands. This approach is sufficiently general to accommodate solely the responsibility of the authors.
essentially every classic inferential obstacle.
ORCID
23
Specifically, it can be shown the ATE is identified for the Figure 7(d) graph
only among faithful distributions where X → Y is nonnull—that is, almost Dean Knox http://orcid.org/0000-0002-1945-7938
everywhere in the model space. Jonathan Mummolo http://orcid.org/0000-0002-5639-3718
16 G. DUARTE ET AL.

References Malloy, M. L., Tripathy, A., and Nowak, R. D. (2020), “Optimal


Confidence Regions for the Multinomial Parameter,” arXiv preprint
Angrist, J. D., Imbens, G. W., and Rubin, D. B. (1996), “Identification of arXiv:2002.01044. [12]
Causal Effects Using Instrumental Variables,” Journal of the American Manski, C. (1990), “Nonparametric Bounds on Treatment Effects,” The
Statistical Association, 91, 444–455. [4,7,12] American Economic Review, 80, 319–323. [1,2,14,15]
Balke, A., and Pearl, J. (1997), “Bounds on Treatment Effects from Studies Miao, W., Liu, L., Tchetgen, E. T., and Geng, Z. (2016), “Identification,
with Imperfect Compliance,” Journal of the American Statistical Associa- Doubly Robust Estimation, and Semiparametric Efficiency Theory of
tion, 92, 1171–1176. [1,2,4,12] Nonignorable Missing Data with a Shadow Variable,” Biometrika, 103,
Belotti, P., Lee, J., Liberti, L., Margot, F., and Wächter, A. (2009), “Branching 475–482. [2,15]
and Bounds Tightening Techniques for Non-convex MINLP,” Optimiza- Miguel, E., Camerer, C., Casey, K., Cohen, J., Esterling, K., Gerber, A.,
tion Methods and Software, 24, 597–634. [2,11] Glennerster, R., Green, D., Humphreys, M., Imbens, G., and Laitin, D.
Bienaymé, I. J. (1838), Mémoire sur la probabilité des résultats moyens des (2014), “Promoting Transparency in Social Science Research,” Science,
observations: démonstration directe de la règle de Laplace, Imprimerie 343, 30–31. [15]
Royale. [12] Molinari, F. (2020), “Microeconometrics with Partial Identification,”
Blei, D. M., Kucukelbir, A., and McAuliffe, J. D. (2017), “Variational Infer- arXiv:2004.11751. [1]
ence: A Review for Statisticians,” Journal of the American Statistical Pearl, J. (1995), “On the Testability of Causal Models with Latent and
Association, 112, 859–877. [11] Instrumental Variables,” in Uncertainty in Artificial Intelligence II. San
Bonet, B. (2001), “Instrumentality Tests Revisited,” in Proceedings of the 17th Francisco, CA: Morgan Kaufmann Publishers. [2]
Conference in Uncertainty in Artificial Intelligence, eds. J. S. Breese and Pearl, J. (2000), Causality, New York: Cambridge University Press. [3]
D. Koller, pp. 48–55. [2] Ramsahai, R. R. (2012), “Causal Bounds and Observable Constraints for
Cai, Z., Kuroki, M., Pearl, J., and Tian, J. (2008), “Bounds on Direct Effects Non-deterministic Models,” Journal of Machine Learning Research, 13,
in the Presence of Confounded Intermediate Variables,” Biometrics, 64, 829–848. [2]
695–701. [1] Richardson, T. S., and Robins, J. M. (2013), “Single World Intervention
Dean, T. L., and Boddy, M. (1988), An Analysis of Time-Dependent Planning, Graphs (SWIGs) : A Unification of the Counterfactual and Graphical
pp. 49–54, Washington DC: American Association for Artificial Intelli- Approaches to Causality,” Working paper, Center for Stat. & Soc. Sci., U.
gence. [2] Washington 128. [3]
Evans, R. (2018), “Margins of Discrete Bayesian Networks,” Annals of Statis- Robins, J. (1986), “A New Approach to Causal Inference in Mortality Studies
tics, 46, 2623–2656. [4,6] with a Sustained Exposure Period–Application to Control of the Healthy
Evans, R. J. (2016), “Graphs for Margins of Bayesian Networks,” Scandina- Worker Survivor Effect,” Mathematical Modelling, 7, 1393–1512. [3]
vian Journal of Statistics, 43, 625–648. [3] Sachs, M. C., Jonzon, G., Sjölander, A., and Gabriel, E. E. (2022), “A General
Finkelstein, N., Adams, R., Saria, S., and Shpitser, I. (2020), “Partial Iden- Method for Deriving Tight Symbolic Bounds on Causal Effects,” Journal
tifiability in Discrete Data with Measurement Error,” arXiv preprint of Computational and Graphical Statistics, 1–10. [2]
arXiv:2012.12449. [13,14] Shpitser, I. (2018), “Identification in Graphical Causal Models,” in Hand-
Finkelstein, N., Wolfe, E., and Shpitser, I. (2021), “Non-Restrictive Cardinal- book of Graphical Models, eds. M. Maathuis, M. Drton, S. Lauritzen, and
ities and Functional Models for Discrete Latent Variable DAGs,” working M. Wainwright, pp. 381–404, Boca Raton, FL: CRC Press. [3]
paper. [4,6] Sjölander, A., Lee, W., Källberg, H., and Pawitan, Y. (2014), “Bounds On
Frangakis, C. E., and Rubin, D. B. (2002), “Principal Stratification in Causal Causal Interactions for Binary Outcomes,” Biometrics, 70, 500–505. [1]
Inference,” Biometrics, 58, 21–29. [2,3,4] Stein, W. et al. (2019), Sage Mathematics Software (Version 9.0). The Sage
Gabriel, E. E., Sachs, M. C., and Sjölander, A. (2022), “Causal Bounds for Development Team. www.sagemath.org. [8]
Outcome-Dependent Sampling in Observational Studies,” Journal of the Swanson, S., Hernán, M., Miller, M., Robins, J., and Richardson, T. (2018),
American Statistical Association, 117, 939–950. [1,2,13,14] “Partial Identification of the Average Treatment Effect Using Instrumen-
Geiger, D., and Meek, C. (1999), “Quantifier Elimination for Statistical tal Variables,” Journal of the American Statistical Association, 113, 933–
Problems,” in Proceedings of Fifteenth Conference on Uncertainty in Arti- 947. [1]
ficial Intelligence, pp. 226–235. [2] Tian, J., and Pearl, J. (2002), “On the Testable Implications of Causal
Greenland, S., and Robins, J. (1986), “Identifiability, Exchangeability, and Models with Hidden Variables,” in Proceedings of the 18th Conference in
Epidemiological Confounding,” International Journal of Epidemiology, Uncertainty in Artificial Intelligence. [2]
15, 413–419. [4] Verma, T., and Pearl, J. (1990), “Equivalence and Synthesis of Causal
Jordan, M. I., Ghahramani, Z., Jaakkola, T. S., and Saul, L. K. (1999), “An Models,” in Proceedings of the Conference on Uncertainty in Artificial
Introduction to Variational Methods for Graphical Models,” Machine Intelligence, eds. P. P. Bonissone, M. Henrion, L. N. Kanal, and J. F.
Learning, 37, 183–233. [11] Lemmer, pp. 255–268, Morgan Kaufmann. [2]
Kennedy, E. H., Harris, S., and Keele, L. J. (2019), “Survivor-Complier Vigerske, S., and Gleixner, A. (2018), “SCIP: Global Optimization of Mixed-
Effects in the Presence of Selection on Treatment, with Application to Integer Nonlinear Programs in a Branch-and-Cut Framework,” Opti-
a Study of Prompt ICU Admission,” Journal of the American Statistical mization Methods and Software, 33, 563–593. [2,11]
Association, 114, 93–104. [1,2] Wolfe, E., Spekkens, R. W., and Fritz, T. (2019), “The Inflation Technique
Knox, D., Lowe, W., and Mummolo, J. (2020), “Administrative Records for Causal Inference with Latent Variables,” Journal of Causal Inference,
Mask Racially Biased Policing,” American Political Science Review, 114, 7. [2]
619–637. [1,2] Zhang, J., and Bareinboim, E. (2021), “Non-parametric Methods for Partial
Lee, D. (2009), “Training, Wages, and Sample Selection: Estimating Sharp Identification of Causal Effects,” Technical Report R-72, Causal AI Lab,
Bounds on Treatment Effects,” The Review of Economic Studies, 76, 1071– Columbia University. [2]
1102. [1] Zhang, J. L., and Rubin, D. B. (2003), “Estimation of Causal Effects
Lundberg, I., Johnson, R., and Stewart, B. M. (2021), “What is Your Esti- via Principal Stratification When Some Outcomes are Truncated by
mand? Defining the Target Quantity Connects Statistical Evidence to “death”,” Journal of Educational and Behavioral Statistics, 28, 353–368.
Theory,” American Sociological Review, 86, 532–565. [15] [1]

You might also like