2020 - Introduction_to_Causal_Inference - From ML Perspective
2020 - Introduction_to_Causal_Inference - From ML Perspective
net/publication/364807813
CITATIONS READS
0 374
1 author:
Brady Neal
Université de Montréal
8 PUBLICATIONS 7 CITATIONS
SEE PROFILE
All content following this page was uploaded by Brady Neal on 27 October 2022.
Brady Neal
Prerequisites There is one main prerequisite: basic probability. This course assumes
you’ve taken an introduction to probability course or have had equivalent experience.
Topics from statistics and machine learning will pop up in the course from time to
time, so some familiarity with those will be helpful but is not necessary. For example, if
cross-validation is a new concept to you, you can learn it relatively quickly at the point in
the book that it pops up. And we give a primer on some statistics terminology that we’ll
use in Section 2.4.
Active Reading Exercises Research shows that one of the best techniques to remember
material is to actively try to recall information that you recently learned. You will see
“active reading exercises” throughout the book to help you do this. They’ll be marked by
the Active reading exercise: heading.
Many Figures in This Book As you will see, there are a ridiculous amount of figures in
this book. This is on purpose. This is to help give you as much visual intuition as possible.
We will sometimes copy the same figures, equations, etc. that you might have seen in
preceding chapters so that we can make sure the figures are always right next to the text
that references them.
Sending Me Feedback This is a book draft, so I greatly appreciate any feedback you’re
willing to send my way. If you’re unsure whether I’ll be receptive to it or not, don’t be.
Please send any feedback to me at [email protected] with “[Causal Book]” in the
beginning of your email subject. Feedback can be at the word level, sentence level, section
level, chapter level, etc. Here’s a non-exhaustive list of useful kinds of feedback:
I Typoz.
I Some part is confusing.
I You notice your mind starts to wander, or you don’t feel motivated to read some
part.
I Some part seems like it can be cut.
I You feel strongly that some part absolutely should not be cut.
I Some parts are not connected well. Moving from one part to the next, you notice
that there isn’t a natural flow.
I A new active reading exercise you thought of.
Bibliographic Notes Although we do our best to cite relevant results, we don’t want to
disrupt the flow of the material by digging into exactly where each concept came from.
There will be complete sections of bibliographic notes in the final version of this book,
but they won’t come until after the course has finished.
Contents
Preface ii
Contents iii
2 Potential Outcomes 6
2.1 Potential Outcomes and Individual Treatment Effects . . . . . . . . . . . . 6
2.2 The Fundamental Problem of Causal Inference . . . . . . . . . . . . . . . . 7
2.3 Getting Around the Fundamental Problem . . . . . . . . . . . . . . . . . . 8
2.3.1 Average Treatment Effects and Missing Data Interpretation . . . . 8
2.3.2 Ignorability and Exchangeability . . . . . . . . . . . . . . . . . . . 9
2.3.3 Conditional Exchangeability and Unconfoundedness . . . . . . . . 10
2.3.4 Positivity/Overlap and Extrapolation . . . . . . . . . . . . . . . . . 12
2.3.5 No interference, Consistency, and SUTVA . . . . . . . . . . . . . . 13
2.3.6 Tying It All Together . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.4 Fancy Statistics Terminology Defancified . . . . . . . . . . . . . . . . . . . 15
2.5 A Complete Example with Estimation . . . . . . . . . . . . . . . . . . . . . 16
4 Causal Models 32
4.1 The do-operator and Interventional Distributions . . . . . . . . . . . . . . 32
4.2 The Main Assumption: Modularity . . . . . . . . . . . . . . . . . . . . . . 34
4.3 Truncated Factorization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
4.3.1 Example Application and Revisiting “Association is Not Causation” 36
4.4 The Backdoor Adjustment . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
4.4.1 Relation to Potential Outcomes . . . . . . . . . . . . . . . . . . . . . 39
4.5 Structural Causal Models (SCMs) . . . . . . . . . . . . . . . . . . . . . . . 40
4.5.1 Structural Equations . . . . . . . . . . . . . . . . . . . . . . . . . . 40
4.5.2 Interventions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
4.5.3 Collider Bias and Why to Not Condition on Descendants of Treatment 43
4.6 Example Applications of the Backdoor Adjustment . . . . . . . . . . . . . 44
4.6.1 Association vs. Causation in a Toy Example . . . . . . . . . . . . . 44
4.6.2 A Complete Example with Estimation . . . . . . . . . . . . . . . . 45
4.7 Assumptions Revisited . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
5 Randomized Experiments 49
5.1 Comparability and Covariate Balance . . . . . . . . . . . . . . . . . . . . . 49
5.2 Exchangeability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
5.3 No Backdoor Paths . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
6 Nonparametric Identification 52
6.1 Frontdoor Adjustment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
6.2 do-calculus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
6.2.1 Application: Frontdoor Adjustment . . . . . . . . . . . . . . . . . . 57
6.3 Determining Identifiability from the Graph . . . . . . . . . . . . . . . . . . 58
7 Estimation 62
7.1 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
7.2 Conditional Outcome Modeling (COM) . . . . . . . . . . . . . . . . . . . . 63
7.3 Grouped Conditional Outcome Modeling (GCOM) . . . . . . . . . . . . . 64
7.4 Increasing Data Efficiency . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
7.4.1 TARNet . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
7.4.2 X-Learner . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
7.5 Propensity Scores . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
7.6 Inverse Probability Weighting (IPW) . . . . . . . . . . . . . . . . . . . . . . 68
7.7 Doubly Robust Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
7.8 Other Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
7.9 Concluding Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
7.9.1 Confidence Intervals . . . . . . . . . . . . . . . . . . . . . . . . . . 71
7.9.2 Comparison to Randomized Experiments . . . . . . . . . . . . . . 72
9 Instrumental Variables 86
9.1 What is an Instrument? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
9.2 No Nonparametric Identification of the ATE . . . . . . . . . . . . . . . . . 87
9.3 Warm-Up: Binary Linear Setting . . . . . . . . . . . . . . . . . . . . . . . . 87
9.4 Continuous Linear Setting . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
9.5 Nonparametric Identification of Local ATE . . . . . . . . . . . . . . . . . . 90
9.5.1 New Potential Notation with Instruments . . . . . . . . . . . . . . 90
9.5.2 Principal Stratification . . . . . . . . . . . . . . . . . . . . . . . . . 90
9.5.3 Local ATE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
9.6 More General Settings for ATE Identification . . . . . . . . . . . . . . . . . 94
10 Difference in Differences 95
10.1 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
10.2 Introducing Time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
10.3 Identification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
10.3.1 Assumptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
10.3.2 Main Result and Proof . . . . . . . . . . . . . . . . . . . . . . . . . 97
10.4 Major Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
Appendix 113
A Proofs 114
A.1 Proof of Equation 6.1 from Section 6.1 . . . . . . . . . . . . . . . . . . . . . 114
A.2 Proof of Propensity Score Theorem (7.1) . . . . . . . . . . . . . . . . . . . . 114
A.3 Proof of IPW Estimand (7.18) . . . . . . . . . . . . . . . . . . . . . . . . . . 115
Bibliography 117
List of Tables
Listings
4.1 Python code for estimating the ATE, without adjusting for the collider . . 46
Motivation: Why You Might Care 1
1.1 Simpson’s Paradox . . . . . 1
1.1 Simpson’s Paradox
1.2 Applications of Causal Infer-
ence . . . . . . . . . . . . . . 2
Consider a purely hypothetical future where there is a new disease known
1.3 Correlation Does Not Imply
as COVID-27 that is prevalent in the human population. In this purely
Causation . . . . . . . . . . 3
hypothetical future, there are two treatments that have been developed:
Nicolas Cage and Pool
treatment A and treatment B. Treatment B is more scarce than treatment Drownings . . . . . . . . . . 3
A, so the split of those currently receiving treatment A vs. treatment Why is Association Not Cau-
B is roughly 73%/27%. You are in charge of choosing which treatment sation? . . . . . . . . . . . . 4
your country will exclusively use, in a country that only cares about 1.4 Main Themes . . . . . . . . . 5
minimizing loss of life.
You have data on the percentage of people who die from COVID-27,
given the treatment they were assigned and given their condition at the
time treatment was decided. Their condition is a binary variable: either
mild or severe. In this data, 16% of those who receive A die, whereas
19% of those who receive B die. However, when we examine the people
with mild condition separately from the people with severe condition,
the numbers reverse order. In the mild subpopulation, 15% of those who
receive A die, whereas 10% of those who receive B die. In the severe
subpopulation, 30% of those who receive A die, whereas 20% of those
who receive B die. We depict these percentages and the corresponding
counts in Table 1.1.
A
en
19%
B better in all subpopulations.
(5/50) (100/500) (105/550)
The apparent paradox stems from the fact that, in Table 1.1, the “Total”
column could be interpreted to mean that we should prefer treatment
A, whereas the “Mild” and “Severe” columns could both be interpreted
to mean that we should prefer treatment B.1 In fact, the answer is that if 1A key ingredient necessary to find Simp-
we know someone’s condition, we should give them treatment B, and if son’s paradox is the non-uniformity of
allocation of people to the groups. 1400
we do not know their condition, we should give them treatment A. Just
of the 1500 people who received treatment
kidding... that doesn’t make any sense. So really, what treatment should A had mild condition, whereas 500 of
you choose for your country? the 550 people who received treatment
B had severe condition. Because people
Either treatment A or treatment B could be the right answer, depending with mild condition are less likely to die,
on the causal structure of the data. In other words, causality is essential to this means that the total mortality rate
for those with treatment A is lower than
solve Simpson’s paradox. For now, we will just give the intuition for when
what it would have been if mild and severe
you should prefer treatment A vs. when you should prefer treatment B, conditions were equally split among them.
but it will be made more formal in Chapter 4. The opposite bias is true for treatment B.
1 Motivation: Why You Might Care 2
Scenario 2 If the prescription2 of treatment 𝑇 is a cause of the condition 2 𝑇 refers to the prescription of the treat-
𝐶 (Figure 1.2), treatment A is more effective. An example scenario is ment, rather than the subsequent recep-
tion of the treatment.
where treatment B is so scarce that it requires patients to wait a long
time after they were prescribed the treatment before they can receive
the treatment. Treatment A does not have this problem. Because the
condition of a patient with COVID-27 worsens over time, the prescription
of treatment B actually causes patients with mild conditions to develop
severe conditions, causing a higher mortality rate. Therefore, even if
𝑇 𝐶
treatment B is more effective than treatment A once administered (positive
effect along 𝑇 → 𝑌 in Figure 1.2), because prescription of treatment B
causes worse conditions (negative effect along 𝑇 → 𝐶 → 𝑌 in Figure
1.2), treatment B is less effective in total. Note: Because treatment B is 𝑌
more expensive, treatment B is prescribed with 0.27 probability, while Figure 1.2: Causal structure of scenario 2,
treatment A is prescribed with 0.73 probability; importantly, treatment where treatment 𝑇 is a cause of condition
𝐶 . Given this causal structure, treatment
prescription is independent of condition in this scenario.
A is preferable.
In sum, the more effective treatment is completely dependent on the
causal structure of the problem. In Scenario 1, where 𝐶 was a cause of
𝑇 (Figure 1.1), treatment B was more effective. In Scenario 2, where 𝑇
was a cause of 𝐶 (Figure 1.2), treatment A was more effective. Without
causality, Simpson’s paradox cannot be resolved. With causality, it is not
a paradox at all.
Many of you will have heard the mantra “correlation does not imply
causation.” In this section, we will quickly review that and provide you
with a bit more intuition about why this is the case.
It turns out that the yearly number of people who drown by falling into
swimming pools has a high degree of correlation with the yearly number
of films that Nicolas Cage appears in [1]. See Figure 1.3 for a graph of this [1]: Vigen (2015), Spurious correlations
data. Does this mean that Nicolas Cage encourages bad swimmers to
hop in the pool in his films? Or does Nicolas Cage feel more motivated to
act in more films when he sees how many drownings are happening that
year, perhaps to try to prevent more drownings? Or is there some other
explanation? For example, maybe Nicolas Cage is interested in increasing
his popularity among causal inference practitioners, so he travels back in
time to convince his past self to do just the right number of movies for us
to see this correlation, but not too close of a match as that would arouse
suspicion and potentiallyNumber of people
cause someone to who drowned
prevent him fromby falling into a pool
rigging
correlates with
the data this way. We may never know for sure.
Films Nicolas Cage appeared in
1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009
140 drownings 6 films
Swimming pool drownings
80 drownings 0 films
1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009
Figure 1.3: The yearly number of movies Nicolas Cage appears in correlates with the yearly number of pool drownings [1].
illustrative example that will help clarify how spurious correlations can
arise.
Before moving to the next example, let’s be a bit more precise about
terminology. “Correlation” is often colloquially used as a synonym
for statistical dependence. However, “correlation” is technically only a
measure of linear statistical dependence. We will largely be using the
term association to refer to statistical dependence from now on.
Causation is not all or none. For any given amount of association, it
does not need to be “all of the association is causal” or “none of the
association is causal.” Rather, it is possible to have a large amount of
association with only some of it being causal. The phrase “association
is not causation” simply means that the amount of association and the
amount of causation can be different. Some amount of association and
zero causation is a special case of “association is not causation.”
Say you happen upon some data that relates wearing shoes to bed and
waking up with a headache, as one does. It turns out that most times
that someone wears shoes to bed, that person wakes up with a headache.
And most times someone doesn’t wear shoes to bed, that person doesn’t
wake up with a headache. It is not uncommon for people to interpret
data like this (with associations) as meaning that wearing shoes to bed
causes people to wake up with headaches, especially if they are looking
for a reason to justify not wearing shoes to bed. A careful journalist might
make claims like “wearing shoes to bed is associated with headaches”
or “people who wear shoes to bed are at higher risk of waking up with
headaches.” However, the main reason to make claims like that is that
most people will internalize claims like that as “if I wear shoes to bed,
I’ll probably wake up with a headache.”
We can explain how wearing shoes to bed and headaches are associated
without either being a cause of the other. It turns out that they are
both caused by a common cause: drinking the night before. We depict
this in Figure 1.4. You might also hear this kind of variable referred
to as a “confounder” or a “lurking variable.” We will call this kind of
association confounding association since the association is facilitated by a
confounder.
The total association observed can be made up of both confounding
association and causal association. It could be the case that wearing shoes
to bed does have some small causal effect on waking up with a headache. Figure 1.4: Causal structure, where drink-
Then, the total association would not be solely confounding association ing the night before is a common cause of
nor solely causal association. It would be a mixture of both. For example, sleeping with shoes on and of waking up
with a headaches.
in Figure 1.4, causal association flows along the arrow from shoe-sleeping
to waking up with a headache. And confounding association flows along
the path from shoe-sleeping to drinking to headachening (waking up
with a headache). We will make the graphical interpretation of these
different kinds of association clear in Chapter 3.
1 Motivation: Why You Might Care 5
There are several overarching themes that will keep coming up through-
out this book. These themes will largely be comparisons of two different
categories. As you are reading, it is important that you understand which
categories different sections of the book fit into and which categories
they do not fit into.
Statistical vs. Causal Even with an infinite amount of data, we some-
times cannot compute some causal quantities. In contrast, much of
statistics is about addressing uncertainty in finite samples. When given
infinite data, there is no uncertainty. However, association, a statistical
concept, is not causation. There is more work to be done in causal infer-
ence, even after starting with infinite data. This is the main distinction
motivating causal inference. We have already made this distinction in
this chapter and will continue to make this distinction throughout the
book.
Identification vs. Estimation Identification of causal effects is unique
to causal inference. It is the problem that remains to solve, even when we
have infinite data. However, causal inference also shares estimation with
traditional statistics and machine learning. We will largely begin with
identification of causal effects (in Chapters 2, 4 and 6) before moving to
estimation of causal effects (in Chapter 7). The exceptions are Section 2.5
and Section 4.6.2, where we carry out complete examples with estimation
to give you an idea of what the whole process looks like early on.
Interventional vs. Observational If we can intervene/experiment,
identification of causal effects is relatively easy. This is simply because
we can actually take the action that we want to measure the causal effect
of and simply measure the effect after we take that action. Observational
data is where it gets more complicated because confounding is almost
always introduced into the data.
Assumptions There will be a large focus on what assumptions we are
using to get the results that we get. Each assumption will have its own
box to help make it difficult to not notice. Clear assumptions should make
it easy to see where critiques of a given causal analysis or causal model
will be. The hope is that presenting assumptions clearly will lead to more
lucid discussions about causality.
Potential Outcomes 2
In this chapter, we will ease into the world of causality. We will see that 2.1 Potential Outcomes and Indi-
new concepts and corresponding notations need to be introduced to vidual Treatment Effects . 6
clearly describe causal concepts. These concepts are “new” in the sense 2.2 The Fundamental Problem
that they may not exist in traditional statistics or math, but they should of Causal Inference . . . . 7
be familiar in that we use them in our thinking and describe them with 2.3 Getting Around the Funda-
natural language all the time. mental Problem . . . . . . . 8
Average Treatment Effects
Familiar statistical notation We will use 𝑇 to denote the random vari-
and Missing Data Interpre-
able for treatment, 𝑌 to denote the random variable for the outcome of
tation . . . . . . . . . . . . . 8
interest and 𝑋 to denote covariates. In general, we will use uppercase Ignorability and Exchange-
letters to denote random variables (except in maybe one case) and lower- ability . . . . . . . . . . . . . 9
case letters to denote values that random variables take on. Much of what Conditional Exchangeability
we consider will be settings where 𝑇 is binary. Know that, in general, we and Unconfoundedness . 10
can extend things to work in settings where 𝑇 can take on more than two Positivity/Overlap and Ex-
values or where 𝑇 is continuous. trapolation . . . . . . . . . . 12
No interference, Consis-
tency, and SUTVA . . . . . 13
Tying It All Together . . . . 14
2.1 Potential Outcomes and Individual 2.4 Fancy Statistics Terminology
Treatment Effects Defancified . . . . . . . . . 15
2.5 A Complete Example with
Estimation . . . . . . . . . . 16
We will now introduce the first causal concept to appear in this book.
These concepts are sometimes characterized as being unique to the
Neyman-Rubin [2–4] causal model (or potential outcomes framework), [2]: Splawa-Neyman (1923 [1990]), ‘On the
but they are not. For example, these same concepts are still present Application of Probability Theory to Agri-
cultural Experiments. Essay on Principles.
(just under different notation) in the framework that uses causal graphs Section 9.’
(Chapters 3 and 4). It is important that you spend some time ensuring [3]: Rubin (1974), ‘Estimating causal effects
that you understand these initial causal concepts. If you have not studied of treatments in randomized and nonran-
causal inference before, they will be unfamiliar to see in mathematical domized studies.’
[4]: Sekhon (2008), ‘The Neyman-Rubin
contexts, though they may be quite familiar intuitively because we Model of Causal Inference and Estimation
commonly think and communicate in causal language. via Matching Methods’
Scenario 1 Consider the scenario where you are unhappy. And you are
considering whether or not to get a dog to help make you happy. If you
become happy after you get the dog, does this mean the dog caused you
to be happy? Well, what if you would have also become happy had you
not gotten the dog? In that case, the dog was not necessary to make you
happy, so its claim to a causal effect on your happiness is weak.
Scenario 2 Let’s switch things up a bit. Consider that you will still be
happy if you get a dog, but now, if you don’t get a dog, you will remain
unhappy. In this scenario, the dog has a pretty strong claim to a causal
effect on your happiness.
In both the above scenarios, we have used the causal concept known as
potential outcomes. Your outcome 𝑌 is happiness: 𝑌 = 1 corresponds to
happy while 𝑌 = 0 corresponds to unhappy. Your treatment 𝑇 is whether
or not you get a dog: 𝑇 = 1 corresponds to you getting a dog while 𝑇 = 0
2 Potential Outcomes 7
the individual treatment effect (ITE) 2 for individual 𝑖 : 2 The ITE is also known as the individual
causal effect, unit-level causal effect, or unit-
𝜏𝑖 , 𝑌𝑖 (1) − 𝑌𝑖 (0) (2.1) level treatment effect.
could observe 𝑌(0) by not getting a dog and observing your happiness.
domized studies.’
However, you cannot observe both 𝑌(1) and 𝑌(0), unless you have a time
machine that would allow you to go back in time and choose the version
of treatment that you didn’t take the first time. You cannot simply get
a dog, observe 𝑌(1), give the dog away, and then observe 𝑌(0) because
the second observation will be influenced by all the actions you took
between the two observations and anything else that changed since the
first observation.
2 Potential Outcomes 8
This is known as the fundamental problem of causal inference [5]. It is [5]: Holland (1986), ‘Statistics and Causal
fundamental because if we cannot observe both 𝑌𝑖 (1) and 𝑌𝑖 (0), then we Inference’
cannot observe the causal effect 𝑌𝑖 (1) − 𝑌𝑖 (0). This problem is unique
to causal inference because, in causal inference, we care about making
causal claims, which are defined in terms of potential outcomes. For
contrast, consider machine learning. In machine learning, we often only
care about predicting the observed outcome 𝑌 , so there is no need for
potential outcomes, which means machine learning does not have to
deal with this fundamental problem that we must deal with in causal
inference.
The potential outcomes that you do not (and cannot) observe are known
as counterfactuals because they are counter to fact (reality). “Potential
outcomes” are sometimes referred to as “counterfactual outcomes,” but
we will never do that in this book because a potential outcome 𝑌(𝑡)
does not become counter to fact until another potential outcome 𝑌(𝑡 0) is
observed. The potential outcome that is observed is sometimes referred
to as a factual. Note that there are no counterfactuals or factuals until the
outcome is observed. Before that, there are only potential outcomes.
I suspect this section is where this chapter might start to get a bit unclear.
If that is the case for you, don’t worry too much, and just continue to the
next chapter, as it will build up parallel concepts in a hopefully more
intuitive way.
marks (missing data) is known as ignorability. Assuming ignorability is procedure is equivalent to 𝔼[𝑌|𝑇 = 1] −
𝔼[𝑌|𝑇 = 0] in the data in Table 2.1.
like ignoring how people ended up selecting the treatment they selected
and just assuming they were randomly assigned their treatment; we
depict this graphically in Figure 2.2 by the lack of a causal arrow from 𝑋
to 𝑇 . We will now state this assumption formally.
𝑋
Assumption 2.1 (Ignorability / Exchangeability)
(𝑌(1), 𝑌(0)) ⊥
⊥𝑇 𝑇 𝑌
This assumption is key to causal inference because it allows us to reduce Figure 2.2: Causal structure when the
treatment assignment mechanism is ig-
the ATE to the associational difference: norable. Notably, this means there’s no
arrow from 𝑋 to 𝑇 , which means there is
𝔼[𝑌(1)] − 𝔼[𝑌(0)] = 𝔼[𝑌(1) | 𝑇 = 1] − 𝔼[𝑌(0) | 𝑇 = 0] (2.3) no confounding.
(𝑌(1), 𝑌(0)) ⊥
⊥𝑇 | 𝑋
The idea is that although the treatment and potential outcomes may
be unconditionally associated (due to confounding), within levels of 𝑋 ,
they are not associated. In other words, there is no confounding within
levels of 𝑋 because controlling for 𝑋 has made the treatment groups
comparable. We’ll now give a bit of graphical intuition for the above. We
will not draw the rigorous connection between the graphical intuition
and Assumption 2.2 until Chapter 3; for now, it is just meant to aid
intuition.
2 Potential Outcomes 11
This marks an important result for causal inference, so we’ll give it its
own proposition box. The proof we give above leaves out some details.
Read through to Section 2.3.6 (where we redo the proof with all details
specified) to get the rest of the details. We will call this result the adjustment
formula.
To see why positivity is important, let’s take a closer look at Equation 2.9:
Intuition That’s the math for why we need the positivity assumption,
but what’s the intuition? Well, if we have a positivity violation, that
means that within some subgroup of the data, everyone always receives
treatment or everyone always receives the control. It wouldn’t make
sense to be able to estimate a causal effect of treatment vs. control in that
subgroup since we see only treatment or only control. We never see the
alternative in that subgroup.
Another name for positivity is overlap. The intuition for this name is that
9
we want the covariate distribution of the treatment group to overlap Whenever we use a random variable (de-
with the covariate distribution of the control group. More specifically, noted by a capital letter) as the argument
for 𝑃 , we are referring to the whole dis-
we want 𝑃(𝑋 | 𝑇 = 1)9 to have the same support as 𝑃(𝑋 | 𝑇 = 0).10 This tribution, rather than just the scalar that
is why another common alias for positivity is common support. something like 𝑃(𝑥 | 𝑇 = 1) refers to.
The Positivity-Unconfoundedness Tradeoff Although conditioning 10 Active reading exercise: convince your-
on more covariates could lead to a higher chance of satisfying uncon- self that this formulation of overlap/posi-
foundedness, it can lead to a higher chance of violating positivity. As we tivity is equivalent to the formulation in
increase the dimension of the covariates, we make the subgroups for any Assumption 2.3.
level 𝑥 of the covariates smaller.11 As each subgroup gets smaller, there
11 This is related to the curse of dimensional-
is a higher and higher chance that either the whole subgroup will have
treatment or the whole subgroup will have control. For example, once ity.
𝑇 = 𝑡 =⇒ 𝑌 = 𝑌(𝑡) (2.12)
𝑌 = 𝑌(𝑇) (2.13)
(Theorem 2.1):
𝔼[𝑌(1) − 𝑌(0)] is the causal estimand that we are interested in. In order
to actually estimate this causal estimand, we must translate it into a
statistical estimand: 𝔼𝑋 [𝔼[𝑌 | 𝑇 = 1 , 𝑋] − 𝔼[𝑌 | 𝑇 = 0 , 𝑋]].15 15 Active reading exercise: Why can’t we di-
rectly estimate a causal estimand without
When we say “identification” in this book, we are referring to the process first translating it to a statistical estimand?
of moving from a causal estimand to an equivalent statistical estimand.
When we say “estimation,” we are referring to the process of moving from
a statistical estimand to an estimate. We illustrate this in the flowchart in
Figure 2.5.
Identification Estimation
Causal Estimand Statistical Estimand Estimate
Figure 2.5: The Identification-Estimation Flowchart – a flowchart that illustrates the process of moving from a target causal estimand to a
corresponding estimate, through identification and estimation.
Theorem 2.1 and the corresponding recent copy in Equation 2.14 give
us identification. However, we haven’t discussed estimation at all. In
this section, we will give a short example complete with estimation. We
will cover the topic of estimation of causal effects more completely in
Chapter 7.
We use Luque-Fernandez et al. [8]’s example from epidemiology. The [8]: Luque-Fernandez et al. (2018), ‘Edu-
outcome 𝑌 of interest is (systolic) blood pressure. This is an important cational Note: Paradoxical collider effect
in the analysis of non-communicable dis-
outcome because roughly 46% of Americans have high blood pressure, ease epidemiological data: a reproducible
and high blood pressure is associated with increased risk of mortality illustration and web application’
[9]. The “treatment” 𝑇 of interest is sodium intake. Sodium intake is [9]: Virani et al. (2020), ‘Heart Disease and
a continuous variable; in order to easily apply Equation 2.14, which is Stroke Statistics—2020 Update: A Report
specified for binary treatment, we will binarize 𝑇 by letting 𝑇 = 1 denote From the American Heart Association’
daily sodium intake above 3.5 grams and letting 𝑇 = 0 denote daily
sodium intake below 3.5 grams.16 We will be estimating the causal effect 16 As we will see, this binarization is purely
of sodium intake on blood pressure. In our data, we also have the age pedagogical and does not reflect any limi-
of the individuals and amount of protein in their urine as covariates 𝑋 . tations of adjusting for confounders.
1X
[𝔼[𝑌 | 𝑇 = 1, 𝑋 = 𝑥 𝑖 ] − 𝔼[𝑌 | 𝑇 = 0 , 𝑋 = 𝑥 𝑖 ]] (2.15)
𝑛 𝑖
All of the above is done using the adjustment formula with model-assisted
estimation, where we first fit a model for the conditional expectation
𝔼[𝑌 | 𝑡, 𝑥], and then we take an empirical mean over 𝑋 , using that model.
However, because we are using a linear model, this is equivalent to just
taking the coefficient in front of 𝑇 in the linear regression as the ATE
estimate. This is what we do in the following code (which gives the exact
same ATE estimate):
model.fit(Xt, y)
ate_est = model.coef_[0]
print('ATE estimate:', ate_est)
comes as a cost: the linear parametric form we assumed. If this model were
misspecified,19 our ATE estimate would be biased. And because linear 19 By “misspecified,” we mean that the
models are so simple, they will likely be misspecified. For example, the functional form of the model does not
match the functional form of the data gen-
following assumption is implicit in assuming that a linear model is well-
erating process.
specified: the treatment effect is the same for all individuals. See Morgan
and Winship [12, Sections 6.2 and 6.3] for a more complete critique of [12]: Morgan and Winship (2014), Counter-
using the coefficient in front of treatment as the ATE estimate. factuals and Causal Inference: Methods and
Principles for Social Research
The Flow of Association and
Causation in Graphs 3
We’ve been using causal graphs in the previous chapters to aid intuition. 3.1 Graph Terminology . . . . . 19
In this chapter, we will introduce the formalisms that underlie this 3.2 Bayesian Networks . . . . . 20
intuition. Hopefully, we have sufficiently motivated this chapter and
3.3 Causal Graphs . . . . . . . . 22
made the utility of graphical models clear with all of the graphical
3.4 Two-Node Graphs and
interpretations of concepts in previous chapters.
Graphical Building Blocks 23
3.5 Chains and Forks . . . . . . 24
3.6 Colliders and their Descen-
3.1 Graph Terminology dants . . . . . . . . . . . . . . 26
3.7 d-separation . . . . . . . . . . 28
In this section, we will use the terminology machine gun (see Figure 3.1). To
3.8 Flow of Association and Cau-
be able to use nice convenient graph language in the following sections,
sation . . . . . . . . . . . . . 30
rapid-firing a lot of graph terminology is a necessary evil, unfortunately.
The term “graph” is often used to describe a variety of visualizations. term
For example, “graph” might refer to a visualization of a single variable termtermterm
term term
function 𝑓 (𝑥), where 𝑥 is plotted on the 𝑥 -axis and 𝑓 (𝑥) is plotted term
term
on the 𝑦 -axis. Or “bar graph” might be used as a synonym for a bar Figure 3.1: Terminology machine gun
chart. However, in graph theory, the term “graph” refers to a specific
mathematical object.
A graph is a collection of nodes (also called “vertices”) and edges that
connect the nodes. For example, in Figure 3.2, 𝐴, 𝐵, 𝐶 , and 𝐷 are the nodes
of the graph, and the lines connecting them are the edges. Figure 3.2 is
called an undirected graph because the edges do not have any direction. In
𝐴 𝐵
contrast, Figure 3.3 is a directed graph. A directed graph’s edges go out
of a parent node and into a child node, with the arrows signifying which
direction the edges are going. We will denote the parents of a node 𝑋
with pa(𝑋). We’ll use an even simpler shorthand when the nodes are 𝐶 𝐷
ordered so that we can denote the 𝑖 th node by 𝑋𝑖 ; in that case, we will
Figure 3.2: Undirected graph
also denote the parents of 𝑋𝑖 by pa𝑖 . Two nodes are said to be adjacent
if they are connected by an edge. For example, in both Figure 3.2 and
Figure 3.3, 𝐴 and 𝐶 are adjacent, but 𝐴 and 𝐷 are not.
𝐴 𝐵
A path in a graph is any sequence of adjacent nodes, regardless of the
direction of the edges that join them. For example, 𝐴 — 𝐶 — 𝐵 is a path
in Figure 3.2, and 𝐴 → 𝐶 ← 𝐵 is a path in Figure 3.3. A directed path is
a path that consists of directed edges that are all directed in the same 𝐶 𝐷
direction (no two edges along the path both point into or both point
Figure 3.3: Directed graph
out of the same node). For example, 𝐴 → 𝐶 → 𝐷 is a directed path in
Figure 3.3, but 𝐴 → 𝐶 ← 𝐵 and 𝐶 ← 𝐴 → 𝐵 are not.
If there is a directed path that starts at node 𝑋 and ends at node 𝑌 , then 𝑋
is an ancestor of 𝑌 , and 𝑌 is a descendant of 𝑋 . We will denote descendants 𝐴 𝐵
of 𝑋 by de(𝑋). For example, in Figure 3.3, 𝐴 is an ancestor of 𝐵 and
𝐷 , and 𝐵 and 𝐷 are both descendants of 𝐴 (de(𝐴)). If 𝑋 is an ancestor
of itself, then some funky time travel has taken place. In seriousness, a
directed path from some node 𝑋 back to itself is known as a cycle (see 𝐶 𝐷
Figure 3.4). If there are no cycles in a directed graph, the graph is known Figure 3.4: Directed graph with cycle
3 The Flow of Association and Causation in Graphs 20
as a directed acyclic graph (DAG). The graphs we focus on in this book will
mostly be DAGs.
If two parents 𝑋 and 𝑌 share some child 𝑍 , but there is no edge connecting 𝐴 𝐵
𝑋 and 𝑌 , then 𝑋 → 𝑍 ← 𝑌 is known as an immorality. Seriously; that’s
a real term in graphical models. For example, if we remove the 𝐴 → 𝐵
from Figure 3.3 to get Figure 3.5, then 𝐴 → 𝐶 ← 𝐵 is an immorality.
𝐶 𝐷
Figure 3.5: Directed graph with immoral-
ity
3.2 Bayesian Networks
It turns out that much of the work for causal graphical models was done
in the field of probabilistic graphical models. Probabilistic graphical
models are statistical models while causal graphical models are causal
models. Bayesian networks are the main probabilistic graphical model
that causal graphical models (causal Bayesian networks) inherit most of
their properties from.
Imagine that we only cared about modeling association, without any
causal modeling. We would want to model the data distribution 𝑃(𝑥 1 , 𝑥 2 , . . . , 𝑥 𝑛 ).
In general, we can use the chain rule of probability to factorize any distri-
bution:
Y
𝑃(𝑥1 , 𝑥2 , . . . , 𝑥 𝑛 ) = 𝑃(𝑥1 ) 𝑃(𝑥 𝑖 | 𝑥 𝑖−1 , . . . , 𝑥 1 ) (3.1)
𝑖
However, if we were to model these factors with tables, it would take an Table 3.1: Table required to model the
exponential number of parameters. To see this, take each 𝑥 𝑖 to be binary single factor 𝑃(𝑥 𝑛 | 𝑥 𝑛−1 , . . . , 𝑥 1 ) where
and consider how we would model the factor 𝑃(𝑥 𝑛 | 𝑥 𝑛−1 , . . . , 𝑥 1 ). Since 𝑛 = 4 and the variables are binary. The
𝑥 𝑛 is binary, we only need to model 𝑃(𝑋𝑛 = 1 | 𝑥 𝑛−1 , . . . , 𝑥 1 ) because
number of parameters to necessary is ex-
ponential in 𝑛 .
𝑃(𝑋𝑛 = 0 | 𝑥 𝑛−1 , . . . , 𝑥 1 ) is simply 1 − 𝑃(𝑋𝑛 = 1 | 𝑥 𝑛−1 , . . . , 𝑥 1 ). Well, we
would need 2𝑛−1 parameters to model this. As a specific example, let 𝑥1 𝑥2 𝑥3 𝑃(𝑥4 | 𝑥 3 , 𝑥2 , 𝑥1 )
𝑛 = 4. As we can see in Table 3.1, this would require 24−1 = 8 parameters: 0 0 0 𝛼1
𝛼1 , . . . , 𝛼 8 . This brute-force parametrization quickly becomes intractable 0 0 1 𝛼2
as 𝑛 increases. 0 1 0 𝛼3
An intuitive way to more efficiently model many variables together in 0 1 1 𝛼4
a joint distribution is to only model local dependencies. For example, 1 0 0 𝛼5
rather than modeling the 𝑋4 factor as 𝑃(𝑥 4 |𝑥 3 , 𝑥 2 , 𝑥 1 ), we could model 1 0 1 𝛼6
it as 𝑃(𝑥 4 |𝑥 3 ) if we have reason to believe that 𝑋4 only locally depends 1 1 0 𝛼7
on 𝑋3 . In fact, in the corresponding graph in Figure 3.6, the only node 1 1 1 𝛼8
that feeds into 𝑋4 is 𝑋3 . This is meant to signify that 𝑋4 only locally
depends on 𝑋3 . Whenever we use a graph 𝐺 in relation to a probability
distribution 𝑃 , there will always be a one-to-one mapping between the
nodes in 𝐺 and the random variables in 𝑃 , so when we talk about nodes
𝑋1 𝑋2
being independent, we mean the corresponding random variables are
independent.
Given a probability distribution and a corresponding directed acyclic
graph (DAG), we can formalize the specification of independencies with 𝑋3 𝑋4
the local Markov assumption:
Figure 3.6: Four node DAG where 𝑋4 lo-
cally depends on only 𝑋3 .
Assumption 3.1 (Local Markov Assumption) Given its parents in the
DAG, a node 𝑋 is independent of all its non-descendants.
3 The Flow of Association and Causation in Graphs 21
Hopefully you see the resemblance between the move from Equation 3.2
to Equation 3.3 or the move to Equation 3.4 and the generalization of this
that is presented in Definition 3.1.
The Bayesian network factorization is also known as the chain rule for
Bayesian networks or Markov compatibility. For example, if 𝑃 factorizes
according to 𝐺 , then 𝑃 and 𝐺 are Markov compatible.
We have given the intuition of how the local Markov assumption implies
the Bayesian network factorization, and it turns out that the two are
actually equivalent. In other words, we could have started with the
Bayesian network factorization as the main assumption (and labeled it as
an assumption) and shown that it implies the local Markov assumption.
See Koller and Friedman [13, Chapter 3] for these proofs and more [13]: Koller and Friedman (2009), Proba-
information on this topic. bilistic Graphical Models: Principles and Tech-
niques
As important as the local Markov assumption is, it only gives us infor-
mation about the independencies in 𝑃 that a DAG implies. It does not
even tell us that if 𝑋 and 𝑌 are adjacent in the DAG, then 𝑋 and 𝑌 are
dependent. And this additional information is very commonly assumed
in causal DAGs. To get this guaranteed dependence between adjacent
nodes, we will generally assume a slightly stronger assumption than the
local Markov assumption: minimality.
The previous section was all about statistical models and modeling
association. In this section, we will augment these models with causal
assumptions, turning them into causal models and allowing us to study
causation. In order to introduce causal assumptions, we must first have
an understanding of what it means for 𝑋 to be a cause of 𝑌 .
that every edge is “active,” just like in DAGs that satisfy minimality. In
other words, because the definition of a cause (Definition 3.2) implies
that a cause and its effect are dependent and because we are assuming
all parents are causes of their children, we are assuming that parents
and their children are dependent. So the second part of minimality
(Assumption 3.2) is baked into the strict causal edges assumption.
In contrast, the non-strict causal edges assumption would allow for
some parents to not be causes of their children. It would just assume
that children are not causes of their parents. This allows us to draw
graphs with extra edges to make fewer assumptions, just like we would
in Bayesian networks, where more edges means fewer independence
assumptions. Causal graphs are sometimes drawn with this kind of
non-minimal meaning, but the vast majority of the time, when someone
draws a causal graph, they mean that parents are causes of their children.
Therefore, unless we specify otherwise, throughout this book, we will
use “causal graph” to refer to a DAG that satisfies the strict causal edges
assumption. And we will often omit the word “strict” when we refer to
this assumption.
When we add the causal edges assumption, directed paths in the DAG
take on a very special meaning; they correspond to causation. This is in
contrast to other paths in the graph, which association may flow along,
but causation certainly may not. This will become more clear when we
go into detail on these other kinds of paths in Sections 3.5 and 3.6.
Moving forward, we will now think of the edges of graphs as causal, in
order to describe concepts intuitively with causal language. However,
all of the associational claims about statistical independence will still
hold, even when the edges do not have causal meaning like in the vanilla
Bayesian networks of Section 3.2.
As we will see in the next few sections, the main assumptions that we
need for our causal graphical models to tell us how association and
causation flow between variables are the following two:
1. Local Markov Assumption (Assumption 3.1)
2. Causal Edges Assumption (Assumption 3.3)
We will discuss these assumptions throughout the next few sections and
come back to discuss them more fully again in Section 3.8 after we’ve
established the necessary preliminaries.
Now that we’ve gotten the basic assumptions and definitions out of the
way, we can get to the core of this chapter: the flow of association and
causation in DAGs. We can understand this flow in general DAGs by
understanding the flow in the minimal building blocks of graphs. These
minimal building blocks consist of chains (Figure 3.9a), forks (Figure 3.9b),
immoralities (Figure 3.9c), two unconnected nodes (Figure 3.10), and two
connected nodes (Figure 3.11).
3 The Flow of Association and Causation in Graphs 24
𝑋2 𝑋1 𝑋3
𝑋1 𝑋2 𝑋3 𝑋1 𝑋3 𝑋2
Chains (Figure 3.12) and forks (Figure 3.13) share the same set of depen-
𝑋1 𝑋2 𝑋3
dencies. In both structures, 𝑋1 and 𝑋2 are dependent, and 𝑋2 and 𝑋3
are dependent for the same reason that we discussed toward the end Figure 3.12: Chain with flow of association
of Section 3.4. Adjacent nodes are always dependent when we make drawn as a dashed red arc.
the causal edges assumption (Assumption 3.3). What about 𝑋1 and 𝑋3 ,
3 The Flow of Association and Causation in Graphs 25
𝑃(𝑥1 , 𝑥2 )
𝑃(𝑥1 , 𝑥3 | 𝑥2 ) = 𝑃(𝑥3 |𝑥2 ) (3.8)
𝑃(𝑥2 )
= 𝑃(𝑥1 |𝑥 2 ) 𝑃(𝑥3 |𝑥2 ) (3.9)
Recall from Section 3.1 that we have an immorality when we have a child
whose two parents do not have an edge connecting them (Figure 3.16).
And in this graph structure, the child is known as a bastard. No, just
kidding; it’s called a collider.
In contrast to chains and forks, in an immorality, 𝑋1 ⊥ ⊥ 𝑋3 . Look at
the graph structure and think about it a bit. Why would 𝑋1 and 𝑋3 be
associated? One isn’t the descendant of the other like in chains, and
they don’t share a common cause like in forks. Rather, we can think of
𝑋1 and 𝑋3 simply as unrelated events that happen, which happen to
both contribute to some common effect (𝑋2 ). To show this, we apply the
Bayesian network factorization and marginalize out 𝑥 2 :
𝑋1 𝑋3
X
𝑃(𝑥1 , 𝑥3 ) = 𝑃(𝑥 1 , 𝑥2 , 𝑥3 ) (3.10)
𝑥2
X
= 𝑃(𝑥 1 ) 𝑃(𝑥 3 ) 𝑃(𝑥2 | 𝑥1 , 𝑥3 ) (3.11) 𝑋2
𝑥2
X
= 𝑃(𝑥1 ) 𝑃(𝑥3 ) 𝑃(𝑥2 | 𝑥1 , 𝑥3 ) (3.12)
𝑥2
Figure 3.16: Immorality with association
= 𝑃(𝑥1 ) 𝑃(𝑥3 ) (3.13) blocked by a collider.
10 10 10
available
taken
8 8 8
6 6 6
kindness
kindness
kindness
4 4 4
2 2 2
0 0 0
0 2 4 6 8 10 0 2 4 6 8 10 0 2 4 6 8 10
looks looks looks
(a) Looks and kindness data for the whole (b) Looks and kindness data grouped by (c) Looks and kindness data for only the
population. Looks and kindness are indepen- whether the person is available or not. Within available people. Now, there is a negative
dent. each group, there is a negative correlation. correlation.
Figure 3.18: Example data for the “good-looking men are jerks” example. Both looks and kindness are continuous values on a scale from 0
to 10.
Numerical Example All of the above has been to give you intuition
about why conditioning on a collider induces association between its
parents, but we have yet to give a concrete numerical example of this.
We will give a simple one here. Consider the following data generating
3 The Flow of Association and Causation in Graphs 28
path that is not blocked. The graphical intuition to have in mind is that
association flows along unblocked paths, and association does not flow
along blocked paths. If you don’t have this intuition in mind, then it is
probably worth it to reread the previous two sections, with the goal of
gaining this intuition. Now, we are ready to introduce a very important
concept: d-separation.
Definition 3.4 (d-separation) Two (sets of) nodes 𝑋 and 𝑌 are d-separated
by a set of nodes 𝑍 if all of the paths between (any node in) 𝑋 and (any node
in) 𝑌 are blocked by 𝑍 [16]. [16]: Pearl (1988), Probabilistic Reasoning
in Intelligent Systems: Networks of Plausible
Inference
If all the paths between two nodes 𝑋 and 𝑌 are blocked, then we say that
𝑋 and 𝑌 are d-separated. Similarly, if there exists at least one path between
𝑋 and 𝑌 that is unblocked, then we say that 𝑋 and 𝑌 are d-connected.
As we will see in Theorem 3.1, d-separation is such an important concept
because it implies conditional independence. We will use the notation
𝑋 ⊥⊥𝐺 𝑌 | 𝑍 to denote that 𝑋 and 𝑌 are d-separated in the graph 𝐺
when conditioning on 𝑍 . Similarly, we will use the notation 𝑋 ⊥
⊥𝑃 𝑌 | 𝑍
to denote that 𝑋 and 𝑌 are independent in the distribution 𝑃 when
conditioning on 𝑍 .
Theorem 3.1 Given that 𝑃 is Markov with respect to 𝐺 (satisfies the local
Markov assumption, Assumption 3.1), if 𝑋 and 𝑌 are d-separated in 𝐺
conditioned on 𝑍 , then 𝑋 and 𝑌 are independent in 𝑃 conditioned on 𝑍 . We
can write this succinctly as follows:
𝑋⊥
⊥𝐺 𝑌 | 𝑍 =⇒ 𝑋 ⊥
⊥𝑃 𝑌 | 𝑍 (3.20)
Because this is so important, we will give Equation 3.20 a name: the global
Markov assumption. Theorem 3.1 tells us that the local Markov assumption
implies the global Markov assumption.
Just as we built up the intuition that suggested that the local Markov
assumption (Assumption 3.1) implies the Bayesian network factorization
(Definition 3.1) and alerted you to the fact that the Bayesian network
factorization also implies the local Markov assumption (the two are equiv-
alent), it turns out that the global Markov assumption also implies the
local Markov assumption. In other words, the local Markov assumption,
global Markov assumption, and the Bayesian network factorization are
all equivalent [see, e.g., 13, Chapter 3]. Therefore, we will use the slightly [13]: Koller and Friedman (2009), Proba-
shortened phrase Markov assumption to refer to these concepts as a bilistic Graphical Models: Principles and Tech-
niques
group, or we will simply write “𝑃 is Markov with respect to 𝐺 ” to convey
the same meaning.
Active reading exercise: To get some practice with d-separation, here are
some questions about d-separation in Figure 3.19.
Questions about Figure 3.19a:
1. Are 𝑇 and 𝑌 d-separated by the empty set?
2. Are 𝑇 and 𝑌 d-separated by 𝑊2 ?
3. Are 𝑇 and 𝑌 d-separated by {𝑊2 , 𝑀1 }?
4. Are 𝑇 and 𝑌 d-separated by {𝑊1 , 𝑀2 }?
5. Are 𝑇 and 𝑌 d-separated by {𝑊1 , 𝑀2 , 𝑋2 }?
3 The Flow of Association and Causation in Graphs 30
𝑊2
𝑊1 𝑊3 𝑊
𝑇 𝑌
𝑇 𝑀1 𝑀2 𝑌
𝑋1
𝑋1 𝑋3
𝑋2 𝑋2
(a) (b)
Causal graphs are special in that we additionally assume that the edges
have causal meaning (causal edges assumption, Assumption 3.3). This
assumption is what introduces causality into our models, and it makes
one type of path take on a whole new meaning: directed paths. This
assumption endows directed paths with the unique role of carrying
causation along them. Additionally, this assumption is asymmetric; “𝑋
is a cause of 𝑌 ” is not the same as saying “𝑌 is a cause of 𝑋 .” This means
that there is an important difference between association and causation:
association is symmetric, whereas causation is asymmetric.
d-separation Implies Association is Causation Given that we have
tools to measure association, how can we isolate causation? In other
words, how can we ensure that the association we measure is causation,
say, for measuring the causal effect of 𝑋 on 𝑌 ? Well, we can do that by
ensuring that there is no non-causal association flowing between 𝑋 and
𝑌 . This is true if 𝑋 and 𝑌 are d-separated in the augmented graph where
we remove outgoing edges from 𝑋 . This is because all of 𝑋 ’s causal effect
on 𝑌 would flow through it’s outgoing edges, so once those are removed,
the only association that remains is purely non-causal association.
In Figure 3.21, we illustrate what each of the important assumptions
gives us in terms of interpreting this flow of association. First, we have
the (local/global) Markov assumption (Assumption 3.1). As we saw
in Section 3.7, this assumption allows us to know which nodes are
unassociated. In other words, the Markov assumption tells along which
paths the association does not flow. When we slightly strengthen the
Markov assumption to the minimality assumption (Assumption 3.2),
we get which paths association does flow along (except in intransitive
edges cases). When we further add in the causal edges assumption
(Assumption 3.3), we get that causation flows along directed paths.
Therefore, the following two8 assumptions are essential for graphical 8 Recall that the first part of the minimal-
causal models: ity assumption is just the local Markov
assumption and that the second part is
1. Markov Assumption (Assumption 3.1) contained in the causal edges assumption.
2. Causal Edges Assumption (Assumption 3.3)
Figure 3.21: A flowchart that illustrates what kind of claims we can make about our data as we add each additional important assumption.
Causal Models 4
Causal models are essential for identification of causal quantities. When 4.1 The do-operator and Inter-
we presented the Identification-Estimation Flowchart (Figure 2.5) back ventional Distributions . . 32
in Section 2.4, we described identification as the process of moving 4.2 The Main Assumption: Mod-
from a causal estimand to a statistical estimand. However, to do that, ularity . . . . . . . . . . . . . 34
we must have a causal model. We depict this more full version of the 4.3 Truncated Factorization . . 35
Identification-Estimation Flowchart in Figure 4.1. Example Application and Re-
visiting “Association is Not
Causal Estimand Causal Model Causation” . . . . . . . . . . 36
4.4 The Backdoor Adjustment 37
Relation to Potential Out-
comes . . . . . . . . . . . . . 39
Statistical Estimand Data
4.5 Structural Causal Models
(SCMs) . . . . . . . . . . . . 40
Structural Equations . . . . 40
Estimate Interventions . . . . . . . . . 42
Collider Bias and Why to Not
Figure 4.1: The Identification-Estimation Flowchart – a flowchart that illustrates the process
of moving from a target causal estimand to a corresponding estimate, through identification Condition on Descendants
and estimation. In contrast to Figure 2.5, this version is augmented with a causal model of Treatment . . . . . . . . . 43
and data. 4.6 Example Applications of the
Backdoor Adjustment . . . 44
The previous chapter gives graphical intuition for causal models, but it Association vs. Causation in
a Toy Example . . . . . . . . 44
doesn’t explain how to identify causal quantities and formalize causal
A Complete Example with
models. We will do that in this chapter.
Estimation . . . . . . . . . . 45
4.7 Assumptions Revisited . . . 47
Note that we shorten do(𝑇 = 𝑡) to just do(𝑡) in the last option in Equation
4.1. We will use this shorthand throughout the book. We can similarly
write the ATE (average treatment effect) when the treatment is binary as
follows:
𝔼[𝑌 | do(𝑇 = 1)] − 𝔼[𝑌 | do(𝑇 = 0)] (4.2)
4 Causal Models 33
or or
𝑇=0 do(𝑇 = 0)
We will often work with full distributions like 𝑃(𝑌 | do(𝑡)), rather than
their means, as this is more general; if we characterize 𝑃(𝑌 | do(𝑡)), then
we’ve characterized 𝔼[𝑌 | do(𝑡)]. We will commonly refer to 𝑃(𝑌 | do(𝑇 =
𝑡)) and other expressions with the do-operator in them as interventional
distributions.
Interventional distributions such as 𝑃(𝑌 | do(𝑇 = 𝑡)) are conceptually
quite different from the observational distribution 𝑃(𝑌). Observational
distributions such as 𝑃(𝑌) or 𝑃(𝑌, 𝑇, 𝑋) do not have the do-operator in
them. Because they don’t have the do-operator, we can observe data from
them without needing to carry out any experiment. This is why we call
data from 𝑃(𝑌, 𝑇, 𝑋) observational data. If we can reduce an expression
𝑄 with do in it (an interventional expression) to one without do in it (an
observational expression), then 𝑄 is said to be identifiable. An expression
with a do in it is fundamentally different from an expression without a
do in it, despite the fact that in do-notation, do appears after a regular
conditioning bar. As we discussed in Section 2.4, we will refer to an
estimand as a causal estimand when it contains a do-operator, and we
refer to an estimand as a statistical estimand when it doesn’t contain a
do-operator.
Whenever, do(𝑡) appears after the conditioning bar, it means that ev-
erything in that expression is in the post-intervention world where the
intervention do(𝑡) occurs. For example, 𝔼[𝑌 | do(𝑡), 𝑍 = 𝑧] refers to the
expected outcome in the subpopulation where 𝑍 = 𝑧 after the whole
subpopulation has taken treatment 𝑡 . In contrast, 𝔼[𝑌 | 𝑍 = 𝑧] simply
refers to the expected value in the (pre-intervention) population where
individuals take whatever treatment they would normally take (𝑇 ). This
distinction will become important when we get to counterfactuals in
Chapter 14.
4 Causal Models 34
𝑇3 𝑇3 𝑇3
𝑇2 𝑇2 𝑇2
𝑇 𝑇 𝑇
𝑌 𝑌 𝑌
(a) Causal graph for observational distri- (b) Causal graph after intervention on 𝑇 (c) Causal graph after intervention on 𝑇2
bution (interventional distribution) (interventional distribution)
The key thing that changed when we moved from the regular factorization
in Equation 4.3 to the truncated factorization in Equation 4.4 is that the
latter’s product is only over 𝑖 ∉ 𝑆 rather than all 𝑖 . In other words, the
factors for 𝑖 ∈ 𝑆 have been truncated.
To see the power that the truncated factorization gives us, let’s apply it
to identify the causal effect of treatment on outcome in a simple graph.
Specifically, we will identify the causal quantity 𝑃(𝑦 | do(𝑡)). In this
example, the distribution 𝑃 is Markov with respect to the graph in Figure 𝑋
4.5. The Bayesian network factorization (from the Markov assumption),
gives us the following:
𝑇 𝑌
𝑃(𝑦, 𝑡, 𝑥) = 𝑃(𝑥) 𝑃(𝑡 | 𝑥) 𝑃(𝑦 | 𝑡, 𝑥) (4.5)
Figure 4.5: Simple causal structure where
When we intervene on the treatment, the truncated factorization (from 𝑋 counfounds the effect of 𝑇 on 𝑌 and
where 𝑋 is the only confounder.
adding the modularity assumption) gives us the following:
If we then plug in Equation 4.7 for 𝑃(𝑦 | do(𝑇 = 1)) and 𝑃(𝑦 | do(𝑇 = 0)),
we have a fully identified ATE. Given the simple graph in Figure 4.5, we
have shown how we can use the truncated factorization to identify causal
effects in Equations 4.5 to 4.7. We will now generalize this identification
process to a more general formula.
2. 𝑊 does not contain any descendants of 𝑇 . which set of nodes related to 𝑇 will always
be a sufficient adjustment set? Which set
of nodes related to 𝑌 will always be a
sufficient adjustment set?
4 Causal Models 38
Given that 𝑊 satisfies the backdoor criterion, we can write the following:
X X
𝑃(𝑦 | do(𝑡), 𝑤) 𝑃(𝑤 | do(𝑡)) = 𝑃(𝑦 | 𝑡, 𝑤) 𝑃(𝑤 | do(𝑡)) (4.12)
𝑤 𝑤
Here’s a concise recap of the proof (Equations 4.11 to 4.13) without all of
the explanation/justification:
Proof.
X
𝑃(𝑦 | do(𝑡)) = 𝑃(𝑦 | do(𝑡), 𝑤) 𝑃(𝑤 | do(𝑡)) (4.14)
𝑤
X
= 𝑃(𝑦 | 𝑡, 𝑤) 𝑃(𝑤 | do(𝑡)) (4.15)
𝑤
X
= 𝑃(𝑦 | 𝑡, 𝑤) 𝑃(𝑤) (4.16)
𝑤
We can derive this from the more general backdoor adjustment in a few
steps. First, we take an expectation over 𝑌 :
X
𝔼[𝑌 | do(𝑡)] = 𝔼[𝑌 | 𝑡, 𝑤] 𝑃(𝑤) (4.18)
𝑤
Then, we notice that the sum over 𝑤 and 𝑃(𝑤) is an expectation (for
discrete 𝑤 , but just replace with an integral if not):
(𝑌(1), 𝑌(0)) ⊥
⊥𝑇 |𝑊 (4.21)
As Judea Pearl often says, the equals sign in mathematics does not convey
any causal information. Saying 𝐴 = 𝐵 is the same as saying 𝐵 = 𝐴.
Equality is symmetric. However, in order to talk about causation, we
must have something asymmetric. We need to be able to write that 𝐴
is a cause of 𝐵, meaning that changing 𝐴 results in changes in 𝐵, but
changing 𝐵 does not result in changes in 𝐴. This is what we get when we
write the following structural equation:
𝐵 := 𝑓 (𝐴) , (4.22)
where 𝑓 is some function that maps 𝐴 to 𝐵. While the usual “=” symbol
does not give us causal information, this new “:=” symbol does. This
is a major difference that we see when moving from statistical models
to causal models. Now, we have the asymmetry we need to describe
causal relations. However, the mapping between 𝐴 and 𝐵 is deterministic.
Ideally, we’d like to allow it to be probabilistic, which allows room for
some unknown causes of 𝐵 that factor into this mapping. Then, we can
write the following:
𝐵 := 𝑓 (𝐴, 𝑈) , (4.23)
where 𝑈 is some unobserved random variable. We depict this in Figure 4.7,
𝐴 𝑈
where 𝑈 is drawn inside a dashed node to indicate that it is unobserved.
The unobserved 𝑈 is analogous to the randomness that we would
see by sampling units (individuals); it denotes all the relevant (noisy)
background conditions that determine 𝐵. More concretely, there are 𝐵
analogs to every part of the potential outcome 𝑌𝑖 (𝑡): 𝐵 is the analog of 𝑌 , Figure 4.7: Graph for simple structural
𝐴 = 𝑎 is the analog of 𝑇 = 𝑡 , and 𝑈 is the analog of 𝑖 . equation. The dashed node 𝑈 means that
𝑈 is unobserved.
The functional form of 𝑓 does not need to be specified, and when
left unspecified, we are in the nonparametric regime because we aren’t
making any assumptions about parametric form. Although the mapping
4 Causal Models 41
𝐵 := 𝑓𝐵 (𝐴, 𝑈𝐵 ) 𝐴 𝑈𝐵
𝑀: 𝐶 := 𝑓𝐶 (𝐴, 𝐵, 𝑈𝐶 ) (4.24)
𝐷 := 𝑓𝐷 (𝐴, 𝐶, 𝑈𝐷 )
𝐵 𝑈𝐶
In causal graphs, the noise variables are often implicit, rather than
explicitly drawn. The variables that we write structural equations for
are known as endogenous variables. These are the variables whose causal
mechanisms we are modeling – the variables that have parents in the 𝐶 𝑈𝐷
causal graph. In contrast, exogenous variables are variables who do not
have any parents in the causal graph; these variables are external to our
causal model in the sense that we choose not to model their causes. For
𝐷
example, in the causal model described by Figure 4.8 and Equation 4.24,
the endogenous variables are {𝐵, 𝐶, 𝐷}. And the exogenous variables Figure 4.8: Graph for the structural equa-
are {𝐴, 𝑈𝐵 , 𝑈𝐶 , 𝑈𝐷 }.
tions in Equation 4.24.
4.5.2 Interventions
𝑇 := 𝑓𝑇 (𝑋 , 𝑈𝑇 )
𝑀: (4.25) 𝑇 𝑌
𝑌 := 𝑓𝑌 (𝑋 , 𝑇, 𝑈𝑌 )
Figure 4.9: Basic causal graph
If we then intervene on 𝑇 to set it to 𝑡 , we get the interventional SCM 𝑀𝑡
below and corresponding manipulated graph in Figure 4.10.
𝑋
𝑇 := 𝑡
𝑀𝑡 : (4.26)
𝑌 := 𝑓𝑌 (𝑋 , 𝑇, 𝑈𝑌 )
The fact that do(𝑇 = 𝑡) only changes the equation for 𝑇 and no other 𝑇 𝑌
variables is a consequence of the modularity assumption; these causal Figure 4.10: Basic causal with the the in-
mechanisms (structural equations) are modular. Assumption 4.1 states coming edges to 𝑇 removed, due to the
intervention do(𝑇 = 𝑡).
the modularity assumption in the context of causal Bayesian networks,
but we need a slightly different translation of this assumption for SCMs.
for the 𝑈 in Equation 4.23 and the paragraph that followed it. This is
the notation that Pearl uses for SCMs as well [see, e.g., 17, Definition
4]. So 𝑌𝑡 (𝑢) denotes the outcome that unit 𝑢 would observe if they take [17]: Pearl (2009), ‘Causal inference in
treatment 𝑡 , given that the SCM is 𝑀 . Similarly, we define 𝑌𝑀𝑡 (𝑢) as statistics: An overview’
the outcome that unit 𝑢 would observe if they take treatment 𝑡 , given
that the SCM is 𝑀𝑡 (remember that 𝑀𝑡 is the same SCM as 𝑀 but with
the structural equation for 𝑇 changed to 𝑇 := 𝑡 ). Now, we are ready to
4 Causal Models 43
present one of Pearl’s two key principles from which all other causal
8
results follow:8 Active reading exercise: Can you recall
which was the other key principle/as-
sumption?
Definition 4.3 (The Law of Counterfactuals (and Interventions))
Adjustment
𝑍2
4.6.1 Association vs. Causation in a Toy Example
In this section, we posit a toy generative process and derive the bias of the
𝑇 𝑌
associational quantity 𝔼[𝑌 | 𝑡]. We compare this to the causal quantity
𝔼[𝑌 | do(𝑡)], which gives us exactly what we want. Note that both of Figure 4.16: Causal graph depicting M-
these quantities are actually functions of 𝑡 . If the treatment were binary, Bias.
then we would just look at the difference between the quantities with
𝑇 = 1 and with 𝑇 = 0. However, because our generative processes will be
𝑑 𝔼[𝑌|𝑡] 𝑑 𝔼[𝑌| do(𝑡)]
linear, 𝑑𝑡 and 𝑑𝑡 actually gives us all the information about
the treatment effect, regardless of if treatment is continuous, binary, or
multi-valued. We will assume infinite data so that we can work with
expectations. This means this section has nothing to do with estimation;
for estimation, see the next section
The generative process that we consider has the causal graph in Figure 4.17
4 Causal Models 45
𝑇 := 𝛼 1 𝑋 (4.28)
𝑌 := 𝛽𝑇 + 𝛼 2 𝑋 . (4.29) 𝑋
𝛼
Note that in the structural equation for 𝑌 , 𝛽 is the coefficient in front of 𝑇 . 𝛼1 2
𝔼𝑋 𝔼[𝑌 | 𝑡, 𝑋] = 𝔼𝑋 𝔼[𝛽𝑇 + 𝛼2 𝑋 | 𝑇 = 𝑡, 𝑋]
(4.30)
= 𝔼𝑋 𝛽𝑡 + 𝛼 2 𝑋
(4.31)
= 𝛽𝑡 + 𝛼 2 𝔼[𝑋] (4.32)
Importantly, we made use of the equality that the structural equation for
𝑌 (Equation 4.29) gives us in Equation 4.30. Now, we just have to take
the derivative to get the causal effect:
𝑑 𝔼𝑋 𝔼[𝑌 | 𝑡, 𝑋]
= 𝛽. (4.33)
𝑑𝑡
We got exactly what we were looking for. Now, let’s move to the associa-
tional quantity:
In Equation 4.36, we made use of the equality that the structural equation
for 𝑇 (Equation 4.28) gives us. If we then take the derivative, we see that
there is confounding bias:
𝑑 𝔼[𝑌 | 𝑡] 𝛼2
=𝛽+ . (4.37)
𝑑𝑡 𝛼1
Recall that we estimated a concrete value for the causal effect of sodium
intake on blood pressure in Section 2.5. There, we used the potential
outcomes framework. Here, we will do the same thing, but using causal
graphs. The spoiler is that the 19% error that we saw in Section 2.5 was
due to conditioning on a collider. [8]: Luque-Fernandez et al. (2018), ‘Edu-
cational Note: Paradoxical collider effect
First, we need to write down our causal assumptions in terms of a causal
in the analysis of non-communicable dis-
graph. Remember that in Luque-Fernandez et al. [8]’s example from ease epidemiological data: a reproducible
epidemiology, the treatment 𝑇 is sodium intake, and the outcome 𝑌 is illustration and web application’
4 Causal Models 46
blood pressure. The covariates are age 𝑊 and amount of protein in urine
(proteinuria) 𝑍 . Age is a common cause of both blood pressure and the
body’s ability to self-regulate sodium levels. In contrast, high amounts
of urinary protein are caused by high blood pressure and high sodium
intake. This means that proteinuria is a collider. We depict this causal
𝑊
graph in Figure 4.18.
Because 𝑍 is a collider, conditioning on it induces bias. Because 𝑊 and 𝑍
were grouped together as “covariates” 𝑋 in Section 2.5, we conditioned 𝑇 𝑌
on all of them. This is why we saw that our estimate was 19% off from
the true causal effect 1.05. Now that we’ve made the causal relationships
clear with a causal graph, the backdoor criterion (Definition 4.1) tells us
𝑍
to only adjust for 𝑊 and to not adjust for 𝑍 . More precisely, we were
doing the following adjustment in Section 2.5: Figure 4.18: Causal graph for the blood
pressure example. 𝑇 is sodium intake. 𝑌
is blood pressure. 𝑊 is age. And, impor-
𝔼𝑊 ,𝑍 𝔼[𝑌 | 𝑡, 𝑊 , 𝑍] (4.38) tantly, the amount of protein excreted in
urine 𝑍 is a collider.
And now, we will use the backdoor adjustment (Theorem 4.2) to change
our statistical estimand to the following:
𝔼𝑊 𝔼[𝑌 | 𝑡, 𝑊] (4.39)
We have simply removed the collider 𝑍 from the variables we adjust for.
For estimation, just as we did in Section 2.5, we use a model-assisted
estimator. We replace the outer expectation over 𝑊 with an empirical
mean over 𝑊 and replace the conditional expectation 𝔼[𝑌 | 𝑡, 𝑊] with a
machine learning model (in this case, linear regression).
Just as writing down the graph has lead us to simply not condition on 𝑍
in Equation 4.39, the code for estimation also barely changes. We need to
change just a single line of code in our previous program (Listing 2.1).
We display the full program with the fixed line of code below:
Xt = df[['sodium', 'age']]
y = df['blood_pressure']
model = LinearRegression()
model.fit(Xt, y)
Full code, complete with simulation,
is available at https://github.com/
Xt1 = pd.DataFrame.copy(Xt) bradyneal/causal-book-code/blob/
Xt1['sodium'] = 1 master/sodium_example.py.
Xt0 = pd.DataFrame.copy(Xt)
Xt0['sodium'] = 0
ate_est = np.mean(model.predict(Xt1) - model.predict(Xt0))
print('ATE estimate:', ate_est)
in Listing 2.1 to
Xt = df[['sodium', 'age']]
in Listing 4.1. When we run this revised code, we get an ATE estimate of
1.0502, which corresponds to 0.02% error (true value is 1.05) when using
4 Causal Models 47
bias (see Section 2.5). When we adjusted for all covariates in Section 2.5,
we reduced the percent error all the way down to 19%. In this section,
we saw this remaining error is due to collider bias. When we removed
the collider bias, by not conditioning on the collider 𝑍 , the error became
non-existent.
𝑍1 𝑍3
Potential Outcomes and M-Bias In fairness to the general culture
around the potential outcomes framework, it is common to only condition
on pretreatment covariates. This would prevent a practitioner who
adheres to this rule from conditioning on the collider 𝑍 in Figure 4.18. 𝑍2
However, there is no reason that there can’t be pretreatment colliders
that induce M-bias (Section 4.5.3). In Figure 4.19, we depict an example
of M-bias that is created by conditioning on 𝑍2 . We could fix this by
𝑇 𝑌
additionally conditioning on 𝑍1 and/or 𝑍3 , but in this example, they are
unobserved (indicated by the dashed lines). This means that the only Figure 4.19: Causal graph depicting M-
Bias that can only be avoided by not con-
way to avoid M-bias in Figure 4.19 is to not condition on the covariates ditioning on the collider 𝑍2 . This is due to
𝑍2 . the fact that the dashed nodes 𝑍1 and 𝑍3
are unobserved.
The first main set of assumptions is encoded by the causal graph that we
write down. Exactly what this causal graph means is determined by two
main assumptions, each of which can take on several different forms:
1. The Modularity Assumption
Different forms:
I Modularity Assumption for Causal Bayesian Networks (Assumption 4.1)
I Modularity Assumption for SCMs (Assumption 4.2)
I The Law of Counterfactuals (Definition 4.3)
2. The Markov Assumption
Different equivalent forms:
I Local Markov assumption (Assumption 3.1)
I Bayesian network factorization (Definition 3.1)
I Global Markov assumption (Theorem 3.1)
Given, these two assumptions (and positivity), if the backdoor criterion Now that you’re familiar with causal
(Definition 4.1) is satisfied in our assumed causal graph, then we have graphical models and SCMs, it may be
worth going back and rereading Chap-
identification. Note that although the backdoor criterion is a sufficient ter 2 while trying to make connections
condition for identification, it is not a necessary condition. We will see to what you’ve learned about graphical
this more in Chapter 6. causal models in these past two chapters.
Positivity (Assumption 2.3) is still a very important assumption that we [18]: Pearl (2009), Causality
must make, though it is sometimes neglected in the graphical models [23]: Pearl (2010), ‘On the consistency rule
in causal inference: axiom, definition, as-
literature.
sumption, or theorem?’
Randomized Experiments 5
Randomized experiments are noticeably different from observational 5.1 Comparability and Covari-
studies. In randomized experiments, the experimenter has complete con- ate Balance . . . . . . . . . . 49
trol over the treatment assignment mechanism (how treatment is assigned). 5.2 Exchangeability . . . . . . . 50
For example, in the most simple kind of randomized experiment, the 5.3 No Backdoor Paths . . . . . 51
experimenter randomly assigns (e.g. via coin toss) each participant to
either the treatment group or the control group. This complete control
over how treatment is chosen is what distinguishes randomized experi-
ments from observational studies. In this simple experimental setup, the
treatment isn’t a function of covariates at all! In contrast, in observational
studies, the treatment is almost always a function of some covariate(s).
As we will see, this difference is key to whether or not confounding is
present in our data.
In randomized experiments, association is causation. This is because ran-
domized experiments are special in that they guarantee that there is no
confounding. As a consequence, this allows us to measure the causal effect
𝔼[𝑌(1)]−𝔼[𝑌(0)] via the associational difference 𝔼[𝑌 | 𝑇 = 1] − 𝔼[𝑌 | 𝑇 = 0].
In the following sections, we explain why this is the case from a variety
of different perspectives. If any one of these explanations clicks with you,
that might be good enough. Definitely stick through to the most visually
appealing explanation in Section 5.3.
Ideally, the treatment and control groups would be the same, in all
aspects, except for treatment. This would mean they only differ in the
treatment they receive (i.e. they are comparable). This would allow us to
attribute any difference in the outcomes of the treatment and control
groups to the treatment. Saying that these treatment groups are the same
in everything other than their treatment and outcomes is the same as
saying they have the same distribution of confounders. Because people
often check for this property on observed variables (often what people
mean by “covariates”), this concept is known as covariate balance.
𝑑
𝑃(𝑋 | 𝑇 = 1) = 𝑃(𝑋 | 𝑇 = 0) (5.1)
𝑑
The symbol = means “equal in distribu-
Randomization implies covariate balance, across all covariates, even tion.”
unobserved ones. Intuitively, this is because the treatment is chosen at
random, regardless of 𝑋 , so the treatment and control groups should
look very similar. The proof is simple. Because 𝑇 is not at all determined
by 𝑋 (solely by a coin flip), 𝑇 is independent of 𝑋 . This means that
5 Randomized Experiments 50
𝑑 𝑑
𝑃(𝑋 | 𝑇 = 1) = 𝑃(𝑋). Similarly, it means 𝑃(𝑋 | 𝑇 = 0) = 𝑃(𝑋). Therefore,
𝑑
we have 𝑃(𝑋 | 𝑇 = 1) = 𝑃(𝑋 | 𝑇 = 0).
Although we have proven that randomization implies covariate balance,
we have not proven that that covariate balance implies that association is
causation.1 We’ll now prove that by showing that 𝑃(𝑦 | do(𝑡)) = 𝑃(𝑦 | 𝑡). 1Recall that the intuition is that covariate
For the proof, the main property we utilize is that covariate balance balance means that everything is the same
implies 𝑋 and 𝑇 are independent. between the treatment groups, except for
the treatment, so the treatment must be
the explanation for the change in 𝑌 .
Proof. First, let 𝑋 be a sufficient adjustment set that potentially contains
unobserved variables (randomization also balances unobserved covariates).
Such an adjustment set must exist because we allow it to contain any
variables, observed or unobserved. Then, we have the following from the
backdoor adjustment (Theorem 4.2):
X
𝑃(𝑦 | do(𝑡)) = 𝑃(𝑦 | 𝑡, 𝑥)𝑃(𝑥) (5.2)
𝑥
𝑃(𝑡|𝑥)
By multiplying by 𝑃(𝑡|𝑥)
, we get the joint distribution in the numerator:
X 𝑃(𝑦, 𝑡, 𝑥)
= (5.5)
𝑥 𝑃(𝑡)
5.2 Exchangeability
confounding association
5.3 No Backdoor Paths 𝑋
The final perspective that we’ll look at to see why association is causation
in randomized experiments is that of graphical causal models. In regular
𝑇 𝑌
observational data, there is almost always confounding. For example,
in Figure 5.1, we see that 𝑋 is a confounder of the effect of 𝑇 on 𝑌 . Figure 5.1: Causal structure of 𝑋 con-
founding the effect of 𝑇 on 𝑌 .
Non-causal association flows along the backdoor path 𝑇 ← 𝑋 → 𝑌 .
However, if we randomize 𝑇 , something magical happens: 𝑇 no longer
has any causal parents, as we depict in Figure 5.2. This is because 𝑇 is
purely random. It doesn’t depend on anything other than the output of a
coin toss (or a quantum random number generator, if you’re into the kind
of stuff). Because 𝑇 has no incoming edges, under randomization, there 𝑋
are no backdoor paths. So the empty set is a sufficient adjustment set. This
means that all of the association that flows from 𝑇 to 𝑌 is causal. We can
identify 𝑃(𝑌 | do(𝑇 = 𝑡)) by simply applying the backdoor adjustment
𝑇 𝑌
(Theorem 4.2), adjusting for the empty set:
Figure 5.2: Causal structure when we ran-
𝑃(𝑌 | do(𝑇 = 𝑡)) = 𝑃(𝑌 | 𝑇 = 𝑡) domize treatment.
causal association
6.1 Frontdoor Adjustment Figure 6.1: Causal graph where 𝑊 is un-
observed, so we cannot block the back-
door path. We depict the flow of causal
The high-level intuition for why we can identify the causal effect of 𝑇 on association and the flow of confounding
𝑌 in the graph in Figure 6.1 (even when we can’t adjust for the confounder association with dashed lines.
Step 3 Now that we know how changing 𝑇 changes 𝑀 (step 1) and how
changing 𝑀 changes 𝑌 (step 2), we can combine these two to get how
changing 𝑇 changes 𝑌 (through 𝑀 ):
X
𝑃(𝑦 | do(𝑡)) = 𝑃(𝑚 | do(𝑡)) 𝑃(𝑦 | do(𝑚)) (6.3)
𝑚
The causal graph we’ve been using (Figure 6.4) is an example of a simple
graph that satisfies the frontdoor criterion. To get the full definition, we 𝑇 𝑀 𝑌
must first define complete/full mediation: a set of variables 𝑀 completely
Figure 6.4: Simple causal graph that satis-
mediates the effect of 𝑇 on 𝑌 if all causal (directed) paths from 𝑇 to fies the frontdoor criterion
𝑌 go through 𝑀 . We now give the general definition of the frontdoor
criterion:
2 Active reading exercise: Think of a graph
Definition 6.1 (Frontdoor Criterion) A set of variables 𝑀 satisfies the other than Figure 6.4 that satisfies the
to 𝑌 go through 𝑀 ).
2. There is no unblocked backdoor path from 𝑇 to 𝑀 .
3. All backdoor paths from 𝑀 to 𝑌 are blocked by 𝑇 .2
𝑇
Although Equations 6.1 and 6.2 are straightforward applications of the
much
rigor 𝑌
𝑊
backdoor adjustment, we hand-waved our way to Equation 6.3, which
𝑀
was key to the frontdoor adjustment (Theorem 6.1). We’ll now walk equation very
through how to get Equation 6.3. Active reading exercise: Feel free to wow
stop reading here and do this yourself.
quick
We are about to enter Equationtown (Figure 6.5), so if you are satisfied with
maths
the intuition we gave for step 3 and prefer to not see a lot of equations,
feel free to skip to the end of the proof (denoted by the symbol).
Figure 6.5: Equationtown
6 Nonparametric Identification 54
Even though we’ve removed all the do operators, recall that we are not
done because 𝑊 is unobserved. So we must also remove the 𝑤 from the
expression. This is where we have to get a bit creative.
We want to be able to combine 𝑃(𝑦 | 𝑤, 𝑚) and 𝑃(𝑤) into a joint factor
over both 𝑦 and 𝑤 so that we can marginalize out 𝑤 . To do this, we
need to get 𝑚 behind the conditioning bar of the 𝑃(𝑤) factor. This would
be easy if we could just swap 𝑃(𝑤) out for 𝑃(𝑤 | 𝑚) in Equation 6.8.3 3 Active reading exercise: Why would it
The key thing to notice is that we actually can include 𝑚 behind the be easy to marginalize out 𝑤 if it were
the case that 𝑃(𝑤) = 𝑃(𝑤 | 𝑚)? And why
conditioning bar if 𝑡 were also there because 𝑇 d-separates 𝑊 from 𝑀 in
does this equality not hold?
Figure 6.6. In math, this means that the following equality holds:
Great, so how do we get 𝑡 into this party? The usual trick of conditioning
𝑊
on it and marginalizing it out:
X X
𝑃(𝑦 | do(𝑡)) = 𝑃(𝑚 | 𝑡) 𝑃(𝑦 | 𝑤, 𝑚) 𝑃(𝑤) (6.8 revisited)
𝑚 𝑤
X X X 𝑇 𝑀 𝑌
= 𝑃(𝑚 | 𝑡) 𝑃(𝑦 | 𝑤, 𝑚) 𝑃(𝑤 | 𝑡 0) 𝑃(𝑡 0) (6.10)
𝑚 𝑤 𝑡 0
Figure 6.6: Simple causal graph that satis-
X X X
= 𝑃(𝑚 | 𝑡) 𝑃(𝑦 | 𝑤, 𝑚) 𝑃(𝑤 | 𝑡 0 , 𝑚) 𝑃(𝑡 0) (6.11) fies the frontdoor criterion
𝑚 𝑤 𝑡0
X X 0
X
= 𝑃(𝑚 | 𝑡) 𝑃(𝑡 ) 𝑃(𝑦 | 𝑤, 𝑚) 𝑃(𝑤 | 𝑡 0 , 𝑚) (6.12)
𝑚 𝑡0 𝑤
This matches the result stated in Theorem 6.1, so we’ve completed the
derivation of the frontdoor adjustment without using the backdoor
adjustment. However, we still need to show that Equation 6.3 is correct
to justify step 3. To do that, all that’s left is to recognize that these parts
match Equations 6.1 and 6.2 and plug those in: 𝑃(𝑚 | do(𝑡)) = 𝑃(𝑚 | 𝑡) (6.1)
X
X 𝑃(𝑦 | do(𝑚)) = 𝑃(𝑦 | 𝑚, 𝑡) 𝑃(𝑡) (6.2)
= 𝑃(𝑚 | do(𝑡)) 𝑃(𝑦 | do(𝑚)) (6.17) 𝑡
𝑚
And we’re done! We just needed to be a bit clever with our uses of d-
separation and marginalization. Part of why we went through that proof
is because we will prove the frontdoor adjustment using do-calculus in
Section 6.2. This way you can easily compare a proof using the truncated
factorization to a proof using do-calculus to prove the same result.
6.2 do-calculus
As we saw in the last section, it turns out that satisfying the backdoor
criterion (Definition 4.1) isn’t necessary to identify causal effects. For
example, if the frontdoor criterion (Definition 6.1) is satisfied, that also
gives us identifiability. This leads to the following questions: can we
identify causal estimands when the associated causal graph satisfies
neither the backdoor criterion nor the frontdoor criterion? If so, how?
Pearl’s do-calculus [24] gives us the answer to these questions. [24]: Pearl (1995), ‘Causal diagrams for
empirical research’
As we will see, the do-calculus gives us tools to identify causal effects
using the causal assumptions encoded in the causal graph. It will allow
us to identify any causal estimand that is identifiable. More concretely,
consider an arbitrary causal estimand 𝑃(𝑌 | do(𝑇 = 𝑡), 𝑋 = 𝑥), where 𝑌
is an arbitrary set of outcome variables, 𝑇 is an arbitrary set of treatment
variables, and 𝑋 is an arbitrary (potentially empty) set of covariates that
we want to choose how specific the causal effect we’re looking at is. Note
that this means we can use do-calculus to identify causal effects where
there are multiple treatments and/or multiple outcomes.
In order to present the rules of do-calculus, we must define a bit of
notation for augmented versions of the causal graph 𝐺 . Let 𝐺 𝑋 denote
the graph that we get if we take 𝐺 and remove all of the incoming edges
to nodes in the set 𝑋 ; recall from Section 4.2 that this is known as the
manipulated graph. Let 𝐺 𝑋 denote the graph that we get if we take 𝐺 and
remove all of the outgoing edges from nodes in the set 𝑋 . The mnemonic
meaning to help you remember this is to think of parents as drawn above
their children in the graph, so the bar above 𝑋 is cutting its incoming
edges and the bar below 𝑋 is cutting its outgoing edges. Combining these
two, we’ll use 𝐺 𝑋𝑍 to denote the graph with the incoming edges to 𝑋
and the outgoing edges from 𝑍 removed. And recall from Section 3.7 that
we use ⊥ ⊥𝐺 to denote d-separation in 𝐺 . We’re now ready; do-calculus
consists of just three rules:
Rule 2:
Now, rather than recreate the proofs for these rules from Pearl [24], we’ll [24]: Pearl (1995), ‘Causal diagrams for
give intuition for each of them in terms of concepts we’ve already seen in empirical research’
this book.
Rule 1 Intuition If we take Rule 1 and simply remove the intervention
do(𝑡), we get the following (Active reading exercise: what familiar concept
is this?):
𝑃(𝑦 | 𝑧, 𝑤) = 𝑃(𝑦 | 𝑤) if 𝑌 ⊥⊥𝐺 𝑍 | 𝑊 (6.21)
This is just what d-separation gives us under the Markov assumption;
recall from Theorem 3.1 that d-separation in the graph implies conditional
independence in 𝑃 . This means that Rule 1 is simply a generalization of
Theorem 3.1 to interventional distributions.
Rule 2 Intuition Just as with Rule 1, we’ll remove the intervention do(𝑡)
from Rule 2 and see what this reminds us of (Active reading exercise:
what concept does this remind you of?):
To get the equality in this equation, it must be the case that removing
the intervention do(𝑧) (which is like taking the manipulated graph and
reintroducing the edges going into 𝑍 ) introduces no new association
that can affect 𝑌 . Because do(𝑧) removes the incoming edges to 𝑍 to give
us 𝐺 𝑍 , the main association that we need to worry about is association
flowing from 𝑍 to 𝑌 in 𝐺 𝑍 (causal association). Therefore, you might
6 Nonparametric Identification 57
expect that the condition that gives us the equality in Equation 6.23 is
𝑌⊥ ⊥𝐺𝑍 𝑍 | 𝑊 . However, we have to refine this a bit to prevent inducing
association by conditioning on the descendants of colliders (recall from
Section 3.6). Namely, 𝑍 could contain colliders in 𝐺 , and 𝑊 could contain
descendants of these colliders. Therefore, to not induce new association
through colliders in 𝑍 when we reintroduce the incoming edges to 𝑍 to
get 𝐺 , we must limit the set of manipulated nodes to those that are not
ancestors of nodes in the conditioning set 𝑊 : 𝑍(𝑊).
Completeness of do-calculus Maybe there could exist causal estimands
that are identifiable but that can’t be identified using only the rules of
do-calculus in Theorem 6.2. Fortunately, Shpitser and Pearl [25] and
Huang and Valtorta [26] independently proved that this is not the case. [25]: Shpitser and Pearl (2006), ‘Identifica-
They proved that do-calculus is complete, which means that these three tion of Joint Interventional Distributions
in Recursive Semi-Markovian Causal Mod-
rules are sufficient to identify all identifiable causal estimands. Because els’
these proofs are constructive, they also admit algorithms that identify [26]: Huang and Valtorta (2006), ‘Pearl’s
any causal estimand in polynomial time. Calculus of Intervention is Complete’
Recall the simple graph we used that satisfies the frontdoor criterion
(Figure 6.7), and recall the frontdoor adjustment:
X X
𝑃(𝑦 | do(𝑡)) = 𝑃(𝑚 | 𝑡) 𝑃(𝑦 | 𝑚, 𝑡 0) 𝑃(𝑡 0) (6.4 revisited)
𝑚 𝑡0
At the end of Section 6.1, we saw a proof for the frontdoor adjustment
𝑊
using just the truncated factorization. To get an idea for how do-calculus
works and the intuition we use in proofs that use it, we’ll now do the
frontdoor adjustment proof using the rules of do-calculus.
𝑇 𝑀 𝑌
Proof. Our goal is to identify 𝑃(𝑦 | do(𝑡)). Because we have the intu-
Figure 6.7: Simple causal graph that satis-
ition we described in Section 6.1 that the full mediator 𝑀 will help us fies the frontdoor criterion
out, the first thing we’ll do is introduce 𝑀 into the equation via the
marginalization trick:
X
𝑃(𝑦 | do(𝑡)) = 𝑃(𝑦 | do(𝑡), 𝑚) 𝑃(𝑚 | do(𝑡)) (6.24)
𝑚
take two steps of do-calculus. To remove do(𝑡), we’ll need to use Rule 3,
which requires that 𝑇 have no causal effect on 𝑌 in the relevant graph. We
can get to a graph like that by removing the edge from 𝑇 to 𝑀 (Figure 6.9);
𝑊
in do-calculus, we do this by using Rule 2 (in the opposite direction as
before) to do(𝑚). We can do this because the existing do(𝑡) makes it so
there are no backdoor paths from 𝑀 to 𝑌 in 𝐺𝑇 (Figure 6.8).
X 𝑇 𝑀 𝑌
= 𝑃(𝑦 | do(𝑡), do(𝑚)) 𝑃(𝑚 | 𝑡) (6.26)
𝑚 Figure 6.8: 𝐺𝑇
Now, as we planned, we can remove the do(𝑡) using Rule 3. We can use
Rule 3 here because there is no causation flowing from 𝑇 to 𝑌 in 𝐺 𝑀 𝑊
(Figure 6.9).
X
= 𝑃(𝑦 | do(𝑚)) 𝑃(𝑚 | 𝑡) (6.27)
𝑚
𝑇 𝑀 𝑌
All that’s left is to remove this last do-operator. As we discussed in Figure 6.9: 𝐺 𝑀
Section 6.1, 𝑇 blocks the only backdoor path from 𝑀 to 𝑌 in the graph
(Figure 6.10). This means, that if we can condition on 𝑇 , we can get rid
of this last do-operator. As usual, we do that by conditioning on and
marginalizing out 𝑇 . Rearranging a bit and using 𝑡 0 for the marginalization
since 𝑡 is already present:
𝑊
X X 0 0
= 𝑃(𝑚 | 𝑡) 𝑃(𝑦 | do(𝑚), 𝑡 ) 𝑃(𝑡 | do(𝑚))
𝑚 𝑡0
(6.28)
Now, we can simply apply Rule 2, since 𝑇 blocks the backdoor path from 𝑇 𝑀 𝑌
𝑀 to 𝑌 : Figure 6.10: 𝐺
X X
= 𝑃(𝑚 | 𝑡) 𝑃(𝑦 | 𝑚, 𝑡 0) 𝑃(𝑡 0 | do(𝑚))
𝑚 𝑡0
(6.29)
And finally, we can apply Rule 3 to remove the last do(𝑚) because there
is no causal effect of 𝑀 on 𝑇 (i.e. there is no directed path from 𝑀 to 𝑇
in the graph in (Figure 6.10).
X X
= 𝑃(𝑚 | 𝑡) 𝑃(𝑦 | 𝑚, 𝑡 0) 𝑃(𝑡 0) (6.30)
𝑚 𝑡0
It’s nice to know that we can identify any causal estimand that is possible
to identify using do-calculus, but this isn’t as satisfying as knowing
whether a causal estimand is identifiable by simply looking at the causal
graph. For example, the backdoor criterion (Definition 4.1) and the
frontdoor criterion (Definition 6.1) gave us simple ways to know for
sure that a causal estimand is identifiable. However, there are plenty of
6 Nonparametric Identification 59
This criterion generalizes the backdoor criterion (Definition 4.1) and the
frontdoor criterion (Definition 6.1). Like them, it is a sufficient condition
for identifiability:
non-causal association
𝑊1 𝑊1
𝑊2 𝑊2
focus
𝑇 𝑀1 𝑌 𝑇 𝑀1 𝑌
𝑀2 𝑀2
(a) Visualization of the flow of confound- (b) Visualization of the isolation of the
ing association and causal association. causal association flowing from 𝑇 to
its children, allowing the unconfounded
children criterion to imply identifiability. Figure 6.12: Example graph that satisfies
the unconfounded children criterion
is blocked by the collider 𝑊2 . And we can block the the backdoor path
𝑇 ← 𝑊2 → 𝑌 by conditioning on 𝑊2 . However, conditioning on 𝑊2
unblocks the other backdoor path where 𝑊2 is a collider. Being able to
block both paths individually does not mean we can block them both with
a single conditioning set. In sum, the unconfounded children criterion is
sufficient but not necessary, and this related condition is necessary but
not sufficient. Also, everything we’ve seen in this section so far is for a
single variable intervention.
Necessary and Sufficient Conditions for Multiple Variable Interven-
tions Shpitser and Pearl [25] provide a necessary and sufficient criterion [25]: Shpitser and Pearl (2006), ‘Identifica-
for identifiability of 𝑃(𝑌 = 𝑦 | do(𝑇 = 𝑡)) when 𝑌 and 𝑇 are arbitrary tion of Joint Interventional Distributions
in Recursive Semi-Markovian Causal Mod-
sets of variables: the hedge criterion. However, this is outside the scope els’
of this book, as it requires more complex objects such as hedges, C-
trees, and other leafy objects. Moving further along, Shpitser and Pearl
[28] provide a necessary and sufficient criterion for the most general [28]: Shpitser and Pearl (2006), ‘Identifica-
type of causal estimand: conditional causal effects, which take the form tion of Conditional Interventional Distri-
𝑃(𝑌 = 𝑦 | do(𝑇 = 𝑡), 𝑋 = 𝑥), where 𝑌 , 𝑇 , and 𝑋 are all arbitrary sets of
butions’
variables.
Active reading exercises:
1. Is the unconfounded criterion (Definition 6.2) satisfied in Fig-
ure 6.13a?
2. Is the unconfounded criterion satisfied in Figure 6.13b?
3. Can we get identifiability in Figure 6.13b via any simpler criterion
that we’ve seen before?
6 Nonparametric Identification 61
𝑊1 𝑊3
𝑊1 𝑊3
𝑊2
𝑊2
𝑇 𝑀 𝑌 𝑇 𝑀 𝑌
(a) (b)
Figure 6.13: Graphs for the questions about the unconfounded children criterion
Estimation 7
In the previous chapter, we covered identification. Once we identify some 7.1 Preliminaries . . . . . . . . . 62
causal estimand by reducing it to a statistical estimand, we still have more 7.2 Conditional Outcome Mod-
work to do. We need to get a corresponding estimate. In this chapter, we’ll eling (COM) . . . . . . . . . 63
cover a variety of estimators that we can use to do this. This isn’t meant 7.3 Grouped Conditional Out-
to be anywhere near exhaustive as there are many different estimators of come Modeling (GCOM) . 64
causal effects, but it is meant to give you a solid introduction to them.
7.4 Increasing Data Efficiency . 65
All of the estimators that we include full sections on are model-assisted TARNet . . . . . . . . . . . . . 65
estimators (recall from Section 2.4). And they all work with arbitrary X-Learner . . . . . . . . . . . 66
statistical models such as the ones you might get from scikit-learn [29]. 7.5 Propensity Scores . . . . . . 67
7.6 Inverse Probability Weight-
ing (IPW) . . . . . . . . . . . 68
7.7 Doubly Robust Methods . . 70
7.1 Preliminaries 7.8 Other Methods . . . . . . . . 70
7.9 Concluding Remarks . . . . 71
Recall from Chapter 2 that we denote the individual treatment effect Confidence Intervals . . . . 71
Comparison to Randomized
(ITE) with 𝜏𝑖 and average treatment effect (ATE) with 𝜏:
Experiments . . . . . . . . . 72
𝜏𝑖 , 𝑌𝑖 (1) − 𝑌𝑖 (0) (7.1) [29]: Pedregosa et al. (2011), ‘Scikit-learn:
Machine Learning in Python’
𝜏 , 𝔼[𝑌𝑖 (1) − 𝑌𝑖 (0)] (7.2)
ITEs are the most specific kind of causal effects, but they are hard
to estimate without strong assumptions (on top of those discussed in
Chapters 2 and 4). However, we often want to estimate causal effects that
are a bit more individualized than the ATE.
For example, say we’ve observed an individual’s covariates 𝑥 ; we might
like to use those to estimate a more specific effect for that individual (and
anyone else with covariates 𝑥 ). This brings us to the conditional average
treatment effect (CATE) 𝜏(𝑥):
The 𝑋 that is conditioned on does not need to consist of all of the observed
covariates, but this is often the case when people refer to CATEs. We call
that individualized average treatment effects (IATEs).
ITEs and “CATEs” (what we call IATEs) are sometimes conflated, but
they are not the same. For example, two individuals could have the same
covariates, but their potential outcomes could be different because of
other unobserved differences between these individuals. If we encompass
everything about an individual that is relevant to their potential outcomes
in the vector 𝐼 , then ITEs and “CATEs” are the same if 𝑋 = 𝐼 . In a causal 1 This paragraph contains a lot of informa-
graph, 𝐼 corresponds to all of the exogenous variables in the magnified tion. Active reading exercise:
1) Convince yourself that ITEs and
graph that have causal association flowing to 𝑌 .1 “CATEs” (what we call IATEs) are the same
if 𝑋 = 𝐼 .
2) Convince yourself that 𝐼 corresponds to
the exogenous variables in the magnified
graph that have causal association flowing
to 𝑌 .
7 Estimation 63
We are interested in estimating the ATE 𝜏. We’ll start with recalling the
adjustment formula (Theorem 2.1), which can be derived as a corollary
of the backdoor adjustment (Theorem 4.2), as we saw in Section 4.4.1:
Identification Estimation
Causal Estimand Statistical Estimand Estimate
Figure 7.1: The Identification-Estimation Flowchart – a flowchart that illustrates the process of moving from a target causal estimand to a
corresponding estimate, through identification and estimation.
Then, we can fit a statistical model to 𝜇. We will denote that these fitted
models are approximations of 𝜇 with a hat: 𝜇ˆ . We will refer to a model 𝜇ˆ as
a conditional outcome model. Now, we can cleanly write the model-assisted
estimator (for the ATE) that we’ve described:
Active reading exercise: What are the two
1X different approximations we make in this
𝜏ˆ = 𝜇(
ˆ 1, 𝑤 𝑖 ) − 𝜇(
ˆ 0, 𝑤 𝑖 )
(7.6)
𝑛 𝑖 estimator and what parts of the statistical
estimand in Equation 7.4 do each of them
replace?
We will refer to estimators that take this form as conditional outcome model
(COM) estimators. Because minimizing the mean-squared error (MSE) of
predicting 𝑌 from (𝑇, 𝑋) pairs is equivalent to modeling this conditional
expectation [see, e.g., 10, Section 2.4], there are many different models we [10]: Hastie et al. (2001), The Elements of
Statistical Learning
7 Estimation 64
can use for 𝜇ˆ in Equation 7.6 to get a COM estimator (see, e.g., scikit-learn
[29]). [29]: Pedregosa et al. (2011), ‘Scikit-learn:
Machine Learning in Python’
For CATE estimation, because we assumed that 𝑊 ∪ 𝑋 is a sufficient
adjustment set, rather than just 𝑊 ,3 we must additionally add 𝑋 as 3 Active reading exercise: Why do we addi-
an input to our conditional outcome model. More precisely, for CATE tionally add 𝑋 to the adjustment set when
estimation, we define 𝜇 as follows: we are interested in CATEs?
Then, we train a statistical model 𝜇ˆ to predict 𝑌 from (𝑇, 𝑊 , 𝑋). And this
gives us the following COM estimator for the CATE 𝜏(𝑥):
Active reading exercise: Write down the
1 X causal estimand and statistical estimand
𝜏(𝑥) 𝜇(
ˆ 1 , 𝑤 𝑖 , 𝑥) − 𝜇(
ˆ 0 , 𝑤 𝑖 , 𝑥)
ˆ = (7.8)
𝑛 𝑥 𝑖 :𝑥 𝑖 =𝑥 that lead us to the estimator in Equa-
tion 7.8, and proof that they’re equal under
unconfoundedness and positivity. In other
where 𝑛 𝑥 is the number of data points that have 𝑥 𝑖 = 𝑥 . When we are words, identify the CATE.
interested in the IATE (CATE where 𝑋 is all of the observed covariates),
𝑛 𝑥 is often 1, which simplifies our estimator to a simple difference between
predictions:
𝜏(𝑥
ˆ 𝑖 ) = 𝜇(
ˆ 1 , 𝑤 𝑖 , 𝑥 𝑖 ) − 𝜇(
ˆ 0, 𝑤 𝑖 , 𝑥 𝑖 ) (7.9)
Even, though IATEs are different from ITEs (𝜏(𝑥 𝑖 ) ≠ 𝜏𝑖 ), if we really want
to give estimates for ITEs, it is relatively common to take this estimator
as our estimator of the ITE 𝜏𝑖 as well:
𝜏ˆ 𝑖 = 𝜏(𝑥
ˆ 𝑖 ) = 𝜇(
ˆ 1 , 𝑤 𝑖 , 𝑥 𝑖 ) − 𝜇(
ˆ 0, 𝑤 𝑖 , 𝑥 𝑖 ) (7.10)
Though, this will likely be unreliable due to severe positivity violation.4 4 Active reading exercise: Why is there
a severe positivity violation here? Does
The Many-Faced Estimator COM estimators have many different names this only apply in Equation 7.10 or also in
in the literature. For example, they are often called G-computation esti- Equation 7.9? What if there were multiple
mators, parametric G-formula, or standardization in epidemiology and units with 𝑥 𝑖 = 𝑥 ?
In order to get the estimate in Equation 7.6, we must train a model that
predicts 𝑌 from (𝑇, 𝑊). However, 𝑇 is often one-dimensional, whereas
𝑊 can be high-dimensional. But the input to 𝜇ˆ for 𝑡 is the only thing that
changes between the two terms inside the sum 𝜇(ˆ 1, 𝑤 𝑖 )− 𝜇(
ˆ 0 , 𝑤 𝑖 ). Imagine
concatenating 𝑇 to a 100-dimensional vector 𝑊 and then feeding that
through a neural network that we’re using for 𝜇ˆ . It seems reasonable that
the network could ignore 𝑇 while focusing on the other 100 dimensions
of its input. This would result in an ATE estimate of zero. And, indeed,
there is some evidence of COM estimators being biased toward zero
[30]. [30]: Künzel et al. (2019), ‘Metalearners
for estimating heterogeneous treatment
So how can we ensure that the model 𝜇ˆ doesn’t ignore 𝑇 ? Well, we can effects using machine learning’
just train two different models 𝜇ˆ 1 (𝑤) and 𝜇ˆ 0 (𝑤) that model 𝜇1 (𝑤) and
7 Estimation 65
1 X
𝜏(𝑥) 𝜇ˆ 1 (𝑤 𝑖 , 𝑥) − 𝜇ˆ 0 (𝑤 𝑖 , 𝑥)
ˆ = (7.13)
𝑛 𝑥 𝑖 :𝑥 𝑖 =𝑥
While GCOM estimation seems to fix the problem that COM estimation
can have regarding bias toward zero treatment effect, it does have an
important downside. In COM estimation, we were able to make use of
all the data when we estimate the single model 𝜇ˆ . However, in grouped
conditional outcome model estimation, we only use the 𝑇 = 1 group to
estimate 𝜇ˆ 1 , and we only use the 𝑇 = 0 group to estimate 𝜇ˆ 0 . Importantly,
we are missing out on making the most of our data by not using all of
the data to estimate 𝜇ˆ 1 and all of the data to estimate 𝜇ˆ 0 .
In this section, we’ll cover two ways to address the problem of data
efficiency that we mentioned is present in GCOM estimation at the end
of the last section: TARNet (Section 7.4.1) and X-Learner (Section 7.4.2).
7.4.1 TARNet
Consider that we’re using neural networks for our statistical models;
starting with that, we’ll contrast, vanilla COM estimation, GCOM estima-
tion, and TARNet. In vanilla COM estimation, the neural network is used
to predict 𝑌 from (𝑇, 𝑊) (see Figure 7.2a). This has the problem of poten-
tially yielding ATE estimates that are biased toward zero, as the network
might ignore the scalar 𝑇 , especially when 𝑊 is high-dimensional. We
ensure that 𝑇 can’t be ignored in GCOM estimation by using two separate
neural networks for the two treatment groups (Figure 7.2b). However,
this is inefficient as we only use the treatment group data for training
one network and the control group data for training the other network.
We can achieve a middle ground between vanilla COM estimation and
GCOM estimation using Shalit et al. [31]’s TARNet. With TARNet, we use [31]: Shalit et al. (2017), ‘Estimating in-
a single network that takes only 𝑊 as input but then branches off into dividual treatment effect: generalization
bounds and algorithms’
7 Estimation 66
𝑊 𝑌 𝑌
1
=
𝑇
𝑇 = 0 network 𝑊
𝑇
=
0
𝑇 𝑊 𝑌
𝑌 𝑌
𝑊
(a) A single neural network to model (b) Two neural networks: a network to (c) TARNet [31]. A single neural network
𝜇(𝑡, 𝑤), used in vanilla COM estimation model 𝜇1 (𝑤) (top) and a network to model to model 𝜇(𝑡, 𝑤) that branches off into two
(Section 7.2). 𝜇0 (𝑤) (bottom), used in GCOM estimation heads: one for 𝑇 = 1 and one for 𝑇 = 0.
(Section 7.3).
Figure 7.2: Coarse neural networks architectures for vanilla COM estimation (left), GCOM estimation (middle), and TARNet (right). In this
figure, we use each arrow to denote a sub-network that has an arbitrary number of layers.
7.4.2 X-Learner
We just saw that one way to increase data efficiency relative to GCOM
estimation is to use TARNet, a COM estimator that shares some qualities
with GCOM estimators. However, TARNet still doesn’t use all of the
data for the full model (neural network). In this section, we will start
with GCOM estimation and build on it to create a class of estimators
that use all of the data for both models that are part of the estimators.
An estimator in this class is known as an X-learner [30]. Unlike TARNet, [30]: Künzel et al. (2019), ‘Metalearners
for estimating heterogeneous treatment
X-learners are neither COM estimators nor GCOM estimators.
effects using machine learning’
There are three steps to X-learning, and the first step is the exact same
as what’s used in GCOM estimation: estimate 𝜇ˆ 1 (𝑥) using the treatment
group data and estimate 𝜇ˆ 0 (𝑥) using the control group data.7 As before, 7Recall that 𝜇ˆ 1 (𝑤) and 𝜇ˆ 0 (𝑤) are approx-
this can be done with any models that minimize MSE. For simplicity, imations of 𝔼[𝑌 | 𝑇 = 1 , 𝑊 = 𝑤] and
𝔼[𝑌 | 𝑇 = 0, 𝑊 = 𝑤], respectively.
in this section, we’ll be considering IATEs (𝑋 is all of the observed
variables) where 𝑋 satisfies the backdoor criterion (𝑋 contains 𝑊 and
no descendants of 𝑇 ).
The second step is the most important part as it is both where we end up
using all of the data for both models and where the “X” comes from. We
specify 𝜏ˆ 1,𝑖 for the treatment group ITE estimates and 𝜏ˆ 0,𝑖 for the control
7 Estimation 67
Here, 𝜏ˆ 1,𝑖 is estimated using the treatment group outcomes and the
imputed counterfactual that we get from 𝜇ˆ 0 . Similarly, 𝜏ˆ 0,𝑖 is estimated
using the control group outcomes and the imputed counterfactual that we
get from 𝜇ˆ 1 . If you draw a line between the observed potential outcomes
and a line between the imputed potential outcomes, you can see the
“X” shape. Importantly, this “X” tells us that each treatment group ITE
estimate 𝜏ˆ 1,𝑖 uses both treatment group data (its observed potential
outcome under treatment), and control group data (in 𝜇ˆ 0 ). Similarly, 𝜏ˆ 0,𝑖
is estimated with data from both treatment groups.
However, each ITE estimate only uses a single data point from its
corresponding treatment group. We can fix this by fitting a model 𝜏ˆ 1 (𝑥)
to predict 𝜏ˆ 1,𝑖 from the corresponding treatment group 𝑥 𝑖 ’s. Finally, we
have a model 𝜏ˆ 1 (𝑥) that was fit using all of the data (treatment group
data just now and control group data when 𝜇0 was fit in step 1). Similarly,
we can fit a model 𝜏ˆ 0 (𝑥) to predict 𝜏ˆ 0,𝑖 from the corresponding control
group 𝑥 𝑖 ’s. The output of step 2 is two different estimators for the IATE:
𝜏ˆ 1 (𝑥) and 𝜏ˆ 0 (𝑥).
Finally, in step 3, we combine 𝜏ˆ 1 (𝑥) and 𝜏ˆ 0 (𝑥) together to get our IATE
estimator:
𝜏(𝑥)
ˆ = 𝑔(𝑥) 𝜏ˆ 0 (𝑥) + (1 − 𝑔(𝑥)) 𝜏ˆ 1 (𝑥) (7.16)
where 𝑔(𝑥) is some weighting function that produces values between 0
and 1. Künzel et al. [30] report that an estimate of the propensity score [30]: Künzel et al. (2019), ‘Metalearners
(introduced in next section) works well, but that choosing the constant for estimating heterogeneous treatment
effects using machine learning’
function 0 or 1 also makes sense if the treatment groups are very different
sizes. Or that choosing 𝑔(𝑥) to minimize the variance of 𝜏(𝑥)
ˆ could also Active reading exercise: In this section,
be attractive. we covered the X-learner for IATE estima-
tion. What would an X-learner for more
general CATE estimation (𝑋 is arbitrary
and doesn’t necessarily contain all con-
founders 𝑊 ) look like?
Given that the vector of variables 𝑊 satisfies the backdoor criterion (or,
equivalently, that (𝑌(1), 𝑌(0)) ⊥
⊥ 𝑇 | 𝑊 ), we might wonder if it is really
necessary to condition on that whole vector to isolate causal association,
especially when 𝑊 is high-dimensional. It turns out that it isn’t. If 𝑊
satisfies unconfoundedness and positivity, then we can actually get away
with only conditioning on the scalar 𝑃(𝑇 = 1 | 𝑊). We’ll let 𝑒(𝑤) denote
𝑃(𝑇 = 1 | 𝑊 = 𝑤), as we’ll refer to 𝑒(𝑤) as the propensity score since it is
the propensity for (probability of) receiving treatment given that 𝑊 is
𝑤 . The magic of being able to condition on the scalar 𝑒(𝑊) in the place
of the vector 𝑊 is due to Rosenbaum and Rubin [32]’s propensity score [32]: Rosenbaum and Rubin (1983), ‘The
theorem: central role of the propensity score in ob-
servational studies for causal effects’
Equivalently,
(𝑌(1), 𝑌(0)) ⊥
⊥ 𝑇 | 𝑊 =⇒ (𝑌(1), 𝑌(0)) ⊥⊥ 𝑇 | 𝑒(𝑊) . (7.17)
We provide a more traditional mathematical proof in Appendix A.2 and 𝑊
give a graphical proof here. Consider the graph in Figure 7.3. Because
the edge from 𝑊 to 𝑇 is a symbol for the mechanism 𝑃(𝑇 | 𝑊) and
because the propensity score completely describes that distribution 𝑇 𝑌
(𝑃(𝑇 = 1 | 𝑊) = 𝑒(𝑊)), we can think of the propensity score as a full
Figure 7.3: Simple graph where 𝑊 satisfies
mediator of the effect of 𝑊 on 𝑇 . This means that we can redraw this
the backdoor criterion
graph with 𝑒(𝑊) situated between 𝑊 and 𝑇 . And in this redrawned
graph in Figure 7.4, we can see that 𝑒(𝑊) blocks all backdoor paths that
𝑊
𝑊 blocks, so 𝑒(𝑊) must be a sufficient adjustment set if 𝑊 is. Therefore,
we have a graphical proof of the propensity score theorem using the
backdoor adjustment (Theorem 4.2).
Importantly, this theorem means that we can swap in 𝑒(𝑊) in place of 𝑊 𝑒(𝑊)
wherever we are adjusting for 𝑊 in a given estimator in this chapter. For
example, this seems very useful when 𝑊 is high-dimensional.
Recall The Positivity-Unconfoundedness Tradeoff from Section 2.3.4. As 𝑇 𝑌
we condition on more non-collider-bias-inducing variables, we decrease
Figure 7.4: Graph illustrating that 𝑒(𝑊)
confounding. However, this comes at the cost of decreasing overlap
blocks the backdoor path(s) that 𝑊 blocks.
because the 𝑊 in 𝑃(𝑇 = 1 | 𝑊) becomes higher and higher dimensional.
The propensity score seems to allow us to magically fix that issue since
the 𝑒(𝑊) remains a scalar, even as 𝑊 grows in dimension. Fantastic,
right?
Well, unfortunately, we usually don’t have access to 𝑒(𝑊). Rather, the
best we can do is model it. We do this by training a model to predict 𝑇
from 𝑊 . For example, logistic regression (logit model) is very commonly
used to do this. And because this model is fit to the high-dimensional 𝑊 ,
in some sense, we have just shifted the positivity problem to our model
for 𝑒(𝑊).
Association is not causation in the graph in Figure 7.5 because 𝑊 is a Figure 7.5: Simple graph where 𝑊 con-
founds the effect of 𝑇 on 𝑌
common cause of 𝑇 and 𝑌 . In other words, the mechanism that generates
𝑇 depends on 𝑊 , and the mechanism that generates 𝑌 depends on
𝑊 . Focusing on the mechanism that generates 𝑇 , we can write this
mathematically as 𝑃(𝑇 | 𝑊) ≠ 𝑃(𝑇). It turns out that we can reweight 𝑊
the data to get a pseudo-population where 𝑃(𝑇 | 𝑊) = 𝑃(𝑇) or 𝑃(𝑇 | 𝑊)
equals some constant; the important part is that we make 𝑇 independent
of 𝑊 . The corresponding graph for such a pseudo-population has no 𝑇 𝑌
edge from 𝑊 to 𝑇 because 𝑇 does not depend on 𝑊 ; we depict this in
Figure 7.6: Effective graph for pseudo-
Figure 7.6. population that we get by reweighting
the data generated according to the graph
It turns out that the propensity score is key to this reweighting. All we in Figure 7.5 using inverse probability
have to do is reweight each data point with treatment 𝑇 and confounders weighting.
7 Estimation 69
1(𝑇 = 𝑡)𝑌
𝔼[𝑌(𝑡)] = 𝔼 (7.18)
𝑃(𝑡 | 𝑊)
Weight Trimming As you can see in Equations 7.20 and 7.21, if the
propensity scores are very close to 0 or 1, the estimates will blow up. In
order to prevent this, it is not uncommon to trim the propensity scores
that are less than 𝜖 to 𝜖 and those that are greater than 1 − 𝜖 to 1 − 𝜖
(effectively trimming the weights to be no larger than 1𝜖 ), though this
introduces its own problems such as bias.
CATE Estimation We can extend the ATE estimator in Equation 7.20
to get an IPW estimator for the CATE 𝜏(𝑥) by just restricting to the data
points where 𝑥 𝑖 = 𝑥 :
However, there is some controversy over how well doubly robust meth-
ods work in practice if not at least one of 𝜇ˆ or 𝑒ˆ is well-specified [35]. [35]: Kang and Schafer (2007), ‘Demysti-
Though, this might be contested as we get better at using doubly ro- fying Double Robustness: A Comparison
of Alternative Strategies for Estimating a
bust estimators with flexible machine learning models (see, e.g., [36]). Population Mean from Incomplete Data’
Meanwhile, the estimators that currently seem to do the best all flexibly
[36]: Zivich and Breskin (2020), Machine
model 𝜇 (unlike pure IPW estimators) [37]. This is why we began this learning for causal inference: on the use of
chapter with estimators that model 𝜇 and dedicated several sections to cross-fit estimators
such estimators. [37]: Dorie et al. (2019), ‘Automated versus
Do-It-Yourself Methods for Causal Infer-
Doubly robust methods are largely outside the scope of this book, so ence: Lessons Learned from a Data Analy-
we refer the reader to an introduction by Seaman and Vansteelandt [38], sis Competition’
along with other seminal works on the topic: [39–41]. Additionally, there [38]: Seaman and Vansteelandt (2018), ‘In-
is a large body of doubly robust work on methods that have performed troduction to Double Robust Methods for
Incomplete Data’
reasonably well in competitions [37]; this category is known as targeted
[39]: Tsiatis (2007), Semiparametric theory
maximum likelihood estimation (TMLE). [42–44].
and missing data
[40]: Robins et al. (1994), ‘Estimation of
Regression Coefficients When Some Re-
7.8 Other Methods gressors are not Always Observed’
[41]: Bang and Robins (2005), ‘Doubly
Robust Estimation in Missing Data and
Causal Inference Models’
As this chapter is only an introduction to estimation in causal inference,
[42]: Van Der Laan and Rubin (2006), ‘Tar-
there are some methods that we’ve entirely left out. We’ll briefly describe
geted maximum likelihood learning’
some of the most popular ones in this section. [43]: Schuler and Rose (2017), ‘Targeted
Maximum Likelihood Estimation for
Matching In matching methods, we try to match units in the treatment Causal Inference in Observational Studies’
group with units in the control group and throw away the non-matches [44]: Van der Laan and Rose (2011), Targeted
to create comparable groups. We can match in raw covariate space, learning: causal inference for observational and
experimental data
coarsened covariate space, or propensity score space. There are different
distance functions for deciding how close two units are. Furthermore,
there are different criteria for deciding whether a given distance is close
enough to count as a match (one criterion requires an exact match), how
many matches each treatment group unit can have, how many matches
7 Estimation 71
each control group unit can have, etc. See, for example, Stuart [45] for a [45]: Stuart (2010), ‘Matching Methods for
review. Causal Inference: A Review and a Look
Forward’
Double Machine Learning In double machine learning, we fit three
models in two stages: two in the first stage and a final model in the second
stage. First stage:
1. Fit a model to predict 𝑌 from 𝑊 to get the predicted 𝑌ˆ .11 11
Active reading exercise: How is this
2. Fit a model to predict 𝑇 from 𝑊 to get the to get the predicted 𝑇ˆ . model different from 𝜇ˆ ?
is known as causal forests [51], which are part of more general class [50]: Athey and Imbens (2016), ‘Recursive
partitioning for heterogeneous causal ef-
known as generalized random forests [52]. Importantly, these methods were
fects’
developed with the goal in mind of yielding valid confidence intervals [51]: Wager and Athey (2018), ‘Estima-
for the estimates. tion and Inference of Heterogeneous Treat-
ment Effects using Random Forests’
[52]: Athey et al. (2019), ‘Generalized ran-
So far, in this chapter, we have only discussed point estimates for causal
effects. We haven’t discussed how we can gauge our uncertainty due
to data sampling. We haven’t discussed how to calculate confidence
intervals on these estimates. This is a machine learning perspective, after
all; who cares about confidence intervals... Jokes aside, because we are
allowing for arbitrary machine learning models in all of the estimators
we discuss, it is actually quite difficult to get valid confidence intervals.
Bootstrapping One way to get confidence intervals is to use bootstrap-
ping. With bootstrapping, we repeat the causal effect estimation process
many times, each time with a different sample (with replacement) from
our data. This allows us to build an empirical distribution for the estimate.
We can then compute whatever confidence interval we like from that em-
pirical distribution. Unfortunately, bootstrapped confidence intervals are
not always valid. For example, if we take a bootstrapped 95% confidence
interval, it might not contain the true value (estimand) 95% of the time.
Specialized Models Another way to get confidence intervals is to
analyze very specific models, rather than allowing for arbitrary models
Linear models are the simplest example of this; it is easy to get confidence
intervals in linear models. Similarly, if we use a linear model as the second
stage model in double machine learning, we can get confidence intervals.
Noticeably, causal trees and causal forests were developed with the goal
in mind of getting confidence intervals.
7 Estimation 72
𝑊 𝑈
𝑊
𝑇 𝑌 𝑇 𝑌
(a) No unobserved confounding (b) Unobserved confounding (𝑈 )
Figure 8.1: On the left, we have the setting we have considered up till now, where we have
unconfoundedness / the backdoor criterion. On the right, we have a simple graph where
the unobserved confounder 𝑈 make the causal effect of 𝑇 on 𝑌 not identifiable.
8.1 Bounds
Say all we know about the potential outcomes 𝑌(0) and 𝑌(1) is that they
are between 0 and 1. Then, the maximum value of an ITE 𝑌𝑖 (1) − 𝑌𝑖 (0) is
1 (1 - 0), and the minimum is -1 (0 - 1):
These are intervals of length (𝑏 − 𝑎)−(𝑎 −𝑏) = 2(𝑏 − 𝑎). And the bounds for
the ITEs cannot be made tighter without further assumptions. However,
seemingly magically, we can halve the length of the interval for the ATE.
To see this, we rewrite the ATE as follows:
𝜋 𝔼[𝑌 | 𝑇 = 1] + (1 − 𝜋) 𝑏 − 𝜋 𝑎 − (1 − 𝜋) 𝔼[𝑌 | 𝑇 = 0]
−(𝜋 𝔼[𝑌 | 𝑇 = 1] + (1 − 𝜋) 𝑎 − 𝜋 𝑏 − (1 − 𝜋) 𝔼[𝑌 | 𝑇 = 0])
= (1 − 𝜋) 𝑏 + 𝜋 𝑏 − 𝜋 𝑎 − (1 − 𝜋) 𝑎 (8.11)
=𝑏−𝑎 (8.12)
Running Example
and get smaller intervals by bounding the counterfactual parts using the
different assumptions we make.
The intervals we will see in the next couple of subsections will all contain
zero. We won’t see an interval that is purely positive or purely negative
until Section 8.1.4, so feel free to skip to that section if you only want to
see those intervals.
(MTR) assumption:
This means that every ITE is nonnegative, so we can bring our lower
bound on the ITEs up from 𝑎 − 𝑏 (Equation 8.3) to 0. So, intuitively, this
should mean that our lower bound on the ATE should move up to 0. And
we will now see that this is the case.
Now, rather than lower bounding 𝔼[𝑌(1) | 𝑇 = 0] with 𝑎 and −𝔼[𝑌(0) |
𝑇 = 1] with −𝑏 , we can do better. Because the treatment only helps,
𝔼[𝑌(1) | 𝑇 = 0] ≥ 𝔼[𝑌(0) | 𝑇 = 0] = 𝔼[𝑌 | 𝑇 = 0], so we can lower
bound 𝔼[𝑌(1) | 𝑇 = 0] with 𝔼[𝑌 | 𝑇 = 0]. Similarly, −𝔼[𝑌(0) | 𝑇 = 1] ≥
−𝔼[𝑌(1) | 𝑇 = 1] = 𝔼[𝑌 | 𝑇 = 1] (since multiplying by a negative flips the
inequality), so we can lower bound −𝔼[𝑌(0) | 𝑇 = 1] with −𝔼[𝑌 | 𝑇 = 1].
Therefore, we can improve on the no-assumptions lower bound3 to get 0, 3 Recall that by only assuming that out-
as our intuition suggested: comes are bounded between 𝑎 and 𝑏 ,
we get the no-assumptions lower bound
𝔼[𝑌(1) − 𝑌(0)] = 𝜋 𝔼[𝑌 | 𝑇 = 1] + (1 − 𝜋) 𝔼[𝑌(1) | 𝑇 = 0] (Proposition 8.2):
𝔼[𝑌(1) − 𝑌(0)]
− 𝜋 𝔼[𝑌(0) | 𝑇 = 1] − (1 − 𝜋) 𝔼[𝑌 | 𝑇 = 0]
≥ 𝜋 𝔼[𝑌 | 𝑇 = 1] + (1 − 𝜋) 𝑎
(8.8 revisited)
− 𝜋 𝑏 − (1 − 𝜋) 𝔼[𝑌 | 𝑇 = 0]
≥ 𝜋 𝔼[𝑌 | 𝑇 = 1] + (1 − 𝜋) 𝔼[𝑌 | 𝑇 = 0] (8.10 revisited)
− 𝜋 𝔼[𝑌 | 𝑇 = 1] − (1 − 𝜋) 𝔼[𝑌 | 𝑇 = 0] (8.17)
=0 (8.18)
Running Example The no-assumptions upper bound4 still applies here, 4 Recall the no-assumptions upper bound
so in our running example from Section 8.1.1 where 𝜋 = .3, 𝔼[𝑌 | 𝑇 = 1] = (Proposition 8.2):
.9, and 𝔼[𝑌 | 𝑇 = 0] = .2, our ATE interval improves from [−0.17 , 0.83] 𝔼[𝑌(1) − 𝑌(0)]
(Equation 8.15) to [0 , 0.83]. ≤ 𝜋 𝔼[𝑌 | 𝑇 = 1] + (1 − 𝜋) 𝑏
Alternatively, say the treatment can only hurt people; it can’t help them − 𝜋 𝑎 − (1 − 𝜋) 𝔼[𝑌 | 𝑇 = 0]
(8.9 revisited)
(e.g. a gunshot wound only hurts chances of staying alive). In those cases,
we would have the nonpositive monotone treatment response assumption
and the nonpositive MTR upper bound:
Proposition 8.4 (Nonpositive MTR Upper Bound) Under the nonpositive Active reading exercise: Prove Proposi-
tion 8.4.
MTR assumption, the ATE is bounded from above by 0. Mathematically,
Running Example And in this setting, the no-assumptions lower 5 Recall the no-assumptions lower bound
bound5 still applies. That means that the ATE interval in our exam- (Proposition 8.2):
ple improves from [−0.17 , 0.83] (Equation 8.15) to [−0.17 , 0]. 𝔼[𝑌(1) − 𝑌(0)]
≥ 𝜋 𝔼[𝑌 | 𝑇 = 1] + (1 − 𝜋) 𝑎
Active reading exercise: What is the ATE interval if we assume both non-
− 𝜋 𝑏 − (1 − 𝜋) 𝔼[𝑌 | 𝑇 = 0]
negative MTR and nonpositive MTR? Does this make sense, intuitively?
(8.10 revisited)
8 Unobserved Confounding: Bounds and Sensitivity Analysis 78
The next assumption that we’ll consider is the assumption that the people
who selected treatment would have better outcomes than those who
didn’t select treatment, under either treatment scenario. Manski and
Pepper [59] introduced this as the monotone treatment selection (MTS) [59]: Manski and Pepper (2000), ‘Mono-
assumption. tone Instrumental Variables: With an Ap-
plication to the Returns to Schooling’
As Morgan and Winship [12, Section 12.2.2] point out, you might think of [12]: Morgan and Winship (2014), Counter-
this as positive self-selection. Those who generally get better outcomes factuals and Causal Inference: Methods and
Principles for Social Research
self-select into the treatment group. Again, we start with the observational-
counterfactual decomposition, and we now obtain an upper bound using
the MTS assumption (Assumption 8.4):
Proof.
where Equation 8.25 followed from the fact that (a) Equation 8.22 of the
MTS assumption allows us to upper bound 𝔼[𝑌(1) | 𝑇 = 0] by 𝔼[𝑌(1) |
𝑇 = 1] = 𝔼[𝑌(1) | 𝑇 = 1] and (b) Equation 8.23 of the MTS assumption
allows us to upper bound −𝔼[𝑌(0) | 𝑇 = 1] by −𝔼[𝑌 | 𝑇 = 0].
Running Example Recall our running example from Section 8.1.1 where
𝜋 = .3, 𝔼[𝑌 | 𝑇 = 1] = .9, and 𝔼[𝑌 | 𝑇 = 0] = .2. The MTS assumption 6 Recall the no-assumptions lower bound
gives us an upper bound, and we still have the no-assumptions lower (Proposition 8.2):
bound.6 That means that the ATE interval in our example improves from 𝔼[𝑌(1) − 𝑌(0)]
[−0.17 , 0.83] (Equation 8.15) to [−0.17 , 0.7]. ≥ 𝜋 𝔼[𝑌 | 𝑇 = 1] + (1 − 𝜋) 𝑎
− 𝜋 𝑏 − (1 − 𝜋) 𝔼[𝑌 | 𝑇 = 0]
Both MTR and MTS Then, we can combine the nonnegative MTR
(8.10 revisited)
assumption (Assumption 8.2) with the MTS assumption (Assumption 8.4)
to get the lower bound in Proposition 8.3 and the upper bound in
Proposition 8.5, respectively. In our running example, this yields the
following interval for the ATE: [0 , 0.7].
8 Unobserved Confounding: Bounds and Sensitivity Analysis 79
Intervals Contain Zero Although bounds from the MTR and MTS
assumptions can be useful for ruling out very large or very small causal
effects, the corresponding intervals still contain zero. This means that
these assumptions are not enough to identify whether there is an effect
or not.
We now consider what we will call the optimal treatment selection (OTS) as-
sumption from Manski [55]. This assumption means that the individuals [55]: Manski (1990), ‘Nonparametric
always receive the treatment that is best for them (e.g. if an expert doctor Bounds on Treatment Effects’
(Proposition 8.2):
𝔼[𝑌(1) − 𝑌(0)] = 𝜋 𝔼[𝑌 | 𝑇 = 1] + (1 − 𝜋) 𝔼[𝑌(1) | 𝑇 = 0] 𝔼[𝑌(1) − 𝑌(0)]
− 𝜋 𝔼[𝑌(0) | 𝑇 = 1] − (1 − 𝜋) 𝔼[𝑌 | 𝑇 = 0] ≤ 𝜋 𝔼[𝑌 | 𝑇 = 1] + (1 − 𝜋) 𝑏
(8.8 revisited) − 𝜋 𝑎 − (1 − 𝜋) 𝔼[𝑌 | 𝑇 = 0]
≤ 𝜋 𝔼[𝑌 | 𝑇 = 1] + (1 − 𝜋) 𝔼[𝑌 | 𝑇 = 0]
(8.9 revisited)
− 𝜋 𝑎 − (1 − 𝜋) 𝔼[𝑌 | 𝑇 = 0] (8.29)
= 𝜋 𝔼[𝑌 | 𝑇 = 1] − 𝜋 𝑎 (8.30)
(Proposition 8.2):
𝔼[𝑌(1) − 𝑌(0)] = 𝜋 𝔼[𝑌 | 𝑇 = 1] + (1 − 𝜋) 𝔼[𝑌(1) | 𝑇 = 0] 𝔼[𝑌(1) − 𝑌(0)]
− 𝜋 𝔼[𝑌(0) | 𝑇 = 1] − (1 − 𝜋) 𝔼[𝑌 | 𝑇 = 0] ≥ 𝜋 𝔼[𝑌 | 𝑇 = 1] + (1 − 𝜋) 𝑎
(8.8 revisited) − 𝜋 𝑏 − (1 − 𝜋) 𝔼[𝑌 | 𝑇 = 0]
≥ 𝜋 𝔼[𝑌 | 𝑇 = 1] + (1 − 𝜋) 𝑎
(8.10 revisited)
Unfortunately, this interval also always contains zero!9 This means that 9 Active reading exercise: Show that this
Proposition 8.6 doesn’t tell us whether the causal effect is non-zero or interval always contains zero.
not.
Running Example Recall our running example from Section 8.1.1 where
𝑎 = 0, 𝑏 = 1, 𝜋 = .3, 𝔼[𝑌 | 𝑇 = 1] = .9, and 𝔼[𝑌 | 𝑇 = 0] = .2. Plugging
these in to Proposition 8.6 gives us the following:
We’ll now give an interval that can be purely positive or purely negative,
potentially identifying the ATE as non-zero.
It turns out that, although we take the OTS assumption from Manski
[55], the bound we gave in Proposition 8.6 is not actually the bound that [55]: Manski (1990), ‘Nonparametric
Manski [55] derives with that assumption. For example, where we used Bounds on Treatment Effects’
= 𝔼[𝑌(1) | 𝑇 = 1] (8.47)
= 𝔼[𝑌 | 𝑇 = 1] (8.48)
Equation 8.49:
This interval can also include zero, but it doesn’t have to. For example, in
our running example, it doesn’t.
Running Example Recall our running example from Section 8.1.1 where
𝑎 = 0, 𝑏 = 1, 𝜋 = .3, 𝔼[𝑌 | 𝑇 = 1] = .9, and 𝔼[𝑌 | 𝑇 = 0] = .2. Plugging
these in to Proposition 8.7 gives us the following for the OTS bound 2:
𝔼[𝑌(1) − 𝑌(0)] ≤ (.9) − (.3) (0) − (1 − .3) (.2) (8.56) Application of OTS bound 1 (Proposi-
tion 8.6) to our running example:
𝔼[𝑌(1) − 𝑌(0)] ≥ (.3) (.9) + (1 − .3) (0) − (.2) (8.57)
−0.14 ≤ 𝔼[𝑌(1) − 𝑌(0)] ≤ 0.27
0.07 ≤ 𝔼[𝑌(1) − 𝑌(0)] ≤ 0.76 (8.58) (8.39 revisited)
Interval Length = 0.69 (8.59) Interval Length = 0.41 (8.40 revisited)
So while the OTS bound 2 from Manski [55] identifies the sign of the ATE [55]: Manski (1990), ‘Nonparametric
in our running example, unlike the OTS bound 1, the OTS bound 2 gives Bounds on Treatment Effects’
us a 68% larger interval. You can see this by comparing Equation 8.40 (in
the above margin) with Equation 8.59.
This illustrates some important takeaways:
12
1. Different bounds are better in different cases.12 Active reading exercise: Using Equa-
tions 8.40 and 8.59, derive the conditions
2. Different bounds can be better in different ways (e.g., identifying
under which OTS bound 1 yields a smaller
the sign vs. getting a smaller interval). interval and the conditions under which
OTS bound 2 yields a smaller interval.
Mixing Bounds Fortunately because both the OTS bound 1 and OTS
bound 2 come from the same assumption (Assumption 8.5), we can take
the lower bound from OTS bound 2 and the upper bound from OTS
8 Unobserved Confounding: Bounds and Sensitivity Analysis 82
bound 1 to get the following tighter interval that still identifies the sign: [54]: Manski (1989), ‘Anatomy of the Selec-
tion Problem’
𝑇 := 𝛼 𝑤 𝑊 + 𝛼 𝑢 𝑈 (8.61)
𝑌 := 𝛽 𝑤 𝑊 + 𝛽 𝑢 𝑈 + 𝛿𝑇 (8.62)
𝑇 𝑌
So the relevant quantity that describes causal effects of 𝑇 on 𝑌 is 𝛿 since Figure 8.3: Simple causal structure where
it is the coefficient in front of 𝑇 in the structural equation for 𝑌 . From the 𝑊 is the observed confounders and 𝑈 is
the unobserved confounders.
backdoor adjustment (Theorem 4.2) / adjustment formula (Theorem 2.1),
we know that
13 Active reading exercise: What assump-
𝔼[𝑌(1) − 𝑌(0)] = 𝔼𝑊 ,𝑈 [𝔼[𝑌 | 𝑇 = 1, 𝑊 , 𝑈] − 𝔼[𝑌 | 𝑇 = 0, 𝑊 , 𝑈]] = 𝛿 tion is violated when the data are gener-
ated by a noiseless process?
(8.63)
But because 𝑈 isn’t observed, the best we can do is adjust for only 𝑊 .
𝛽
This leads to a confounding bias of 𝛼𝑢𝑢 . We’ll be focusing on identification,
not estimation, here, so we’ll consider that we have infinite data. This
means that we have access to 𝑃(𝑊 , 𝑇, 𝑌). Then, we’ll write down and
prove the following proposition about confounding bias:
Proposition 8.8 When 𝑇 and 𝑌 are generated by the noiseless linear process
in Equations 8.61 and 8.62, the confounding bias of adjusting for just 𝑊 (and
8 Unobserved Confounding: Bounds and Sensitivity Analysis 83
𝛽𝑢
not 𝑈 ) is 𝛼𝑢 . Mathematically:
This is where we use the structural equation for 𝑇 (Equation 8.61). 𝑇 := 𝛼 𝑤 𝑊 + 𝛼 𝑢 𝑈 (8.61 revisited)
Rearranging it gives us 𝑈 = 𝑇−𝛼 𝛼𝑢
𝑤𝑊
. We can then use that for the
remaining conditional expectation:
𝑡 − 𝛼𝑤 𝑊
= 𝔼𝑊 𝛽 𝑤 𝑊 + 𝛽 𝑢 + 𝛿𝑡 (8.67)
𝛼𝑢
𝛽𝑢 𝛽𝑢 𝛼𝑤
= 𝔼𝑊 𝛽 𝑤 𝑊 + 𝑡− 𝑊 + 𝛿𝑡 (8.68)
𝛼𝑢 𝛼𝑢
𝛽𝑢 𝛽𝑢 𝛼𝑤
= 𝛽 𝑤 𝔼[𝑊] + 𝑡− 𝔼[𝑊] + 𝛿𝑡 (8.69)
𝛼𝑢 𝛼𝑢
𝛽𝑢 𝛽𝑢 𝛼𝑤
= 𝛿+ 𝑡 + 𝛽𝑤 − 𝔼[𝑊] (8.70)
𝛼𝑢 𝛼𝑢
The only parts of this that matter are the parts that depend on 𝑡 because
we want to know the effect of 𝑇 on 𝑌 . For example, consider the expected
ATE estimate we would get if we were to only adjust for 𝑊 :
𝑊 𝑈
Generalization to Arbitrary Graphs/Estimands Here, we’ve performed
a sensitivity analysis for the ATE for the simple graph structure in Fig-
ure 8.4. For arbitrary estimands in arbitrary graphs, where the structural
equations are linear, see Cinelli et al. [61]. 𝑇 𝑌
Figure 8.4: Simple causal structure where
Sensitivity Contour Plots 𝑊 is the observed confounders and 𝑈 is
the unobserved confounders.
If we rearrange Equation 8.7315 to solve for 𝛿 , we get the following: 15 Recall Equation 8.73:
10.0 100
1 24
7.5 10 15
25 80 0
5.0 50 -25
2.5 60
0.0
40
2.5
5.0 20
7.5
0
10.0
15 10 5 0 5 10 15 0 1 2 3 4 5
In the example we depict in Figure 8.5, the figure tells us that the green
curve (third from the bottom/left) indicates how strong the confounding
would need to be in order to completely explain the observed association.
In other words, ( 𝛼1𝑢 , 𝛽 𝑢 ) would need be large enough to fall on the green
curve or above in order for the true ATE 𝛿 to be zero or the opposite sign
of 𝔼𝑊 [𝔼[𝑌 | 𝑇 = 1 , 𝑊] − 𝔼[𝑌 | 𝑇 = 0 , 𝑊]] = 25.
𝑇 𝑌
Figure 9.1: Graph where 𝑈 is an unobserved confounder of the effect of 𝑇 on 𝑌 and 𝑍 is an
instrumental variable.
There are three main assumptions that must be satisfied for a variable 𝑍
to be considered an instrument. The first is that 𝑍 must be relevant in
the sense that it must influence 𝑇 .
As a warm-up, we’ll start in the setting where 𝑇 and 𝑍 are binary and
where we make the parametric assumption that 𝑌 is a linear function of
𝑇 and 𝑈 :
𝑌 := 𝛿𝑇 + 𝛼 𝑢 𝑈 (9.1)
Proposition 9.1
𝔼[𝑌 | 𝑍 = 1] − 𝔼[𝑌 | 𝑍 = 0]
𝛿= (9.7)
𝔼[𝑇 | 𝑍 = 1] − 𝔼[𝑇 | 𝑍 = 0]
We’ll now consider the setting where 𝑇 and 𝑍 are continuous, rather
than binary. We’ll still assume the linear form for 𝑌 (Assumption 9.4),
which means that the causal efffect of 𝑇 on 𝑌 is 𝛿 . In the continuous
setting, we get the natural continuous analog of the Wald estimand:
9 Instrumental Variables 89
Proposition 9.2
Cov(𝑌, 𝑍)
𝛿= (9.9)
Cov(𝑇, 𝑍)
Now, we see that we can apply the same covariance identity again:
= 𝛿Cov(𝑇, 𝑍) (9.15)
nominator is non-zero.
𝑍 𝑈
This leads us to the following natural estimator, similar to the Wald
estimator:
d (𝑌, 𝑍)
𝑇 𝑌
Cov
𝛿ˆ = (9.17)
d (𝑇, 𝑍)
Cov
Figure 9.4: Graph where 𝑈 is an unob-
served confounder of the effect of 𝑇 on 𝑌
Another equivalent estimator is what’s known as the two-stage least squares and 𝑍 is an instrumental variable.
estimator (2SLS). The two stages are as follows:
1. Linearly regress 𝑇 on 𝑍 to estimate 𝔼[𝑇 | 𝑍]. This gives us the
projection of 𝑇 onto 𝑍 : 𝑇ˆ . 𝑍 𝑈
2. Linearly regress 𝑌 on 𝑇ˆ to estimate 𝔼[𝑌 | 𝑇]
ˆ . Obtain our estimate 𝛿ˆ
as the fitted coefficient in front of 𝑇ˆ .
There is helpful intuition that comes with the 2SLS estimator. To see this, 𝑇ˆ 𝑌
start with the canonical instrumental variable graph we’ve been using
Figure 9.5: Augmented version of Fig-
(Figure 9.4). In stage one, we are projecting 𝑇 onto 𝑍 to get 𝑇ˆ as a function ure 9.4, where 𝑇 is replaced with 𝑇ˆ =
of only 𝑍 : 𝑇ˆ = 𝔼ˆ [𝑇 | 𝑍]. Then, imagine a graph where 𝑇 is replaced with 𝔼ˆ [𝑇 | 𝑍], which doesn’t depend on 𝑈 , so
𝑇ˆ (Figure 9.5). Because 𝑇ˆ isn’t a function of 𝑈 , we can think of removing there it no longer has an incoming edge
the 𝑈 → 𝑇ˆ edge in this graph. Now, because there are no backdoor paths from 𝑈 .
9 Instrumental Variables 90
The problem with the previous two sections is that we’ve made the strong
parametric assumption of linearity (Assumption 9.4). For example, this
assumption requires homogeneity (that the treatment effect is the same
for every unit). There are other variants that encode the homogeneity
assumption (see, e.g., Hernán and Robins [7, Section 16.3]), and they [7]: Hernán and Robins (2020), Causal In-
are all strong assumptions. Ideally, we’d be able to use instrumental ference: What If
We will segment the population into four principal strata, based on the
relationship between the encouragement 𝑍 and the treatment taken 𝑇 .
There are four strata because there is one for each combination of the
values the binary variables 𝑍 and 𝑇 can take on.
𝑍 = 1, 𝑇
for each of these combinations of observed
3. = 0. Compatible strata: defiers or never-takers values.
4. 𝑍 = 1, 𝑇 = 1. Compatible strata: compliers or always-takers
This means that we can’t identify if a given unit is a complier, a defier, an
always-taker, or a never-taker.
To identify the LATE, although we will no longer need the linearity as-
sumption (Assumption 9.4), we will need to introduce a new assumption
known as monotonicity.
∀𝑖, 𝑇𝑖 (𝑍 = 1) ≥ 𝑇𝑖 (𝑍 = 0) (9.19)
9 Instrumental Variables 92
𝔼[𝑌 | 𝑍 = 1] − 𝔼[𝑌 | 𝑍 = 0]
𝔼[𝑌(1) − 𝑌(0) | 𝑇(1) = 1, 𝑇(0) = 0] =
𝔼[𝑇 | 𝑍 = 1] − 𝔼[𝑇 | 𝑍 = 0]
(9.20)
The first term correponds to the compliers, the second term corresponds
to the the defiers, the third term corresponds to the always-takers, and the
last term corresponds to the never takers. As we discussed in Section 9.5.2,
the causal effect of 𝑍 on 𝑌 among the always-takers and never-takers is
zero, so we can remove those terms.
the following:
𝔼[𝑌 | 𝑍 = 1] − 𝔼[𝑌 | 𝑍 = 0]
= (9.27)
𝑃(𝑇(1) = 1 , 𝑇(0) = 0)
𝔼[𝑌 | 𝑍 = 1] − 𝔼[𝑌 | 𝑍 = 0]
= (9.28)
1 − 𝑃(𝑇 = 0 | 𝑍 = 1) − 𝑃(𝑇 = 1 | 𝑍 = 0)
𝔼[𝑌 | 𝑍 = 1] − 𝔼[𝑌 | 𝑍 = 0]
=
1 − (1 − 𝑃(𝑇 = 1 | 𝑍 = 1)) − 𝑃(𝑇 = 1 | 𝑍 = 0)
(9.29)
𝔼[𝑌 | 𝑍 = 1] − 𝔼[𝑌 | 𝑍 = 0]
= (9.30)
𝑃(𝑇 = 1 | 𝑍 = 1) − 𝑃(𝑇 = 1 | 𝑍 = 0)
𝔼[𝑌 | 𝑍 = 1] − 𝔼[𝑌 | 𝑍 = 0]
= (9.31)
𝔼[𝑇 | 𝑍 = 1] − 𝔼[𝑇 | 𝑍 = 0]
9 Instrumental Variables 94
This is exactly the Wald estimand that we saw back in the linear setting
(Section 9.3) in Equation 9.7. However, this time, it is the corresponding
statistical estimand of the local ATE 𝔼[𝑌(𝑇 = 1) − 𝑌(𝑇 = 0) | 𝑇(1) =
1 , 𝑇(0) = 0], also known as the complier average causal effect (CACE). This
LATE/CACE causal estimand is in contrast to the ATE causal estimand
that we saw in Section 9.3: 𝔼[𝑌(𝑇 = 1) − 𝑌(𝑇 = 0)]. The difference
is that the complier average causal effect is the ATE specifically in the
subpopulation of compliers, rather than the total population. It’s local
(LATE) to that subpopulation, rather than being global over the whole
population like the ATE is. So we’ve seen two different assumptions that
get us to the Wald estimand with instrumental variables:
1. Linearity (or more generally homogeneity)
2. Monotonicity
Problems with LATE/CACE There are a few problems with the Wald
estimand for LATE, though. The first is that monotonicity might not be
satisfied in your setting of interest. The second is that, even if monotonicity
is satisfied, you might not be interested in the causal effect specifically
among the compliers, especially because you can’t even identify who the
compliers are (see Section 9.5.2). Rather, the regular ATE is often a more
useful quantity to know.
𝑌 := 𝑓 (𝑇, 𝑊) + 𝑈 (9.32)
See, for example, Hartford et al. [75] and Xu et al. [76] for using deep [75]: Hartford et al. (2017), ‘Deep IV: A
learning to model 𝑓 . See references in those papers for using other Flexible Approach for Counterfactual Pre-
𝑌 := 𝑓 (𝑇, 𝑈) (9.33)
Difference in Differences 10
Note: the following chapter is much more rough than usual and currently 10.1 Preliminaries . . . . . . . . . 95
does not contain as many figures and intuition as the corresponding 10.2 Introducing Time . . . . . . 96
lecture.
10.3 Identification . . . . . . . . . 96
Assumptions . . . . . . . . . 96
Main Result and Proof . . . 97
10.1 Preliminaries 10.4 Major Problems . . . . . . . 98
We will now introduce the time dimension. Using information from the
time dimension will be key for us to get identification without assuming
the usual unconfoundedness. We’ll use 𝜏 for the variable for time.
Setting As usual, we have a treatment group (𝑇 = 1) and a control
group (𝑇 = 0). However, now there is also time, and the treatment group
only gets the treatment after a certain time. So we have some time 𝜏 = 1
that denotes a time after the treatment has been administered to the
treatment group and some time 𝜏 = 0 that denotes some time before
the treatment has been administered to the treatment group. Because
the control group never gets the treatment, the control group hasn’t
received treatment at either of time 𝜏 = 0 or at time 𝜏 = 1. We will denote
the random variable for potential outcome under treatment 𝑡 at time
𝜏 as 𝑌𝜏 (𝑡). Then, the causal estimand we’re interested in is the average
difference in potential outcomes after treatment has been administered
(in time period 𝜏 = 1) in the treatment group:
In other words, we’re interested in the ATT after the treatment has been
administered.
10.3 Identification
10.3.1 Assumptions
You can just treat 𝑌1 and 𝑌0 as two different random variables. So even
though we have a time subscript now, we still have trivial identification
via consistency (recall Assumption 2.5) when the value inside of the
parenthesis for the potential outcome matches the conditioning value for
𝑇:
This assumption may seem like it’s obviously true, but that isn’t necessarily
the case. For example, if participants anticipate the treatment, then they
might be able to
Using the assumptions in the previous section, we can show that the
ATT is equal to the difference between the differences across time in
each treatment group. We state this mathematically in the following
proposition.
So we’ve identified the first term, but the second term remains to be
identified. To do that, we’ll solve for this term in the parallel trends
assumption:1 1 Parallel trends assumptions (Assump-
tion 10.2):
𝔼[𝑌1 (0) | 𝑇 = 1] = 𝔼[𝑌0 (0) | 𝑇 = 1] + 𝔼[𝑌1 (0) | 𝑇 = 0] − 𝔼[𝑌0 (0) | 𝑇 = 0] 𝔼[𝑌1 (0) | 𝑇 = 1] − 𝔼[𝑌0 (0) | 𝑇 = 1]
(10.19) = 𝔼[𝑌1 (0) | 𝑇 = 0] − 𝔼[𝑌0 (0) | 𝑇 = 0]
(10.13 revisited)
We can use consistency to identify the last two terms:
But the first term is counterfactual. This is where we need the no pre-
treatment effect assumption:2 2 No pretreatment effect assumption (As-
sumption 10.3)
= 𝔼[𝑌0 (1) | 𝑇 = 1] + 𝔼[𝑌1 | 𝑇 = 0] − 𝔼[𝑌0 | 𝑇 = 0] 𝔼[𝑌0 (1) | 𝑇 = 1] − 𝔼[𝑌0 (0) | 𝑇 = 1] = 0
(10.21) (10.15 revisited)
Now that we’ve identified 𝔼[𝑌1 (0) | 𝑇 = 1], we can plug Equation 10.22
back into Equation 10.18 to complete the proof:
The main assumption we’ve seen that relates the graph to the distribution
is the Markov assumption. The Markov assumption tells us if variables are
d-separated in the graph 𝐺 , then they are independent in the distribution
𝑃 (Theorem 3.1):
𝑋⊥
⊥𝐺 𝑌 | 𝑍 =⇒ 𝑋 ⊥
⊥𝑃 𝑌 | 𝑍 (3.20 revisited)
Maybe we can detect independencies in the data and then use that
to infer the causal graph. However, going from independencies in the
distribution 𝑃 to d-separations in the graph 𝐺 isn’t something that the
Markov assumption gives us (see Equation 3.20 above). Rather, we need
the converse of the Markov assumption. This is known as the faithfulness
assumption.
𝑋⊥
⊥𝐺 𝑌 | 𝑍 ⇐= 𝑋 ⊥
⊥𝑃 𝑌 | 𝑍 (11.1)
𝑋2
𝑋1 𝑋2 𝑋3 𝑋1 𝑋2 𝑋3 𝑋1 𝑋3
(a) Chain directed to the right (b) Chain directed to the left (c) Fork
Although these are all distinct graphs, they correspond to the same set
of independence/dependence assumptions. Recall from Section 3.5 that
𝑋1 ⊥ ⊥ 𝑋3 | 𝑋2 in distributions that are Markov with respect to any of these
three graphs in Figure 11.2. We also saw that minimality told us that 𝑋1
and 𝑋2 are dependent and that 𝑋2 and 𝑋3 are dependent. And the stronger
faithfulness assumption additionally tells us that in any distributions
that are faithful with respect to any of these graphs, 𝑋1 and 𝑋3 are
dependent if we don’t condition on 𝑋2 . So using the presence/absence
of (conditional) independencies in the data isn’t enough to distinguish
these three graphs from each other; these graphs are Markov equivalent;
We say that two graphs are Markov equivalent if they correspond to 𝑋1 𝑋3
the same set of conditional independencies. Given a graph, we refer to
its Markov equivalence class as the set of graphs that encode the same
conditional independencies. Under faithfulness, we are able to identify a 𝑋2
graph from conditional independencies in the data if it is the only graph
in its Markov equivalence class. Any example of a graph that is the only Figure 11.3: Immoralities are in their own
Markov equivalence class.
one in its Markov equivalence class the basic immorality that we show in
Figure 11.3. Recall from Section 3.6 that immoralities are distinct from
the two other basic graphical building blocks (chains and forks) in that
in Figure 11.3, 𝑋1 is (unconditionally) independent of 𝑋3 , and 𝑋1 and 𝑋3
become dependent if we condition on 𝑋2 . This means that while the basic
chains and fork in Figure 11.2 are in the same Markov equivalence class,
the basic immorality is by itself in its own Markov equivalence class.
11 Causal Discovery from Observational Data 102
We’ve seen that we can identify the causal graph if it’s a basic immorality,
but what else can we identify? We saw that chains and forks are all in
the same Markov equivalence class, but that doesn’t mean that we can’t
get any information from distributions that are Markov and faithful with
respect to those graphs. What do all the chains and forks in Figure 11.2
have in common? They are share the same skeleton. A graph’s skeleton is 𝑋1 𝑋2 𝑋3
the structure we get if we replace all of its directed edges with undirected
Figure 11.4: Chain/fork skeleton.
edges. We depict the skeleton of a basic chain and a basic fork in
Figure 11.4.
A graph’s skeleton also gives us important conditional independence
information that we can use to distinguish it from graphs with different 𝑋1 𝑋2 𝑋3
skeletons. For example, if we add an 𝑋1 → 𝑋3 edge to the chain in Figure 11.5: Complete graph.
Figure 11.2a, we get the complete1 graph Figure 11.5. In this graph, unlike
in a chain or fork graph, 𝑋1 and 𝑋3 are not independent when we 1
condition on 𝑋2 . So this graph is not in the same Markov equivalence
Recall that a complete graph is one where
there is an edge connecting every pair of
class as the chains and fork in Figure 11.2. And we can see that graphically nodes.
by the fact that this graph has a different skeleton than those graphs (this
graph has an additional edge between 𝑋1 and 𝑋3 ).
To recap, we’ve pointed out two structural qualities that we can use to
distinguish graphs from each other:
1. Immoralities
2. Skeleton
And it turns out that we can determine whether graphs are in the same or
different Markov equivalence classes using these two structural qualities,
due to a result by Verma and Pearl [78] and Frydenberg [79]: [78]: Verma and Pearl (1990), ‘Equivalence
and Synthesis of Causal Models’
[79]: Frydenberg (1990), ‘The Chain Graph
Proposition 11.1 (Markov Equivalence via Immoral Skeletons) Two Markov Property’
graphs are Markov equivalent if and only if they have the same skeleton and
same immoralities.
PC [80] starts with a complete undirected graph and then trims it down 𝐴 𝐵
and orients edges via three steps:
1. Identify the skeleton.
2. Identify immoralities and orient them. 𝐶
3. Orient qualifying edges that are incident on colliders.
We’ll use the true graph in Figure 11.6 as a concrete example as we explain
𝐷 𝐸
each of these steps.
Figure 11.6: True graph for PC example.
11 Causal Discovery from Observational Data 103
𝐴 𝐵 𝐴 𝐵 𝐴 𝐵
𝐶 𝐶 𝐶
𝐷 𝐸 𝐷 𝐸 𝐷 𝐸
(a) Complete undirected graph that we (b) Undirected graph that remains after (c) Undirected graph that remains after
start with removing 𝑋 − 𝑌 edges where 𝑋 ⊥ ⊥𝑌 removing 𝑋 − 𝑌 edges where 𝑋 ⊥ ⊥𝑌 | 𝑍
Figure 11.7: Illustration of the process of step 1 of PC, where we start with the complete graph (left) and remove edges until we’ve identified
the skeleton of the graph (right), given that the true graph is the one in Figure 11.6.
algorithm [81] works without assuming acyclicity. And there is various [81]: Richardson (1996), ‘Feedback Models:
work on SAT-based causal discovery that allows us to drop both of the Interpretation and Discovery’
We’ve seen that assuming the Markov assumption and faithfulness can
only get us so far; with those assumptions, we can only identify a graph
up to its Markov equivalence class. If we make more assumptions, can
we identify the graph more precisely than just its Markov equivalence
class?
Well, if we are in the case where the distributions are multinomial, we
cannot [85]. Or if we are in the common toy case where the SCMs are [85]: Meek (1995), ‘Strong Completeness
linear with Gaussian noise, we cannot [86]. So we have the following and Faithfulness in Bayesian Networks’
completeness result due to Geiger and Pearl [86] and Meek [85]: [86]: Geiger and Pearl (1988), ‘On the Logic
of Causal Models’
𝑌 = 𝑓𝑌 (𝑋 , 𝑈𝑌 ) , 𝑋⊥
⊥ 𝑈𝑌 (11.6)
𝑋 = 𝑓𝑋 (𝑌, 𝑈𝑋 ) , 𝑌⊥
⊥ 𝑈𝑋 (11.7)
See, e.g., Peters et al. [14, p. 44] for a short proof. Similarly, this non- [14]: Peters et al. (2017), Elements of Causal
identifiability result can be extended to more general graphs that have Inference: Foundations and Learning Algo-
rithms
more than two variables [see, e.g., 14, p. 135].
However, if we make assumptions about the parametric form of the
SCM, we can distinguish 𝑋 → 𝑌 from 𝑋 ← 𝑌 and identify graphs more
generally. That’s what we’ll see in the rest of this chapter.
𝑌 := 𝑓 (𝑋) + 𝑈 (11.8)
𝑌 := 𝑓 (𝑋) + 𝑈 , 𝑋 ⊥⊥ 𝑈 , (11.9)
˜ ,
𝑋 := 𝑔(𝑌) + 𝑈 𝑌⊥
⊥𝑈˜ , (11.10)
Proof. We’ll first introduce a important result from Darmois [87] and
Skitovich [88] and Skitovich [88] that we’ll use to prove this theorem: [87]: Darmois (1953), ‘Analyse générale des
liaisons stochastiques: etude particulière
de l’analyse factorielle linéaire’
Theorem 11.5 (Darmois-Skitovich) Let 𝑋1 , . . . , 𝑋𝑛 be independent, non- [88]: Skitovich (1954), ‘Linear forms of in-
degenerate random variables. If there exist coefficients 𝛼 1 , . . . , 𝛼 𝑛 and dependent random variables and the nor-
𝛽 1 , . . . , 𝛽 𝑛 that are all non-zero such that the two linear combinations
mal distribution law’
[88]: Skitovich (1954), ‘Linear forms of in-
dependent random variables and the nor-
𝐴 = 𝛼 1 𝑋1 + . . . + 𝛼 𝑛 𝑋 𝑛 and mal distribution law’
𝐵 = 𝛽 1 𝑋1 + . . . + 𝛽 𝑛 𝑋 𝑛
We will use the contrapositive of the special case of this theorem for
𝑛 = 2 to do almost all of the work for this proof:
𝐴 = 𝛼 1 𝑋1 + 𝛼 2 𝑋2 and
𝐵 = 𝛽 1 𝑋1 + 𝛽 2 𝑋2
Proof Outline With the above corollary in mind, our proof strategy is
to write 𝑌 and 𝑈 ˜ as linear combinations of 𝑋 and 𝑈 . By doing this, we
are effectively mapping our variables in Equations 11.9 and 11.10 onto the
˜ onto 𝐵, 𝑋 onto 𝑋1 , and 𝑈
variables in the corollary as follows: 𝑌 onto 𝐴, 𝑈
onto 𝑋2 . Then, we can apply the above corollary of the Darmois-Skitovich
Theorem to have that 𝑌 and 𝑈 ˜ must be dependent, which violates the
reverse direction SCM in Equation 11.10. We now proceed with the proof.
11 Causal Discovery from Observational Data 107
˜ = 𝑋 − 𝛿𝑌
𝑈 ˜ (11.13)
˜
= 𝑋 − 𝛿(𝛿𝑋 + 𝑈) (11.14)
˜
= (1 − 𝛿𝛿)𝑋 ˜
+ 𝛿𝑈 (11.15)
˜ ,
𝑋 := 𝑔(𝑌) + 𝑈 𝑌⊥
⊥𝑈˜ (11.10 revisited)
We’ve given the proof here for just two variables, but it can be extended
to the more general setting with multiple variables (see [89] and [14, [89]: Shimizu et al. (2006), ‘A Linear Non-
Section 7.1.4]). Gaussian Acyclic Model for Causal Dis-
covery’
[14]: Peters et al. (2017), Elements of Causal
Graphical Intuition Inference: Foundations and Learning Algo-
rithms
When we fit the data in the causal direction, we get residuals that are
independent of the input variable, but when we fit the data in the
anti-causal direction, we get residuals that are dependent on the input
variable. We depict the regression line 𝑓ˆ we get if we linearly regress 𝑌
on 𝑋 (causal direction) in Figure 11.10a, and we depict the regression
line 𝑔ˆ we get if we linearly regress 𝑋 on 𝑌 (anti-causal direction) in
Figure 11.10b. Just from these fits, you can see that the forward model (fit
in the causal direction) looks more pleasing than the backward model
(fit in the ant-causal direction).
Forward model SCM:
To make this graphical intuition more clear, we plot the residuals of the
𝑌 := 𝑓 (𝑋) + 𝑈 , 𝑋⊥ ⊥𝑈
forward model 𝑓ˆ (causal direction) and the backward model 𝑔ˆ (anti- (11.9 revisited)
causal direction) in Figure 11.11. The residuals in the forward direction Backward model SCM:
correspond to the following: 𝑈 ˆ = 𝑌 − 𝑓ˆ(𝑋). And the residuals in the ˜ ,
𝑋 := 𝑔(𝑌) + 𝑈 𝑌⊥ ⊥𝑈 ˜
see in Figure 11.11a, the residuals of the forward model look independent
of the input variable 𝑋 (on the x-axis). However in Figure 11.10b, the
residuals of the backward model don’t look independent of the input
variable 𝑌 (on the x-axis) at all. Clearly, the range of the residuals (on the
vertical) changes as we move across values of 𝑌 (from left to right).
11 Causal Discovery from Observational Data 108
2.0 2.0
1.5 1.5
1.0 1.0
0.5 0.5
0.0 0.0
0.5 0.5
1.0 1.0
1.5 1.5
2.0 2.0
1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00 1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00
(a) Causal direction fit: linear fit that results from regressing 𝑌 on (b) Anti-causal direction fit: linear fit that results from regressing
𝑋. 𝑋 on 𝑌 .
Figure 11.10: Linear fits (in both directions) of the linear non-Gaussian data.
1.0
0.8
0.5 0.6
0.4
0.0 0.2
0.0
0.5
0.2
0.4
1.0
0.6
1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00 2.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.0
(a) Causal direction residuals: residuals that result from linearly (b) Anti-causal direction residuals: residuals that result from lin-
regressing 𝑌 on 𝑋 . early regressing 𝑋 on 𝑌 .
Figure 11.11: Residuals of linear models (in both directions) fit to the linear non-Gaussian data.
∀𝑖 , 𝑋𝑖 := 𝑓 (pa𝑖 ) + 𝑈 𝑖 (11.16)
Post-Nonlinear Setting What if you don’t believe that the noise realis-
tically enters additively. This motivates post-nonlinear models, where
there is another nonlinear transformation after adding the noise as in
11 Causal Discovery from Observational Data 109
Assumption 11.5 below. This setting can also yield identifiability (under
another technical condition). See Zhang and Hyvärinen [92] for more [92]: Zhang and Hyvärinen (2009), ‘On
details. the Identifiability of the Post-Nonlinear
Causal Model’
∀𝑖 , 𝑋𝑖 := 𝑔( 𝑓 (pa𝑖 ) + 𝑈 𝑖 ) (11.17)
And then if you want a whole book on this stuff, Peters et al. [14] wrote a [94]: Glymour et al. (2019), ‘Review of
Causal Discovery Methods Based on
popular one!
Graphical Models’
[14]: Peters et al. (2017), Elements of Causal
Inference: Foundations and Learning Algo-
rithms
Causal Discovery from
Interventional Data 12
12.1 Structural Interventions . . 110
12.1 Structural Interventions
Single-Node Interventions 110
Multi-Node Interventions . 110
12.1.1 Single-Node Interventions 12.2 Parametric Interventions . . 110
Coming Soon . . . . . . . . . 110
Coming Soon 12.3 Interventional Markov
Equivalence . . . . . . . . . 110
Coming Soon . . . . . . . . . 110
12.1.2 Multi-Node Interventions 12.4 Miscellaneous Other Set-
tings . . . . . . . . . . . . . . 110
Coming Soon Coming Soon . . . . . . . . . 110
Again, using the fact that 𝑇 is binary, we can reduce the inner expectation
to 𝑃(𝑇 = 1 | 𝑊) , 𝑒(𝑊), something that is already conditioned on:
Because this does not depend on 𝑌(𝑡), we’ve proven that 𝑇 is independent
of 𝑌(𝑡) given 𝑒(𝑊).
Proof. We will start with the statistical estimand that we get from the ad-
justment formula (Theorem 2.1). Given unconfoundedness and positivity,
the adjustment formula tells us
We’ll assume the variable are discrete to break these expectations into
sums (replace with integrals if continuous):
!
X X
= 𝑦 𝑃(𝑦 | 𝑡, 𝑤) 𝑃(𝑤) (A.14)
𝑤 𝑦
𝑃(𝑡|𝑤)
To get 𝑃(𝑡 | 𝑤) in there, we multiply by 𝑃(𝑡|𝑤)
:
XX 𝑃(𝑡 | 𝑤)
= 𝑦 𝑃(𝑦 | 𝑡, 𝑤) 𝑃(𝑤) (A.15)
𝑤 𝑦 𝑃(𝑡 | 𝑤)
XX 1
= 𝑦 𝑃(𝑦, 𝑡, 𝑤) (A.16)
𝑤 𝑦 𝑃(𝑡 | 𝑤)
A Proofs 116
𝔼 [1(𝑇 = 𝑡, 𝑊 = 𝑤) 𝑌]
X 1
= (A.17)
𝑤 𝑃(𝑡 | 𝑤)
1(𝑇 = 𝑡) 𝑌
=𝔼 (A.18)
𝑃(𝑡 | 𝑊)
Note: For some people, it might be more natural to skip straight from
Equation A.16 to Equation A.18.
Bibliography