0% found this document useful (0 votes)
10 views

2020 - Introduction_to_Causal_Inference - From ML Perspective

The document is a preprint titled 'Introduction to Causal Inference from a Machine Learning Perspective' by Brady Neal, providing a comprehensive overview of causal inference concepts through a machine learning lens. It includes topics such as potential outcomes, causal models, randomized experiments, and methods for estimation and sensitivity analysis. The text serves as a course lecture note and is designed for readers with a basic understanding of probability and some familiarity with statistics and machine learning.

Uploaded by

yihov53788
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views

2020 - Introduction_to_Causal_Inference - From ML Perspective

The document is a preprint titled 'Introduction to Causal Inference from a Machine Learning Perspective' by Brady Neal, providing a comprehensive overview of causal inference concepts through a machine learning lens. It includes topics such as potential outcomes, causal models, randomized experiments, and methods for estimation and sensitivity analysis. The text serves as a course lecture note and is designed for readers with a basic understanding of probability and some familiarity with statistics and machine learning.

Uploaded by

yihov53788
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 133

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/364807813

Introduction to Causal Inference from a Machine Learning Perspective

Preprint · December 2020


DOI: 10.13140/RG.2.2.10696.60161

CITATIONS READS

0 374

1 author:

Brady Neal
Université de Montréal
8 PUBLICATIONS 7 CITATIONS

SEE PROFILE

All content following this page was uploaded by Brady Neal on 27 October 2022.

The user has requested enhancement of the downloaded file.


Course Lecture Notes

Introduction to Causal Inference


from a Machine Learning Perspective

Brady Neal

December 17, 2020


Preface

Prerequisites There is one main prerequisite: basic probability. This course assumes
you’ve taken an introduction to probability course or have had equivalent experience.
Topics from statistics and machine learning will pop up in the course from time to
time, so some familiarity with those will be helpful but is not necessary. For example, if
cross-validation is a new concept to you, you can learn it relatively quickly at the point in
the book that it pops up. And we give a primer on some statistics terminology that we’ll
use in Section 2.4.
Active Reading Exercises Research shows that one of the best techniques to remember
material is to actively try to recall information that you recently learned. You will see
“active reading exercises” throughout the book to help you do this. They’ll be marked by
the Active reading exercise: heading.
Many Figures in This Book As you will see, there are a ridiculous amount of figures in
this book. This is on purpose. This is to help give you as much visual intuition as possible.
We will sometimes copy the same figures, equations, etc. that you might have seen in
preceding chapters so that we can make sure the figures are always right next to the text
that references them.
Sending Me Feedback This is a book draft, so I greatly appreciate any feedback you’re
willing to send my way. If you’re unsure whether I’ll be receptive to it or not, don’t be.
Please send any feedback to me at [email protected] with “[Causal Book]” in the
beginning of your email subject. Feedback can be at the word level, sentence level, section
level, chapter level, etc. Here’s a non-exhaustive list of useful kinds of feedback:
I Typoz.
I Some part is confusing.
I You notice your mind starts to wander, or you don’t feel motivated to read some
part.
I Some part seems like it can be cut.
I You feel strongly that some part absolutely should not be cut.
I Some parts are not connected well. Moving from one part to the next, you notice
that there isn’t a natural flow.
I A new active reading exercise you thought of.

Bibliographic Notes Although we do our best to cite relevant results, we don’t want to
disrupt the flow of the material by digging into exactly where each concept came from.
There will be complete sections of bibliographic notes in the final version of this book,
but they won’t come until after the course has finished.
Contents

Preface ii

Contents iii

1 Motivation: Why You Might Care 1


1.1 Simpson’s Paradox . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Applications of Causal Inference . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 Correlation Does Not Imply Causation . . . . . . . . . . . . . . . . . . . . 3
1.3.1 Nicolas Cage and Pool Drownings . . . . . . . . . . . . . . . . . . . 3
1.3.2 Why is Association Not Causation? . . . . . . . . . . . . . . . . . . 4
1.4 Main Themes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2 Potential Outcomes 6
2.1 Potential Outcomes and Individual Treatment Effects . . . . . . . . . . . . 6
2.2 The Fundamental Problem of Causal Inference . . . . . . . . . . . . . . . . 7
2.3 Getting Around the Fundamental Problem . . . . . . . . . . . . . . . . . . 8
2.3.1 Average Treatment Effects and Missing Data Interpretation . . . . 8
2.3.2 Ignorability and Exchangeability . . . . . . . . . . . . . . . . . . . 9
2.3.3 Conditional Exchangeability and Unconfoundedness . . . . . . . . 10
2.3.4 Positivity/Overlap and Extrapolation . . . . . . . . . . . . . . . . . 12
2.3.5 No interference, Consistency, and SUTVA . . . . . . . . . . . . . . 13
2.3.6 Tying It All Together . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.4 Fancy Statistics Terminology Defancified . . . . . . . . . . . . . . . . . . . 15
2.5 A Complete Example with Estimation . . . . . . . . . . . . . . . . . . . . . 16

3 The Flow of Association and Causation in Graphs 19


3.1 Graph Terminology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.2 Bayesian Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.3 Causal Graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.4 Two-Node Graphs and Graphical Building Blocks . . . . . . . . . . . . . . 23
3.5 Chains and Forks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.6 Colliders and their Descendants . . . . . . . . . . . . . . . . . . . . . . . . 26
3.7 d-separation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3.8 Flow of Association and Causation . . . . . . . . . . . . . . . . . . . . . . 30

4 Causal Models 32
4.1 The do-operator and Interventional Distributions . . . . . . . . . . . . . . 32
4.2 The Main Assumption: Modularity . . . . . . . . . . . . . . . . . . . . . . 34
4.3 Truncated Factorization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
4.3.1 Example Application and Revisiting “Association is Not Causation” 36
4.4 The Backdoor Adjustment . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
4.4.1 Relation to Potential Outcomes . . . . . . . . . . . . . . . . . . . . . 39
4.5 Structural Causal Models (SCMs) . . . . . . . . . . . . . . . . . . . . . . . 40
4.5.1 Structural Equations . . . . . . . . . . . . . . . . . . . . . . . . . . 40
4.5.2 Interventions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
4.5.3 Collider Bias and Why to Not Condition on Descendants of Treatment 43
4.6 Example Applications of the Backdoor Adjustment . . . . . . . . . . . . . 44
4.6.1 Association vs. Causation in a Toy Example . . . . . . . . . . . . . 44
4.6.2 A Complete Example with Estimation . . . . . . . . . . . . . . . . 45
4.7 Assumptions Revisited . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

5 Randomized Experiments 49
5.1 Comparability and Covariate Balance . . . . . . . . . . . . . . . . . . . . . 49
5.2 Exchangeability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
5.3 No Backdoor Paths . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

6 Nonparametric Identification 52
6.1 Frontdoor Adjustment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
6.2 do-calculus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
6.2.1 Application: Frontdoor Adjustment . . . . . . . . . . . . . . . . . . 57
6.3 Determining Identifiability from the Graph . . . . . . . . . . . . . . . . . . 58

7 Estimation 62
7.1 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
7.2 Conditional Outcome Modeling (COM) . . . . . . . . . . . . . . . . . . . . 63
7.3 Grouped Conditional Outcome Modeling (GCOM) . . . . . . . . . . . . . 64
7.4 Increasing Data Efficiency . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
7.4.1 TARNet . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
7.4.2 X-Learner . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
7.5 Propensity Scores . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
7.6 Inverse Probability Weighting (IPW) . . . . . . . . . . . . . . . . . . . . . . 68
7.7 Doubly Robust Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
7.8 Other Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
7.9 Concluding Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
7.9.1 Confidence Intervals . . . . . . . . . . . . . . . . . . . . . . . . . . 71
7.9.2 Comparison to Randomized Experiments . . . . . . . . . . . . . . 72

8 Unobserved Confounding: Bounds and Sensitivity Analysis 73


8.1 Bounds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
8.1.1 No-Assumptions Bound . . . . . . . . . . . . . . . . . . . . . . . . 74
8.1.2 Monotone Treatment Response . . . . . . . . . . . . . . . . . . . . 76
8.1.3 Monotone Treatment Selection . . . . . . . . . . . . . . . . . . . . . 78
8.1.4 Optimal Treatment Selection . . . . . . . . . . . . . . . . . . . . . . 79
8.2 Sensitivity Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
8.2.1 Sensitivity Basics in Linear Setting . . . . . . . . . . . . . . . . . . . 82
8.2.2 More General Settings . . . . . . . . . . . . . . . . . . . . . . . . . 85

9 Instrumental Variables 86
9.1 What is an Instrument? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
9.2 No Nonparametric Identification of the ATE . . . . . . . . . . . . . . . . . 87
9.3 Warm-Up: Binary Linear Setting . . . . . . . . . . . . . . . . . . . . . . . . 87
9.4 Continuous Linear Setting . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
9.5 Nonparametric Identification of Local ATE . . . . . . . . . . . . . . . . . . 90
9.5.1 New Potential Notation with Instruments . . . . . . . . . . . . . . 90
9.5.2 Principal Stratification . . . . . . . . . . . . . . . . . . . . . . . . . 90
9.5.3 Local ATE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
9.6 More General Settings for ATE Identification . . . . . . . . . . . . . . . . . 94

10 Difference in Differences 95
10.1 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
10.2 Introducing Time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
10.3 Identification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
10.3.1 Assumptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
10.3.2 Main Result and Proof . . . . . . . . . . . . . . . . . . . . . . . . . 97
10.4 Major Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98

11 Causal Discovery from Observational Data 100


11.1 Independence-Based Causal Discovery . . . . . . . . . . . . . . . . . . . . 100
11.1.1 Assumptions and Theorem . . . . . . . . . . . . . . . . . . . . . . . 100
11.1.2 The PC Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
11.1.3 Can We Get Any Better Identification? . . . . . . . . . . . . . . . . 104
11.2 Semi-Parametric Causal Discovery . . . . . . . . . . . . . . . . . . . . . . . 104
11.2.1 No Identifiability Without Parametric Assumptions . . . . . . . . . 105
11.2.2 Linear Non-Gaussian Noise . . . . . . . . . . . . . . . . . . . . . . 105
11.2.3 Nonlinear Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
11.3 Further Resources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109

12 Causal Discovery from Interventional Data 110


12.1 Structural Interventions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
12.1.1 Single-Node Interventions . . . . . . . . . . . . . . . . . . . . . . . 110
12.1.2 Multi-Node Interventions . . . . . . . . . . . . . . . . . . . . . . . 110
12.2 Parametric Interventions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
12.2.1 Coming Soon . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
12.3 Interventional Markov Equivalence . . . . . . . . . . . . . . . . . . . . . . 110
12.3.1 Coming Soon . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
12.4 Miscellaneous Other Settings . . . . . . . . . . . . . . . . . . . . . . . . . . 110
12.4.1 Coming Soon . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110

13 Transfer Learning and Transportability 111


13.1 Causal Insights for Transfer Learning . . . . . . . . . . . . . . . . . . . . . 111
13.1.1 Coming Soon . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
13.2 Transportability of Causal Effects Across Populations . . . . . . . . . . . . 111
13.2.1 Coming Soon . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111

14 Counterfactuals and Mediation 112


14.1 Counterfactuals Basics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
14.1.1 Coming Soon . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
14.2 Important Application: Mediation . . . . . . . . . . . . . . . . . . . . . . . 112
14.2.1 Coming Soon . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112

Appendix 113
A Proofs 114
A.1 Proof of Equation 6.1 from Section 6.1 . . . . . . . . . . . . . . . . . . . . . 114
A.2 Proof of Propensity Score Theorem (7.1) . . . . . . . . . . . . . . . . . . . . 114
A.3 Proof of IPW Estimand (7.18) . . . . . . . . . . . . . . . . . . . . . . . . . . 115

Bibliography 117

Alphabetical Index 123


List of Figures

1.1 Causal structure for when to prefer treatment B for COVID-27 . . . . . . . . . 2


1.2 Causal structure for when to prefer treatment A for COVID-27 . . . . . . . . . 2
1.3 Number of Nicolas Cage movies correlates with number of pool drownings . 3
1.4 Causal structure with getting lit as a confounder . . . . . . . . . . . . . . . . . 4

2.2 Causal structure for ignorable treatment assignment mechanism . . . . . . . . 9


2.1 Causal structure of 𝑋 confounding the effect of 𝑇 on 𝑌 . . . . . . . . . . . . . 9
2.3 Causal structure of confounding through 𝑋 . . . . . . . . . . . . . . . . . . . . 11
2.4 Causal structure for conditional exchangeability given 𝑋 . . . . . . . . . . . . 11
2.5 The Identification-Estimation Flowchart . . . . . . . . . . . . . . . . . . . . . . 16
3.3 Directed graph . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

3.1 Terminology machine gun . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19


3.2 Undirected graph . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.4 Directed graph with cycle . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.5 Directed graph with immorality . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.6 Four node DAG where 𝑋4 locally depends on only 𝑋3 . . . . . . . . . . . . . . 20
3.7 Four node DAG with many independencies . . . . . . . . . . . . . . . . . . . . 21
3.8 Two connected node DAG . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.9 Basic graph building blocks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.11 Two connected node DAG . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.12 Chain with association . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.10 Two unconnected node DAG . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.13 Fork with association . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.14 Chain with blocked association . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.15 Fork with blocked association . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.16 Immorality with association blocked by collider . . . . . . . . . . . . . . . . . 26
3.17 Immorality with association unblocked . . . . . . . . . . . . . . . . . . . . . . 26
3.18 Good-looking men are jerks example . . . . . . . . . . . . . . . . . . . . . . . . 27
3.19 Graphs for d-separation exercise . . . . . . . . . . . . . . . . . . . . . . . . . . 30
3.20 Causal association and confounding association . . . . . . . . . . . . . . . . . 30
3.21 Assumptions flowchart from statistical independencies to causal dependencies 31

4.1 The Identification-Estimation Flowchart (extended) . . . . . . . . . . . . . . . 32


4.2 Illustration of the difference between conditioning and intervening . . . . . . 33
4.3 Causal mechanism . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
4.4 Intervention as edge deletion in causal graphs . . . . . . . . . . . . . . . . . . 35
4.5 Causal structure for application of truncated factorization . . . . . . . . . . . . 36
4.6 Manipulated graph for three nodes . . . . . . . . . . . . . . . . . . . . . . . . . 37
4.7 Graph for structural equation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
4.8 Causal graph for several structural equations . . . . . . . . . . . . . . . . . . . 41
4.9 Causal structure before simple intervention . . . . . . . . . . . . . . . . . . . . 42
4.10 Causal structure after simple intervention . . . . . . . . . . . . . . . . . . . . . 42
4.11 Causal graph for completely blocking causal flow . . . . . . . . . . . . . . . . 43
4.12 Causal graph for partially blocking causal flow . . . . . . . . . . . . . . . . . . 43
4.13 Causal graph where a conditioned collider induces bias . . . . . . . . . . . . . 43
4.14 Causal graph where child of a mediator is conditioned on . . . . . . . . . . . . 44
4.15 Magnified causal graph where child of a mediator is conditioned on . . . . . . 44
4.16 Causal graph for M-bias . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
4.17 Causal graph for toy example . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
4.18 Causal graph for blood pressure example with collider . . . . . . . . . . . . . 46
4.19 Causal graph for M-bias with unobserved variables . . . . . . . . . . . . . . . 47

5.1 Causal structure of confounding through 𝑋 . . . . . . . . . . . . . . . . . . . . 51


5.2 Causal structure when we randomize treatment . . . . . . . . . . . . . . . . . 51

6.1 Causal graph for frontdoor criterion . . . . . . . . . . . . . . . . . . . . . . . . 52


6.2 Illustration of focusing analysis to a mediator . . . . . . . . . . . . . . . . . . . 52
6.3 Illustration of steps of frontdoor adjustment . . . . . . . . . . . . . . . . . . . . 52
6.5 Equationtown . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
6.4 Causal graph for frontdoor criterion . . . . . . . . . . . . . . . . . . . . . . . . 53
6.6 Causal graph for frontdoor criterion . . . . . . . . . . . . . . . . . . . . . . . . 54
6.7 Causal graph for frontdoor criterion . . . . . . . . . . . . . . . . . . . . . . . . 57
6.10 Causal graph for frontdoor criterion . . . . . . . . . . . . . . . . . . . . . . . . 58
6.8 Causal graph for frontdoor with 𝑊 − 𝑇 edge removed . . . . . . . . . . . . . . 58
6.9 Causal graph for frontdoor with 𝑇 − 𝑀 edge removed . . . . . . . . . . . . . . 58
6.11 Graph where blocking one backdoor path unblocks another . . . . . . . . . . 59
6.12 Example graph that satisfies the unconfounded children criterion . . . . . . . 60
6.13 Graphs for the questions about the unconfounded children criterion . . . . . . 61

7.1 The Identification-Estimation Flowchart . . . . . . . . . . . . . . . . . . . . . . 63


7.2 Different neural networks for different kinds of estimators . . . . . . . . . . . . 66
7.3 Simple graph where 𝑊 satisfies the backdoor criterion . . . . . . . . . . . . . 68
7.5 Simple graph where 𝑊 confounds the effect of 𝑇 on 𝑌 . . . . . . . . . . . . . . 68
7.6 Effective graph for pseudo-population that we get by reweighting the data
generated according to the graph in Figure 7.5 using inverse probability weighting. 68
7.4 Graphical proof of propensity score theorem . . . . . . . . . . . . . . . . . . . 68

8.1 Unobserved confounding graph . . . . . . . . . . . . . . . . . . . . . . . . . . 73


8.2 Simple unobserved confounding graph . . . . . . . . . . . . . . . . . . . . . . 82
8.3 Simple unobserved confounding graph . . . . . . . . . . . . . . . . . . . . . . 82
8.4 Simple unobserved confounding graph . . . . . . . . . . . . . . . . . . . . . . 84
8.5 Unobserved confounding sensitivity contour plots . . . . . . . . . . . . . . . . 84

9.1 Instrumental variable graph . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86


9.2 Instrumental variable graph . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
9.3 Instrumental variable graph . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
9.4 Instrumental variable graph . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
9.5 Instrumental variable graph . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
9.6 Causal graph for the compliers and defiers . . . . . . . . . . . . . . . . . . . . 91
9.7 Causal graph for the always-takers and never taker . . . . . . . . . . . . . . . 91

11.1 Faithfulness counterexample graph. . . . . . . . . . . . . . . . . . . . . . . . . 100

11.3 Immorality Markov equivalence class . . . . . . . . . . . . . . . . . . . . . . . 101


11.2 Three Markov equivalent graphs . . . . . . . . . . . . . . . . . . . . . . . . . . 101
11.5 Complete graph. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
11.6 True graph for PC example. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
11.4 Chain/fork skeleton. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
11.8 Graph from PC after we’ve oriented the immoralities. . . . . . . . . . . . . . . 103
11.9 Graph from PC after we’ve oriented edges that would form immoralities if
they were oriented in the other (incorrect) direction. . . . . . . . . . . . . . . . 103
11.7 Illustration of the process of step 1 of PC, where we start with the complete
graph (left) and remove edges until we’ve identified the skeleton of the graph
(right), given that the true graph is the one in Figure 11.6. . . . . . . . . . . . . 103
11.10Linear fits of linear non-Gaussian data . . . . . . . . . . . . . . . . . . . . . . . 108
11.11 Residuals of linear models fit to linear non-Gaussian data . . . . . . . . . . . . 108

A.1 Causal graph for frontdoor criterion . . . . . . . . . . . . . . . . . . . . . . . . 114

List of Tables

1.1 Simpson’s paradox in COVID-27 data . . . . . . . . . . . . . . . . . . . . . . . 1

2.1 Causal Inference as Missing Data Problem . . . . . . . . . . . . . . . . . . . . . 9

3.1 Exponential number of parameters for modeling factors . . . . . . . . . . . . . 20

Listings

2.1 Python code for estimating the ATE . . . . . . . . . . . . . . . . . . . . . . 17


2.2 Python code for estimating the ATE using the coefficient of linear regression 17

4.1 Python code for estimating the ATE, without adjusting for the collider . . 46
Motivation: Why You Might Care 1
1.1 Simpson’s Paradox . . . . . 1
1.1 Simpson’s Paradox
1.2 Applications of Causal Infer-
ence . . . . . . . . . . . . . . 2
Consider a purely hypothetical future where there is a new disease known
1.3 Correlation Does Not Imply
as COVID-27 that is prevalent in the human population. In this purely
Causation . . . . . . . . . . 3
hypothetical future, there are two treatments that have been developed:
Nicolas Cage and Pool
treatment A and treatment B. Treatment B is more scarce than treatment Drownings . . . . . . . . . . 3
A, so the split of those currently receiving treatment A vs. treatment Why is Association Not Cau-
B is roughly 73%/27%. You are in charge of choosing which treatment sation? . . . . . . . . . . . . 4
your country will exclusively use, in a country that only cares about 1.4 Main Themes . . . . . . . . . 5
minimizing loss of life.
You have data on the percentage of people who die from COVID-27,
given the treatment they were assigned and given their condition at the
time treatment was decided. Their condition is a binary variable: either
mild or severe. In this data, 16% of those who receive A die, whereas
19% of those who receive B die. However, when we examine the people
with mild condition separately from the people with severe condition,
the numbers reverse order. In the mild subpopulation, 15% of those who
receive A die, whereas 10% of those who receive B die. In the severe
subpopulation, 30% of those who receive A die, whereas 20% of those
who receive B die. We depict these percentages and the corresponding
counts in Table 1.1.

Table 1.1: Simpson’s paradox in COVID-27


Condition data. The percentages denote the mortality
Mild Severe Total rates in each of the groups. Lower is better.
The numbers in parentheses are the corre-
15% 30% 16% sponding counts. This apparent paradox
t

A
en

stems from the interpretation that treat-


m

(210/1400) (30/100) (240/1500)


ment A looks better when examining the
at
e

10% 20% whole population, but treatment B looks


Tr

19%
B better in all subpopulations.
(5/50) (100/500) (105/550)

The apparent paradox stems from the fact that, in Table 1.1, the “Total”
column could be interpreted to mean that we should prefer treatment
A, whereas the “Mild” and “Severe” columns could both be interpreted
to mean that we should prefer treatment B.1 In fact, the answer is that if 1A key ingredient necessary to find Simp-
we know someone’s condition, we should give them treatment B, and if son’s paradox is the non-uniformity of
allocation of people to the groups. 1400
we do not know their condition, we should give them treatment A. Just
of the 1500 people who received treatment
kidding... that doesn’t make any sense. So really, what treatment should A had mild condition, whereas 500 of
you choose for your country? the 550 people who received treatment
B had severe condition. Because people
Either treatment A or treatment B could be the right answer, depending with mild condition are less likely to die,
on the causal structure of the data. In other words, causality is essential to this means that the total mortality rate
for those with treatment A is lower than
solve Simpson’s paradox. For now, we will just give the intuition for when
what it would have been if mild and severe
you should prefer treatment A vs. when you should prefer treatment B, conditions were equally split among them.
but it will be made more formal in Chapter 4. The opposite bias is true for treatment B.
1 Motivation: Why You Might Care 2

Scenario 1 If the condition 𝐶 is a cause of the treatment 𝑇 (Figure


1.1), treatment B is more effective at reducing mortality 𝑌 . An example
scenario is where doctors decide to give treatment A to most people
who have mild conditions. And they save the more expensive and more
limited treatment B for people with severe conditions. Because having
𝐶
severe condition causes one to be more likely to die (𝐶 → 𝑌 in Figure
1.1) and causes one to be more likely to receive treatment B (𝐶 → 𝑇
in Figure 1.1), treatment B will be associated with higher mortality in
the total population. In other words, treatment B is associated with a 𝑇 𝑌
higher mortality rate simply because condition is a common cause of Figure 1.1: Causal structure of scenario 1,
both treatment and mortality. Here, condition confounds the effect of where condition 𝐶 is a common cause of
treatment 𝑇 and mortality 𝑌 . Given this
treatment on mortality. To correct for this confounding, we must examine
causal structure, treatment B is preferable.
the relationship of 𝑇 and 𝑌 among patients with the same conditions.
This means that the better treatment is the one that yields lower mortality
in each of the subpopulations (the “Mild” and “Severe” columns in Table
1.1): treatment B.

Scenario 2 If the prescription2 of treatment 𝑇 is a cause of the condition 2 𝑇 refers to the prescription of the treat-
𝐶 (Figure 1.2), treatment A is more effective. An example scenario is ment, rather than the subsequent recep-
tion of the treatment.
where treatment B is so scarce that it requires patients to wait a long
time after they were prescribed the treatment before they can receive
the treatment. Treatment A does not have this problem. Because the
condition of a patient with COVID-27 worsens over time, the prescription
of treatment B actually causes patients with mild conditions to develop
severe conditions, causing a higher mortality rate. Therefore, even if
𝑇 𝐶
treatment B is more effective than treatment A once administered (positive
effect along 𝑇 → 𝑌 in Figure 1.2), because prescription of treatment B
causes worse conditions (negative effect along 𝑇 → 𝐶 → 𝑌 in Figure
1.2), treatment B is less effective in total. Note: Because treatment B is 𝑌
more expensive, treatment B is prescribed with 0.27 probability, while Figure 1.2: Causal structure of scenario 2,
treatment A is prescribed with 0.73 probability; importantly, treatment where treatment 𝑇 is a cause of condition
𝐶 . Given this causal structure, treatment
prescription is independent of condition in this scenario.
A is preferable.
In sum, the more effective treatment is completely dependent on the
causal structure of the problem. In Scenario 1, where 𝐶 was a cause of
𝑇 (Figure 1.1), treatment B was more effective. In Scenario 2, where 𝑇
was a cause of 𝐶 (Figure 1.2), treatment A was more effective. Without
causality, Simpson’s paradox cannot be resolved. With causality, it is not
a paradox at all.

1.2 Applications of Causal Inference

Causal inference is essential to science, as we often want to make causal


claims, rather than merely associational claims. For example, if we
are choosing between treatments for a disease, we want to choose the
treatment that causes the most people to be cured, without causing too
many bad side effects. If we want a reinforcement learning algorithm to
maximize reward, we want it to take actions that cause it to achieve the
maximum reward. If we are studying the effect of social media on mental
health, we are trying to understand what the main causes of a given
mental health outcome are and order these causes by the percentage of
the outcome that can be attributed to each cause.
1 Motivation: Why You Might Care 3

Causal inference is essential for rigorous decision-making. For example,


say we are considering several different policies to implement to reduce
greenhouse gas emissions, and we must choose just one due to budget
constraints. If we want to be maximally effective, we should carry out
causal analysis to determine which policy will cause the largest reduc-
tion in emissions. As another example, say we are considering several
interventions to reduce global poverty. We want to know which policies
will cause the largest reductions in poverty.
Now that we’ve gone through the general example of Simpson’s paradox
and a few specific examples in science and decision-making, we’ll move
to how causal inference is so different from prediction.

1.3 Correlation Does Not Imply Causation

Many of you will have heard the mantra “correlation does not imply
causation.” In this section, we will quickly review that and provide you
with a bit more intuition about why this is the case.

1.3.1 Nicolas Cage and Pool Drownings

It turns out that the yearly number of people who drown by falling into
swimming pools has a high degree of correlation with the yearly number
of films that Nicolas Cage appears in [1]. See Figure 1.3 for a graph of this [1]: Vigen (2015), Spurious correlations
data. Does this mean that Nicolas Cage encourages bad swimmers to
hop in the pool in his films? Or does Nicolas Cage feel more motivated to
act in more films when he sees how many drownings are happening that
year, perhaps to try to prevent more drownings? Or is there some other
explanation? For example, maybe Nicolas Cage is interested in increasing
his popularity among causal inference practitioners, so he travels back in
time to convince his past self to do just the right number of movies for us
to see this correlation, but not too close of a match as that would arouse
suspicion and potentiallyNumber of people
cause someone to who drowned
prevent him fromby falling into a pool
rigging
correlates with
the data this way. We may never know for sure.
Films Nicolas Cage appeared in
1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009
140 drownings 6 films
Swimming pool drownings

120 drownings 4 films


Nicholas Cage

100 drownings 2 films

80 drownings 0 films
1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009

Nicholas Cage Swimming pool drownings


tylervigen.com

Figure 1.3: The yearly number of movies Nicolas Cage appears in correlates with the yearly number of pool drownings [1].

Of course, all of the possible explanations in the preceding paragraph


seem quite unlikely. Rather, it is likely that this is a spurious correlation,
where there is no causal relationship. We’ll soon move on to a more
1 Motivation: Why You Might Care 4

illustrative example that will help clarify how spurious correlations can
arise.

1.3.2 Why is Association Not Causation?

Before moving to the next example, let’s be a bit more precise about
terminology. “Correlation” is often colloquially used as a synonym
for statistical dependence. However, “correlation” is technically only a
measure of linear statistical dependence. We will largely be using the
term association to refer to statistical dependence from now on.
Causation is not all or none. For any given amount of association, it
does not need to be “all of the association is causal” or “none of the
association is causal.” Rather, it is possible to have a large amount of
association with only some of it being causal. The phrase “association
is not causation” simply means that the amount of association and the
amount of causation can be different. Some amount of association and
zero causation is a special case of “association is not causation.”
Say you happen upon some data that relates wearing shoes to bed and
waking up with a headache, as one does. It turns out that most times
that someone wears shoes to bed, that person wakes up with a headache.
And most times someone doesn’t wear shoes to bed, that person doesn’t
wake up with a headache. It is not uncommon for people to interpret
data like this (with associations) as meaning that wearing shoes to bed
causes people to wake up with headaches, especially if they are looking
for a reason to justify not wearing shoes to bed. A careful journalist might
make claims like “wearing shoes to bed is associated with headaches”
or “people who wear shoes to bed are at higher risk of waking up with
headaches.” However, the main reason to make claims like that is that
most people will internalize claims like that as “if I wear shoes to bed,
I’ll probably wake up with a headache.”
We can explain how wearing shoes to bed and headaches are associated
without either being a cause of the other. It turns out that they are
both caused by a common cause: drinking the night before. We depict
this in Figure 1.4. You might also hear this kind of variable referred
to as a “confounder” or a “lurking variable.” We will call this kind of
association confounding association since the association is facilitated by a
confounder.
The total association observed can be made up of both confounding
association and causal association. It could be the case that wearing shoes
to bed does have some small causal effect on waking up with a headache. Figure 1.4: Causal structure, where drink-
Then, the total association would not be solely confounding association ing the night before is a common cause of
nor solely causal association. It would be a mixture of both. For example, sleeping with shoes on and of waking up
with a headaches.
in Figure 1.4, causal association flows along the arrow from shoe-sleeping
to waking up with a headache. And confounding association flows along
the path from shoe-sleeping to drinking to headachening (waking up
with a headache). We will make the graphical interpretation of these
different kinds of association clear in Chapter 3.
1 Motivation: Why You Might Care 5

The Main Problem The main problem motivating causal inference is


that association is not causation.3 If the two were the same, then causal 3As we’ll see in Chapter 5, if we randomly
inference would be easy. Traditional statistics and machine learning assign the treatment in a controlled exper-
iment, association actually is causation.
would already have causal inference solved, as measuring causation
would be as simple as just looking at measures such as correlation and
predictive performance in data. A large portion of this book will be about
better understanding and solving this problem.

1.4 Main Themes

There are several overarching themes that will keep coming up through-
out this book. These themes will largely be comparisons of two different
categories. As you are reading, it is important that you understand which
categories different sections of the book fit into and which categories
they do not fit into.
Statistical vs. Causal Even with an infinite amount of data, we some-
times cannot compute some causal quantities. In contrast, much of
statistics is about addressing uncertainty in finite samples. When given
infinite data, there is no uncertainty. However, association, a statistical
concept, is not causation. There is more work to be done in causal infer-
ence, even after starting with infinite data. This is the main distinction
motivating causal inference. We have already made this distinction in
this chapter and will continue to make this distinction throughout the
book.
Identification vs. Estimation Identification of causal effects is unique
to causal inference. It is the problem that remains to solve, even when we
have infinite data. However, causal inference also shares estimation with
traditional statistics and machine learning. We will largely begin with
identification of causal effects (in Chapters 2, 4 and 6) before moving to
estimation of causal effects (in Chapter 7). The exceptions are Section 2.5
and Section 4.6.2, where we carry out complete examples with estimation
to give you an idea of what the whole process looks like early on.
Interventional vs. Observational If we can intervene/experiment,
identification of causal effects is relatively easy. This is simply because
we can actually take the action that we want to measure the causal effect
of and simply measure the effect after we take that action. Observational
data is where it gets more complicated because confounding is almost
always introduced into the data.
Assumptions There will be a large focus on what assumptions we are
using to get the results that we get. Each assumption will have its own
box to help make it difficult to not notice. Clear assumptions should make
it easy to see where critiques of a given causal analysis or causal model
will be. The hope is that presenting assumptions clearly will lead to more
lucid discussions about causality.
Potential Outcomes 2
In this chapter, we will ease into the world of causality. We will see that 2.1 Potential Outcomes and Indi-
new concepts and corresponding notations need to be introduced to vidual Treatment Effects . 6
clearly describe causal concepts. These concepts are “new” in the sense 2.2 The Fundamental Problem
that they may not exist in traditional statistics or math, but they should of Causal Inference . . . . 7
be familiar in that we use them in our thinking and describe them with 2.3 Getting Around the Funda-
natural language all the time. mental Problem . . . . . . . 8
Average Treatment Effects
Familiar statistical notation We will use 𝑇 to denote the random vari-
and Missing Data Interpre-
able for treatment, 𝑌 to denote the random variable for the outcome of
tation . . . . . . . . . . . . . 8
interest and 𝑋 to denote covariates. In general, we will use uppercase Ignorability and Exchange-
letters to denote random variables (except in maybe one case) and lower- ability . . . . . . . . . . . . . 9
case letters to denote values that random variables take on. Much of what Conditional Exchangeability
we consider will be settings where 𝑇 is binary. Know that, in general, we and Unconfoundedness . 10
can extend things to work in settings where 𝑇 can take on more than two Positivity/Overlap and Ex-
values or where 𝑇 is continuous. trapolation . . . . . . . . . . 12
No interference, Consis-
tency, and SUTVA . . . . . 13
Tying It All Together . . . . 14
2.1 Potential Outcomes and Individual 2.4 Fancy Statistics Terminology
Treatment Effects Defancified . . . . . . . . . 15
2.5 A Complete Example with
Estimation . . . . . . . . . . 16
We will now introduce the first causal concept to appear in this book.
These concepts are sometimes characterized as being unique to the
Neyman-Rubin [2–4] causal model (or potential outcomes framework), [2]: Splawa-Neyman (1923 [1990]), ‘On the
but they are not. For example, these same concepts are still present Application of Probability Theory to Agri-
cultural Experiments. Essay on Principles.
(just under different notation) in the framework that uses causal graphs Section 9.’
(Chapters 3 and 4). It is important that you spend some time ensuring [3]: Rubin (1974), ‘Estimating causal effects
that you understand these initial causal concepts. If you have not studied of treatments in randomized and nonran-
causal inference before, they will be unfamiliar to see in mathematical domized studies.’
[4]: Sekhon (2008), ‘The Neyman-Rubin
contexts, though they may be quite familiar intuitively because we Model of Causal Inference and Estimation
commonly think and communicate in causal language. via Matching Methods’

Scenario 1 Consider the scenario where you are unhappy. And you are
considering whether or not to get a dog to help make you happy. If you
become happy after you get the dog, does this mean the dog caused you
to be happy? Well, what if you would have also become happy had you
not gotten the dog? In that case, the dog was not necessary to make you
happy, so its claim to a causal effect on your happiness is weak.
Scenario 2 Let’s switch things up a bit. Consider that you will still be
happy if you get a dog, but now, if you don’t get a dog, you will remain
unhappy. In this scenario, the dog has a pretty strong claim to a causal
effect on your happiness.
In both the above scenarios, we have used the causal concept known as
potential outcomes. Your outcome 𝑌 is happiness: 𝑌 = 1 corresponds to
happy while 𝑌 = 0 corresponds to unhappy. Your treatment 𝑇 is whether
or not you get a dog: 𝑇 = 1 corresponds to you getting a dog while 𝑇 = 0
2 Potential Outcomes 7

corresponds to you not getting a dog. We denote by 𝑌(1) the potential


outcome of happiness you would observe if you were to get a dog (𝑇 = 1).
Similarly, we denote by 𝑌(0) the potential outcome of happiness you
would observe if you were to not get a dog (𝑇 = 0). In scenario 1, 𝑌(1) = 1
and 𝑌(0) = 1. In contrast, in scenario 2, 𝑌(1) = 1 and 𝑌(0) = 0.
More generally, the potential outcome 𝑌(𝑡) denotes what your outcome
would be, if you were to take treatment 𝑡 . A potential outcome 𝑌(𝑡) is
distinct from the observed outcome 𝑌 in that not all potential outcomes
are observed. Rather all potential outcomes can potentially be observed.
The one that is actually observed depends on the value that the treatment
𝑇 takes on.
In the previous scenarios, there was only a single individual in the whole
population: you. However, generally, there are many individuals 1 in 1“Unit” is often used in the place of “indi-
the population of interest. We will denote the treatment, covariates, and vidual” as the units of the population are
outcome of the 𝑖 th individual using 𝑇𝑖 , 𝑋𝑖 , and 𝑌𝑖 . Then, we can define not always people.

the individual treatment effect (ITE) 2 for individual 𝑖 : 2 The ITE is also known as the individual
causal effect, unit-level causal effect, or unit-
𝜏𝑖 , 𝑌𝑖 (1) − 𝑌𝑖 (0) (2.1) level treatment effect.

Whenever there is more than one individual in a population, 𝑌(𝑡) is a ran-


dom variable because different individuals will have different potential
outcomes. In contrast, 𝑌𝑖 (𝑡) is usually treated as non-random3 because 3 Though, 𝑌𝑖 (𝑡) can be treated as random.
the subscript 𝑖 means that we are conditioning on so much individual-
ized (and context-specific) information, that we restrict our focus to a
single individual (in a specific context) whose potential outcomes are
deterministic.
ITEs are some of the main quantities that we care about in causal
inference. For example, in scenario 2 above, you would choose to get
a dog because the causal effect of getting a dog on your happiness is
positive: 𝑌(1) − 𝑌(0) = 1 − 0 = 1. In contrast, in scenario 1, you might
choose to not get a dog because there is no causal effect of getting a dog
on your happiness: 𝑌(1) − 𝑌(0) = 1 − 1 = 0.
Now that we’ve introduced potential outcomes and ITEs, we can intro-
duce the main problems that pop up in causal inference that are not
present in fields where the main focus is on association or prediction.

2.2 The Fundamental Problem of Causal


Inference

It is impossible to observe all potential outcomes for a given individual


[3] . Consider the dog example. You could observe 𝑌(1) by getting a dog [3]: Rubin (1974), ‘Estimating causal effects
and observing your happiness after getting a dog. Alternatively, you of treatments in randomized and nonran-

could observe 𝑌(0) by not getting a dog and observing your happiness.
domized studies.’

However, you cannot observe both 𝑌(1) and 𝑌(0), unless you have a time
machine that would allow you to go back in time and choose the version
of treatment that you didn’t take the first time. You cannot simply get
a dog, observe 𝑌(1), give the dog away, and then observe 𝑌(0) because
the second observation will be influenced by all the actions you took
between the two observations and anything else that changed since the
first observation.
2 Potential Outcomes 8

This is known as the fundamental problem of causal inference [5]. It is [5]: Holland (1986), ‘Statistics and Causal
fundamental because if we cannot observe both 𝑌𝑖 (1) and 𝑌𝑖 (0), then we Inference’

cannot observe the causal effect 𝑌𝑖 (1) − 𝑌𝑖 (0). This problem is unique
to causal inference because, in causal inference, we care about making
causal claims, which are defined in terms of potential outcomes. For
contrast, consider machine learning. In machine learning, we often only
care about predicting the observed outcome 𝑌 , so there is no need for
potential outcomes, which means machine learning does not have to
deal with this fundamental problem that we must deal with in causal
inference.
The potential outcomes that you do not (and cannot) observe are known
as counterfactuals because they are counter to fact (reality). “Potential
outcomes” are sometimes referred to as “counterfactual outcomes,” but
we will never do that in this book because a potential outcome 𝑌(𝑡)
does not become counter to fact until another potential outcome 𝑌(𝑡 0) is
observed. The potential outcome that is observed is sometimes referred
to as a factual. Note that there are no counterfactuals or factuals until the
outcome is observed. Before that, there are only potential outcomes.

2.3 Getting Around the Fundamental Problem

I suspect this section is where this chapter might start to get a bit unclear.
If that is the case for you, don’t worry too much, and just continue to the
next chapter, as it will build up parallel concepts in a hopefully more
intuitive way.

2.3.1 Average Treatment Effects and Missing Data


Interpretation

We know that we can’t access individual treatment effects, but what


about average treatment effects? We get the average treatment effect (ATE)4 4 The ATE is also known as the “average
by taking an average over the ITEs: causal effect (ACE).”

𝜏 , 𝔼[𝑌𝑖 (1) − 𝑌𝑖 (0)] = 𝔼[𝑌(1) − 𝑌(0)] , (2.2)

where the average is over the individuals 𝑖 if 𝑌𝑖 (𝑡) is deterministic. If 𝑌𝑖 (𝑡)


is random, the average is also over any other randomness.
Okay, but how would we actually compute the ATE? Let’s look at
some made-up data in Table 2.1 for this. If you like examples, feel free to
substitute in the COVID-27 example from Section 1.1 or the dog-happiness
example from Section 2.1. We will take this table as the whole population
of interest. Because of the fundamental problem of causal inference, this
is fundamentally a missing data problem. All of the question marks in
the table indicate that we do not observe that cell.
A natural quantity that comes to mind is the associational difference:
𝔼[𝑌|𝑇 = 1] − 𝔼[𝑌|𝑇 = 0]. By linearity of expectation, we have that the
ATE 𝔼[𝑌(1) − 𝑌(0)] = 𝔼[𝑌(1)] − 𝔼[𝑌(0)]. Then, maybe 𝔼[𝑌(1)] − 𝔼[𝑌(0)]
equals 𝔼[𝑌|𝑇 = 1] − 𝔼[𝑌|𝑇 = 0]. Unfortunately, this is not true in general.
If it were, that would mean that causation is simply association. 𝔼[𝑌|𝑇 =
1] − 𝔼[𝑌|𝑇 = 0] is an associational quantity, whereas 𝔼[𝑌(1)] − 𝔼[𝑌(0)]
2 Potential Outcomes 9

Table 2.1: Example data to illustrate that


𝑖 𝑇 𝑌 𝑌(1) 𝑌(0) 𝑌(1) − 𝑌(0) the fundamental problem of causal infer-
ence can be interpreted as a missing data
1 0 0 ? 0 ?
problem.
2 1 1 1 ? ?
3 1 0 0 ? ?
4 0 0 ? 0 ?
5 0 1 ? 1 ?
6 1 1 1 ? ? 𝑋

is a causal quantity. They are not equal due to confounding, which we


discussed in Section 1.3. The graphical interpretation of this, depicted in 𝑇 𝑌
Figure 2.1, is that 𝑋 confounds the effect of 𝑇 on 𝑌 because there is this Figure 2.1: Causal structure of 𝑋 con-
𝑇 ← 𝑋 → 𝑌 path that non-causal association flows along.5 founding the effect of 𝑇 on 𝑌 .

2.3.2 Ignorability and Exchangeability 5 Keep reading to Chapter 3, where we


will flesh out and formalize this graphical
Well, what assumption(s) would make it so that the ATE is simply the interpretation.

associational difference? This is equivalent to saying “what makes it valid


to calculate the ATE by taking the average of the 𝑌(0) column, ignoring
the question marks, and subtracting that from the average of the 𝑌(1)
column, ignoring the question marks?”6 This ignoring of the question 6 Active reading exercise: verify that this

marks (missing data) is known as ignorability. Assuming ignorability is procedure is equivalent to 𝔼[𝑌|𝑇 = 1] −
𝔼[𝑌|𝑇 = 0] in the data in Table 2.1.
like ignoring how people ended up selecting the treatment they selected
and just assuming they were randomly assigned their treatment; we
depict this graphically in Figure 2.2 by the lack of a causal arrow from 𝑋
to 𝑇 . We will now state this assumption formally.
𝑋
Assumption 2.1 (Ignorability / Exchangeability)

(𝑌(1), 𝑌(0)) ⊥
⊥𝑇 𝑇 𝑌
This assumption is key to causal inference because it allows us to reduce Figure 2.2: Causal structure when the
treatment assignment mechanism is ig-
the ATE to the associational difference: norable. Notably, this means there’s no
arrow from 𝑋 to 𝑇 , which means there is
𝔼[𝑌(1)] − 𝔼[𝑌(0)] = 𝔼[𝑌(1) | 𝑇 = 1] − 𝔼[𝑌(0) | 𝑇 = 0] (2.3) no confounding.

= 𝔼[𝑌 | 𝑇 = 1] − 𝔼[𝑌 | 𝑇 = 0] (2.4)

The ignorability assumption is used in Equation 2.3. We will talk more


about Equation 2.4 when we get to Section 2.3.5.
Another perspective on this assumption is that of exchangeability. Ex-
changeability means that the treatment groups are exchangeable in
the sense that if they were swapped, the new treatment group would
observe the same outcomes as the old treatment group, and the new
control group would observe the same outcomes as the old control
group. Formally, this assumption means 𝔼[𝑌(1)|𝑇 = 0] = 𝔼[𝑌(1)|𝑇 = 1]
and 𝔼[𝑌(0)|𝑇 = 1] = 𝔼[𝑌(0)|𝑇 = 0], respectively. Then, this implies 7 Technically, this is mean exchangeabil-
𝔼[𝑌(1)|𝑇 = 𝑡] = 𝔼[𝑌(1)] and 𝔼[𝑌(0)|𝑇 = 𝑡] = 𝔼[𝑌(0)], for all 𝑡 , which is ity, which is a weaker assumption than the
full exchangeability that we describe in As-
nearly equivalent7 to Assumption 2.1. sumption 2.1 because it only constrains the
first moment of the distribution. Generally,
An important intuition to have about exchangeability is that it guarantees
we only need mean ignorability/exchange-
that the treatment groups are comparable. In other words, the treatment ability for average treatment effects, but it
groups are the same in all relevant aspects other than the treatment. This is common to assume complete indepen-
intuition is what underlies the concept of “controlling for” or “adjusting dence, as in Assumption 2.1.
2 Potential Outcomes 10

for” variables, which we will discuss shortly when we get to conditional


exchangeability.
We have leveraged Assumption 2.1 to identify causal effects. To identify
a causal effect is to reduce a causal expression to a purely statistical
expression. In this chapter, that means to reduce an expression from
one that uses potential outcome notation to one that uses only statistical
notation such as 𝑇 , 𝑋 , 𝑌 , expectations, and conditioning. This means that
we can calculate the causal effect from just the observational distribution
𝑃(𝑋 , 𝑇, 𝑌).

Definition 2.1 (Identifiability) A causal quantity (e.g. 𝔼[𝑌(𝑡)]) is identifi-


able if we can compute it from a purely statistical quantity (e.g. 𝔼[𝑌 | 𝑡]).

We have seen that ignorability is extremely important (Equation 2.3), but


how realistic of an assumption is it? In general, it is completely unrealistic
because there is likely to be confounding in most data we observe (causal
structure shown in Figure 2.1). However, we can make this assumption
realistic by running randomized experiments, which force the treatment
to not be caused by anything but a coin toss, so then we have the causal
structure shown in Figure 2.2. We cover randomized experiments in
greater depth in Chapter 5.
We have covered two prominent perspectives on this main assumption
(2.1): ignorability and exchangeability. Mathematically, these mean the
same thing, but their names correspond to different ways of thinking
about the same assumption. Exchangeability and ignorability are only
two names for this assumption. We will see more aliases after we cover
the more realistic, conditional version of this assumption.

2.3.3 Conditional Exchangeability and


Unconfoundedness

In observational data, it is unrealistic to assume that the treatment groups


are exchangeable. In other words, there is no reason to expect that the
groups are the same in all relevant variables other than the treatment.
However, if we control for relevant variables by conditioning, then maybe
the subgroups will be exchangeable. We will clarify what the “relevant
variables” are in Chapter 3, but for now, let’s just say they are all of the
covariates 𝑋 . Then, we can state conditional exchangeability formally.

Assumption 2.2 (Conditional Exchangeability / Unconfoundedness)

(𝑌(1), 𝑌(0)) ⊥
⊥𝑇 | 𝑋

The idea is that although the treatment and potential outcomes may
be unconditionally associated (due to confounding), within levels of 𝑋 ,
they are not associated. In other words, there is no confounding within
levels of 𝑋 because controlling for 𝑋 has made the treatment groups
comparable. We’ll now give a bit of graphical intuition for the above. We
will not draw the rigorous connection between the graphical intuition
and Assumption 2.2 until Chapter 3; for now, it is just meant to aid
intuition.
2 Potential Outcomes 11

We do not have exchangeability in the data because 𝑋 is a common cause


of 𝑇 and 𝑌 . We illustrate this in Figure 2.3. Because 𝑋 is a common
cause of 𝑇 and 𝑌 , there is non-causal association between 𝑇 and 𝑌 . This
non-causal association flows along the 𝑇 ← 𝑋 → 𝑌 path; we depict this
with a red dashed arc.
However, we do have conditional exchangeability in the data. This is
because, when we condition on 𝑋 , there is no longer any non-causal 𝑋
association between 𝑇 and 𝑌 . The non-causal association is now “blocked”
at 𝑋 by conditioning on 𝑋 . We illustrate this blocking in Figure 2.4 by
shading 𝑋 to indicate it is conditioned on and by showing the red dashed
arc being blocked there. 𝑇 𝑌
Figure 2.3: Causal structure of 𝑋 con-
Conditional exchangeability is the main assumption necessary for causal founding the effect of 𝑇 on 𝑌 . We depict
inference. Armed with this assumption, we can identify the causal effect the confounding with a red dashed line.
within levels of 𝑋 , just like we did with (unconditional) exchangeability:

𝔼[𝑌(1) − 𝑌(0) | 𝑋] = 𝔼[𝑌(1) | 𝑋] − 𝔼[𝑌(0) | 𝑋] (2.5)


= 𝔼[𝑌(1) | 𝑇 = 1 , 𝑋] − 𝔼[𝑌(0) | 𝑇 = 0 , 𝑋] (2.6)
= 𝔼[𝑌 | 𝑇 = 1 , 𝑋] − 𝔼[𝑌 | 𝑇 = 0 , 𝑋] (2.7) 𝑋

In parallel to before, we get Equation 2.5 by linearity of expectation.


And we now get Equation 2.6 by conditional exchangeability. If we want
𝑇 𝑌
the marginal effect that we had before when assuming (unconditional)
exchangeability, we can get that by simply marginalizing out 𝑋 : Figure 2.4: Illustration of conditioning on
𝑋 leading to no confounding.

𝔼[𝑌(1) − 𝑌(0)] = 𝔼𝑋 𝔼[𝑌(1) − 𝑌(0) | 𝑋] (2.8)


= 𝔼𝑋 [𝔼[𝑌 | 𝑇 = 1, 𝑋] − 𝔼[𝑌 | 𝑇 = 0 , 𝑋]] (2.9)

This marks an important result for causal inference, so we’ll give it its
own proposition box. The proof we give above leaves out some details.
Read through to Section 2.3.6 (where we redo the proof with all details
specified) to get the rest of the details. We will call this result the adjustment
formula.

Theorem 2.1 (Adjustment Formula) Given the assumptions of uncon-


foundedness, positivity, consistency, and no interference, we can identify the
average treatment effect:

𝔼[𝑌(1) − 𝑌(0)] = 𝔼𝑋 [𝔼[𝑌 | 𝑇 = 1, 𝑋] − 𝔼[𝑌 | 𝑇 = 0, 𝑋]]

Conditional exchangeability (Assumption 2.2) is a core assumption for


causal inference and goes by many names. For example, the following
are reasonably commonly used to refer to the same assumption: un-
confoundedness, conditional ignorability, no unobserved confounding,
selection on observables, no omitted variable bias, etc. We will use the
name “unconfoundedness” a fair amount throughout this book.
The main reason for moving from exchangeability (Assumption 2.1) to
conditional exchangeability (Assumption 2.2) was that it seemed like a
more realistic assumption. However, we often cannot know for certain
if conditional exchangeability holds. There may be some unobserved
confounders that are not part of 𝑋 , meaning conditional exchangeability
is violated. Fortunately, that is not a problem in randomized experiments
2 Potential Outcomes 12

(Chapter 5). Unfortunately, it is something that we must always be


conscious of in observational data. Intuitively, the best thing we can do is
to observe and fit as many covariates into 𝑋 as possible to try to ensure
8
unconfoundedness.8 As we will see in Chapters 3 and 4, it is
not necessarily true that conditioning on
more covariates always helps our causal
estimates be less biased.
2.3.4 Positivity/Overlap and Extrapolation

While conditioning on many covariates is attractive for achieving uncon-


foundedness, it can actually be detrimental for another reason that has
to do with another important assumption that we have yet to discuss:
positivity. We will get to why at the end of this section. Positivity is the
condition that all subgroups of the data with different covariates have
some probability of receiving any value of treatment. Formally, we define
positivity for binary treatment as follows.

Assumption 2.3 (Positivity / Overlap / Common Support) For all


values of covariates 𝑥 present in the population of interest (i.e. 𝑥 such that
𝑃(𝑋 = 𝑥) > 0),
0 < 𝑃(𝑇 = 1 | 𝑋 = 𝑥) < 1

To see why positivity is important, let’s take a closer look at Equation 2.9:

𝔼[𝑌(1) − 𝑌(0)] = 𝔼𝑋 [𝔼[𝑌 | 𝑇 = 1, 𝑋] − 𝔼[𝑌 | 𝑇 = 0, 𝑋]]


(2.9 revisited)
In short, if we have a positivity violation, then we will be conditioning
on a zero probability event. This is because there will be some value
of 𝑥 with non-zero probability for which 𝑃(𝑇 = 1 | 𝑋 = 𝑥) = 0 or
𝑃(𝑇 = 0 | 𝑋 = 𝑥) = 0. This means that for some value of 𝑥 that we
are marginalizing out in the above equation, 𝑃(𝑇 = 1 , 𝑋 = 𝑥) = 0 or
𝑃(𝑇 = 0 , 𝑋 = 𝑥) = 0, and these are the two events that we condition on
in Equation 2.9.
To clearly see how a positivity violation translates to division by zero,
let’s rewrite the right-hand side of Equation 2.9. For discrete covariates
and outcome, it can be rewritten as follows:
!
X X X
𝑃(𝑋 = 𝑥) 𝑦 𝑃(𝑌 = 𝑦 | 𝑇 = 1 , 𝑋 = 𝑥) − 𝑦 𝑃(𝑌 = 𝑦 | 𝑇 = 0 , 𝑋 = 𝑥)
𝑥 𝑦 𝑦
(2.10)
Then, applying Bayes’ rule, this can be further rewritten:
!
X X 𝑃(𝑌 = 𝑦, 𝑇 = 1 , 𝑋 = 𝑥) X 𝑃(𝑌 = 𝑦, 𝑇 = 0 , 𝑋 = 𝑥)
𝑃(𝑋 = 𝑥) 𝑦 − 𝑦
𝑥 𝑦 𝑃(𝑇 = 1 | 𝑋 = 𝑥)𝑃(𝑋 = 𝑥) 𝑦 𝑃(𝑇 = 0 | 𝑋 = 𝑥)𝑃(𝑋 = 𝑥)
(2.11)
In Equation 2.11, we can clearly see why positivity is essential. If
𝑃(𝑇 = 1 | 𝑋 = 𝑥) = 0 for any level of covariates 𝑥 with non-zero prob-
ability, then there is division by zero in the first term in the equation,
so 𝔼𝑋 𝔼[𝑌 | 𝑇 = 1 , 𝑋] is undefined. Similarly, if 𝑃(𝑇 = 1 | 𝑋 = 𝑥) = 1
for any level of 𝑥 , then 𝑃(𝑇 = 0 | 𝑋 = 𝑥) = 0, so there is division by
zero in the second term and 𝔼𝑋 𝔼[𝑌 | 𝑇 = 0 , 𝑋] is undefined. With
either of these violations of the positivity assumption, the causal effect is
undefined.
2 Potential Outcomes 13

Intuition That’s the math for why we need the positivity assumption,
but what’s the intuition? Well, if we have a positivity violation, that
means that within some subgroup of the data, everyone always receives
treatment or everyone always receives the control. It wouldn’t make
sense to be able to estimate a causal effect of treatment vs. control in that
subgroup since we see only treatment or only control. We never see the
alternative in that subgroup.
Another name for positivity is overlap. The intuition for this name is that
9
we want the covariate distribution of the treatment group to overlap Whenever we use a random variable (de-
with the covariate distribution of the control group. More specifically, noted by a capital letter) as the argument
for 𝑃 , we are referring to the whole dis-
we want 𝑃(𝑋 | 𝑇 = 1)9 to have the same support as 𝑃(𝑋 | 𝑇 = 0).10 This tribution, rather than just the scalar that
is why another common alias for positivity is common support. something like 𝑃(𝑥 | 𝑇 = 1) refers to.

The Positivity-Unconfoundedness Tradeoff Although conditioning 10 Active reading exercise: convince your-
on more covariates could lead to a higher chance of satisfying uncon- self that this formulation of overlap/posi-
foundedness, it can lead to a higher chance of violating positivity. As we tivity is equivalent to the formulation in
increase the dimension of the covariates, we make the subgroups for any Assumption 2.3.
level 𝑥 of the covariates smaller.11 As each subgroup gets smaller, there
11 This is related to the curse of dimensional-
is a higher and higher chance that either the whole subgroup will have
treatment or the whole subgroup will have control. For example, once ity.

the size of any subgroup has decreased to one, positivity is guaranteed to


not hold. See [6] for a rigorous argument of high-dimensional covariates [6]: D’Amour et al. (2017), Overlap in Ob-
leading to positivity violations. servational Studies with High-Dimensional
Covariates
Extrapolation Violations of the positivity assumption can actually lead
to demanding too much from models and getting very bad behavior in
return. Many causal effect estimators12 fit a model to 𝔼[𝑌|𝑡, 𝑥] using the 12An “estimator” is a function that takes
(𝑡, 𝑥, 𝑦) tuples as data. The inputs to these models are (𝑡, 𝑥) pairs and the a dataset as input and outputs an esti-
mate. We discuss this statistics terminol-
outputs are the corresponding outcomes. These models will be forced
ogy more in Section 2.4.
to extrapolate in regions (using their parametric assumptions) where
𝑃(𝑇 = 1, 𝑋 = 𝑥) = 0 and regions where 𝑃(𝑇 = 0 , 𝑋 = 𝑥) = 0 when
they are used in the adjustment formula (Theorem 2.1) in place of the
corresponding conditional expectations.

2.3.5 No interference, Consistency, and SUTVA

There are a few additional assumptions we’ve been smuggling in through-


out this chapter. We will specify all the rest of these assumptions in this
section. The first assumption in this section is that of no interference.
No interference means that my outcome is unaffected by anyone else’s
treatment. Rather, my outcome is only a function of my own treatment.
We’ve been using this assumption implicitly throughout this chapter.
We’ll now formalize it.

Assumption 2.4 (No Interference)

𝑌𝑖 (𝑡1 , . . . , 𝑡 𝑖−1 , 𝑡 𝑖 , 𝑡 𝑖+1 , . . . , 𝑡 𝑛 ) = 𝑌𝑖 (𝑡 𝑖 )

Of course, this assumption could be violated. For example, if the treatment


is “get a dog” and the outcome is my happiness, it could easily be that my
happiness is influenced by whether or not my friends get dogs because
we could end up hanging out more to have our dogs play together. As you
2 Potential Outcomes 14

might expect, violations of the no interference assumption are rampant


in network data.
The last assumption is consistency. Consistency is the assumption that
the outcome we observe 𝑌 is actually the potential outcome under the
observed treatment 𝑇 .

Assumption 2.5 (Consistency) If the treatment is 𝑇 , then the observed


outcome 𝑌 is the potential outcome under treatment 𝑇 . Formally,

𝑇 = 𝑡 =⇒ 𝑌 = 𝑌(𝑡) (2.12)

We could write this equivalently as follow:

𝑌 = 𝑌(𝑇) (2.13)

Note that 𝑇 is different from 𝑡 , and 𝑌(𝑇) is different from 𝑌(𝑡). 𝑇 is a


random variable that corresponds to the observed treatment, whereas 𝑡
is a specific value of treatment. Similarly, 𝑌(𝑡) is the potential outcome for
some specific value of treatment, whereas 𝑌(𝑇) is the potential outcome
for the actual value of treatment that we observe.
When we were using exchangeability to prove identifiability, we actually
assumed consistency in Equation 2.4 to get the follow equality:

𝔼[𝑌(1) | 𝑇 = 1] − 𝔼[𝑌(0) | 𝑇 = 0] = 𝔼[𝑌 | 𝑇 = 1] − 𝔼[𝑌 | 𝑇 = 0]

Similarly, when we were using conditional exchangeability to prove


identifiability, we assumed consistency in Equation 2.7.
It might seem like consistency is obviously true, but that is not always the
case. For example, if the treatment specification is simply “get a dog” or
“don’t get a dog,” this can be too coarse to yield consistency. It might be
that if I were to get a puppy, I would observe 𝑌 = 1 (happiness) because
I needed an energetic friend, but if I were to get an old, low-energy dog, I
would observe 𝑌 = 0 (unhappiness). However, both of these treatments
fall under the category of “get a dog,” so both correspond to 𝑇 = 1. This
means that 𝑌(1) is not well defined, since it will be 1 or 0, depending
on something that is not captured by the treatment specification. In
this sense, consistency encompasses the assumption that is sometimes
referred to as “no multiple versions of treatment.” See Sections 3.4 and
3.5 of Hernán and Robins [7] and references therein for more discussion [7]: Hernán and Robins (2020), Causal In-
ference: What If
on this topic.
SUTVA You will also commonly see the stable unit-treatment value
assumption (SUTVA) in the literature. SUTVA is satisfied if unit (individual)
𝑖 ’s outcome is simply a function of unit 𝑖 ’s treatment. Therefore, SUTVA is
a combination of consistency and no interference (and also deterministic
potential outcomes).13 13 Active reading exercise: convince your-
self that SUTVA is a combination of con-
sistency and no inference
2.3.6 Tying It All Together

We introduced unconfoundedness (conditional exchangeability) first


because it is the main causal assumption. However, all of the assumptions
are necessary:
2 Potential Outcomes 15

1. Unconfoundedness (Assumption 2.2)


2. Positivity (Assumption 2.3)
3. No interference (Assumption 2.4)
4. Consistency (Assumption 2.5)
We’ll now review the proof of the adjustment formula (Theorem 2.1)
that was done in Equation 2.5 through Equation 2.9 and list which
assumptions are used for each step. Even before we get to these equations,
we use the no interference assumption to justify that the quantity we
should be looking at for causal inference is 𝔼[𝑌(1) − 𝑌(0)], rather than
something more complex like the left-hand side of Assumption 2.4. In
the proof below, the first two equalities follow from mathematical facts,
whereas the last two follow from these key assumptions.

Proof of Theorem 2.1.

𝔼[𝑌(1) − 𝑌(0)] = 𝔼[𝑌(1)] − 𝔼[𝑌(0)] (linearity of expectation)


= 𝔼𝑋 [𝔼[𝑌(1) | 𝑋] − 𝔼[𝑌(0) | 𝑋]]
(law of iterated expectations)
= 𝔼𝑋 [𝔼[𝑌(1) | 𝑇 = 1 , 𝑋] − 𝔼[𝑌(0) | 𝑇 = 0 , 𝑋]]
(unconfoundedness and positivity)
= 𝔼𝑋 [𝔼[𝑌 | 𝑇 = 1 , 𝑋] − 𝔼[𝑌 | 𝑇 = 0 , 𝑋]]
(consistency)

That’s how all of these assumptions tie together to give us identifiability


of the ATE. We’ll soon see how to use this result to get an actual estimated
number for the ATE.

2.4 Fancy Statistics Terminology Defancified

Before we start computing concrete numbers for the ATE, we must


quickly introduce some terminology from statistics that will help clarify
the discussion. An estimand is the quantity that we want to estimate. For
example, 𝔼𝑋 [𝔼[𝑌 | 𝑇 = 1 , 𝑋] − 𝔼[𝑌 | 𝑇 = 0 , 𝑋]] is the estimand we care
about for estimating the ATE. An estimate (noun) is an approximation of
some estimand, which we get using data. We will see concrete numbers
in the next section; these are estimates. Given some estimand 𝛼 , we write
an estimate of that estimand by simply putting a hat on it: 𝛼ˆ . And an
estimator is a function that maps a dataset to an estimate of the estimand.
The process that we will use to go from data + estimand to a concrete
number is known as estimation. To estimate (verb) is to feed data into an
estimator to get an estimate.
In this book, we will use even more specific language that allows us to
make the distinction between causal quantities and statistical quantities.
We will use the phrase causal estimand to refer to any estimand that
14 As we will see in Chapter 4, we will
contains a potential outcome in it. We will use the phrase statistical
equivalently refer to a causal estimand as
estimand to denote the complement: any estimand that does not contain any estimand that contains a do-operator,
a potential outcome.14 For an example, recall the adjustment formula and we will refer to a statistical estimand
as any estimand that does not contain a
do-operator.
2 Potential Outcomes 16

(Theorem 2.1):

𝔼[𝑌(1) − 𝑌(0)] = 𝔼𝑋 [𝔼[𝑌 | 𝑇 = 1, 𝑋] − 𝔼[𝑌 | 𝑇 = 0, 𝑋]] (2.14)

𝔼[𝑌(1) − 𝑌(0)] is the causal estimand that we are interested in. In order
to actually estimate this causal estimand, we must translate it into a
statistical estimand: 𝔼𝑋 [𝔼[𝑌 | 𝑇 = 1 , 𝑋] − 𝔼[𝑌 | 𝑇 = 0 , 𝑋]].15 15 Active reading exercise: Why can’t we di-
rectly estimate a causal estimand without
When we say “identification” in this book, we are referring to the process first translating it to a statistical estimand?
of moving from a causal estimand to an equivalent statistical estimand.
When we say “estimation,” we are referring to the process of moving from
a statistical estimand to an estimate. We illustrate this in the flowchart in
Figure 2.5.

Identification Estimation
Causal Estimand Statistical Estimand Estimate

Figure 2.5: The Identification-Estimation Flowchart – a flowchart that illustrates the process of moving from a target causal estimand to a
corresponding estimate, through identification and estimation.

What do we do when we go to actually estimate quantities such as


𝔼𝑋 [𝔼[𝑌 | 𝑇 = 1, 𝑋] − 𝔼[𝑌 | 𝑇 = 0, 𝑋]]? We will often use a model (e.g.
linear regression or some more fancy predictor from machine learning)
in place of the conditional expectations 𝔼[𝑌 | 𝑇 = 𝑡, 𝑋 = 𝑥]. We will
refer to estimators that use models like this as model-assisted estimators.
Now that we’ve gotten some of this terminology out of the way, we can
proceed to an example of estimating the ATE.

2.5 A Complete Example with Estimation

Theorem 2.1 and the corresponding recent copy in Equation 2.14 give
us identification. However, we haven’t discussed estimation at all. In
this section, we will give a short example complete with estimation. We
will cover the topic of estimation of causal effects more completely in
Chapter 7.
We use Luque-Fernandez et al. [8]’s example from epidemiology. The [8]: Luque-Fernandez et al. (2018), ‘Edu-
outcome 𝑌 of interest is (systolic) blood pressure. This is an important cational Note: Paradoxical collider effect
in the analysis of non-communicable dis-
outcome because roughly 46% of Americans have high blood pressure, ease epidemiological data: a reproducible
and high blood pressure is associated with increased risk of mortality illustration and web application’
[9]. The “treatment” 𝑇 of interest is sodium intake. Sodium intake is [9]: Virani et al. (2020), ‘Heart Disease and
a continuous variable; in order to easily apply Equation 2.14, which is Stroke Statistics—2020 Update: A Report
specified for binary treatment, we will binarize 𝑇 by letting 𝑇 = 1 denote From the American Heart Association’

daily sodium intake above 3.5 grams and letting 𝑇 = 0 denote daily
sodium intake below 3.5 grams.16 We will be estimating the causal effect 16 As we will see, this binarization is purely

of sodium intake on blood pressure. In our data, we also have the age pedagogical and does not reflect any limi-
of the individuals and amount of protein in their urine as covariates 𝑋 . tations of adjusting for confounders.

Luque-Fernandez et al. [8] run a simulation, taking care to be sure that


the range of values is “biologically plausible and as close to reality as
possible.”
Because we are using data from a simulation, we know that the true ATE
of sodium on blood pressure is 1.05. More concretely, the line of code
that generates blood pressure 𝑌 looks as follows:
blood_pressure = 1.05 * sodium + ...
2 Potential Outcomes 17

Now, how do we actually estimate the ATE? First, we assume consistency,


positivity, and unconfoundedness given 𝑋 . As we recently recalled in
Equation 2.14, this means that we’ve identified the ATE as

𝔼𝑋 [𝔼[𝑌 | 𝑇 = 1, 𝑋] − 𝔼[𝑌 | 𝑇 = 0, 𝑋]] .

We then take that outer expectation over 𝑋 and replace it with an


empirical mean over the data, giving us the following:

1X
[𝔼[𝑌 | 𝑇 = 1, 𝑋 = 𝑥 𝑖 ] − 𝔼[𝑌 | 𝑇 = 0 , 𝑋 = 𝑥 𝑖 ]] (2.15)
𝑛 𝑖

To complete our estimator, we then fit some machine learning model to


the conditional expectation 𝔼[𝑌 | 𝑡, 𝑥]. Minimizing the mean-squared
error (MSE) of predicting 𝑌 from (𝑇, 𝑋) pairs is equivalent to modeling
this conditional expectation [see, e.g., 10, Section 2.4]. Therefore, we can [10]: Hastie et al. (2001), The Elements of
plug in any machine learning model for 𝔼[𝑌 | 𝑡, 𝑥], which gives us a Statistical Learning

model-assisted estimator. We’ll use linear regression here, which works


out nicely since blood pressure is generated as a linear combination of
other variables, in this simulation. We give Python code for this below,
where our data are in a Pandas DataFrame called df. We fit the model
for 𝔼[𝑌 | 𝑡, 𝑥] in line 8, and we take the empirical mean over 𝑋 in lines
10-14.

import numpy as np Listing 2.1: Python code for estimating


the ATE
import pandas as pd
from sklearn.linear_model import LinearRegression

Xt = df[['sodium', 'age', 'proteinuria']]


y = df['blood_pressure']
model = LinearRegression()
model.fit(Xt, y)
Full code, complete with simulation,
is available at https://github.com/
Xt1 = pd.DataFrame.copy(Xt) bradyneal/causal-book-code/blob/
Xt1['sodium'] = 1 master/sodium_example.py.
Xt0 = pd.DataFrame.copy(Xt)
Xt0['sodium'] = 0
ate_est = np.mean(model.predict(Xt1) - model.predict(Xt0))
print('ATE estimate:', ate_est)

This yields an ATE estimate of 0.85. If we were to naively regress 𝑌


on only 𝑇 , which corresponds to replacing line 5 in Listing 2.1 with
Xt = df[['sodium']],17 we would get an ATE estimate of 5.33. That’s a 17 Active reading exercise: This
| 5.33−1.05 |
1.05 × 100% = 407% error! In contrast, when we control for 𝑋 (as in naive version is equivalent to just
|.85−1.05 | taking the associational difference:
Listing 2.1), our percent error is only 1.05 × 100% = 19%. 𝔼[𝑌 | 𝑇 = 1] − 𝔼[𝑌 | 𝑇 = 0]. Why?

All of the above is done using the adjustment formula with model-assisted
estimation, where we first fit a model for the conditional expectation
𝔼[𝑌 | 𝑡, 𝑥], and then we take an empirical mean over 𝑋 , using that model.
However, because we are using a linear model, this is equivalent to just
taking the coefficient in front of 𝑇 in the linear regression as the ATE
estimate. This is what we do in the following code (which gives the exact
same ATE estimate):

Xt = df[['sodium', 'age', 'proteinuria']] Listing 2.2: Python code for estimating


the ATE using the coefficient of linear re-
y = df['blood_pressure']
gression
model = LinearRegression()
2 Potential Outcomes 18

model.fit(Xt, y)
ate_est = model.coef_[0]
print('ATE estimate:', ate_est)

Continuous Treatment What if we allow the treatment, daily sodium


intake, to remain continuous, instead of binarizing it? The cool thing
about just taking the regression coefficient as the ATE estimate is that
it doesn’t require taking a difference between two values of treatment
(e.g. 𝑇 = 1 and 𝑇 = 0), so it trivially generalizes to when 𝑇 is continuous.
When 𝑇 is continous, we care about how 𝔼[𝑌(𝑡)] changes with 𝑡 . Since
we are assuming 𝔼[𝑌(𝑡)] is linear, this change is completely captured
by 𝑑𝑡𝑑 𝔼[𝑌(𝑡)].18 When 𝔼[𝑌(𝑡)] is linear, it turns out that this quantity 18 Concisely summarizing nonlinear func-
is exactly what taking the coefficient from linear regression estimates. tions 𝔼[𝑌(𝑡)] is an open problem. See, e.g.,
Seemingly magically, we have compressed all of 𝔼[𝑌(𝑡)] = 𝔼[𝑌 | 𝑡], Janzing et al. [11].

which is a function of 𝑡 , into a single value.


[11]: Janzing et al. (2013), ‘Quantifying
However, this effortless compression of all of 𝔼[𝑌 | 𝑡] for continuous 𝑡 causal influences’

comes as a cost: the linear parametric form we assumed. If this model were
misspecified,19 our ATE estimate would be biased. And because linear 19 By “misspecified,” we mean that the
models are so simple, they will likely be misspecified. For example, the functional form of the model does not
match the functional form of the data gen-
following assumption is implicit in assuming that a linear model is well-
erating process.
specified: the treatment effect is the same for all individuals. See Morgan
and Winship [12, Sections 6.2 and 6.3] for a more complete critique of [12]: Morgan and Winship (2014), Counter-
using the coefficient in front of treatment as the ATE estimate. factuals and Causal Inference: Methods and
Principles for Social Research
The Flow of Association and
Causation in Graphs 3
We’ve been using causal graphs in the previous chapters to aid intuition. 3.1 Graph Terminology . . . . . 19
In this chapter, we will introduce the formalisms that underlie this 3.2 Bayesian Networks . . . . . 20
intuition. Hopefully, we have sufficiently motivated this chapter and
3.3 Causal Graphs . . . . . . . . 22
made the utility of graphical models clear with all of the graphical
3.4 Two-Node Graphs and
interpretations of concepts in previous chapters.
Graphical Building Blocks 23
3.5 Chains and Forks . . . . . . 24
3.6 Colliders and their Descen-
3.1 Graph Terminology dants . . . . . . . . . . . . . . 26
3.7 d-separation . . . . . . . . . . 28
In this section, we will use the terminology machine gun (see Figure 3.1). To
3.8 Flow of Association and Cau-
be able to use nice convenient graph language in the following sections,
sation . . . . . . . . . . . . . 30
rapid-firing a lot of graph terminology is a necessary evil, unfortunately.
The term “graph” is often used to describe a variety of visualizations. term
For example, “graph” might refer to a visualization of a single variable termtermterm
term term
function 𝑓 (𝑥), where 𝑥 is plotted on the 𝑥 -axis and 𝑓 (𝑥) is plotted term
term
on the 𝑦 -axis. Or “bar graph” might be used as a synonym for a bar Figure 3.1: Terminology machine gun
chart. However, in graph theory, the term “graph” refers to a specific
mathematical object.
A graph is a collection of nodes (also called “vertices”) and edges that
connect the nodes. For example, in Figure 3.2, 𝐴, 𝐵, 𝐶 , and 𝐷 are the nodes
of the graph, and the lines connecting them are the edges. Figure 3.2 is
called an undirected graph because the edges do not have any direction. In
𝐴 𝐵
contrast, Figure 3.3 is a directed graph. A directed graph’s edges go out
of a parent node and into a child node, with the arrows signifying which
direction the edges are going. We will denote the parents of a node 𝑋
with pa(𝑋). We’ll use an even simpler shorthand when the nodes are 𝐶 𝐷
ordered so that we can denote the 𝑖 th node by 𝑋𝑖 ; in that case, we will
Figure 3.2: Undirected graph
also denote the parents of 𝑋𝑖 by pa𝑖 . Two nodes are said to be adjacent
if they are connected by an edge. For example, in both Figure 3.2 and
Figure 3.3, 𝐴 and 𝐶 are adjacent, but 𝐴 and 𝐷 are not.
𝐴 𝐵
A path in a graph is any sequence of adjacent nodes, regardless of the
direction of the edges that join them. For example, 𝐴 — 𝐶 — 𝐵 is a path
in Figure 3.2, and 𝐴 → 𝐶 ← 𝐵 is a path in Figure 3.3. A directed path is
a path that consists of directed edges that are all directed in the same 𝐶 𝐷
direction (no two edges along the path both point into or both point
Figure 3.3: Directed graph
out of the same node). For example, 𝐴 → 𝐶 → 𝐷 is a directed path in
Figure 3.3, but 𝐴 → 𝐶 ← 𝐵 and 𝐶 ← 𝐴 → 𝐵 are not.
If there is a directed path that starts at node 𝑋 and ends at node 𝑌 , then 𝑋
is an ancestor of 𝑌 , and 𝑌 is a descendant of 𝑋 . We will denote descendants 𝐴 𝐵
of 𝑋 by de(𝑋). For example, in Figure 3.3, 𝐴 is an ancestor of 𝐵 and
𝐷 , and 𝐵 and 𝐷 are both descendants of 𝐴 (de(𝐴)). If 𝑋 is an ancestor
of itself, then some funky time travel has taken place. In seriousness, a
directed path from some node 𝑋 back to itself is known as a cycle (see 𝐶 𝐷
Figure 3.4). If there are no cycles in a directed graph, the graph is known Figure 3.4: Directed graph with cycle
3 The Flow of Association and Causation in Graphs 20

as a directed acyclic graph (DAG). The graphs we focus on in this book will
mostly be DAGs.
If two parents 𝑋 and 𝑌 share some child 𝑍 , but there is no edge connecting 𝐴 𝐵
𝑋 and 𝑌 , then 𝑋 → 𝑍 ← 𝑌 is known as an immorality. Seriously; that’s
a real term in graphical models. For example, if we remove the 𝐴 → 𝐵
from Figure 3.3 to get Figure 3.5, then 𝐴 → 𝐶 ← 𝐵 is an immorality.
𝐶 𝐷
Figure 3.5: Directed graph with immoral-
ity
3.2 Bayesian Networks

It turns out that much of the work for causal graphical models was done
in the field of probabilistic graphical models. Probabilistic graphical
models are statistical models while causal graphical models are causal
models. Bayesian networks are the main probabilistic graphical model
that causal graphical models (causal Bayesian networks) inherit most of
their properties from.
Imagine that we only cared about modeling association, without any
causal modeling. We would want to model the data distribution 𝑃(𝑥 1 , 𝑥 2 , . . . , 𝑥 𝑛 ).
In general, we can use the chain rule of probability to factorize any distri-
bution:
Y
𝑃(𝑥1 , 𝑥2 , . . . , 𝑥 𝑛 ) = 𝑃(𝑥1 ) 𝑃(𝑥 𝑖 | 𝑥 𝑖−1 , . . . , 𝑥 1 ) (3.1)
𝑖

However, if we were to model these factors with tables, it would take an Table 3.1: Table required to model the
exponential number of parameters. To see this, take each 𝑥 𝑖 to be binary single factor 𝑃(𝑥 𝑛 | 𝑥 𝑛−1 , . . . , 𝑥 1 ) where
and consider how we would model the factor 𝑃(𝑥 𝑛 | 𝑥 𝑛−1 , . . . , 𝑥 1 ). Since 𝑛 = 4 and the variables are binary. The
𝑥 𝑛 is binary, we only need to model 𝑃(𝑋𝑛 = 1 | 𝑥 𝑛−1 , . . . , 𝑥 1 ) because
number of parameters to necessary is ex-
ponential in 𝑛 .
𝑃(𝑋𝑛 = 0 | 𝑥 𝑛−1 , . . . , 𝑥 1 ) is simply 1 − 𝑃(𝑋𝑛 = 1 | 𝑥 𝑛−1 , . . . , 𝑥 1 ). Well, we
would need 2𝑛−1 parameters to model this. As a specific example, let 𝑥1 𝑥2 𝑥3 𝑃(𝑥4 | 𝑥 3 , 𝑥2 , 𝑥1 )
𝑛 = 4. As we can see in Table 3.1, this would require 24−1 = 8 parameters: 0 0 0 𝛼1
𝛼1 , . . . , 𝛼 8 . This brute-force parametrization quickly becomes intractable 0 0 1 𝛼2
as 𝑛 increases. 0 1 0 𝛼3
An intuitive way to more efficiently model many variables together in 0 1 1 𝛼4
a joint distribution is to only model local dependencies. For example, 1 0 0 𝛼5
rather than modeling the 𝑋4 factor as 𝑃(𝑥 4 |𝑥 3 , 𝑥 2 , 𝑥 1 ), we could model 1 0 1 𝛼6
it as 𝑃(𝑥 4 |𝑥 3 ) if we have reason to believe that 𝑋4 only locally depends 1 1 0 𝛼7
on 𝑋3 . In fact, in the corresponding graph in Figure 3.6, the only node 1 1 1 𝛼8
that feeds into 𝑋4 is 𝑋3 . This is meant to signify that 𝑋4 only locally
depends on 𝑋3 . Whenever we use a graph 𝐺 in relation to a probability
distribution 𝑃 , there will always be a one-to-one mapping between the
nodes in 𝐺 and the random variables in 𝑃 , so when we talk about nodes
𝑋1 𝑋2
being independent, we mean the corresponding random variables are
independent.
Given a probability distribution and a corresponding directed acyclic
graph (DAG), we can formalize the specification of independencies with 𝑋3 𝑋4
the local Markov assumption:
Figure 3.6: Four node DAG where 𝑋4 lo-
cally depends on only 𝑋3 .
Assumption 3.1 (Local Markov Assumption) Given its parents in the
DAG, a node 𝑋 is independent of all its non-descendants.
3 The Flow of Association and Causation in Graphs 21

This assumption (along with specific DAGs) gives us a lot. We will


demonstrate this in the next few equations. In our four variable example,
the chain rule of probability tells us that we can factorize any 𝑃 such that

𝑃(𝑥1 , 𝑥2 , 𝑥3 , 𝑥4 ) = 𝑃(𝑥1 ) 𝑃(𝑥2 | 𝑥1 ) 𝑃(𝑥3 | 𝑥 2 , 𝑥1 ) 𝑃(𝑥4 | 𝑥 3 , 𝑥2 , 𝑥1 ) .


(3.2)
If 𝑃 is Markov with respect to the graph1 in Figure 3.6, then we can 1 A probability distribution is said to be

simplify the last factor: (locally) Markov with respect to a DAG if


they satisfy the local Markov assumption.
𝑃(𝑥1 , 𝑥2 , 𝑥3 , 𝑥4 ) = 𝑃(𝑥1 ) 𝑃(𝑥2 | 𝑥1 ) 𝑃(𝑥3 | 𝑥 2 , 𝑥1 ) 𝑃(𝑥4 | 𝑥3 ) . (3.3)

If we further remove edges, removing 𝑋1 → 𝑋2 and 𝑋2 → 𝑋3 as in 𝑋1 𝑋2


Figure 3.7, we can further simplify the factorization of 𝑃 :

𝑃(𝑥1 , 𝑥2 , 𝑥3 , 𝑥4 ) = 𝑃(𝑥1 ) 𝑃(𝑥2 ) 𝑃(𝑥3 | 𝑥1 ) 𝑃(𝑥4 | 𝑥3 ) . (3.4)


𝑋3 𝑋4
With the understanding that we have hopefully built up from a few
Figure 3.7: Four node DAG with even
examples,2 we will now state one of the main consequences of the local more independencies.
Markov assumption:

2 Active reading exercise:: ensure that you


Definition 3.1 (Bayesian Network Factorization) Given a probability
know how we get from Equation 3.2 to
distribution 𝑃 and a DAG 𝐺 , 𝑃 factorizes according to 𝐺 if Equation 3.3 and to Equation 3.4 using the
Y local Markov assumption.
𝑃(𝑥1 , . . . , 𝑥 𝑛 ) = 𝑃(𝑥 𝑖 | pa𝑖 )
𝑖

Hopefully you see the resemblance between the move from Equation 3.2
to Equation 3.3 or the move to Equation 3.4 and the generalization of this
that is presented in Definition 3.1.
The Bayesian network factorization is also known as the chain rule for
Bayesian networks or Markov compatibility. For example, if 𝑃 factorizes
according to 𝐺 , then 𝑃 and 𝐺 are Markov compatible.
We have given the intuition of how the local Markov assumption implies
the Bayesian network factorization, and it turns out that the two are
actually equivalent. In other words, we could have started with the
Bayesian network factorization as the main assumption (and labeled it as
an assumption) and shown that it implies the local Markov assumption.
See Koller and Friedman [13, Chapter 3] for these proofs and more [13]: Koller and Friedman (2009), Proba-
information on this topic. bilistic Graphical Models: Principles and Tech-
niques
As important as the local Markov assumption is, it only gives us infor-
mation about the independencies in 𝑃 that a DAG implies. It does not
even tell us that if 𝑋 and 𝑌 are adjacent in the DAG, then 𝑋 and 𝑌 are
dependent. And this additional information is very commonly assumed
in causal DAGs. To get this guaranteed dependence between adjacent
nodes, we will generally assume a slightly stronger assumption than the
local Markov assumption: minimality.

3 This is often equivalently stated in the


Assumption 3.2 (Minimality Assumption) 1. Given its parents in following way: if we were to remove any
the DAG, a node 𝑋 is independent of all its non-descendants (Assump- edges from the DAG, 𝑃 would not be
tion 3.1). Markov with respect to the graph with
the removed edges [see, e.g., 14, Section
2. Adjacent nodes in the DAG are dependent.3
6.5.3].

[14]: Peters et al. (2017), Elements of Causal


Inference: Foundations and Learning Algo-
rithms
3 The Flow of Association and Causation in Graphs 22

To see why this assumption is named “minimality” consider, what we


know when we know that 𝑃 is Markov with respect to a DAG 𝐺 . We know
that 𝑃 satisfies a set of independencies that are specific to the structure of
𝐺 . If 𝑃 and 𝐺 also satisfy minimality, then this set of independencies is
minimal in the sense the 𝑃 does not satisfy any additional independencies.
This is equivalent to saying that adjacent nodes are dependent.
For example, if the DAG were simply two connected nodes 𝑋 and 𝑌 as
in Figure 3.8, the local Markov assumption would tell us that we can
factorize 𝑃(𝑥, 𝑦) as 𝑃(𝑥)𝑃(𝑦|𝑥), but it would also allow us to factorize
𝑃(𝑥, 𝑦) as 𝑃(𝑥)𝑃(𝑦), meaning it allows distributions where 𝑋 and 𝑌 are
𝑋 𝑌
independent. In contrast, the minimality assumption does not allow this
additional independence. Minimality would tell us to factorize 𝑃(𝑥, 𝑦) Figure 3.8: Two connected nodes
as 𝑃(𝑥)𝑃(𝑦|𝑥), and it would tell us that no additional independencies
(𝑋 ⊥⊥ 𝑌 ) exist in 𝑃 that are minimal with respect to Figure 3.8.
Because removing edges in a Bayesian network is equivalent to adding
independencies,4 the minimality assumption is equivalent to saying that 4 Active reading exercise: why is removing
we can’t remove any more edges from the graph. In a sense, every edge is edges in a Bayesian network equivalent to
adding independencies?
“active.” More concretely, consider that 𝑃 and 𝐺 are Markov compatible
and that 𝐺0 is what we get when we remove some edge from 𝐺 . If 𝑃 is
also Markov with respect to 𝐺0, then 𝑃 is not minimal with respect to
𝐺.
Armed with the minimality assumption and what it implies about how
distributions factorize when they are Markov with respect to some DAG
(Definition 3.1), we are now ready to discuss the flow of association in
DAGs. However, because everything in this section is purely statistical,
we are not ready to discuss the flow of causation in DAGs. To do that, we
must make causal assumptions. Pedagogically, this will also allow us to
use intuitive causal language when we explain the flow of association.

3.3 Causal Graphs

The previous section was all about statistical models and modeling
association. In this section, we will augment these models with causal
assumptions, turning them into causal models and allowing us to study
causation. In order to introduce causal assumptions, we must first have
an understanding of what it means for 𝑋 to be a cause of 𝑌 .

Definition 3.2 (What is a cause?) A variable 𝑋 is said to be a cause of a


variable 𝑌 if 𝑌 can change in response to changes in 𝑋 .5 5See Section 4.5.1 for a definition using
mathematical notation.

Another phrase commonly used to describe this primitive is that 𝑌


“listens” to 𝑋 . With this, we can now specify the main causal assumption
that we will use throughout this book.

Assumption 3.3 ((Strict) Causal Edges Assumption) In a directed graph,


every parent is a direct cause of all its children.

Here, the set of direct causes of 𝑌 is everything that 𝑌 directly responds


to; if we fix all of the direct causes of 𝑌 , then changing any other cause of
𝑌 won’t induce any changes in 𝑌 . This assumption is “strict” in the sense
3 The Flow of Association and Causation in Graphs 23

that every edge is “active,” just like in DAGs that satisfy minimality. In
other words, because the definition of a cause (Definition 3.2) implies
that a cause and its effect are dependent and because we are assuming
all parents are causes of their children, we are assuming that parents
and their children are dependent. So the second part of minimality
(Assumption 3.2) is baked into the strict causal edges assumption.
In contrast, the non-strict causal edges assumption would allow for
some parents to not be causes of their children. It would just assume
that children are not causes of their parents. This allows us to draw
graphs with extra edges to make fewer assumptions, just like we would
in Bayesian networks, where more edges means fewer independence
assumptions. Causal graphs are sometimes drawn with this kind of
non-minimal meaning, but the vast majority of the time, when someone
draws a causal graph, they mean that parents are causes of their children.
Therefore, unless we specify otherwise, throughout this book, we will
use “causal graph” to refer to a DAG that satisfies the strict causal edges
assumption. And we will often omit the word “strict” when we refer to
this assumption.
When we add the causal edges assumption, directed paths in the DAG
take on a very special meaning; they correspond to causation. This is in
contrast to other paths in the graph, which association may flow along,
but causation certainly may not. This will become more clear when we
go into detail on these other kinds of paths in Sections 3.5 and 3.6.
Moving forward, we will now think of the edges of graphs as causal, in
order to describe concepts intuitively with causal language. However,
all of the associational claims about statistical independence will still
hold, even when the edges do not have causal meaning like in the vanilla
Bayesian networks of Section 3.2.
As we will see in the next few sections, the main assumptions that we
need for our causal graphical models to tell us how association and
causation flow between variables are the following two:
1. Local Markov Assumption (Assumption 3.1)
2. Causal Edges Assumption (Assumption 3.3)
We will discuss these assumptions throughout the next few sections and
come back to discuss them more fully again in Section 3.8 after we’ve
established the necessary preliminaries.

3.4 Two-Node Graphs and Graphical Building


Blocks

Now that we’ve gotten the basic assumptions and definitions out of the
way, we can get to the core of this chapter: the flow of association and
causation in DAGs. We can understand this flow in general DAGs by
understanding the flow in the minimal building blocks of graphs. These
minimal building blocks consist of chains (Figure 3.9a), forks (Figure 3.9b),
immoralities (Figure 3.9c), two unconnected nodes (Figure 3.10), and two
connected nodes (Figure 3.11).
3 The Flow of Association and Causation in Graphs 24

𝑋2 𝑋1 𝑋3

𝑋1 𝑋2 𝑋3 𝑋1 𝑋3 𝑋2

(a) Chain (b) Fork (c) Immorality

Figure 3.9: Basic graph building blocks

By “flow of association,” we mean whether any two nodes in a graph are


associated or not associated. Another way of saying this is whether two
nodes are (statistically) dependent or (statistically) independent. Addi-
tionally, we will study whether two nodes are conditionally independent
or not.
For each building block, we will give the intuition for why two nodes
are (conditionally) independent or not, and we will give a proof as well.
We can prove that two nodes 𝐴 and 𝐵 are conditionally independent
given some set of nodes 𝐶 by simply showing that 𝑃(𝑎, 𝑏|𝑐) factorizes
as 𝑃(𝑎|𝑐) 𝑃(𝑏|𝑐). We will now do this in the case of the simplest basic
building block: two unconnected nodes.
Given a graph that is just two unconnected nodes, as depicted in Fig-
ure 3.10, these nodes are not associated simply because there is no edge
between them. To show this, consider the factorization of 𝑃(𝑥 1 , 𝑥 2 ) that 𝑋1 𝑋2
the Bayesian network factorization (Definition 3.1) gives us:
Figure 3.10: Two unconnected nodes
𝑃(𝑥 1 , 𝑥2 ) = 𝑃(𝑥1 ) 𝑃(𝑥2 ) (3.5)

That’s it; applying the Bayesian network factorization immediately gives


us a proof that the two nodes 𝑋1 and 𝑋2 are unassociated (independent)
in this building block. And what is the assumption that allows us to
prove this? That 𝑃 is Markov with respect to the graph in Figure 3.10.
𝑋1 𝑋2
In contrast, if there is an edge between the two nodes (as in Figure 3.11),
then the two nodes are associated. The assumption we leverage here is Figure 3.11: Two connected nodes
the causal edges assumption (Assumption 3.3), which means that 𝑋1
is a cause of 𝑋2 . Since 𝑋1 is a cause of 𝑋2 , 𝑋2 must be able to change
in response to changes in 𝑋1 , so 𝑋2 and 𝑋1 are associated. In general,
any time two nodes are adjacent in a causal graph, they are associated.6 6Two adjacent nodes in a non-strict causal
We will see this same concept several more times in Section 3.5 and graph can be unassociated.
Section 3.6.
Now that we’ve covered the relevant two-node graphs, we’ll cover the
flow of association in the remaining graphical building blocks (three-node
graphs in Figure 3.9), starting with chain graphs.

3.5 Chains and Forks

Chains (Figure 3.12) and forks (Figure 3.13) share the same set of depen-
𝑋1 𝑋2 𝑋3
dencies. In both structures, 𝑋1 and 𝑋2 are dependent, and 𝑋2 and 𝑋3
are dependent for the same reason that we discussed toward the end Figure 3.12: Chain with flow of association
of Section 3.4. Adjacent nodes are always dependent when we make drawn as a dashed red arc.
the causal edges assumption (Assumption 3.3). What about 𝑋1 and 𝑋3 ,
3 The Flow of Association and Causation in Graphs 25

though? Does association flow from 𝑋1 to 𝑋3 through 𝑋2 in chains and


forks?
Usually, yes, 𝑋1 and 𝑋3 are associated in both chains and forks. In chain
graphs, 𝑋1 and 𝑋3 are usually dependent simply because 𝑋1 causes
𝑋2
changes in 𝑋2 which then causes changes in 𝑋3 . In a fork graph, 𝑋1 and
𝑋3 are also usually dependent. This is because the same value that 𝑋2
takes on is used to determine both the value that 𝑋1 takes on and the
value that 𝑋3 takes on. In other words, 𝑋1 and 𝑋3 are associated through 𝑋1 𝑋3
their (shared) common cause. We use the word “usually” throughout this
Figure 3.13: Fork with flow of association
paragraph because there exist pathological cases where the conditional drawn as a dashed red arc.
distributions 𝑃(𝑥 2 |𝑥 1 ) and 𝑃(𝑥 3 |𝑥 2 ) are misaligned in such a specific way
that makes 𝑋1 and 𝑋3 not actually associated [see, e.g., 15, Section 2.2]. [15]: Pearl et al. (2016), Causal inference in
statistics: A primer
An intuitive graphical way of thinking about 𝑋1 and 𝑋3 being associated
in chains and forks is to visualize the flow of association. We visualize
this with a dashed red line in Figure 3.12 and Figure 3.13. In the chain
graph (Figure 3.12), association flows from 𝑋1 to 𝑋3 along the path
𝑋1 → 𝑋2 → 𝑋3 . Symmetrically, association flows from 𝑋3 to 𝑋1 along
that same path, just running opposite the arrows. In the fork graph
(Figure 3.13), association flows from 𝑋1 to 𝑋3 along the path 𝑋1 ← 𝑋2 →
𝑋3 . And similarly, we can think of association flowing from 𝑋3 to 𝑋1
along that same path, just as was the case with chains. In general, the
flow of association is symmetric.
Chains and forks also share the same set of independencies. When we
condition on 𝑋2 in both graphs, it blocks the flow of association from
𝑋1 to 𝑋3 . This is because of the local Markov assumption; each variable
can locally depend on only its parents. So when we condition on 𝑋2
(𝑋3 ’s parent in both graphs), 𝑋3 becomes independent of 𝑋1 (and vice
versa).
We will refer to this independence as an instance of a blocked path. We
illustrate these blocked paths in Figure 3.14 and Figure 3.15. Conditioning 𝑋1 𝑋2 𝑋3
blocks the flow of association in chains and forks. Without conditioning, Figure 3.14: Chain with association
association is free to flow in chains and forks; we will refer to this as blocked by conditioning on 𝑋2 .
an unblocked path. However, the situation is completely different with
immoralities, as we will see in the next section.
That’s all nice intuition, but what about the proof? We can prove that
𝑋1 ⊥⊥ 𝑋3 | 𝑋2 using just the local Markov assumption. We will do this by
showing that 𝑃(𝑥 1 , 𝑥 3 | 𝑥 2 ) = 𝑃(𝑥 1 | 𝑥 2 ) 𝑃(𝑥 3 | 𝑥 2 ). We’ll show the proof 𝑋2
for chain graphs. It is usually useful to start with the Bayesian network
factorization. For chains, we can factorize 𝑃(𝑥 1 , 𝑥 2 , 𝑥 3 ) as follows:

𝑃(𝑥1 , 𝑥2 , 𝑥3 ) = 𝑃(𝑥1 ) 𝑃(𝑥2 |𝑥1 ) 𝑃(𝑥3 |𝑥2 ) (3.6) 𝑋1 𝑋3

𝑃(𝑥 1 ,𝑥2 ,𝑥3 ) Figure 3.15: Fork with association blocked


Bayes’ rule tells us that 𝑃(𝑥 1 , 𝑥 3 | 𝑥 2 ) = 𝑃(𝑥2 )
, so we have by conditioning on 𝑋2 .

𝑃(𝑥1 ) 𝑃(𝑥2 |𝑥 1 ) 𝑃(𝑥 3 |𝑥2 )


𝑃(𝑥1 , 𝑥3 | 𝑥2 ) = (3.7)
𝑃(𝑥 2 )

Since we’re looking to end up with 𝑃(𝑥 1 | 𝑥 2 ) 𝑃(𝑥 3 | 𝑥 2 ) and we already


have 𝑃(𝑥 3 |𝑥 2 ), we must turn the rest into 𝑃(𝑥 1 | 𝑥 2 ). We can do this by
3 The Flow of Association and Causation in Graphs 26

another application of Bayes rule:

𝑃(𝑥1 , 𝑥2 )
𝑃(𝑥1 , 𝑥3 | 𝑥2 ) = 𝑃(𝑥3 |𝑥2 ) (3.8)
𝑃(𝑥2 )
= 𝑃(𝑥1 |𝑥 2 ) 𝑃(𝑥3 |𝑥2 ) (3.9)

With that, we’ve shown that 𝑋1 ⊥


⊥ 𝑋3 | 𝑋2 . Try it yourself; prove the
analog in forks.7 7 Active reading exercise: prove that
𝑋1 ⊥
⊥ 𝑋3 | 𝑋2 for forks (Figure 3.15).
Flow of Causation The flow of association is symmetric, whereas the
flow of causation is not. Under the causal edges assumption (Assump-
tion 3.3), causation only flows in a single direction. Causation only flows
along directed paths. Association flows along any path that does not
contain an immorality.

3.6 Colliders and their Descendants

Recall from Section 3.1 that we have an immorality when we have a child
whose two parents do not have an edge connecting them (Figure 3.16).
And in this graph structure, the child is known as a bastard. No, just
kidding; it’s called a collider.
In contrast to chains and forks, in an immorality, 𝑋1 ⊥ ⊥ 𝑋3 . Look at
the graph structure and think about it a bit. Why would 𝑋1 and 𝑋3 be
associated? One isn’t the descendant of the other like in chains, and
they don’t share a common cause like in forks. Rather, we can think of
𝑋1 and 𝑋3 simply as unrelated events that happen, which happen to
both contribute to some common effect (𝑋2 ). To show this, we apply the
Bayesian network factorization and marginalize out 𝑥 2 :
𝑋1 𝑋3
X
𝑃(𝑥1 , 𝑥3 ) = 𝑃(𝑥 1 , 𝑥2 , 𝑥3 ) (3.10)
𝑥2
X
= 𝑃(𝑥 1 ) 𝑃(𝑥 3 ) 𝑃(𝑥2 | 𝑥1 , 𝑥3 ) (3.11) 𝑋2
𝑥2
X
= 𝑃(𝑥1 ) 𝑃(𝑥3 ) 𝑃(𝑥2 | 𝑥1 , 𝑥3 ) (3.12)
𝑥2
Figure 3.16: Immorality with association
= 𝑃(𝑥1 ) 𝑃(𝑥3 ) (3.13) blocked by a collider.

We illustrate the independence of 𝑋1 and 𝑋3 in Figure 3.16 by showing


that the association that we could have imagined as flowing along the
path 𝑋1 → 𝑋2 ← 𝑋3 is actually blocked at 𝑋2 . Because we have a collider
on the path connecting 𝑋1 and 𝑋3 , association does not flow through
that path. This is another example of a blocked path, but this time the path
is not blocked by conditioning; the path is blocked by a collider.
Good-Looking Men are Jerks Oddly enough, when we condition on 𝑋1 𝑋3
the collider 𝑋2 , its parents 𝑋1 and 𝑋3 become dependent (depicted
in Figure 3.17). An example is the easiest way to see why this is the
case. Imagine that you’re out dating men, and you notice that most 𝑋2
of the nice men you meet are not very good-looking, and most of the
good-looking men you meet are jerks. It seems that you have to choose
between looks and kindness. In other words, it seems like kindness and
Figure 3.17: Immorality with association
looks are negatively associated. However, what if I also told you that unblocked by conditioning on the collider.
there is an important third variable here: availability (whether men are
3 The Flow of Association and Causation in Graphs 27

already in a relationship or not)? And what if I told you that a man’s


availability is largely determined by their looks and kindness; if they are
both good-looking and kind, then they are in a relationship. The available
men are the remaining ones, the ones who are either not good-looking
or not kind. You see an association between looks and kindness because
you’ve conditioned on a collider (availability). You’re only looking at
men who are not in a relationship. You can see the causal structure of
this example by taking Figure 3.17 and replacing 𝑋1 with “looks,” 𝑋3
with “kindness,” and 𝑋2 with “availability.”
The above example naturally suggests that, when dating men, maybe
you should consider not conditioning on 𝑋2 = “not in a relationship”
and, instead, condition on 𝑋2 = “in a relationship.” However, you could
run into other variables 𝑋4 that introduce new immoralities there. Such
moral questions are outside the scope of this book. Active reading exercise: Come up with
your own example of an immorality and
Returning to inside the scope of this book, we have that conditioning how conditioning on the collider induces
on a collider can turn a blocked path into an unblocked path. The parents association between its parents. Hint:
think of rare events for 𝑋1 and 𝑋3 where,
𝑋1 and 𝑋3 are not associated in the general population, but when we
if either of them happens, some outcome
condition on their shared child 𝑋2 taking on a specific value, they become 𝑋2 will happen.
associated. Conditioning on the collider 𝑋2 allows associated to flow
along the path 𝑋1 → 𝑋2 ← 𝑋3 , despite the fact that it does not when we
don’t condition on 𝑋2 . We illustrate this in the move from Figure 3.16 to
Figure 3.17.
We also illustrate this with a scatter plot in Figure 3.18. In Figure 3.18a,
we plot the whole population, with kindness on the x-axis and looks
on the y-axis. As you can see, the variables are not associated in the
general population. However, if we remove the ones who are already in
a relationship (the orange ones in Figure 3.18b), we are left with the clear
negative association that we see in Figure 3.18c. This phenomenon is
known as Berkson’s paradox. The fact that we see this negative association
simply because we are selecting a biased subset of the general population
to look at is why this is sometimes referred to as selection bias [see, e.g., 7, [7]: Hernán and Robins (2020), Causal In-
Chapter 8]. ference: What If

10 10 10
available
taken
8 8 8

6 6 6
kindness

kindness

kindness

4 4 4

2 2 2

0 0 0
0 2 4 6 8 10 0 2 4 6 8 10 0 2 4 6 8 10
looks looks looks

(a) Looks and kindness data for the whole (b) Looks and kindness data grouped by (c) Looks and kindness data for only the
population. Looks and kindness are indepen- whether the person is available or not. Within available people. Now, there is a negative
dent. each group, there is a negative correlation. correlation.

Figure 3.18: Example data for the “good-looking men are jerks” example. Both looks and kindness are continuous values on a scale from 0
to 10.

Numerical Example All of the above has been to give you intuition
about why conditioning on a collider induces association between its
parents, but we have yet to give a concrete numerical example of this.
We will give a simple one here. Consider the following data generating
3 The Flow of Association and Causation in Graphs 28

process (DGP), where 𝑋1 and 𝑋3 are drawn independently from standard


normal distributions and then used to compute 𝑋2 :

𝑋1 ∼ 𝑁(0 , 1) , 𝑋3 ∼ 𝑁(0, 1) (3.14)


𝑋2 = 𝑋1 + 𝑋3 (3.15)

We’ve already stated that 𝑋1 and 𝑋3 are independent, but to juxtapose


the two calculations, let’s compute their covariance:

Cov(𝑋1 , 𝑋3 ) = 𝔼[(𝑋1 − 𝔼[𝑋1 ])(𝑋3 − 𝔼[𝑋3 ])]


= 𝔼[𝑋1 𝑋3 ] (zero mean)
= 𝔼[𝑋1 ]𝔼[𝑋3 ] (independent)
=0

Now, let’s compute their covariance, conditional on 𝑋2 :

Cov(𝑋1 , 𝑋3 | 𝑋2 = 𝑥) = 𝔼[𝑋1 𝑋3 | 𝑋2 = 𝑥] (3.16)


= 𝔼[𝑋1 (𝑥 − 𝑋1 )] (3.17)
= 𝑥 𝔼[𝑋1 ] − 𝔼[𝑋12 ] (3.18)
= −1 (3.19)

Crucially, in Equation 3.17, we used Equation 3.15 to plug in for 𝑋3 in


terms of 𝑋1 and 𝑋2 (conditioned to 𝑥 ). This led to a second-order term,
which led to the calculation giving a nonzero number, which means 𝑋1
and 𝑋3 are associated, conditional on 𝑋2 .
Descendants of Colliders Conditioning on descendants of a collider
also induces association in between the parents of the collider. The
intuition is that if we learn something about a collider’s descendant, we
usually also learn something about the collider itself because there is
a direct causal path from the collider to its descendants, and we know
that nodes in a chain are usually associated (see Section 3.5), assuming
minimality (Assumption 3.2). In other words, a descendant of a collider
can be thought of as a proxy for that collider, so conditioning on the Active reading exercise: We have provided
descendant is similar to conditioning on the collider itself. several techniques for how to think about
colliders: high-level examples, numerical
examples, and abstract reasoning. Use at
least one of them to convince yourself
that conditioning on a descendant of a

3.7 d-separation collider can induce association between


the collider’s parents.

Before we define d-separation, we’ll codify what we mean by the con-


cept of a “blocked path,” which we’ve been discussing in the previous
sections:

Definition 3.3 (blocked path) A path between nodes 𝑋 and 𝑌 is blocked


by a (potentially empty) conditioning set 𝑍 if either of the following is true:
1. Along the path, there is a chain · · · → 𝑊 → · · · or a fork
· · · ← 𝑊 → · · ·, where 𝑊 is conditioned on (𝑊 ∈ 𝑍 ).
2. There is a collider 𝑊 on the path that is not conditioned on (𝑊 ∉ 𝑍 )
and none of its descendants are conditioned on (de(𝑊) * 𝑍 ).

Then, an unblocked path is simply the complement; an unblocked path is a


3 The Flow of Association and Causation in Graphs 29

path that is not blocked. The graphical intuition to have in mind is that
association flows along unblocked paths, and association does not flow
along blocked paths. If you don’t have this intuition in mind, then it is
probably worth it to reread the previous two sections, with the goal of
gaining this intuition. Now, we are ready to introduce a very important
concept: d-separation.

Definition 3.4 (d-separation) Two (sets of) nodes 𝑋 and 𝑌 are d-separated
by a set of nodes 𝑍 if all of the paths between (any node in) 𝑋 and (any node
in) 𝑌 are blocked by 𝑍 [16]. [16]: Pearl (1988), Probabilistic Reasoning
in Intelligent Systems: Networks of Plausible
Inference
If all the paths between two nodes 𝑋 and 𝑌 are blocked, then we say that
𝑋 and 𝑌 are d-separated. Similarly, if there exists at least one path between
𝑋 and 𝑌 that is unblocked, then we say that 𝑋 and 𝑌 are d-connected.
As we will see in Theorem 3.1, d-separation is such an important concept
because it implies conditional independence. We will use the notation
𝑋 ⊥⊥𝐺 𝑌 | 𝑍 to denote that 𝑋 and 𝑌 are d-separated in the graph 𝐺
when conditioning on 𝑍 . Similarly, we will use the notation 𝑋 ⊥
⊥𝑃 𝑌 | 𝑍
to denote that 𝑋 and 𝑌 are independent in the distribution 𝑃 when
conditioning on 𝑍 .

Theorem 3.1 Given that 𝑃 is Markov with respect to 𝐺 (satisfies the local
Markov assumption, Assumption 3.1), if 𝑋 and 𝑌 are d-separated in 𝐺
conditioned on 𝑍 , then 𝑋 and 𝑌 are independent in 𝑃 conditioned on 𝑍 . We
can write this succinctly as follows:

𝑋⊥
⊥𝐺 𝑌 | 𝑍 =⇒ 𝑋 ⊥
⊥𝑃 𝑌 | 𝑍 (3.20)

Because this is so important, we will give Equation 3.20 a name: the global
Markov assumption. Theorem 3.1 tells us that the local Markov assumption
implies the global Markov assumption.
Just as we built up the intuition that suggested that the local Markov
assumption (Assumption 3.1) implies the Bayesian network factorization
(Definition 3.1) and alerted you to the fact that the Bayesian network
factorization also implies the local Markov assumption (the two are equiv-
alent), it turns out that the global Markov assumption also implies the
local Markov assumption. In other words, the local Markov assumption,
global Markov assumption, and the Bayesian network factorization are
all equivalent [see, e.g., 13, Chapter 3]. Therefore, we will use the slightly [13]: Koller and Friedman (2009), Proba-
shortened phrase Markov assumption to refer to these concepts as a bilistic Graphical Models: Principles and Tech-
niques
group, or we will simply write “𝑃 is Markov with respect to 𝐺 ” to convey
the same meaning.
Active reading exercise: To get some practice with d-separation, here are
some questions about d-separation in Figure 3.19.
Questions about Figure 3.19a:
1. Are 𝑇 and 𝑌 d-separated by the empty set?
2. Are 𝑇 and 𝑌 d-separated by 𝑊2 ?
3. Are 𝑇 and 𝑌 d-separated by {𝑊2 , 𝑀1 }?
4. Are 𝑇 and 𝑌 d-separated by {𝑊1 , 𝑀2 }?
5. Are 𝑇 and 𝑌 d-separated by {𝑊1 , 𝑀2 , 𝑋2 }?
3 The Flow of Association and Causation in Graphs 30

6. Are 𝑇 and 𝑌 d-separated by {𝑊1 , 𝑀2 , 𝑋2 , 𝑋3 }?


Questions about Figure 3.19b:
1. Are 𝑇 and 𝑌 d-separated by the empty set?
2. Are 𝑇 and 𝑌 d-separated by 𝑊 ?
3. Are 𝑇 and 𝑌 d-separated by {𝑊 , 𝑋2 }?

𝑊2

𝑊1 𝑊3 𝑊

𝑇 𝑌
𝑇 𝑀1 𝑀2 𝑌

𝑋1
𝑋1 𝑋3

𝑋2 𝑋2
(a) (b)

Figure 3.19: Graphs for d-separation exercise

3.8 Flow of Association and Causation

Now that we have covered the necessary preliminaries (chains, forks,


colliders, and d-separation), it is worth emphasizing how association and
causation flow in directed graphs. Association flows along all unblocked
paths. In causal graphs, causation flows along directed paths. Recall from
Section 1.3.2 that not only is association not causation, but causation is a
sub-category of association. That’s why association and causation both
flow along directed paths. confounding association

We refer to the flow of association along directed paths as causal association. 𝑋


A common type of non-causal association that makes total association
not causation is confounding association. In the graph in Figure 3.20, we
depict the confounding association in red and the causal association in
𝑇 𝑌
blue.
causal association
Regular Bayesian networks are purely statistical models, so we can only
Figure 3.20: Causal graph depicting an
talk about the flow of association in Bayesian networks. Association still example of how confounding association
flows in exactly the same way in Bayesian networks as it does in causal and causal association flow.
graphs, though. In both, association flows along chains and forks, unless
a node is conditioned on. And in both, a collider blocks the flow of
association, unless it is conditioned on. Combining these building blocks,
we get how association flows in general DAGs. We can tell if two nodes
are not associated (no association flows between them) by whether or
not they are d-separated.
3 The Flow of Association and Causation in Graphs 31

Causal graphs are special in that we additionally assume that the edges
have causal meaning (causal edges assumption, Assumption 3.3). This
assumption is what introduces causality into our models, and it makes
one type of path take on a whole new meaning: directed paths. This
assumption endows directed paths with the unique role of carrying
causation along them. Additionally, this assumption is asymmetric; “𝑋
is a cause of 𝑌 ” is not the same as saying “𝑌 is a cause of 𝑋 .” This means
that there is an important difference between association and causation:
association is symmetric, whereas causation is asymmetric.
d-separation Implies Association is Causation Given that we have
tools to measure association, how can we isolate causation? In other
words, how can we ensure that the association we measure is causation,
say, for measuring the causal effect of 𝑋 on 𝑌 ? Well, we can do that by
ensuring that there is no non-causal association flowing between 𝑋 and
𝑌 . This is true if 𝑋 and 𝑌 are d-separated in the augmented graph where
we remove outgoing edges from 𝑋 . This is because all of 𝑋 ’s causal effect
on 𝑌 would flow through it’s outgoing edges, so once those are removed,
the only association that remains is purely non-causal association.
In Figure 3.21, we illustrate what each of the important assumptions
gives us in terms of interpreting this flow of association. First, we have
the (local/global) Markov assumption (Assumption 3.1). As we saw
in Section 3.7, this assumption allows us to know which nodes are
unassociated. In other words, the Markov assumption tells along which
paths the association does not flow. When we slightly strengthen the
Markov assumption to the minimality assumption (Assumption 3.2),
we get which paths association does flow along (except in intransitive
edges cases). When we further add in the causal edges assumption
(Assumption 3.3), we get that causation flows along directed paths.
Therefore, the following two8 assumptions are essential for graphical 8 Recall that the first part of the minimal-
causal models: ity assumption is just the local Markov
assumption and that the second part is
1. Markov Assumption (Assumption 3.1) contained in the causal edges assumption.
2. Causal Edges Assumption (Assumption 3.3)

Markov Minimality Causal Edges


Assumption Statistical Assumption Statistical Assumption Causal
Independencies Dependencies Dependencies

Figure 3.21: A flowchart that illustrates what kind of claims we can make about our data as we add each additional important assumption.
Causal Models 4
Causal models are essential for identification of causal quantities. When 4.1 The do-operator and Inter-
we presented the Identification-Estimation Flowchart (Figure 2.5) back ventional Distributions . . 32
in Section 2.4, we described identification as the process of moving 4.2 The Main Assumption: Mod-
from a causal estimand to a statistical estimand. However, to do that, ularity . . . . . . . . . . . . . 34
we must have a causal model. We depict this more full version of the 4.3 Truncated Factorization . . 35
Identification-Estimation Flowchart in Figure 4.1. Example Application and Re-
visiting “Association is Not
Causal Estimand Causal Model Causation” . . . . . . . . . . 36
4.4 The Backdoor Adjustment 37
Relation to Potential Out-
comes . . . . . . . . . . . . . 39
Statistical Estimand Data
4.5 Structural Causal Models
(SCMs) . . . . . . . . . . . . 40
Structural Equations . . . . 40
Estimate Interventions . . . . . . . . . 42
Collider Bias and Why to Not
Figure 4.1: The Identification-Estimation Flowchart – a flowchart that illustrates the process
of moving from a target causal estimand to a corresponding estimate, through identification Condition on Descendants
and estimation. In contrast to Figure 2.5, this version is augmented with a causal model of Treatment . . . . . . . . . 43
and data. 4.6 Example Applications of the
Backdoor Adjustment . . . 44
The previous chapter gives graphical intuition for causal models, but it Association vs. Causation in
a Toy Example . . . . . . . . 44
doesn’t explain how to identify causal quantities and formalize causal
A Complete Example with
models. We will do that in this chapter.
Estimation . . . . . . . . . . 45
4.7 Assumptions Revisited . . . 47

4.1 The do-operator and Interventional


Distributions

The first thing that we will introduce is a mathematical operator for


intervention. In the regular notation for probability, we have conditioning,
but that isn’t the same as intervening. Conditioning on 𝑇 = 𝑡 just means
that we are restricting our focus to the subset of the population to those
who received treatment 𝑡 . In contrast, an intervention would be to take
the whole population and give everyone treatment 𝑡 . We illustrate this in
Figure 4.2. We will denote intervention with the do-operator: do(𝑇 = 𝑡).
This is the notation commonly used in graphical causal models, and it has
equivalents in potential outcomes notation. For example, we can write
the distribution of the potential outcome 𝑌(𝑡) that we saw in Chapter 2
as follows:

𝑃(𝑌(𝑡) = 𝑦) , 𝑃(𝑌 = 𝑦 | do(𝑇 = 𝑡)) , 𝑃(𝑦 | do(𝑡)) (4.1)

Note that we shorten do(𝑇 = 𝑡) to just do(𝑡) in the last option in Equation
4.1. We will use this shorthand throughout the book. We can similarly
write the ATE (average treatment effect) when the treatment is binary as
follows:
𝔼[𝑌 | do(𝑇 = 1)] − 𝔼[𝑌 | do(𝑇 = 0)] (4.2)
4 Causal Models 33

Population Subpopulations Conditioning Intervening

𝑇=0 𝑇=1 𝑇=1 do(𝑇 = 1)

or or

𝑇=0 do(𝑇 = 0)

Figure 4.2: Illustration of the difference between conditioning and intervening

We will often work with full distributions like 𝑃(𝑌 | do(𝑡)), rather than
their means, as this is more general; if we characterize 𝑃(𝑌 | do(𝑡)), then
we’ve characterized 𝔼[𝑌 | do(𝑡)]. We will commonly refer to 𝑃(𝑌 | do(𝑇 =
𝑡)) and other expressions with the do-operator in them as interventional
distributions.
Interventional distributions such as 𝑃(𝑌 | do(𝑇 = 𝑡)) are conceptually
quite different from the observational distribution 𝑃(𝑌). Observational
distributions such as 𝑃(𝑌) or 𝑃(𝑌, 𝑇, 𝑋) do not have the do-operator in
them. Because they don’t have the do-operator, we can observe data from
them without needing to carry out any experiment. This is why we call
data from 𝑃(𝑌, 𝑇, 𝑋) observational data. If we can reduce an expression
𝑄 with do in it (an interventional expression) to one without do in it (an
observational expression), then 𝑄 is said to be identifiable. An expression
with a do in it is fundamentally different from an expression without a
do in it, despite the fact that in do-notation, do appears after a regular
conditioning bar. As we discussed in Section 2.4, we will refer to an
estimand as a causal estimand when it contains a do-operator, and we
refer to an estimand as a statistical estimand when it doesn’t contain a
do-operator.
Whenever, do(𝑡) appears after the conditioning bar, it means that ev-
erything in that expression is in the post-intervention world where the
intervention do(𝑡) occurs. For example, 𝔼[𝑌 | do(𝑡), 𝑍 = 𝑧] refers to the
expected outcome in the subpopulation where 𝑍 = 𝑧 after the whole
subpopulation has taken treatment 𝑡 . In contrast, 𝔼[𝑌 | 𝑍 = 𝑧] simply
refers to the expected value in the (pre-intervention) population where
individuals take whatever treatment they would normally take (𝑇 ). This
distinction will become important when we get to counterfactuals in
Chapter 14.
4 Causal Models 34

4.2 The Main Assumption: Modularity

Before we can describe a very important assumption, we must specify


what a causal mechanism is. There are a few different ways to think about
causal mechanisms. In this section, we will refer to the causal mechanism
that generates 𝑋𝑖 as the conditional distribution of 𝑋𝑖 given all of its
causes: 𝑃(𝑥 𝑖 | pa𝑖 ). As we show graphically in Figure 4.3, the causal
mechanism that generates 𝑋𝑖 is all of 𝑋𝑖 ’s parents and their edges that go
into 𝑋𝑖 . We will give a slightly more specific description of what a causal
mechanism is in Section 4.5.1, but these suffice for now.
In order to get many causal identification results, the main assumption
we will make is that interventions are local. More specifically, we will
assume that intervening on a variable 𝑋𝑖 only changes the causal mech- 𝑋𝑖
anism for 𝑋𝑖 ; it does not change the causal mechanisms that generate
any other variables. In this sense, the causal mechanisms are modular.
Other names that are used for the modularity property are independent
mechanisms, autonomy, and invariance. We will now state this assumption
more formally. Figure 4.3: A causal graph with the causal
mechanism that generates 𝑋𝑖 depicted in-
side an ellipse.
Assumption 4.1 (Modularity / Independent Mechanisms / Invariance)
If we intervene on a set of nodes 𝑆 ⊆ [𝑛],1 setting them to constants, then for 1 We use [𝑛] to refer to the set {1 , 2 , . . . , 𝑛}.
all 𝑖 , we have the following:
1. If 𝑖 ∉ 𝑆 , then 𝑃(𝑥 𝑖 | pa𝑖 ) remains unchanged.
2. If 𝑖 ∈ 𝑆 , then 𝑃(𝑥 𝑖 | pa𝑖 ) = 1 if 𝑥 𝑖 is the value that 𝑋𝑖 was set to by
the intervention; otherwise, 𝑃(𝑥 𝑖 | pa𝑖 ) = 0.

In the second part of the above assumption, we could have alternatively


said 𝑃(𝑥 𝑖 | pa𝑖 ) = 1 if 𝑥 𝑖 is consistent with the intervention2 and 0 otherwise. 2Yes, the word “consistent” is extremely
More explicitly, we will say (in the future) that if 𝑖 ∈ 𝑆 , a value 𝑥 𝑖 is overloaded.
consistent with the intervention if 𝑥 𝑖 equals the value that 𝑋𝑖 was set to
in the intervention.
The modularity assumption is what allows us to encode many differ-
ent interventional distributions all in a single graph. For example, it
could be the case that 𝑃(𝑌), 𝑃(𝑌 | do(𝑇 = 𝑡)), 𝑃(𝑌 | do(𝑇 = 𝑡 0)), and
𝑃(𝑌 | do(𝑇2 = 𝑡2 )) are all completely different distributions that share
almost nothing. If this were the case, then each of these distributions
would need their own graph. However, by assuming modularity, we can
encode them all with the same graph that we use to encode the joint
𝑃(𝑌, 𝑇, 𝑇2 , . . . ), and we can know that all of the factors (except ones that
are intervened on) are shared across these graphs.
The causal graph for interventional distributions is simply the same
graph that was used for the observational joint distribution, but with
all of the edges to the intervened node(s) removed. This is because the
probability for the intervened factor has been set to 1, so we can just
ignore that factor (this is the focus of the next section). Another way to
see that the intervened node has no causal parents is that the intervened
node is set to a constant value, so it no longer depends on any of the
variables it depends on in the observational setting (its parents). The
graph with edges removed is known as the manipulated graph.
4 Causal Models 35

For example, consider the causal graph for an observational distribution


in Figure 4.4a. Both 𝑃(𝑌 | do(𝑇 = 𝑡)) and 𝑃(𝑌 | do(𝑇 = 𝑡 0)) correspond
to the causal graph in Figure 4.4b, where the incoming edge to 𝑇 has
been removed. Similarly, 𝑃(𝑌 | do(𝑇2 = 𝑡2 )) corresponds to the graph
in Figure 4.4c, where the incoming edges to 𝑇2 have been removed.
Although it is not expressed in the graphs (which only express conditional
independencies and causal relations), under the modularity assumption,
𝑃(𝑌), 𝑃(𝑌 | 𝑇 = 𝑡 0), and 𝑃(𝑌 | do(𝑇2 = 𝑡2 )) all shared the exact same
factors (that are not intervened on).

𝑇3 𝑇3 𝑇3

𝑇2 𝑇2 𝑇2
𝑇 𝑇 𝑇

𝑌 𝑌 𝑌

(a) Causal graph for observational distri- (b) Causal graph after intervention on 𝑇 (c) Causal graph after intervention on 𝑇2
bution (interventional distribution) (interventional distribution)

Figure 4.4: Intervention as edge deletion in causal graphs

What would it mean for the modularity assumption to be violated?


Imagine that you intervene on 𝑋𝑖 , and this causes the mechanism that
generates a different node 𝑋 𝑗 to change; an intervention on 𝑋𝑖 changes
𝑃(𝑥 𝑗 | pa 𝑗 ), where 𝑗 ≠ 𝑖 . In other words, the intervention is not local to
the node you intervene on; causal mechanisms are not invariant to when
you change other causal mechanisms; the causal mechanisms are not
modular.
This assumption is so important that Judea Pearl refers to a closely
related version (which we will see in Section 4.5.2) as The Law of
Counterfactuals (and Interventions), one of two key principles from
which all other causal results follow.3 Incidentally, taking the modularity 3 The other key principle is the global
assumption (Assumption 4.1) and the Markov assumption (the other key Markov assumption (Theorem 3.1), which
principle) together gives us causal Bayesian networks. We’ll now move to is the assumption that d-separation im-
plies conditional independence.
one of the important results that follow from these assumptions.

4.3 Truncated Factorization

Recall the Bayesian network factorization (Definition 3.1), which tells us


that if 𝑃 is Markov with respect to a graph 𝐺 , then 𝑃 factorizes as follows:
Y
𝑃(𝑥1 , . . . , 𝑥 𝑛 ) = 𝑃(𝑥 𝑖 | pa𝑖 ) (4.3)
𝑖

where pa𝑖 denotes the parents of 𝑋𝑖 in 𝐺 . Now, if we intervene on some


set of nodes 𝑆 and assume modularity (Assumption 4.1), then all of the
factors should remain the same except the factors for 𝑋𝑖 ∈ 𝑆 ; those factors
4 Causal Models 36

should change to 1 (for values consistent with the intervention) because


those variables have been intervened on. This is how we get the truncated
factorization.

Proposition 4.1 (Truncated Factorization) We assume that 𝑃 and 𝐺 satisfy


the Markov assumption and modularity. Given, a set of intervention nodes 𝑆 ,
if 𝑥 is consistent with the intervention, then
Y
𝑃(𝑥1 , . . . , 𝑥 𝑛 | do(𝑆 = 𝑠)) = 𝑃(𝑥 𝑖 | pa𝑖 ) . (4.4)
𝑖∉𝑆

Otherwise, 𝑃(𝑥 1 , . . . , 𝑥 𝑛 | do(𝑆 = 𝑠)) = 0.

The key thing that changed when we moved from the regular factorization
in Equation 4.3 to the truncated factorization in Equation 4.4 is that the
latter’s product is only over 𝑖 ∉ 𝑆 rather than all 𝑖 . In other words, the
factors for 𝑖 ∈ 𝑆 have been truncated.

4.3.1 Example Application and Revisiting “Association is


Not Causation”

To see the power that the truncated factorization gives us, let’s apply it
to identify the causal effect of treatment on outcome in a simple graph.
Specifically, we will identify the causal quantity 𝑃(𝑦 | do(𝑡)). In this
example, the distribution 𝑃 is Markov with respect to the graph in Figure 𝑋
4.5. The Bayesian network factorization (from the Markov assumption),
gives us the following:
𝑇 𝑌
𝑃(𝑦, 𝑡, 𝑥) = 𝑃(𝑥) 𝑃(𝑡 | 𝑥) 𝑃(𝑦 | 𝑡, 𝑥) (4.5)
Figure 4.5: Simple causal structure where
When we intervene on the treatment, the truncated factorization (from 𝑋 counfounds the effect of 𝑇 on 𝑌 and
where 𝑋 is the only confounder.
adding the modularity assumption) gives us the following:

𝑃(𝑦, 𝑥 | do(𝑡)) = 𝑃(𝑥) 𝑃(𝑦 | 𝑡, 𝑥) (4.6)

Then, we simply need to marginalize out 𝑥 to get what we want:


X
𝑃(𝑦 | do(𝑡)) = 𝑃(𝑦 | 𝑡, 𝑥) 𝑃(𝑥) (4.7)
𝑥

We assumed 𝑋 is discrete when we summed over its values, but we can


simply replace the sum with an integral if 𝑋 is continuous. Throughout
this book, that will be the case, so we usually won’t point it out.
If we massage Equation 4.7 a bit, we can clearly see how association is not
causation. The purely associational counterpart of 𝑃(𝑦 | do(𝑡)) is 𝑃(𝑦 | 𝑡).
If the 𝑃(𝑥) in Equation 4.7 were 𝑃(𝑥 | 𝑡), then we would actually recover
𝑃(𝑦 | 𝑡). We briefly show this:
X X
𝑃(𝑦 | 𝑡, 𝑥) 𝑃(𝑥 | 𝑡) = 𝑃(𝑦, 𝑥 | 𝑡) (4.8)
𝑥 𝑥
= 𝑃(𝑦 | 𝑡) (4.9)

This gives some concreteness to the difference between association


and causation. In this example (which is representative of a broader
4 Causal Models 37

phenomenon), the difference between 𝑃(𝑦 | do(𝑡)) and 𝑃(𝑦 | 𝑡) is the


difference between 𝑃(𝑥) and 𝑃(𝑥 | 𝑡).
To round this example out, say 𝑇 is a binary random variable, and we
want to compute the ATE. 𝑃(𝑦 | do(𝑇 = 1)) is the distribution for 𝑌(1), so
we can just take the expectation to get 𝔼[𝑌(1)]. Similarly, we can do the
same thing with 𝑌(0). Then, we can write the ATE as follows:
X X
𝔼[𝑌(1) − 𝑌(0)] = 𝑦 𝑃(𝑦 | do(𝑇 = 1)) − 𝑦 𝑃(𝑦 | do(𝑇 = 0)) (4.10)
𝑦 𝑦

If we then plug in Equation 4.7 for 𝑃(𝑦 | do(𝑇 = 1)) and 𝑃(𝑦 | do(𝑇 = 0)),
we have a fully identified ATE. Given the simple graph in Figure 4.5, we
have shown how we can use the truncated factorization to identify causal
effects in Equations 4.5 to 4.7. We will now generalize this identification
process to a more general formula.

4.4 The Backdoor Adjustment

Recall from Chapter 3 that causal association flows from 𝑇 to 𝑌 along


directed paths and that non-causal association flows along any other
paths from 𝑇 to 𝑌 that aren’t blocked by either 1) a non-collider that
is conditioned on or 2) a collider that isn’t conditioned on. These non-
directed unblocked paths from 𝑇 to 𝑌 are known as backdoor paths because
they have an edge that goes in the “backdoor” of the 𝑇 node. And it turns
out that if we can block these paths by conditioning, we can identify
causal quantities like 𝑃(𝑌 | do(𝑡)).4 4 As we mentioned in Section 3.8, blocking
all backdoor paths is equivalent to having
This is precisely what we did in the previous section. We blocked the d-separation in the graph where edges
backdoor path 𝑇 ← 𝑋 → 𝑌 in Figure 4.5 simple by conditioning on 𝑋 going out of 𝑇 are removed. This is because
and marginalizing it out (Equation 4.7). In this section, we will generalize these are the only edges that causation
flows along, so once they are removed, all
Equation 4.7 to arbitrary DAGs. But before we do that, let’s graphically that remains is non-causation association.
consider why the quantity 𝑃(𝑦 | do(𝑡)) is purely causal.
As we discussed in Section 4.2, the graph for the interventional dis-
tribution 𝑃(𝑌 | do(𝑡)) is the same as the graph for the observational
𝑋
distribution 𝑃(𝑌, 𝑇, 𝑋), but with the incoming edges to 𝑇 removed. For
example, if we take the graph from Figure 4.5 and intervene on 𝑇 , then
we get the manipulated graph in Figure 4.6. In this manipulated graph,
there cannot be any backdoor paths because no edges are going into the 𝑇 𝑌
backdoor of 𝑇 . Therefore, all of the association that flows from 𝑇 to 𝑌 in Figure 4.6: Manipulated graph that results
the manipulated graph is purely causal. from intervening on 𝑇 , when the original
graph is Figure 4.5.
With that digression aside, let’s prove that we can identify 𝑃(𝑦 | do(𝑡)).
We want to turn the causal estimand 𝑃(𝑦 | do(𝑡)) into a statistical estimand
(only relies on the observational distribution). We’ll start with assuming
we have a set of variables 𝑊 that satisfy the backdoor criterion:

Definition 4.1 (Backdoor Criterion) A set of variables 𝑊 satisfies the


backdoor criterion relative to 𝑇 and 𝑌 if the following are true:
1. 𝑊 blocks all backdoor paths from 𝑇 to 𝑌 . 5 Active reading exercise: In a general DAG,

2. 𝑊 does not contain any descendants of 𝑇 . which set of nodes related to 𝑇 will always
be a sufficient adjustment set? Which set
of nodes related to 𝑌 will always be a
sufficient adjustment set?
4 Causal Models 38

Satisfying the backdoor criterion makes 𝑊 a sufficient adjustment set.5


We saw an example of 𝑋 as a sufficient adjustment set in Section 4.3.1.
Because there was only a single backdoor path in Section 4.3.1, a single
node (𝑋 ) was enough to block all backdoor paths, but, in general, there
can be multiple backdoor paths.
To introduce 𝑊 into the proof, we’ll use the usual trick of conditioning
on variables and marginalizing them out:
X
𝑃(𝑦 | do(𝑡)) = 𝑃(𝑦 | do(𝑡), 𝑤) 𝑃(𝑤 | do(𝑡)) (4.11)
𝑤

Given that 𝑊 satisfies the backdoor criterion, we can write the following:

X X
𝑃(𝑦 | do(𝑡), 𝑤) 𝑃(𝑤 | do(𝑡)) = 𝑃(𝑦 | 𝑡, 𝑤) 𝑃(𝑤 | do(𝑡)) (4.12)
𝑤 𝑤

This follows from the modularity assumption (Assumption 4.1). If 𝑊 is all


of the parents for 𝑌 (other than 𝑇 ), it should be clear that the modularity
assumption immediately implies 𝑃(𝑦 | do(𝑡), 𝑤) = 𝑃(𝑦 | 𝑡, 𝑤). If 𝑊 isn’t
the parents of 𝑌 but still blocks all backdoor paths another way, then this
equality is still true but requires using the graphical knowledge we built
up in Chapter 3.
In the manipulated graph (for 𝑃(𝑦 | do(𝑡), 𝑤)), all of the 𝑇 -𝑌 association
flows along the directed path(s) from 𝑇 to 𝑌 , since there cannot be
any backdoor paths because 𝑇 has no incoming edges. Similarly, in the
regular graph (for 𝑃(𝑦 | 𝑡, 𝑤)), all of the 𝑇 -𝑌 association flows along
the directed path(s) from 𝑇 to 𝑌 . This is because, even though there
exist backdoor paths, the association that would flow along them is
blocked by 𝑊 , leaving association to only flow along directed paths. In
both cases, association flows along the exact same directed paths, which
correspond to the exact same conditional distributions (by the modularity
assumption).
Although we’ve justified Equation 4.12, there is still a do in the expression:
𝑃(𝑤 | do(𝑡)). However, 𝑃(𝑤 | do(𝑡)) = 𝑃(𝑤). To see this, consider how 𝑇
might have influence 𝑊 in the manipulated graph. It can’t be through
any path that has an edge into 𝑇 because 𝑇 doesn’t have any incoming
edges in the manipulated graph. It can’t be through any path that has an
edge going out of 𝑇 because such a path would have to have a collider,
that isn’t conditioned on, on the path. We know any such colliders are
not conditioned on because we have assumed that 𝑊 does not contain
descendants of 𝑇 (second part of the backdoor criterion).6 Therefore, we 6 We will come back to what goes wrong
can write the final step: if we condition on descendants of 𝑇 in Sec-
tion 4.5.3, after we cover some important
X X
𝑃(𝑦 | 𝑡, 𝑤) 𝑃(𝑤 | do(𝑡)) = 𝑃(𝑦 | 𝑡, 𝑤) 𝑃(𝑤)
concepts that we need before we can fully
(4.13)
explain that.
𝑤 𝑤

This is known as the backdoor adjustment.

Theorem 4.2 (Backdoor Adjustment) Given the modularity assumption


(Assumption 4.1), that 𝑊 satisfies the backdoor criterion (Definition 4.1), and
4 Causal Models 39

positivity (Assumption 2.3), we can identify the causal effect of 𝑇 on 𝑌 :


X
𝑃(𝑦 | do(𝑡)) = 𝑃(𝑦 | 𝑡, 𝑤) 𝑃(𝑤)
𝑤

Here’s a concise recap of the proof (Equations 4.11 to 4.13) without all of
the explanation/justification:

Proof.
X
𝑃(𝑦 | do(𝑡)) = 𝑃(𝑦 | do(𝑡), 𝑤) 𝑃(𝑤 | do(𝑡)) (4.14)
𝑤
X
= 𝑃(𝑦 | 𝑡, 𝑤) 𝑃(𝑤 | do(𝑡)) (4.15)
𝑤
X
= 𝑃(𝑦 | 𝑡, 𝑤) 𝑃(𝑤) (4.16)
𝑤

Relation to d-separation We can use the backdoor adjustment if 𝑊 d-


separates 𝑇 from 𝑌 in the manipulated graph. Recall from Section 3.8 that
we mentioned that we would be able to isolate the causal association if 𝑇
is d-separated from 𝑌 in the manipulated graph. “Isolation of the causal
association” is identification. We can also isolate the causal association if
𝑌 is d-separated from 𝑇 in the manipulated graph, conditional on 𝑊 . This
is what the first part of the backdoor criterion is about and what we’ve
codified in the backdoor adjustment.

4.4.1 Relation to Potential Outcomes

Hmm, the backdoor adjustment (Theorem 4.2) looks quite similar to


the adjustment formula (Theorem 2.1) that we saw back in the potential
outcomes chapter:

𝔼[𝑌(1) − 𝑌(0)] = 𝔼𝑊 [𝔼[𝑌 | 𝑇 = 1, 𝑊] − 𝔼[𝑌 | 𝑇 = 0, 𝑊]] (4.17)

We can derive this from the more general backdoor adjustment in a few
steps. First, we take an expectation over 𝑌 :
X
𝔼[𝑌 | do(𝑡)] = 𝔼[𝑌 | 𝑡, 𝑤] 𝑃(𝑤) (4.18)
𝑤

Then, we notice that the sum over 𝑤 and 𝑃(𝑤) is an expectation (for
discrete 𝑤 , but just replace with an integral if not):

𝔼[𝑌 | do(𝑡)] = 𝔼𝑊 𝔼[𝑌 | 𝑡, 𝑊] (4.19)

And finally, we look at the difference between 𝑇 = 1 and 𝑇 = 0:

𝔼[𝑌 | do(𝑇 = 1)] − 𝔼[𝑌 | do(𝑇 = 0)] = 𝔼𝑊 [𝔼[𝑌 | 𝑇 = 1, 𝑊] − 𝔼[𝑌 | 𝑇 = 0, 𝑊]]


(4.20)
Since the do-notation 𝔼[𝑌 | do(𝑡)] is just another notation for the potential
outcomes 𝔼[𝑌(𝑡)], we are done! If you remember, one of the main as-
sumptions we needed to get Equation 4.17 (Theorem 2.1) was conditional
4 Causal Models 40

exchangeability (Assumption 2.2), which we repeat below:

(𝑌(1), 𝑌(0)) ⊥
⊥𝑇 |𝑊 (4.21)

However, we had no way of knowing how to choose 𝑊 or knowing


that that 𝑊 actually gives us conditional exchangeability. Well, using
graphical causal models, we know how to choose a valid 𝑊 : we simply
choose 𝑊 so that it satisfies the backdoor criterion. Then, under the
assumptions encoded in the causal graph, conditional exchangeability
provably holds; the causal effect is provably identifiable.

4.5 Structural Causal Models (SCMs)

Graphical causal models such as causal Bayesian networks give us


powerful ways to encode statistical and causal assumptions, but we have
yet to explain exactly what an intervention is or exactly what a causal
mechanism is. Moving from causal Bayesian networks to full structural
causal models will give us this additional clarity along with the power to
compute counterfactuals.

4.5.1 Structural Equations

As Judea Pearl often says, the equals sign in mathematics does not convey
any causal information. Saying 𝐴 = 𝐵 is the same as saying 𝐵 = 𝐴.
Equality is symmetric. However, in order to talk about causation, we
must have something asymmetric. We need to be able to write that 𝐴
is a cause of 𝐵, meaning that changing 𝐴 results in changes in 𝐵, but
changing 𝐵 does not result in changes in 𝐴. This is what we get when we
write the following structural equation:

𝐵 := 𝑓 (𝐴) , (4.22)

where 𝑓 is some function that maps 𝐴 to 𝐵. While the usual “=” symbol
does not give us causal information, this new “:=” symbol does. This
is a major difference that we see when moving from statistical models
to causal models. Now, we have the asymmetry we need to describe
causal relations. However, the mapping between 𝐴 and 𝐵 is deterministic.
Ideally, we’d like to allow it to be probabilistic, which allows room for
some unknown causes of 𝐵 that factor into this mapping. Then, we can
write the following:
𝐵 := 𝑓 (𝐴, 𝑈) , (4.23)
where 𝑈 is some unobserved random variable. We depict this in Figure 4.7,
𝐴 𝑈
where 𝑈 is drawn inside a dashed node to indicate that it is unobserved.
The unobserved 𝑈 is analogous to the randomness that we would
see by sampling units (individuals); it denotes all the relevant (noisy)
background conditions that determine 𝐵. More concretely, there are 𝐵
analogs to every part of the potential outcome 𝑌𝑖 (𝑡): 𝐵 is the analog of 𝑌 , Figure 4.7: Graph for simple structural
𝐴 = 𝑎 is the analog of 𝑇 = 𝑡 , and 𝑈 is the analog of 𝑖 . equation. The dashed node 𝑈 means that
𝑈 is unobserved.
The functional form of 𝑓 does not need to be specified, and when
left unspecified, we are in the nonparametric regime because we aren’t
making any assumptions about parametric form. Although the mapping
4 Causal Models 41

is deterministic, because it takes a random variable 𝑈 (a “noise” or


“background conditions” variable) as input, it can represent any stochastic
mapping, so structural equations generalize the probabilistic factors
𝑃(𝑥 𝑖 | pa𝑖 ) that we’ve been using throughout this chapter. Therefore, all
the results that we’ve seen such as the truncated factorization and the
backdoor adjustment still hold when we introduce structural equations.
Cause and Causal Mechanism Revisited We have now come to the
more precise definitions of what a cause is (Definition 3.2) and what a
causal mechanism is (introduced in Section 4.2). A causal mechanism
that generates a variable is the structural equation that corresponds to
that variable. For example, the causal mechanism for 𝐵 is Equation 4.23.
Similarly, 𝑋 is a direct cause of 𝑌 if 𝑋 appears on the right-hand side of
the structural equation for 𝑌 . We say that 𝑋 is a cause of 𝑌 if 𝑋 is a direct
cause of any of the causes of 𝑌 7 or if 𝑋 is a direct cause of 𝑌 . 7 Trust me; the recursion ends. The base
case was specified.
We only showed a single structural equation in Equation 4.23, but there
can be a large collection of structural equations in a single model, which
we will commonly label 𝑀 . For example, we write structural equations
for Figure 4.8 below:

𝐵 := 𝑓𝐵 (𝐴, 𝑈𝐵 ) 𝐴 𝑈𝐵
𝑀: 𝐶 := 𝑓𝐶 (𝐴, 𝐵, 𝑈𝐶 ) (4.24)
𝐷 := 𝑓𝐷 (𝐴, 𝐶, 𝑈𝐷 )
𝐵 𝑈𝐶
In causal graphs, the noise variables are often implicit, rather than
explicitly drawn. The variables that we write structural equations for
are known as endogenous variables. These are the variables whose causal
mechanisms we are modeling – the variables that have parents in the 𝐶 𝑈𝐷
causal graph. In contrast, exogenous variables are variables who do not
have any parents in the causal graph; these variables are external to our
causal model in the sense that we choose not to model their causes. For
𝐷
example, in the causal model described by Figure 4.8 and Equation 4.24,
the endogenous variables are {𝐵, 𝐶, 𝐷}. And the exogenous variables Figure 4.8: Graph for the structural equa-

are {𝐴, 𝑈𝐵 , 𝑈𝐶 , 𝑈𝐷 }.
tions in Equation 4.24.

Definition 4.2 (Structural Causal Model (SCM)) A structural causal


model is a tuple of the following sets:
1. A set of endogenous variables 𝑉
2. A set of exogenous variables 𝑈
3. A set of functions 𝑓 , one to generate each endogenous variable as a
function of other variables

For example, 𝑀 , the set of three equations above in Equation 4.24


constitutes an SCM with corresponding causal graph in Figure 4.8. Every
SCM implies an associated causal graph: for each structural equation,
draw an edge from every variable on the right-hand side to the variable
on the left-hand side.
If the causal graph contains no cycles (is a DAG) and the noise variables
𝑈 are independent, then the causal model is Markovian; the distribution
𝑃 is Markov with respect to the causal graph. If the causal graph doesn’t
contain cycles but the noise terms are dependent, then the model is semi-
Markovian. For example, if there is unobserved confounding, the model
4 Causal Models 42

is semi-Markovian. Finally, the graphs of non-Markovian models contain


cycles. We will largely be considering Markovian and semi-Markovian
models in this book.

4.5.2 Interventions

Interventions in SCMs are remarkably simple. The intervention do(𝑇 = 𝑡)


simply corresponds to replacing the structural equation for 𝑇 with 𝑇 := 𝑡 .
For example, consider the following causal model 𝑀 with corresponding
𝑋
causal graph in Figure 4.9:

𝑇 := 𝑓𝑇 (𝑋 , 𝑈𝑇 )
𝑀: (4.25) 𝑇 𝑌
𝑌 := 𝑓𝑌 (𝑋 , 𝑇, 𝑈𝑌 )
Figure 4.9: Basic causal graph
If we then intervene on 𝑇 to set it to 𝑡 , we get the interventional SCM 𝑀𝑡
below and corresponding manipulated graph in Figure 4.10.
𝑋
𝑇 := 𝑡
𝑀𝑡 : (4.26)
𝑌 := 𝑓𝑌 (𝑋 , 𝑇, 𝑈𝑌 )

The fact that do(𝑇 = 𝑡) only changes the equation for 𝑇 and no other 𝑇 𝑌
variables is a consequence of the modularity assumption; these causal Figure 4.10: Basic causal with the the in-
mechanisms (structural equations) are modular. Assumption 4.1 states coming edges to 𝑇 removed, due to the
intervention do(𝑇 = 𝑡).
the modularity assumption in the context of causal Bayesian networks,
but we need a slightly different translation of this assumption for SCMs.

Assumption 4.2 (Modularity Assumption for SCMs) Consider an SCM


𝑀 and an interventional SCM 𝑀𝑡 that we get by performing the intervention
do(𝑇 = 𝑡). The modularity assumption states that 𝑀 and 𝑀𝑡 share all of
their structural equations except the structural equation for 𝑇 , which is 𝑇 := 𝑡
in 𝑀𝑡 .

In other words, the intervention do(𝑇 = 𝑡) is localized to 𝑇 . None of the


other structural equations change because they are modular; the causal
mechanisms are independent. The modularity assumption for SCMs is
what gives us what Pearl calls the The Law of Counterfactuals, which
we briefly mentioned at the end of Section 4.2, after we defined the
modularity assumption for causal Bayesian networks. But before we can
get to that, we must first introduce a bit more notation.
In the causal inference literature, there are many different ways of writing
the unit-level potential outcome. In Chapter 2, we used 𝑌𝑖 (𝑡). However,
there are other ways such as 𝑌𝑖𝑡 or even 𝑌𝑡 (𝑢). For example, in his
prominent potential outcomes paper, Holland [5] uses the 𝑌𝑡 (𝑢) notation. [5]: Holland (1986), ‘Statistics and Causal
In this notation, 𝑢 is the analog of 𝑖 , just as we mentioned is the case Inference’

for the 𝑈 in Equation 4.23 and the paragraph that followed it. This is
the notation that Pearl uses for SCMs as well [see, e.g., 17, Definition
4]. So 𝑌𝑡 (𝑢) denotes the outcome that unit 𝑢 would observe if they take [17]: Pearl (2009), ‘Causal inference in
treatment 𝑡 , given that the SCM is 𝑀 . Similarly, we define 𝑌𝑀𝑡 (𝑢) as statistics: An overview’

the outcome that unit 𝑢 would observe if they take treatment 𝑡 , given
that the SCM is 𝑀𝑡 (remember that 𝑀𝑡 is the same SCM as 𝑀 but with
the structural equation for 𝑇 changed to 𝑇 := 𝑡 ). Now, we are ready to
4 Causal Models 43

present one of Pearl’s two key principles from which all other causal
8
results follow:8 Active reading exercise: Can you recall
which was the other key principle/as-
sumption?
Definition 4.3 (The Law of Counterfactuals (and Interventions))

𝑌𝑡 (𝑢) = 𝑌𝑀𝑡 (𝑢) (4.27)

This is called “The Law of Counterfactuals” because it gives us informa-


tion about counterfactuals. Given an SCM with enough details about it
specified, we can actually compute counterfactuals. This is a big deal
because this is exactly what the fundamental problem of causal inference
(Section 2.2) told us we cannot do. We won’t say more about how to do this
until we get to the dedicated chapter for counterfactuals: Chapter 14. Active reading exercise: Take what you
now know about structural equations,
and relate it to other parts of this chap-
ter. For example, how do interventions in
4.5.3 Collider Bias and Why to Not Condition on structural equations relate to the modu-
Descendants of Treatment larity assumption? How does the mod-
ularity assumption for SCMs (Assump-
tion 4.2) relate to the modularity assump-
In defining the backdoor criterion (Definition 4.1) for the backdoor tion in causal Bayesian networks (Assump-
adjustment (Theorem 4.2), not only did we specify that the adjustment tion 4.1)? Does this modularity assump-
set 𝑊 blocks all backdoor paths, but we also specified that 𝑊 does not tion for SCMs still give us the backdoor
contain any descendants of 𝑇 . Why? There are two categories of things adjustment?

that could go wrong if we condition on descendants of 𝑇 :


𝑊
1. We block the flow of causation from 𝑇 to 𝑌 .
2. We induce non-causal association between 𝑇 and 𝑌 .
As we’ll see, it is fairly intuitive why we want to avoid the first category.
The second category is a bit more complex, and we’ll break it up into two 𝑇 𝑀 𝑌
different parts, each with their own paragraph. This more complex part Figure 4.11: Causal graph where all cau-
is actually why we delayed this explanation to after we introduced SCMs, sation is blocked by conditioning on 𝑀 .
rather than back when we introduced the backdoor criterion/adjustment
in Section 4.4.
If we condition on a node that is on a directed path from 𝑇 to 𝑌 , then we 𝑊
block the flow of causation along that causal path. We will refer to a node
on a directed path from 𝑇 to 𝑌 as a mediator, as it mediates the effect of
𝑇 on 𝑌 . For example, in Figure 4.11, all of the causal flow is blocked by
𝑀 . This means that we will measure zero association between 𝑇 and 𝑌 𝑇 𝑀 𝑌
(given that 𝑊 blocks all backdoor paths). In Figure 4.12, only a portion of
the causal flow is blocked by 𝑀 . This is because causation can still flow
Figure 4.12: Causal graph where part of
along the 𝑇 → 𝑌 edge. In this case, we will get a non-zero estimate of the causation is blocked by conditioning
the causal effect, but it will still be biased, due to the causal flow that 𝑀 on 𝑀 .
blocks.
If we condition on a descendant of 𝑇 that isn’t a mediator, it could unblock
a path from 𝑇 to 𝑌 that was blocked by a collider. For example, this is 𝑊
the case with conditioning on 𝑍 in Figure 4.13. This induces non-causal
association between 𝑇 and 𝑌 , which biases the estimate of the causal
effect. Consider the following general kind of path, where → · · · → 𝑇 𝑌
denotes a directed path: 𝑇 → · · · → 𝑍 ← · · · ← 𝑌 . Conditioning on 𝑍 ,
or any descendant of 𝑍 in a path like this, will induce collider bias. That
is, the causal effect estimate will be biased by the non-causal association 𝑍
that we induce when we condition on 𝑍 or any of its descendants (see
Figure 4.13: Causal graph where condi-
Section 3.6).
tioning on the collider 𝑍 induces bias.
4 Causal Models 44

What about conditioning on 𝑍 in Figure 4.14? Would that induce bias? 𝑊


Recall that graphs are frequently drawn without explicitly drawing
the noise variables. If we magnify part of the graph, making 𝑀 ’s noise
variable explicit, we get Figure 4.15. Now, we see that 𝑇 → 𝑀 ← 𝑈 𝑀
forms an immorality. Therefore, conditioning on 𝑍 induces an association 𝑇 𝑀 𝑌
between 𝑇 and 𝑈 𝑀 . This induced non-causal association is another form
of collider bias. You might find this unsatisfying because 𝑌 is not one
of the immoral parents here; rather 𝑇 and 𝑈 𝑀 are the ones living the
immoral lifestyle. So why would this change the association between 𝑇 𝑍
and 𝑌 ? One way to get the intuition for this is that there is now induced Figure 4.14: Causal graph where the child
association flowing between 𝑇 and 𝑈 𝑀 through the edge 𝑇 → 𝑀 , which of a mediator is conditioned on.
is also an edge that causal association is flowing along. You can think of
these two types of association getting tangled up along the 𝑇 → 𝑀 edge,
making the observed association between 𝑇 and 𝑌 not purely causal. See
Pearl [18, Section 11.3.1 and 11.3.3] for more information on this topic.
Note that we actually can condition on some descendants of 𝑇 without 𝑊
inducing non-causal associations between 𝑇 and 𝑌 . For example, condi-
tioning on descendants of 𝑇 that aren’t on any causal paths to 𝑌 won’t
induce bias. However, as you can see from the above paragraph, this can
𝑈𝑀
get a bit tricky, so it is safest to just not condition on any descendants of
𝑇 , as the backdoor criterion prescribes. Even outside of graphical causal
models (e.g. in potential outcomes literature), this rule is often applied; it
is usually described as not conditioning on any post-treatment covariates.
𝑇 𝑀 𝑌
M-Bias Unfortunately, even if we only condition on pretreatment co-
variates, we can still induce collider bias. Consider what would happen
if we condition on the collider 𝑍2 in Figure 4.16. Doing this opens up
a backdoor path, along which non-causal association can flow. This is 𝑍
known as M-bias due to the M shape that this non-causal association Figure 4.15: Magnified causal graph
flows along when the graph is drawn with children below their parents. where the child of a mediator is condi-
tioned on.
For many examples of collider bias, see Elwert and Winship [19].

4.6 Example Applications of the Backdoor 𝑍1 𝑍3

Adjustment
𝑍2
4.6.1 Association vs. Causation in a Toy Example

In this section, we posit a toy generative process and derive the bias of the
𝑇 𝑌
associational quantity 𝔼[𝑌 | 𝑡]. We compare this to the causal quantity
𝔼[𝑌 | do(𝑡)], which gives us exactly what we want. Note that both of Figure 4.16: Causal graph depicting M-
these quantities are actually functions of 𝑡 . If the treatment were binary, Bias.

then we would just look at the difference between the quantities with
𝑇 = 1 and with 𝑇 = 0. However, because our generative processes will be
𝑑 𝔼[𝑌|𝑡] 𝑑 𝔼[𝑌| do(𝑡)]
linear, 𝑑𝑡 and 𝑑𝑡 actually gives us all the information about
the treatment effect, regardless of if treatment is continuous, binary, or
multi-valued. We will assume infinite data so that we can work with
expectations. This means this section has nothing to do with estimation;
for estimation, see the next section
The generative process that we consider has the causal graph in Figure 4.17
4 Causal Models 45

and the following structural equations:

𝑇 := 𝛼 1 𝑋 (4.28)
𝑌 := 𝛽𝑇 + 𝛼 2 𝑋 . (4.29) 𝑋
𝛼
Note that in the structural equation for 𝑌 , 𝛽 is the coefficient in front of 𝑇 . 𝛼1 2

This means that the causal effect of 𝑇 on 𝑌 is 𝛽 . Keep this in mind as we


𝑇 𝑌
go through these calculations. 𝛽
From the causal graph in Figure 4.17, we can see that 𝑋 is a sufficient Figure 4.17: Causal graph for toy example
adjustment set. Therefore, 𝔼[𝑌 | do(𝑡)] = 𝔼𝑋 𝔼[𝑌 | 𝑡, 𝑋]. Let’s calculate
the value of this quantity in our example.

𝔼𝑋 𝔼[𝑌 | 𝑡, 𝑋] = 𝔼𝑋 𝔼[𝛽𝑇 + 𝛼2 𝑋 | 𝑇 = 𝑡, 𝑋]
 
(4.30)
= 𝔼𝑋 𝛽𝑡 + 𝛼 2 𝑋
 
(4.31)
= 𝛽𝑡 + 𝛼 2 𝔼[𝑋] (4.32)

Importantly, we made use of the equality that the structural equation for
𝑌 (Equation 4.29) gives us in Equation 4.30. Now, we just have to take
the derivative to get the causal effect:

𝑑 𝔼𝑋 𝔼[𝑌 | 𝑡, 𝑋]
= 𝛽. (4.33)
𝑑𝑡
We got exactly what we were looking for. Now, let’s move to the associa-
tional quantity:

𝔼[𝑌 | 𝑇 = 𝑡] = 𝔼[𝛽𝑇 + 𝛼2 𝑋 | 𝑇 = 𝑡] (4.34)


= 𝛽𝑡 + 𝛼 2 𝔼[𝑋 | 𝑇 = 𝑡] (4.35)
𝛼2
= 𝛽𝑡 + 𝑡 (4.36)
𝛼1

In Equation 4.36, we made use of the equality that the structural equation
for 𝑇 (Equation 4.28) gives us. If we then take the derivative, we see that
there is confounding bias:

𝑑 𝔼[𝑌 | 𝑡] 𝛼2
=𝛽+ . (4.37)
𝑑𝑡 𝛼1

To recap, 𝔼𝑋 𝔼[𝑌 | 𝑡, 𝑋] gave us the causal effect we were looking for


(Equation 4.33), whereas the associational quantity 𝔼[𝑌 | 𝑡] did not
(Equation 4.37). Now, let’s go through an example that also takes into
account estimation.

4.6.2 A Complete Example with Estimation

Recall that we estimated a concrete value for the causal effect of sodium
intake on blood pressure in Section 2.5. There, we used the potential
outcomes framework. Here, we will do the same thing, but using causal
graphs. The spoiler is that the 19% error that we saw in Section 2.5 was
due to conditioning on a collider. [8]: Luque-Fernandez et al. (2018), ‘Edu-
cational Note: Paradoxical collider effect
First, we need to write down our causal assumptions in terms of a causal
in the analysis of non-communicable dis-
graph. Remember that in Luque-Fernandez et al. [8]’s example from ease epidemiological data: a reproducible
epidemiology, the treatment 𝑇 is sodium intake, and the outcome 𝑌 is illustration and web application’
4 Causal Models 46

blood pressure. The covariates are age 𝑊 and amount of protein in urine
(proteinuria) 𝑍 . Age is a common cause of both blood pressure and the
body’s ability to self-regulate sodium levels. In contrast, high amounts
of urinary protein are caused by high blood pressure and high sodium
intake. This means that proteinuria is a collider. We depict this causal
𝑊
graph in Figure 4.18.
Because 𝑍 is a collider, conditioning on it induces bias. Because 𝑊 and 𝑍
were grouped together as “covariates” 𝑋 in Section 2.5, we conditioned 𝑇 𝑌
on all of them. This is why we saw that our estimate was 19% off from
the true causal effect 1.05. Now that we’ve made the causal relationships
clear with a causal graph, the backdoor criterion (Definition 4.1) tells us
𝑍
to only adjust for 𝑊 and to not adjust for 𝑍 . More precisely, we were
doing the following adjustment in Section 2.5: Figure 4.18: Causal graph for the blood
pressure example. 𝑇 is sodium intake. 𝑌
is blood pressure. 𝑊 is age. And, impor-
𝔼𝑊 ,𝑍 𝔼[𝑌 | 𝑡, 𝑊 , 𝑍] (4.38) tantly, the amount of protein excreted in
urine 𝑍 is a collider.
And now, we will use the backdoor adjustment (Theorem 4.2) to change
our statistical estimand to the following:

𝔼𝑊 𝔼[𝑌 | 𝑡, 𝑊] (4.39)

We have simply removed the collider 𝑍 from the variables we adjust for.
For estimation, just as we did in Section 2.5, we use a model-assisted
estimator. We replace the outer expectation over 𝑊 with an empirical
mean over 𝑊 and replace the conditional expectation 𝔼[𝑌 | 𝑡, 𝑊] with a
machine learning model (in this case, linear regression).
Just as writing down the graph has lead us to simply not condition on 𝑍
in Equation 4.39, the code for estimation also barely changes. We need to
change just a single line of code in our previous program (Listing 2.1).
We display the full program with the fixed line of code below:

import numpy as np Listing 4.1: Python code for estimating the


ATE, without adjusting for the collider
import pandas as pd
from sklearn.linear_model import LinearRegression

Xt = df[['sodium', 'age']]
y = df['blood_pressure']
model = LinearRegression()
model.fit(Xt, y)
Full code, complete with simulation,
is available at https://github.com/
Xt1 = pd.DataFrame.copy(Xt) bradyneal/causal-book-code/blob/
Xt1['sodium'] = 1 master/sodium_example.py.
Xt0 = pd.DataFrame.copy(Xt)
Xt0['sodium'] = 0
ate_est = np.mean(model.predict(Xt1) - model.predict(Xt0))
print('ATE estimate:', ate_est)

Namely, we’ve changed line 5 from


Xt = df[['sodium', 'age', 'proteinuria']]

in Listing 2.1 to
Xt = df[['sodium', 'age']]

in Listing 4.1. When we run this revised code, we get an ATE estimate of
1.0502, which corresponds to 0.02% error (true value is 1.05) when using
4 Causal Models 47

a fairly large sample.9 9 Active reading exercise: Given that 𝑌 is


generated as a linear function of 𝑇 and 𝑊 ,
Progression of Reducing Bias When looking at the total association could we have just used the coefficient in
between 𝑇 and 𝑌 by simply regressing 𝑌 on 𝑇 , we got an estimate that was front of 𝑇 in the linear regression as an
a staggering 407% off of the true causal effect, due largely to confounding estimate for the causal effect?

bias (see Section 2.5). When we adjusted for all covariates in Section 2.5,
we reduced the percent error all the way down to 19%. In this section,
we saw this remaining error is due to collider bias. When we removed
the collider bias, by not conditioning on the collider 𝑍 , the error became
non-existent.
𝑍1 𝑍3
Potential Outcomes and M-Bias In fairness to the general culture
around the potential outcomes framework, it is common to only condition
on pretreatment covariates. This would prevent a practitioner who
adheres to this rule from conditioning on the collider 𝑍 in Figure 4.18. 𝑍2
However, there is no reason that there can’t be pretreatment colliders
that induce M-bias (Section 4.5.3). In Figure 4.19, we depict an example
of M-bias that is created by conditioning on 𝑍2 . We could fix this by
𝑇 𝑌
additionally conditioning on 𝑍1 and/or 𝑍3 , but in this example, they are
unobserved (indicated by the dashed lines). This means that the only Figure 4.19: Causal graph depicting M-
Bias that can only be avoided by not con-
way to avoid M-bias in Figure 4.19 is to not condition on the covariates ditioning on the collider 𝑍2 . This is due to
𝑍2 . the fact that the dashed nodes 𝑍1 and 𝑍3
are unobserved.

4.7 Assumptions Revisited

The first main set of assumptions is encoded by the causal graph that we
write down. Exactly what this causal graph means is determined by two
main assumptions, each of which can take on several different forms:
1. The Modularity Assumption
Different forms:
I Modularity Assumption for Causal Bayesian Networks (Assumption 4.1)
I Modularity Assumption for SCMs (Assumption 4.2)
I The Law of Counterfactuals (Definition 4.3)
2. The Markov Assumption
Different equivalent forms:
I Local Markov assumption (Assumption 3.1)
I Bayesian network factorization (Definition 3.1)
I Global Markov assumption (Theorem 3.1)

Given, these two assumptions (and positivity), if the backdoor criterion Now that you’re familiar with causal
(Definition 4.1) is satisfied in our assumed causal graph, then we have graphical models and SCMs, it may be
worth going back and rereading Chap-
identification. Note that although the backdoor criterion is a sufficient ter 2 while trying to make connections
condition for identification, it is not a necessary condition. We will see to what you’ve learned about graphical
this more in Chapter 6. causal models in these past two chapters.

[20]: Galles and Pearl (1998), ‘An Axiomatic


More Formal If you’re really into fancy formalism, there are some Characterization of Causal Counterfactu-
als’
relevant sources to check out. You can see the fundamental axioms that [21]: Halpern (1998), ‘Axiomatizing Causal
underlie The Law of Counterfactuals in [20, 21], or if you want a textbook, Reasoning’
you can find them in [18, Chapter 7.3]. To see proofs of the equivalence of [18]: Pearl (2009), Causality
all three forms of the Markov assumption, see, for example, [13, Chapter [13]: Koller and Friedman (2009), Proba-
3]. bilistic Graphical Models: Principles and Tech-
niques
4 Causal Models 48

Connections to No Interference, Consistency, and Positivity The no


interference assumption (Assumption 2.4) is commonly implicit in causal
graphs, since the outcome 𝑌 (think 𝑌𝑖 ) usually only has a single node 𝑇
(think 𝑇𝑖 ) for treatment as a parent, rather than having multiple treatment
nodes 𝑇𝑖 , 𝑇𝑖−1 , 𝑇𝑖+1 , etc. as parents. However, causal DAGs can be extended
to settings where there is interference [22]. Consistency (Assumption 2.5) [22]: Ogburn and VanderWeele (2014),
follows from the axioms of SCMs (see [18, Corollary 7.3.2] and [23]). ‘Causal Diagrams for Interference’

Positivity (Assumption 2.3) is still a very important assumption that we [18]: Pearl (2009), Causality
must make, though it is sometimes neglected in the graphical models [23]: Pearl (2010), ‘On the consistency rule
in causal inference: axiom, definition, as-
literature.
sumption, or theorem?’
Randomized Experiments 5
Randomized experiments are noticeably different from observational 5.1 Comparability and Covari-
studies. In randomized experiments, the experimenter has complete con- ate Balance . . . . . . . . . . 49
trol over the treatment assignment mechanism (how treatment is assigned). 5.2 Exchangeability . . . . . . . 50
For example, in the most simple kind of randomized experiment, the 5.3 No Backdoor Paths . . . . . 51
experimenter randomly assigns (e.g. via coin toss) each participant to
either the treatment group or the control group. This complete control
over how treatment is chosen is what distinguishes randomized experi-
ments from observational studies. In this simple experimental setup, the
treatment isn’t a function of covariates at all! In contrast, in observational
studies, the treatment is almost always a function of some covariate(s).
As we will see, this difference is key to whether or not confounding is
present in our data.
In randomized experiments, association is causation. This is because ran-
domized experiments are special in that they guarantee that there is no
confounding. As a consequence, this allows us to measure the causal effect
𝔼[𝑌(1)]−𝔼[𝑌(0)] via the associational difference 𝔼[𝑌 | 𝑇 = 1] − 𝔼[𝑌 | 𝑇 = 0].
In the following sections, we explain why this is the case from a variety
of different perspectives. If any one of these explanations clicks with you,
that might be good enough. Definitely stick through to the most visually
appealing explanation in Section 5.3.

5.1 Comparability and Covariate Balance

Ideally, the treatment and control groups would be the same, in all
aspects, except for treatment. This would mean they only differ in the
treatment they receive (i.e. they are comparable). This would allow us to
attribute any difference in the outcomes of the treatment and control
groups to the treatment. Saying that these treatment groups are the same
in everything other than their treatment and outcomes is the same as
saying they have the same distribution of confounders. Because people
often check for this property on observed variables (often what people
mean by “covariates”), this concept is known as covariate balance.

Definition 5.1 (Covariate Balance) We have covariate balance if the distri-


bution of covariates 𝑋 is the same across treatment groups. More formally,

𝑑
𝑃(𝑋 | 𝑇 = 1) = 𝑃(𝑋 | 𝑇 = 0) (5.1)
𝑑
The symbol = means “equal in distribu-
Randomization implies covariate balance, across all covariates, even tion.”
unobserved ones. Intuitively, this is because the treatment is chosen at
random, regardless of 𝑋 , so the treatment and control groups should
look very similar. The proof is simple. Because 𝑇 is not at all determined
by 𝑋 (solely by a coin flip), 𝑇 is independent of 𝑋 . This means that
5 Randomized Experiments 50

𝑑 𝑑
𝑃(𝑋 | 𝑇 = 1) = 𝑃(𝑋). Similarly, it means 𝑃(𝑋 | 𝑇 = 0) = 𝑃(𝑋). Therefore,
𝑑
we have 𝑃(𝑋 | 𝑇 = 1) = 𝑃(𝑋 | 𝑇 = 0).
Although we have proven that randomization implies covariate balance,
we have not proven that that covariate balance implies that association is
causation.1 We’ll now prove that by showing that 𝑃(𝑦 | do(𝑡)) = 𝑃(𝑦 | 𝑡). 1Recall that the intuition is that covariate
For the proof, the main property we utilize is that covariate balance balance means that everything is the same
implies 𝑋 and 𝑇 are independent. between the treatment groups, except for
the treatment, so the treatment must be
the explanation for the change in 𝑌 .
Proof. First, let 𝑋 be a sufficient adjustment set that potentially contains
unobserved variables (randomization also balances unobserved covariates).
Such an adjustment set must exist because we allow it to contain any
variables, observed or unobserved. Then, we have the following from the
backdoor adjustment (Theorem 4.2):
X
𝑃(𝑦 | do(𝑡)) = 𝑃(𝑦 | 𝑡, 𝑥)𝑃(𝑥) (5.2)
𝑥

𝑃(𝑡|𝑥)
By multiplying by 𝑃(𝑡|𝑥)
, we get the joint distribution in the numerator:

X 𝑃(𝑦 | 𝑡, 𝑥)𝑃(𝑡 | 𝑥)𝑃(𝑥)


= (5.3)
𝑥 𝑃(𝑡 | 𝑥)
X 𝑃(𝑦, 𝑡, 𝑥)
= (5.4)
𝑥 𝑃(𝑡 | 𝑥)

Now, we use the important property that 𝑋 ⊥


⊥ 𝑇:

X 𝑃(𝑦, 𝑡, 𝑥)
= (5.5)
𝑥 𝑃(𝑡)

An application of Bayes rule and marginalization gives us the rest:


X
= 𝑃(𝑦, 𝑥 | 𝑡) (5.6)
𝑥
= 𝑃(𝑦 | 𝑡) (5.7)

5.2 Exchangeability

Exchangeability (Assumption 2.1) gives us another perspective on why


randomization makes causation equal to association. To see why, consider
the following thought experiment. We decide an individual’s treatment
group using a random coin flip as follows: if the coin is heads, we assign
the individual to the treatment group (𝑇 = 1), and if the coins is tails,
we assign the individual to the control group (𝑇 = 0). If the groups are
exchangeable, we could exchange these groups, and the average outcomes
would remain the same. This is intuitively true if we chose the groups
with a coin flip. Imagine simply swapping the meaning of “heads” and
“tails” in this experiment. Would you expect that to change the results at
all? No. This is why randomized experiments give us exchangeability.
5 Randomized Experiments 51

Recall from Section 2.3.2 that mean exchangeability is formally the


following:

𝔼[𝑌(1) | 𝑇 = 1] = 𝔼[𝑌(1) | 𝑇 = 0] (5.8)


𝔼[𝑌(0) | 𝑇 = 0] = 𝔼[𝑌(0) | 𝑇 = 1] (5.9)

The “exchange” is when we go from 𝑌(1) in the treatment group to 𝑌(1)


in the control group (Equation 5.8) and from 𝑌(0) in the control group to
𝑌(0) in the treatment group (Equation 5.9).
To see the proof of why association is causation in randomized ex-
periments through the lens of exchangeability, recall the proof from
Section 2.3.2. First, recall that Equation 5.8 means that both quantities in
it are equal to the marginal expected outcome 𝔼[𝑌(1)] and, similarly, that
Equation 5.8 means that both quantities in it are equal to the marginal
expected outcome 𝔼[𝑌(0)]. Then, we have the following proof:

𝔼[𝑌(1)] − 𝔼[𝑌(0)] = 𝔼[𝑌(1) | 𝑇 = 1] − 𝔼[𝑌(0) | 𝑇 = 0] (2.3 revisited)


= 𝔼[𝑌 | 𝑇 = 1] − 𝔼[𝑌 | 𝑇 = 0] (2.4 revisited)

confounding association
5.3 No Backdoor Paths 𝑋

The final perspective that we’ll look at to see why association is causation
in randomized experiments is that of graphical causal models. In regular
𝑇 𝑌
observational data, there is almost always confounding. For example,
in Figure 5.1, we see that 𝑋 is a confounder of the effect of 𝑇 on 𝑌 . Figure 5.1: Causal structure of 𝑋 con-
founding the effect of 𝑇 on 𝑌 .
Non-causal association flows along the backdoor path 𝑇 ← 𝑋 → 𝑌 .
However, if we randomize 𝑇 , something magical happens: 𝑇 no longer
has any causal parents, as we depict in Figure 5.2. This is because 𝑇 is
purely random. It doesn’t depend on anything other than the output of a
coin toss (or a quantum random number generator, if you’re into the kind
of stuff). Because 𝑇 has no incoming edges, under randomization, there 𝑋
are no backdoor paths. So the empty set is a sufficient adjustment set. This
means that all of the association that flows from 𝑇 to 𝑌 is causal. We can
identify 𝑃(𝑌 | do(𝑇 = 𝑡)) by simply applying the backdoor adjustment
𝑇 𝑌
(Theorem 4.2), adjusting for the empty set:
Figure 5.2: Causal structure when we ran-
𝑃(𝑌 | do(𝑇 = 𝑡)) = 𝑃(𝑌 | 𝑇 = 𝑡) domize treatment.

With that, we conclude our discussion of why association is causation in


randomized experiments. Hopefully, at least one of these three explana-
tions is intuitive to you and easy to store in long-term memory.
Nonparametric Identification 6
In Section 4.4, we saw that satisfying the backdoor criterion is sufficient 6.1 Frontdoor Adjustment . . . 52
to give us identifiability, but is the backdoor criterion also necessary? 6.2 do-calculus . . . . . . . . . . 55
In other words, is it possible to get identifiability without being able to Application: Frontdoor Ad-
block all backdoor paths? justment . . . . . . . . . . . 57
6.3 Determining Identifiability
As an example, consider that we have data generated according to the
from the Graph . . . . . . . 58
graph in Figure 6.1. We don’t observe 𝑊 in this data, so we can’t block the
backdoor path through 𝑊 and the confounding association that flows
along it. But we still need to identify the causal effect. It turns out that it
confounding association
is possible to identify the causal effect in this graph, using the frontdoor
criterion. We’ll see the frontdoor criterion and corresponding adjustment
in Section 6.1. Then, we’ll consider even more general identification in 𝑊
Section 6.2 when we introduce do-calculus. We’ll conclude with graphical
conditions for identifiability in Section 6.3.
𝑇 𝑀 𝑌

causal association
6.1 Frontdoor Adjustment Figure 6.1: Causal graph where 𝑊 is un-
observed, so we cannot block the back-
door path. We depict the flow of causal
The high-level intuition for why we can identify the causal effect of 𝑇 on association and the flow of confounding
𝑌 in the graph in Figure 6.1 (even when we can’t adjust for the confounder association with dashed lines.

𝑊 because it is unobserved) is as follows: a mediator like 𝑀 is very


helpful; we can isolate the association that flows through 𝑀 by focusing 𝑊
our statistical analysis on 𝑀 , and the only association that flows through
𝑀 is causal association (association flowing along directed paths from 𝑇 focus
to 𝑌 ). We illustrate this intuition in Figure 6.2, where we depict only the
causal association. In this section, we will focus our analysis on 𝑀 using a 𝑇 𝑀 𝑌
three step procedure (see Figure 6.3 for our corresponding illustration):
only causal association
1. Identify the causal effect of 𝑇 on 𝑀 .
Figure 6.2: In contrast to Figure 6.1, when
2. Identify the causal effect of 𝑀 on 𝑌 . we focus our analysis on 𝑀 , we are able
3. Combine the above steps to identify the causal effect of 𝑇 on 𝑌 . to isolate only the causal association.

Step 1 First, we will identify the effect of 𝑇 on 𝑀 : 𝑃(𝑚 | do(𝑡)). Because 𝑊


𝑌 is a collider on the 𝑇 − 𝑀 path through 𝑊 , it blocks that backdoor path.
So there are no unblocked backdoor paths from 𝑇 to 𝑀 . This means that
the only association that flows from 𝑇 to 𝑀 is the causal association that
flows along the edge connecting them. Therefore, we have the following 𝑇 𝑀 𝑌
Step 1 Step 2
identification via the backdoor adjustment (Theorem 4.2, using the empty
set as the adjustment set):1 Step 3
Figure 6.3: Illustration of steps to get to
𝑃(𝑚 | do(𝑡)) = 𝑃(𝑚 | 𝑡) (6.1) the frontdoor adjustment.

Step 2 Second, we will identify the effect of 𝑀 on 𝑌 : 𝑃(𝑦 | do(𝑚)).


1Active reading exercise: Write a proof for
Because 𝑇 blocks the backdoor path 𝑀 ← 𝑇 ← 𝑊 → 𝑌 , we can simply
Equation 6.1 without using the backdoor
adjustment. Instead, start from the trun-
cated factorization (Proposition 4.1) like
we did in Section 4.3.1. Hint: The proof
can be quite short. We provide a proof in
Appendix A.1, in case you get stuck.
6 Nonparametric Identification 53

adjust for 𝑇 . Therefore, using the backdoor adjustment again, we have


the following:
X
𝑃(𝑦 | do(𝑚)) = 𝑃(𝑦 | 𝑚, 𝑡) 𝑃(𝑡) (6.2)
𝑡

Step 3 Now that we know how changing 𝑇 changes 𝑀 (step 1) and how
changing 𝑀 changes 𝑌 (step 2), we can combine these two to get how
changing 𝑇 changes 𝑌 (through 𝑀 ):
X
𝑃(𝑦 | do(𝑡)) = 𝑃(𝑚 | do(𝑡)) 𝑃(𝑦 | do(𝑚)) (6.3)
𝑚

The first factor on the right-hand side corresponds to setting 𝑇 to 𝑡


and observing the resulting value of 𝑀 . The second factor corresponds
to setting 𝑀 to exactly the value 𝑚 that resulted from setting 𝑇 and
then observing what value of 𝑌 results. We must sum over 𝑚 because
𝑃(𝑚 | do(𝑡)) is probabilistic, so we must sum over its support. In other
words, we must sum over all possible realizations 𝑚 of the random
variables whose distribution is 𝑃(𝑀 | do(𝑡)).
If we then plug in Equations 6.1 and 6.2 into Equation 6.3, we get the
frontdoor adjustment (keep reading to see the definition of the frontdoor
criterion):

Theorem 6.1 (Frontdoor Adjustment) If (𝑇, 𝑀, 𝑌) satisfy the frontdoor


criterion and we have positivity, then
X X
𝑃(𝑦 | do(𝑡)) = 𝑃(𝑚 | 𝑡) 𝑃(𝑦 | 𝑚, 𝑡 0) 𝑃(𝑡 0) (6.4) 𝑊
𝑚 𝑡0

The causal graph we’ve been using (Figure 6.4) is an example of a simple
graph that satisfies the frontdoor criterion. To get the full definition, we 𝑇 𝑀 𝑌
must first define complete/full mediation: a set of variables 𝑀 completely
Figure 6.4: Simple causal graph that satis-
mediates the effect of 𝑇 on 𝑌 if all causal (directed) paths from 𝑇 to fies the frontdoor criterion
𝑌 go through 𝑀 . We now give the general definition of the frontdoor
criterion:
2 Active reading exercise: Think of a graph
Definition 6.1 (Frontdoor Criterion) A set of variables 𝑀 satisfies the other than Figure 6.4 that satisfies the

frontdoor criterion relative to 𝑇 and 𝑌 if the following are true:


frontdoor criterion. Also, for each condi-
tion, think of a graph that does not satisfy
1. 𝑀 completely mediates the effect of 𝑇 on 𝑌 (i.e. all causal paths from 𝑇 only that condition.

to 𝑌 go through 𝑀 ).
2. There is no unblocked backdoor path from 𝑇 to 𝑀 .
3. All backdoor paths from 𝑀 to 𝑌 are blocked by 𝑇 .2
𝑇
Although Equations 6.1 and 6.2 are straightforward applications of the
much
rigor 𝑌
𝑊
backdoor adjustment, we hand-waved our way to Equation 6.3, which
𝑀
was key to the frontdoor adjustment (Theorem 6.1). We’ll now walk equation very
through how to get Equation 6.3. Active reading exercise: Feel free to wow
stop reading here and do this yourself.
quick
We are about to enter Equationtown (Figure 6.5), so if you are satisfied with
maths
the intuition we gave for step 3 and prefer to not see a lot of equations,
feel free to skip to the end of the proof (denoted by the symbol).
Figure 6.5: Equationtown
6 Nonparametric Identification 54

Proof. As usual, we start with the truncated factorization, using the


causal graph in Figure 6.4. From the Bayesian network factorization
(Definition 3.1), we have the following:

𝑃(𝑤, 𝑡, 𝑚, 𝑦) = 𝑃(𝑤) 𝑃(𝑡 | 𝑤) 𝑃(𝑚 | 𝑡) 𝑃(𝑦 | 𝑤, 𝑚) (6.5)

Then, using the truncated factorization (Proposition 4.1), we remove the


factor for 𝑇 :

𝑃(𝑤, 𝑚, 𝑦 | do(𝑡)) = 𝑃(𝑤) 𝑃(𝑚 | 𝑡) 𝑃(𝑦 | 𝑤, 𝑚) (6.6)

Next, we marginalize out 𝑤 and 𝑚 :


XX XX
𝑃(𝑤, 𝑚, 𝑦 | do(𝑡)) = 𝑃(𝑤) 𝑃(𝑚 | 𝑡) 𝑃(𝑦 | 𝑤, 𝑚) (6.7)
𝑚 𝑤 𝑚 𝑤
X X
𝑃(𝑦 | do(𝑡)) = 𝑃(𝑚 | 𝑡) 𝑃(𝑦 | 𝑤, 𝑚) 𝑃(𝑤) (6.8)
𝑚 𝑤

Even though we’ve removed all the do operators, recall that we are not
done because 𝑊 is unobserved. So we must also remove the 𝑤 from the
expression. This is where we have to get a bit creative.
We want to be able to combine 𝑃(𝑦 | 𝑤, 𝑚) and 𝑃(𝑤) into a joint factor
over both 𝑦 and 𝑤 so that we can marginalize out 𝑤 . To do this, we
need to get 𝑚 behind the conditioning bar of the 𝑃(𝑤) factor. This would
be easy if we could just swap 𝑃(𝑤) out for 𝑃(𝑤 | 𝑚) in Equation 6.8.3 3 Active reading exercise: Why would it
The key thing to notice is that we actually can include 𝑚 behind the be easy to marginalize out 𝑤 if it were
the case that 𝑃(𝑤) = 𝑃(𝑤 | 𝑚)? And why
conditioning bar if 𝑡 were also there because 𝑇 d-separates 𝑊 from 𝑀 in
does this equality not hold?
Figure 6.6. In math, this means that the following equality holds:

𝑃(𝑤 | 𝑡) = 𝑃(𝑤 | 𝑡, 𝑚) (6.9)

Great, so how do we get 𝑡 into this party? The usual trick of conditioning
𝑊
on it and marginalizing it out:
X X
𝑃(𝑦 | do(𝑡)) = 𝑃(𝑚 | 𝑡) 𝑃(𝑦 | 𝑤, 𝑚) 𝑃(𝑤) (6.8 revisited)
𝑚 𝑤
X X X 𝑇 𝑀 𝑌
= 𝑃(𝑚 | 𝑡) 𝑃(𝑦 | 𝑤, 𝑚) 𝑃(𝑤 | 𝑡 0) 𝑃(𝑡 0) (6.10)
𝑚 𝑤 𝑡 0
Figure 6.6: Simple causal graph that satis-
X X X
= 𝑃(𝑚 | 𝑡) 𝑃(𝑦 | 𝑤, 𝑚) 𝑃(𝑤 | 𝑡 0 , 𝑚) 𝑃(𝑡 0) (6.11) fies the frontdoor criterion
𝑚 𝑤 𝑡0
X X 0
X
= 𝑃(𝑚 | 𝑡) 𝑃(𝑡 ) 𝑃(𝑦 | 𝑤, 𝑚) 𝑃(𝑤 | 𝑡 0 , 𝑚) (6.12)
𝑚 𝑡0 𝑤

Great, but now we can’t combine 𝑃(𝑦 | 𝑤, 𝑚) and 𝑃(𝑤 | 𝑡 0 , 𝑚) because


𝑃(𝑦 | 𝑤, 𝑚) is missing this newly introduced 𝑡 0 behind its conditioning
bar. Luckily, we can fix that4 and combine the two factors: 4 Active reading exercise: Why is
X X X 𝑃(𝑦 | 𝑤, 𝑚) equal to 𝑃(𝑦 | 𝑤, 𝑡 0 , 𝑚)?
= 𝑃(𝑚 | 𝑡) 𝑃(𝑡 0) 𝑃(𝑦 | 𝑤, 𝑚) 𝑃(𝑤 | 𝑡 0 , 𝑚) (6.13)
𝑚 𝑡0 𝑤
X X X
= 𝑃(𝑚 | 𝑡) 𝑃(𝑡 0) 𝑃(𝑦 | 𝑤, 𝑡 0 , 𝑚) 𝑃(𝑤 | 𝑡 0 , 𝑚)
𝑚 𝑡0 𝑤
(6.14)
X X 0
X 0
= 𝑃(𝑚 | 𝑡) 𝑃(𝑡 ) 𝑃(𝑦, 𝑤 | 𝑡 , 𝑚) (6.15)
𝑚 𝑡0 𝑤
X X
= 𝑃(𝑚 | 𝑡) 𝑃(𝑡 0)𝑃(𝑦 | 𝑡 0 , 𝑚) (6.16)
𝑚 𝑡0
6 Nonparametric Identification 55

This matches the result stated in Theorem 6.1, so we’ve completed the
derivation of the frontdoor adjustment without using the backdoor
adjustment. However, we still need to show that Equation 6.3 is correct
to justify step 3. To do that, all that’s left is to recognize that these parts
match Equations 6.1 and 6.2 and plug those in: 𝑃(𝑚 | do(𝑡)) = 𝑃(𝑚 | 𝑡) (6.1)
X
X 𝑃(𝑦 | do(𝑚)) = 𝑃(𝑦 | 𝑚, 𝑡) 𝑃(𝑡) (6.2)
= 𝑃(𝑚 | do(𝑡)) 𝑃(𝑦 | do(𝑚)) (6.17) 𝑡
𝑚

And we’re done! We just needed to be a bit clever with our uses of d-
separation and marginalization. Part of why we went through that proof
is because we will prove the frontdoor adjustment using do-calculus in
Section 6.2. This way you can easily compare a proof using the truncated
factorization to a proof using do-calculus to prove the same result.

6.2 do-calculus

As we saw in the last section, it turns out that satisfying the backdoor
criterion (Definition 4.1) isn’t necessary to identify causal effects. For
example, if the frontdoor criterion (Definition 6.1) is satisfied, that also
gives us identifiability. This leads to the following questions: can we
identify causal estimands when the associated causal graph satisfies
neither the backdoor criterion nor the frontdoor criterion? If so, how?
Pearl’s do-calculus [24] gives us the answer to these questions. [24]: Pearl (1995), ‘Causal diagrams for
empirical research’
As we will see, the do-calculus gives us tools to identify causal effects
using the causal assumptions encoded in the causal graph. It will allow
us to identify any causal estimand that is identifiable. More concretely,
consider an arbitrary causal estimand 𝑃(𝑌 | do(𝑇 = 𝑡), 𝑋 = 𝑥), where 𝑌
is an arbitrary set of outcome variables, 𝑇 is an arbitrary set of treatment
variables, and 𝑋 is an arbitrary (potentially empty) set of covariates that
we want to choose how specific the causal effect we’re looking at is. Note
that this means we can use do-calculus to identify causal effects where
there are multiple treatments and/or multiple outcomes.
In order to present the rules of do-calculus, we must define a bit of
notation for augmented versions of the causal graph 𝐺 . Let 𝐺 𝑋 denote
the graph that we get if we take 𝐺 and remove all of the incoming edges
to nodes in the set 𝑋 ; recall from Section 4.2 that this is known as the
manipulated graph. Let 𝐺 𝑋 denote the graph that we get if we take 𝐺 and
remove all of the outgoing edges from nodes in the set 𝑋 . The mnemonic
meaning to help you remember this is to think of parents as drawn above
their children in the graph, so the bar above 𝑋 is cutting its incoming
edges and the bar below 𝑋 is cutting its outgoing edges. Combining these
two, we’ll use 𝐺 𝑋𝑍 to denote the graph with the incoming edges to 𝑋
and the outgoing edges from 𝑍 removed. And recall from Section 3.7 that
we use ⊥ ⊥𝐺 to denote d-separation in 𝐺 . We’re now ready; do-calculus
consists of just three rules:

Theorem 6.2 (Rules of do-calculus) Given a causal graph 𝐺 , an associated


6 Nonparametric Identification 56

distribution 𝑃 , and disjoint sets of variables 𝑌 , 𝑇 , 𝑍 , and 𝑊 , the following


rules hold.
Rule 1:

𝑃(𝑦 | do(𝑡), 𝑧, 𝑤) = 𝑃(𝑦 | do(𝑡), 𝑤) ⊥𝐺𝑇 𝑍 | 𝑇, 𝑊


if 𝑌 ⊥ (6.18)

Rule 2:

𝑃(𝑦 | do(𝑡), do(𝑧), 𝑤) = 𝑃(𝑦 | do(𝑡), 𝑧, 𝑤) if 𝑌 ⊥


⊥𝐺𝑇,𝑍 𝑍 | 𝑇, 𝑊
(6.19)
Rule 3:

𝑃(𝑦 | do(𝑡), do(𝑧), 𝑤) = 𝑃(𝑦 | do(𝑡), 𝑤) if 𝑌 ⊥⊥𝐺𝑇,𝑍(𝑊) 𝑍 | 𝑇, 𝑊


(6.20)
where 𝑍(𝑊) denotes the set of nodes of 𝑍 that aren’t ancestors of any node of
𝑊 in 𝐺𝑇 .

Now, rather than recreate the proofs for these rules from Pearl [24], we’ll [24]: Pearl (1995), ‘Causal diagrams for
give intuition for each of them in terms of concepts we’ve already seen in empirical research’

this book.
Rule 1 Intuition If we take Rule 1 and simply remove the intervention
do(𝑡), we get the following (Active reading exercise: what familiar concept
is this?):
𝑃(𝑦 | 𝑧, 𝑤) = 𝑃(𝑦 | 𝑤) if 𝑌 ⊥⊥𝐺 𝑍 | 𝑊 (6.21)
This is just what d-separation gives us under the Markov assumption;
recall from Theorem 3.1 that d-separation in the graph implies conditional
independence in 𝑃 . This means that Rule 1 is simply a generalization of
Theorem 3.1 to interventional distributions.
Rule 2 Intuition Just as with Rule 1, we’ll remove the intervention do(𝑡)
from Rule 2 and see what this reminds us of (Active reading exercise:
what concept does this remind you of?):

𝑃(𝑦 | do(𝑧), 𝑤) = 𝑃(𝑦 | 𝑧, 𝑤) if 𝑌 ⊥


⊥𝐺 𝑍 𝑍 | 𝑊 (6.22)

This is exactly what we do when we justify the backdoor adjustment


(Theorem 4.2) using the backdoor criterion (Definition 4.1). As we saw
at the ends of Section 3.8 and Section 4.4. Association is causation if the
outcome 𝑌 and the treatment 𝑍 are d-separated by some set of variables
that are conditioned on 𝑊 . So rule 2 is a generalization of the backdoor
adjustment to interventional distributions.
Rule 3 Intuition This is the trickiest rule to understand. Just as with
the other two rules, we’ll first remove the intervention do(𝑡) to make
thinking about this simpler:

𝑃(𝑦 | do(𝑧), 𝑤) = 𝑃(𝑦 | 𝑤) if 𝑌 ⊥


⊥𝐺𝑍(𝑊) 𝑍 | 𝑊 (6.23)

To get the equality in this equation, it must be the case that removing
the intervention do(𝑧) (which is like taking the manipulated graph and
reintroducing the edges going into 𝑍 ) introduces no new association
that can affect 𝑌 . Because do(𝑧) removes the incoming edges to 𝑍 to give
us 𝐺 𝑍 , the main association that we need to worry about is association
flowing from 𝑍 to 𝑌 in 𝐺 𝑍 (causal association). Therefore, you might
6 Nonparametric Identification 57

expect that the condition that gives us the equality in Equation 6.23 is
𝑌⊥ ⊥𝐺𝑍 𝑍 | 𝑊 . However, we have to refine this a bit to prevent inducing
association by conditioning on the descendants of colliders (recall from
Section 3.6). Namely, 𝑍 could contain colliders in 𝐺 , and 𝑊 could contain
descendants of these colliders. Therefore, to not induce new association
through colliders in 𝑍 when we reintroduce the incoming edges to 𝑍 to
get 𝐺 , we must limit the set of manipulated nodes to those that are not
ancestors of nodes in the conditioning set 𝑊 : 𝑍(𝑊).
Completeness of do-calculus Maybe there could exist causal estimands
that are identifiable but that can’t be identified using only the rules of
do-calculus in Theorem 6.2. Fortunately, Shpitser and Pearl [25] and
Huang and Valtorta [26] independently proved that this is not the case. [25]: Shpitser and Pearl (2006), ‘Identifica-
They proved that do-calculus is complete, which means that these three tion of Joint Interventional Distributions
in Recursive Semi-Markovian Causal Mod-
rules are sufficient to identify all identifiable causal estimands. Because els’
these proofs are constructive, they also admit algorithms that identify [26]: Huang and Valtorta (2006), ‘Pearl’s
any causal estimand in polynomial time. Calculus of Intervention is Complete’

Nonparametric Identification Note that all of this is about nonparamet-


ric identification; in other words, do-calculus tells us if we can identify
a given causal estimand using only the causal assumptions encoded
in the causal graph. If we introduce more assumptions about the dis-
tribution (e.g. linearity), we can identify more causal estimands. That
would be known as parametric identification. We don’t discuss parametric
identification in this chapter, though we will in later chapters.

6.2.1 Application: Frontdoor Adjustment

Recall the simple graph we used that satisfies the frontdoor criterion
(Figure 6.7), and recall the frontdoor adjustment:
X X
𝑃(𝑦 | do(𝑡)) = 𝑃(𝑚 | 𝑡) 𝑃(𝑦 | 𝑚, 𝑡 0) 𝑃(𝑡 0) (6.4 revisited)
𝑚 𝑡0

At the end of Section 6.1, we saw a proof for the frontdoor adjustment
𝑊
using just the truncated factorization. To get an idea for how do-calculus
works and the intuition we use in proofs that use it, we’ll now do the
frontdoor adjustment proof using the rules of do-calculus.
𝑇 𝑀 𝑌
Proof. Our goal is to identify 𝑃(𝑦 | do(𝑡)). Because we have the intu-
Figure 6.7: Simple causal graph that satis-
ition we described in Section 6.1 that the full mediator 𝑀 will help us fies the frontdoor criterion
out, the first thing we’ll do is introduce 𝑀 into the equation via the
marginalization trick:
X
𝑃(𝑦 | do(𝑡)) = 𝑃(𝑦 | do(𝑡), 𝑚) 𝑃(𝑚 | do(𝑡)) (6.24)
𝑚

Because the backdoor path from 𝑇 to 𝑀 in Figure 6.7 is blocked by the


collider 𝑌 , all of the association that flows from 𝑇 to 𝑀 is causal, so we
can apply Rule 2 to get the following:
X
= 𝑃(𝑦 | do(𝑡), 𝑚) 𝑃(𝑚 | 𝑡) (6.25)
𝑚

Now, because 𝑀 is a full mediator of the causal effect of 𝑇 on 𝑌 , we


should be able to replace 𝑃(𝑦 | do(𝑡), 𝑚) with 𝑃(𝑦 | do(𝑚)), but this will
6 Nonparametric Identification 58

take two steps of do-calculus. To remove do(𝑡), we’ll need to use Rule 3,
which requires that 𝑇 have no causal effect on 𝑌 in the relevant graph. We
can get to a graph like that by removing the edge from 𝑇 to 𝑀 (Figure 6.9);
𝑊
in do-calculus, we do this by using Rule 2 (in the opposite direction as
before) to do(𝑚). We can do this because the existing do(𝑡) makes it so
there are no backdoor paths from 𝑀 to 𝑌 in 𝐺𝑇 (Figure 6.8).
X 𝑇 𝑀 𝑌
= 𝑃(𝑦 | do(𝑡), do(𝑚)) 𝑃(𝑚 | 𝑡) (6.26)
𝑚 Figure 6.8: 𝐺𝑇

Now, as we planned, we can remove the do(𝑡) using Rule 3. We can use
Rule 3 here because there is no causation flowing from 𝑇 to 𝑌 in 𝐺 𝑀 𝑊
(Figure 6.9).
X
= 𝑃(𝑦 | do(𝑚)) 𝑃(𝑚 | 𝑡) (6.27)
𝑚
𝑇 𝑀 𝑌
All that’s left is to remove this last do-operator. As we discussed in Figure 6.9: 𝐺 𝑀
Section 6.1, 𝑇 blocks the only backdoor path from 𝑀 to 𝑌 in the graph
(Figure 6.10). This means, that if we can condition on 𝑇 , we can get rid
of this last do-operator. As usual, we do that by conditioning on and
marginalizing out 𝑇 . Rearranging a bit and using 𝑡 0 for the marginalization
since 𝑡 is already present:
𝑊
X X 0 0
= 𝑃(𝑚 | 𝑡) 𝑃(𝑦 | do(𝑚), 𝑡 ) 𝑃(𝑡 | do(𝑚))
𝑚 𝑡0
(6.28)
Now, we can simply apply Rule 2, since 𝑇 blocks the backdoor path from 𝑇 𝑀 𝑌
𝑀 to 𝑌 : Figure 6.10: 𝐺
X X
= 𝑃(𝑚 | 𝑡) 𝑃(𝑦 | 𝑚, 𝑡 0) 𝑃(𝑡 0 | do(𝑚))
𝑚 𝑡0
(6.29)
And finally, we can apply Rule 3 to remove the last do(𝑚) because there
is no causal effect of 𝑀 on 𝑇 (i.e. there is no directed path from 𝑀 to 𝑇
in the graph in (Figure 6.10).
X X
= 𝑃(𝑚 | 𝑡) 𝑃(𝑦 | 𝑚, 𝑡 0) 𝑃(𝑡 0) (6.30)
𝑚 𝑡0

That concludes our proof of the frontdoor adjustment using do-calculus.


It follows a different path than the proof we gave at the end of Section 6.1,
where we used the truncated factorization, but both proofs rely heavily
on intuition we get from looking at the graph. Active reading exercise: Assuming the
backdoor criterion, prove the backdoor
adjustment using the rules of do-calculus.

6.3 Determining Identifiability from the Graph

It’s nice to know that we can identify any causal estimand that is possible
to identify using do-calculus, but this isn’t as satisfying as knowing
whether a causal estimand is identifiable by simply looking at the causal
graph. For example, the backdoor criterion (Definition 4.1) and the
frontdoor criterion (Definition 6.1) gave us simple ways to know for
sure that a causal estimand is identifiable. However, there are plenty of
6 Nonparametric Identification 59

causal estimands that are identifiable, even though the corresponding


causal graphs don’t satisfy the backdoor or frontdoor criterion. More
general graphical criteria exist that will tell us that these estimands are
identifiable. We will discuss these more general graphical criteria for
identifiability in this section.
Single Variable Intervention When we care about causal effects of
an intervention on a single variable, Tian and Pearl [27] provide a [27]: Tian and Pearl (2002), ‘A General
relatively simple graphical criterion that is sufficient for identifiability: Identification Condition for Causal Effects’

the unconfounded children criterion.

Definition 6.2 (Unconfounded Children Criterion) This criterion is


satisfied if it is possible to block all backdoor paths from the treatment variable
𝑇 to all of its children that are ancestors of 𝑌 with a single conditioning set.

This criterion generalizes the backdoor criterion (Definition 4.1) and the
frontdoor criterion (Definition 6.1). Like them, it is a sufficient condition
for identifiability:

Theorem 6.3 (Unconfounded Children Identifiability) Let 𝑌 be the set


of outcome variables and 𝑇 be a single variable. If the unconfounded children
criterion and positivity are satisfied, then 𝑃(𝑌 = 𝑦 | do(𝑇 = 𝑡)) is identifiable
[27].

The intuition for unconfounded children criterion implies identifiability


is similar to the intuition for the frontdoor criterion; if we can isolate all
of the causal association flowing out of treatment along directed paths
to 𝑌 , we have identifiability. To see this intuition, first, consider that all
of the causal association from 𝑇 must flow through its children. We can
isolate this causal association if there is no confounding between 𝑇 and
any of its children.5 This isolation of all of the causal association is what 5 This is analogous to what we saw
gives us identifiability of the causal effect of 𝑇 on any other node in with the frontdoor criterion in Section 6.1,
where we could isolate the causal associa-
the graph. This intuition might lead you to suspect that this criterion is
tion flowing through the full mediator 𝑀
necessary in the very specific case where the outcome set 𝑌 is all of the if the 𝑇 − 𝑀 relationship is unconfounded
other variables in the graph other than 𝑇 ; it turns out that this is true (no unblocked backdoor paths).
[27]. But this condition is not necessary if 𝑌 is a smaller set than that.
To give you a more visual grasp of the intuition for why the unconfounded
children criterion is sufficient for identification, we give an example graph
in Figure 6.12. In Figure 6.12a, we visualize the flow of confounding
association and causal association that flows in this graph. Then, we depict
the isolation of the causal association in that graph in Figure 6.12b.
Necessary Condition The unconfounded children criterion is not nec- 𝑊1 𝑊3
essary for identifiability, but it might aid your graphical intuition to have
a necessary condition in mind. Here is one: For each backdoor path from
𝑇 to any child 𝑀 of 𝑇 that is an ancestor of 𝑌 , it is possible to block 𝑊2
that path [18, p. 92]. The intuition for this is that because the causal
association that flows from 𝑇 to 𝑌 must go through children of 𝑇 that are
ancestors of 𝑌 , to be able to isolate this causal association, the effect of 𝑇
𝑇 𝑌
on these mediating children must be unconfounded. And a prerequisite
to these 𝑇 − 𝑀 (parent-child) relationships being unconfounded is that Figure 6.11: Graph where blocking one
any single backdoor path from 𝑇 to 𝑀 must be blockable (what we state backdoor path unblocks another

in this condition). Unfortunately, this condition is not sufficient. To see


why, consider Figure 6.11. The backdoor path 𝑇 ← 𝑊1 → 𝑊2 ← 𝑊3 → 𝑌 [18]: Pearl (2009), Causality
6 Nonparametric Identification 60

non-causal association

𝑊1 𝑊1

𝑊2 𝑊2

focus
𝑇 𝑀1 𝑌 𝑇 𝑀1 𝑌

𝑀2 𝑀2

causal association causal association

(a) Visualization of the flow of confound- (b) Visualization of the isolation of the
ing association and causal association. causal association flowing from 𝑇 to
its children, allowing the unconfounded
children criterion to imply identifiability. Figure 6.12: Example graph that satisfies
the unconfounded children criterion

is blocked by the collider 𝑊2 . And we can block the the backdoor path
𝑇 ← 𝑊2 → 𝑌 by conditioning on 𝑊2 . However, conditioning on 𝑊2
unblocks the other backdoor path where 𝑊2 is a collider. Being able to
block both paths individually does not mean we can block them both with
a single conditioning set. In sum, the unconfounded children criterion is
sufficient but not necessary, and this related condition is necessary but
not sufficient. Also, everything we’ve seen in this section so far is for a
single variable intervention.
Necessary and Sufficient Conditions for Multiple Variable Interven-
tions Shpitser and Pearl [25] provide a necessary and sufficient criterion [25]: Shpitser and Pearl (2006), ‘Identifica-
for identifiability of 𝑃(𝑌 = 𝑦 | do(𝑇 = 𝑡)) when 𝑌 and 𝑇 are arbitrary tion of Joint Interventional Distributions
in Recursive Semi-Markovian Causal Mod-
sets of variables: the hedge criterion. However, this is outside the scope els’
of this book, as it requires more complex objects such as hedges, C-
trees, and other leafy objects. Moving further along, Shpitser and Pearl
[28] provide a necessary and sufficient criterion for the most general [28]: Shpitser and Pearl (2006), ‘Identifica-
type of causal estimand: conditional causal effects, which take the form tion of Conditional Interventional Distri-

𝑃(𝑌 = 𝑦 | do(𝑇 = 𝑡), 𝑋 = 𝑥), where 𝑌 , 𝑇 , and 𝑋 are all arbitrary sets of
butions’

variables.
Active reading exercises:
1. Is the unconfounded criterion (Definition 6.2) satisfied in Fig-
ure 6.13a?
2. Is the unconfounded criterion satisfied in Figure 6.13b?
3. Can we get identifiability in Figure 6.13b via any simpler criterion
that we’ve seen before?
6 Nonparametric Identification 61

𝑊1 𝑊3
𝑊1 𝑊3

𝑊2
𝑊2

𝑇 𝑀 𝑌 𝑇 𝑀 𝑌
(a) (b)

Figure 6.13: Graphs for the questions about the unconfounded children criterion
Estimation 7
In the previous chapter, we covered identification. Once we identify some 7.1 Preliminaries . . . . . . . . . 62
causal estimand by reducing it to a statistical estimand, we still have more 7.2 Conditional Outcome Mod-
work to do. We need to get a corresponding estimate. In this chapter, we’ll eling (COM) . . . . . . . . . 63
cover a variety of estimators that we can use to do this. This isn’t meant 7.3 Grouped Conditional Out-
to be anywhere near exhaustive as there are many different estimators of come Modeling (GCOM) . 64
causal effects, but it is meant to give you a solid introduction to them.
7.4 Increasing Data Efficiency . 65
All of the estimators that we include full sections on are model-assisted TARNet . . . . . . . . . . . . . 65
estimators (recall from Section 2.4). And they all work with arbitrary X-Learner . . . . . . . . . . . 66
statistical models such as the ones you might get from scikit-learn [29]. 7.5 Propensity Scores . . . . . . 67
7.6 Inverse Probability Weight-
ing (IPW) . . . . . . . . . . . 68
7.7 Doubly Robust Methods . . 70
7.1 Preliminaries 7.8 Other Methods . . . . . . . . 70
7.9 Concluding Remarks . . . . 71
Recall from Chapter 2 that we denote the individual treatment effect Confidence Intervals . . . . 71
Comparison to Randomized
(ITE) with 𝜏𝑖 and average treatment effect (ATE) with 𝜏:
Experiments . . . . . . . . . 72
𝜏𝑖 , 𝑌𝑖 (1) − 𝑌𝑖 (0) (7.1) [29]: Pedregosa et al. (2011), ‘Scikit-learn:
Machine Learning in Python’
𝜏 , 𝔼[𝑌𝑖 (1) − 𝑌𝑖 (0)] (7.2)

ITEs are the most specific kind of causal effects, but they are hard
to estimate without strong assumptions (on top of those discussed in
Chapters 2 and 4). However, we often want to estimate causal effects that
are a bit more individualized than the ATE.
For example, say we’ve observed an individual’s covariates 𝑥 ; we might
like to use those to estimate a more specific effect for that individual (and
anyone else with covariates 𝑥 ). This brings us to the conditional average
treatment effect (CATE) 𝜏(𝑥):

𝜏(𝑥) , 𝔼[𝑌𝑖 (1) − 𝑌𝑖 (0) | 𝑋 = 𝑥] (7.3)

The 𝑋 that is conditioned on does not need to consist of all of the observed
covariates, but this is often the case when people refer to CATEs. We call
that individualized average treatment effects (IATEs).
ITEs and “CATEs” (what we call IATEs) are sometimes conflated, but
they are not the same. For example, two individuals could have the same
covariates, but their potential outcomes could be different because of
other unobserved differences between these individuals. If we encompass
everything about an individual that is relevant to their potential outcomes
in the vector 𝐼 , then ITEs and “CATEs” are the same if 𝑋 = 𝐼 . In a causal 1 This paragraph contains a lot of informa-
graph, 𝐼 corresponds to all of the exogenous variables in the magnified tion. Active reading exercise:
1) Convince yourself that ITEs and
graph that have causal association flowing to 𝑌 .1 “CATEs” (what we call IATEs) are the same
if 𝑋 = 𝐼 .
2) Convince yourself that 𝐼 corresponds to
the exogenous variables in the magnified
graph that have causal association flowing
to 𝑌 .
7 Estimation 63

Unconfoundedness Throughout this chapter, whenever we are esti-


mating an ATE, we will assume that 𝑊 is a sufficient adjustment set, and
whenever we are estimating a CATE, we will assume that 𝑊 ∪ 𝑋 is a
sufficient adjustment set. In other words, for ATE estimation, we assume
that 𝑊 satisfies the backdoor criterion (Definition 4.1); equivalently for
ATE estimation, we assume that we have conditional exchangeability
given 𝑊 (Assumption 2.2). And similarly for CATE estimation, assuming
𝑊 ∪ 𝑋 is a sufficient adjustment set means that we are assuming that
𝑊 ∪ 𝑋 satisfies the backdoor criterion / gives us unconfoundedness.
This unconfoundedness assumption gives us parametric identification2 2 By “parametric identification,” we mean
and allows us to focus on estimation in this chapter. identification under the parametric as-
sumptions of our statistical models. For
example, these assumptions are for extrap-
olation if we don’t have positivity.
7.2 Conditional Outcome Modeling (COM)

We are interested in estimating the ATE 𝜏. We’ll start with recalling the
adjustment formula (Theorem 2.1), which can be derived as a corollary
of the backdoor adjustment (Theorem 4.2), as we saw in Section 4.4.1:

𝜏 , 𝔼[𝑌(1) − 𝑌(0)] = 𝔼𝑊 [𝔼[𝑌 | 𝑇 = 1 , 𝑊] − 𝔼[𝑌 | 𝑇 = 0, 𝑊]] (7.4)

On the left-hand side of Equation 7.4, we have a causal estimand, and on


the right-hand side, we have a statistical estimand (i.e. we have identified
this causal quantity). Then, the next step in the Identification-Estimation
Flowchart (see Figure 7.1 reproduced from Section 2.4) is to get an estimate
of this (statistical) estimand.

Identification Estimation
Causal Estimand Statistical Estimand Estimate

Figure 7.1: The Identification-Estimation Flowchart – a flowchart that illustrates the process of moving from a target causal estimand to a
corresponding estimate, through identification and estimation.

The most straightforward thing to do is to just fit a statistical model


(machine learning model) to the conditional expectation 𝔼[𝑌 | 𝑇, 𝑊]
and then approximate 𝔼𝑊 with an empirical mean over the 𝑛 data
P
points ( 𝑛1 𝑖 ). And this is exactly what we did in the simple examples of
estimation in Sections 2.5 and 4.6.2. To make this more clear, we introduce
𝜇 in place of this conditional expectation:

𝜇(1 , 𝑤) − 𝜇(0 , 𝑤) , 𝔼[𝑌 | 𝑇 = 1 , 𝑊 = 𝑤] − 𝔼[𝑌 | 𝑇 = 0 , 𝑊 = 𝑤] (7.5)

Then, we can fit a statistical model to 𝜇. We will denote that these fitted
models are approximations of 𝜇 with a hat: 𝜇ˆ . We will refer to a model 𝜇ˆ as
a conditional outcome model. Now, we can cleanly write the model-assisted
estimator (for the ATE) that we’ve described:
Active reading exercise: What are the two
1X different approximations we make in this
𝜏ˆ = 𝜇(
ˆ 1, 𝑤 𝑖 ) − 𝜇(
ˆ 0, 𝑤 𝑖 )

(7.6)
𝑛 𝑖 estimator and what parts of the statistical
estimand in Equation 7.4 do each of them
replace?
We will refer to estimators that take this form as conditional outcome model
(COM) estimators. Because minimizing the mean-squared error (MSE) of
predicting 𝑌 from (𝑇, 𝑋) pairs is equivalent to modeling this conditional
expectation [see, e.g., 10, Section 2.4], there are many different models we [10]: Hastie et al. (2001), The Elements of
Statistical Learning
7 Estimation 64

can use for 𝜇ˆ in Equation 7.6 to get a COM estimator (see, e.g., scikit-learn
[29]). [29]: Pedregosa et al. (2011), ‘Scikit-learn:
Machine Learning in Python’
For CATE estimation, because we assumed that 𝑊 ∪ 𝑋 is a sufficient
adjustment set, rather than just 𝑊 ,3 we must additionally add 𝑋 as 3 Active reading exercise: Why do we addi-
an input to our conditional outcome model. More precisely, for CATE tionally add 𝑋 to the adjustment set when
estimation, we define 𝜇 as follows: we are interested in CATEs?

𝜇(𝑡, 𝑤, 𝑥) , 𝔼[𝑌 | 𝑇 = 𝑡, 𝑊 = 𝑤, 𝑋 = 𝑥] (7.7)

Then, we train a statistical model 𝜇ˆ to predict 𝑌 from (𝑇, 𝑊 , 𝑋). And this
gives us the following COM estimator for the CATE 𝜏(𝑥):
Active reading exercise: Write down the
1 X causal estimand and statistical estimand
𝜏(𝑥) 𝜇(
ˆ 1 , 𝑤 𝑖 , 𝑥) − 𝜇(
ˆ 0 , 𝑤 𝑖 , 𝑥)

ˆ = (7.8)
𝑛 𝑥 𝑖 :𝑥 𝑖 =𝑥 that lead us to the estimator in Equa-
tion 7.8, and proof that they’re equal under
unconfoundedness and positivity. In other
where 𝑛 𝑥 is the number of data points that have 𝑥 𝑖 = 𝑥 . When we are words, identify the CATE.
interested in the IATE (CATE where 𝑋 is all of the observed covariates),
𝑛 𝑥 is often 1, which simplifies our estimator to a simple difference between
predictions:
𝜏(𝑥
ˆ 𝑖 ) = 𝜇(
ˆ 1 , 𝑤 𝑖 , 𝑥 𝑖 ) − 𝜇(
ˆ 0, 𝑤 𝑖 , 𝑥 𝑖 ) (7.9)
Even, though IATEs are different from ITEs (𝜏(𝑥 𝑖 ) ≠ 𝜏𝑖 ), if we really want
to give estimates for ITEs, it is relatively common to take this estimator
as our estimator of the ITE 𝜏𝑖 as well:

𝜏ˆ 𝑖 = 𝜏(𝑥
ˆ 𝑖 ) = 𝜇(
ˆ 1 , 𝑤 𝑖 , 𝑥 𝑖 ) − 𝜇(
ˆ 0, 𝑤 𝑖 , 𝑥 𝑖 ) (7.10)

Though, this will likely be unreliable due to severe positivity violation.4 4 Active reading exercise: Why is there
a severe positivity violation here? Does
The Many-Faced Estimator COM estimators have many different names this only apply in Equation 7.10 or also in
in the literature. For example, they are often called G-computation esti- Equation 7.9? What if there were multiple
mators, parametric G-formula, or standardization in epidemiology and units with 𝑥 𝑖 = 𝑥 ?

biostatistics. Because we are fitting a single statistical model for 𝜇 here,


“COM estimator” is sometimes referred to as an “S-learner,” where the
“S” stands for “single.”

7.3 Grouped Conditional Outcome Modeling


(GCOM)

In order to get the estimate in Equation 7.6, we must train a model that
predicts 𝑌 from (𝑇, 𝑊). However, 𝑇 is often one-dimensional, whereas
𝑊 can be high-dimensional. But the input to 𝜇ˆ for 𝑡 is the only thing that
changes between the two terms inside the sum 𝜇(ˆ 1, 𝑤 𝑖 )− 𝜇(
ˆ 0 , 𝑤 𝑖 ). Imagine
concatenating 𝑇 to a 100-dimensional vector 𝑊 and then feeding that
through a neural network that we’re using for 𝜇ˆ . It seems reasonable that
the network could ignore 𝑇 while focusing on the other 100 dimensions
of its input. This would result in an ATE estimate of zero. And, indeed,
there is some evidence of COM estimators being biased toward zero
[30]. [30]: Künzel et al. (2019), ‘Metalearners
for estimating heterogeneous treatment
So how can we ensure that the model 𝜇ˆ doesn’t ignore 𝑇 ? Well, we can effects using machine learning’
just train two different models 𝜇ˆ 1 (𝑤) and 𝜇ˆ 0 (𝑤) that model 𝜇1 (𝑤) and
7 Estimation 65

𝜇0 (𝑤), respectively, where

𝜇1 (𝑤) , 𝔼[𝑌 | 𝑇 = 1 , 𝑊 = 𝑤] 𝜇0 (𝑤) , 𝔼[𝑌 | 𝑇 = 0 , 𝑊 = 𝑤] .


and
(7.11)
Using two separate models for the values of treatment ensures that 𝑇
cannot be ignored. To train these statistical models, we first group the
data into a group where 𝑇 = 1 and a group where 𝑇 = 0. Then, we train
𝜇ˆ 1 (𝑤) to predict 𝑌 from 𝑊 in the group where 𝑇 = 1. And, similarly, we
train 𝜇ˆ 0 (𝑤) to predict 𝑌 from 𝑊 in the group where 𝑇 = 0. This gives us
a natural derivative of COM estimators (Equation 7.6), grouped conditional
outcome model (GCOM) estimators:5 5 Künzel et al. [30] call a GCOM estimator
a “T-learner” where the “T” is for “two”
1X because it requires fitting two different
𝜏ˆ = 𝜇ˆ 1 (𝑤 𝑖 ) − 𝜇ˆ 0 (𝑤 𝑖 )

(7.12) models: 𝜇ˆ 1 and 𝜇ˆ 0 .
𝑛 𝑖

And just as we saw, in Equation 7.8, we can add 𝑋 as an input to 𝜇ˆ 1 and


𝜇ˆ 0 to get a GCOM estimator for the CATE 𝜏(𝑥):

1 X
𝜏(𝑥) 𝜇ˆ 1 (𝑤 𝑖 , 𝑥) − 𝜇ˆ 0 (𝑤 𝑖 , 𝑥)

ˆ = (7.13)
𝑛 𝑥 𝑖 :𝑥 𝑖 =𝑥

While GCOM estimation seems to fix the problem that COM estimation
can have regarding bias toward zero treatment effect, it does have an
important downside. In COM estimation, we were able to make use of
all the data when we estimate the single model 𝜇ˆ . However, in grouped
conditional outcome model estimation, we only use the 𝑇 = 1 group to
estimate 𝜇ˆ 1 , and we only use the 𝑇 = 0 group to estimate 𝜇ˆ 0 . Importantly,
we are missing out on making the most of our data by not using all of
the data to estimate 𝜇ˆ 1 and all of the data to estimate 𝜇ˆ 0 .

7.4 Increasing Data Efficiency

In this section, we’ll cover two ways to address the problem of data
efficiency that we mentioned is present in GCOM estimation at the end
of the last section: TARNet (Section 7.4.1) and X-Learner (Section 7.4.2).

7.4.1 TARNet

Consider that we’re using neural networks for our statistical models;
starting with that, we’ll contrast, vanilla COM estimation, GCOM estima-
tion, and TARNet. In vanilla COM estimation, the neural network is used
to predict 𝑌 from (𝑇, 𝑊) (see Figure 7.2a). This has the problem of poten-
tially yielding ATE estimates that are biased toward zero, as the network
might ignore the scalar 𝑇 , especially when 𝑊 is high-dimensional. We
ensure that 𝑇 can’t be ignored in GCOM estimation by using two separate
neural networks for the two treatment groups (Figure 7.2b). However,
this is inefficient as we only use the treatment group data for training
one network and the control group data for training the other network.
We can achieve a middle ground between vanilla COM estimation and
GCOM estimation using Shalit et al. [31]’s TARNet. With TARNet, we use [31]: Shalit et al. (2017), ‘Estimating in-
a single network that takes only 𝑊 as input but then branches off into dividual treatment effect: generalization
bounds and algorithms’
7 Estimation 66

two separate heads (sub-networks) for each treatment group. We then


use this model for 𝜇(𝑡, 𝑤) to get a COM estimator. This has the advantage
of learning a treatment-agnostic representation (TAR) of 𝑊 using all of
the data while still forcing the model to not ignore 𝑇 by branching into
two heads for the different values of 𝑇 . In other words, TARNet uses
the knowledge we have about 𝑇 (as a uniquely important variable) in
6 Active reading exercise: Which parts of
its architecture. Still, the sub-networks for each of these heads are only
trained with the data for the corresponding treatment group, rather than TARNet are like Figure 7.2a and which
parts are like Figure 7.2b? What ad-
all of the data.6 vantages/disadvantages do Figures 7.2a
to 7.2c have relative to each other?
𝑇 = 1 network

𝑊 𝑌 𝑌

1
=
𝑇
𝑇 = 0 network 𝑊

𝑇
=
0
𝑇 𝑊 𝑌
𝑌 𝑌
𝑊

(a) A single neural network to model (b) Two neural networks: a network to (c) TARNet [31]. A single neural network
𝜇(𝑡, 𝑤), used in vanilla COM estimation model 𝜇1 (𝑤) (top) and a network to model to model 𝜇(𝑡, 𝑤) that branches off into two
(Section 7.2). 𝜇0 (𝑤) (bottom), used in GCOM estimation heads: one for 𝑇 = 1 and one for 𝑇 = 0.
(Section 7.3).

Figure 7.2: Coarse neural networks architectures for vanilla COM estimation (left), GCOM estimation (middle), and TARNet (right). In this
figure, we use each arrow to denote a sub-network that has an arbitrary number of layers.

7.4.2 X-Learner

We just saw that one way to increase data efficiency relative to GCOM
estimation is to use TARNet, a COM estimator that shares some qualities
with GCOM estimators. However, TARNet still doesn’t use all of the
data for the full model (neural network). In this section, we will start
with GCOM estimation and build on it to create a class of estimators
that use all of the data for both models that are part of the estimators.
An estimator in this class is known as an X-learner [30]. Unlike TARNet, [30]: Künzel et al. (2019), ‘Metalearners
for estimating heterogeneous treatment
X-learners are neither COM estimators nor GCOM estimators.
effects using machine learning’
There are three steps to X-learning, and the first step is the exact same
as what’s used in GCOM estimation: estimate 𝜇ˆ 1 (𝑥) using the treatment
group data and estimate 𝜇ˆ 0 (𝑥) using the control group data.7 As before, 7Recall that 𝜇ˆ 1 (𝑤) and 𝜇ˆ 0 (𝑤) are approx-
this can be done with any models that minimize MSE. For simplicity, imations of 𝔼[𝑌 | 𝑇 = 1 , 𝑊 = 𝑤] and
𝔼[𝑌 | 𝑇 = 0, 𝑊 = 𝑤], respectively.
in this section, we’ll be considering IATEs (𝑋 is all of the observed
variables) where 𝑋 satisfies the backdoor criterion (𝑋 contains 𝑊 and
no descendants of 𝑇 ).
The second step is the most important part as it is both where we end up
using all of the data for both models and where the “X” comes from. We
specify 𝜏ˆ 1,𝑖 for the treatment group ITE estimates and 𝜏ˆ 0,𝑖 for the control
7 Estimation 67

group ITE estimates:

𝜏ˆ 1,𝑖 = 𝑌𝑖 (1) − 𝜇ˆ 0 (𝑥 𝑖 ) (7.14)


𝜏ˆ 0,𝑖 = 𝜇ˆ 1 (𝑥 𝑖 ) − 𝑌𝑖 (0) (7.15)

Here, 𝜏ˆ 1,𝑖 is estimated using the treatment group outcomes and the
imputed counterfactual that we get from 𝜇ˆ 0 . Similarly, 𝜏ˆ 0,𝑖 is estimated
using the control group outcomes and the imputed counterfactual that we
get from 𝜇ˆ 1 . If you draw a line between the observed potential outcomes
and a line between the imputed potential outcomes, you can see the
“X” shape. Importantly, this “X” tells us that each treatment group ITE
estimate 𝜏ˆ 1,𝑖 uses both treatment group data (its observed potential
outcome under treatment), and control group data (in 𝜇ˆ 0 ). Similarly, 𝜏ˆ 0,𝑖
is estimated with data from both treatment groups.
However, each ITE estimate only uses a single data point from its
corresponding treatment group. We can fix this by fitting a model 𝜏ˆ 1 (𝑥)
to predict 𝜏ˆ 1,𝑖 from the corresponding treatment group 𝑥 𝑖 ’s. Finally, we
have a model 𝜏ˆ 1 (𝑥) that was fit using all of the data (treatment group
data just now and control group data when 𝜇0 was fit in step 1). Similarly,
we can fit a model 𝜏ˆ 0 (𝑥) to predict 𝜏ˆ 0,𝑖 from the corresponding control
group 𝑥 𝑖 ’s. The output of step 2 is two different estimators for the IATE:
𝜏ˆ 1 (𝑥) and 𝜏ˆ 0 (𝑥).
Finally, in step 3, we combine 𝜏ˆ 1 (𝑥) and 𝜏ˆ 0 (𝑥) together to get our IATE
estimator:
𝜏(𝑥)
ˆ = 𝑔(𝑥) 𝜏ˆ 0 (𝑥) + (1 − 𝑔(𝑥)) 𝜏ˆ 1 (𝑥) (7.16)
where 𝑔(𝑥) is some weighting function that produces values between 0
and 1. Künzel et al. [30] report that an estimate of the propensity score [30]: Künzel et al. (2019), ‘Metalearners
(introduced in next section) works well, but that choosing the constant for estimating heterogeneous treatment
effects using machine learning’
function 0 or 1 also makes sense if the treatment groups are very different
sizes. Or that choosing 𝑔(𝑥) to minimize the variance of 𝜏(𝑥)
ˆ could also Active reading exercise: In this section,
be attractive. we covered the X-learner for IATE estima-
tion. What would an X-learner for more
general CATE estimation (𝑋 is arbitrary
and doesn’t necessarily contain all con-
founders 𝑊 ) look like?

7.5 Propensity Scores

Given that the vector of variables 𝑊 satisfies the backdoor criterion (or,
equivalently, that (𝑌(1), 𝑌(0)) ⊥
⊥ 𝑇 | 𝑊 ), we might wonder if it is really
necessary to condition on that whole vector to isolate causal association,
especially when 𝑊 is high-dimensional. It turns out that it isn’t. If 𝑊
satisfies unconfoundedness and positivity, then we can actually get away
with only conditioning on the scalar 𝑃(𝑇 = 1 | 𝑊). We’ll let 𝑒(𝑤) denote
𝑃(𝑇 = 1 | 𝑊 = 𝑤), as we’ll refer to 𝑒(𝑤) as the propensity score since it is
the propensity for (probability of) receiving treatment given that 𝑊 is
𝑤 . The magic of being able to condition on the scalar 𝑒(𝑊) in the place
of the vector 𝑊 is due to Rosenbaum and Rubin [32]’s propensity score [32]: Rosenbaum and Rubin (1983), ‘The
theorem: central role of the propensity score in ob-
servational studies for causal effects’

Theorem 7.1 (Propensity Score Theorem) Given positivity, unconfound-


edness given 𝑊 implies unconfoundedness given the propensity score 𝑒(𝑊).
7 Estimation 68

Equivalently,

(𝑌(1), 𝑌(0)) ⊥
⊥ 𝑇 | 𝑊 =⇒ (𝑌(1), 𝑌(0)) ⊥⊥ 𝑇 | 𝑒(𝑊) . (7.17)
We provide a more traditional mathematical proof in Appendix A.2 and 𝑊
give a graphical proof here. Consider the graph in Figure 7.3. Because
the edge from 𝑊 to 𝑇 is a symbol for the mechanism 𝑃(𝑇 | 𝑊) and
because the propensity score completely describes that distribution 𝑇 𝑌
(𝑃(𝑇 = 1 | 𝑊) = 𝑒(𝑊)), we can think of the propensity score as a full
Figure 7.3: Simple graph where 𝑊 satisfies
mediator of the effect of 𝑊 on 𝑇 . This means that we can redraw this
the backdoor criterion
graph with 𝑒(𝑊) situated between 𝑊 and 𝑇 . And in this redrawned
graph in Figure 7.4, we can see that 𝑒(𝑊) blocks all backdoor paths that
𝑊
𝑊 blocks, so 𝑒(𝑊) must be a sufficient adjustment set if 𝑊 is. Therefore,
we have a graphical proof of the propensity score theorem using the
backdoor adjustment (Theorem 4.2).
Importantly, this theorem means that we can swap in 𝑒(𝑊) in place of 𝑊 𝑒(𝑊)
wherever we are adjusting for 𝑊 in a given estimator in this chapter. For
example, this seems very useful when 𝑊 is high-dimensional.
Recall The Positivity-Unconfoundedness Tradeoff from Section 2.3.4. As 𝑇 𝑌
we condition on more non-collider-bias-inducing variables, we decrease
Figure 7.4: Graph illustrating that 𝑒(𝑊)
confounding. However, this comes at the cost of decreasing overlap
blocks the backdoor path(s) that 𝑊 blocks.
because the 𝑊 in 𝑃(𝑇 = 1 | 𝑊) becomes higher and higher dimensional.
The propensity score seems to allow us to magically fix that issue since
the 𝑒(𝑊) remains a scalar, even as 𝑊 grows in dimension. Fantastic,
right?
Well, unfortunately, we usually don’t have access to 𝑒(𝑊). Rather, the
best we can do is model it. We do this by training a model to predict 𝑇
from 𝑊 . For example, logistic regression (logit model) is very commonly
used to do this. And because this model is fit to the high-dimensional 𝑊 ,
in some sense, we have just shifted the positivity problem to our model
for 𝑒(𝑊).

7.6 Inverse Probability Weighting (IPW)

What if we could resample the data in a way to make it so that association


𝑊
is causation? This is the motivation behind creating “pseudo-populations”
that are made up of reweighted versions of the observed population. To
get to this, let’s recall why association is not causation in general. 𝑇 𝑌

Association is not causation in the graph in Figure 7.5 because 𝑊 is a Figure 7.5: Simple graph where 𝑊 con-
founds the effect of 𝑇 on 𝑌
common cause of 𝑇 and 𝑌 . In other words, the mechanism that generates
𝑇 depends on 𝑊 , and the mechanism that generates 𝑌 depends on
𝑊 . Focusing on the mechanism that generates 𝑇 , we can write this
mathematically as 𝑃(𝑇 | 𝑊) ≠ 𝑃(𝑇). It turns out that we can reweight 𝑊
the data to get a pseudo-population where 𝑃(𝑇 | 𝑊) = 𝑃(𝑇) or 𝑃(𝑇 | 𝑊)
equals some constant; the important part is that we make 𝑇 independent
of 𝑊 . The corresponding graph for such a pseudo-population has no 𝑇 𝑌
edge from 𝑊 to 𝑇 because 𝑇 does not depend on 𝑊 ; we depict this in
Figure 7.6: Effective graph for pseudo-
Figure 7.6. population that we get by reweighting
the data generated according to the graph
It turns out that the propensity score is key to this reweighting. All we in Figure 7.5 using inverse probability
have to do is reweight each data point with treatment 𝑇 and confounders weighting.
7 Estimation 69

𝑊 by its inverse probability of receiving its value of treatment given that


it has its value of 𝑊 . This is why this technique is called inverse probability
weighting (IPW). For individuals that received treatment 1, this weight is
1 1
𝑒(𝑊)
, and for individuals that received treatment 0, this weight is 1−𝑒(𝑊) .8 8 Active reading exercise: Why is the de-
If the treatment were continuous, the weight would be which 1
, nominator 1 − 𝑒(𝑊) when 𝑇 = 0. Hint:
𝑃(𝑇 |𝑊)
recall the precise definition of 𝑒(𝑊).
happens to also be the reciprocal of the generalization of the propensity
score to continuous treatment.
Why does what we described in the above paragraph work? Well, recall
that our goal is to undo confounding by “removing” the edge that goes
from 𝑊 to 𝑇 (i.e. move from Figure 7.5 to Figure 7.6). And the mechanism
that edge describes is 𝑃(𝑇 | 𝑊). By weighting the data points by 𝑃(𝑇1|𝑊) ,
we are effectively canceling it out. That’s the intuition. Formally, we have
the following identification equation:

1(𝑇 = 𝑡)𝑌
 
𝔼[𝑌(𝑡)] = 𝔼 (7.18)
𝑃(𝑡 | 𝑊)

where 1(𝑇 = 𝑡) is an indicator random variable that takes on the value 1


if 𝑇 = 𝑡 and 0 otherwise. We provide a proof of Equation 7.18 using the
familiar adjustment formula 𝔼[𝑌(𝑡)] = 𝔼[𝔼[𝑌 | 𝑡, 𝑊]] (Theorem 2.1) in
Appendix A.3.
Assuming binary treatment, the following identification equation for the
ATE follows from Equation 7.18:

1(𝑇 = 1)𝑌 1(𝑇 = 0)𝑌


   
𝜏 , 𝔼[𝑌(1) − 𝑌(0)] = 𝔼 −𝔼 (7.19)
𝑒(𝑊) 1 − 𝑒(𝑊)

Now that we have a statistical estimand in the form of IPW, we can


get an IPW estimator. Replacing expectations by empirical means and
𝑒(𝑊) by a propensity score model 𝑒ˆ (𝑊), we get the following equivalent
9
formulations of the basic IPW estimator9 for the ATE: This estimator is originally from Horvitz
and Thompson [33].
1(𝑡 𝑖 = 1)𝑦𝑖 1(𝑡 𝑖 = 0)𝑦𝑖
 
1X
𝜏ˆ = − (7.20)
𝑛 𝑖 𝑒ˆ (𝑤 𝑖 ) 1 − 𝑒ˆ (𝑤 𝑖 ) [33]: Horvitz and Thompson (1952), ‘A
Generalization of Sampling Without Re-
1 X 𝑦𝑖 1 X 𝑦𝑖 placement from a Finite Universe’
= − (7.21)
𝑛1 𝑖 :𝑡 𝑖 =1 𝑒ˆ (𝑤 𝑖 ) 𝑛0 𝑖 :𝑡 𝑖 =0 1 − 𝑒ˆ (𝑤 𝑖 )
Active reading exercise: What would be
where 𝑛1 and 𝑛0 are the number of treatment group units and control the corresponding formulations of the ba-
sic IPW estimator for 𝔼[𝑌(𝑡)]?
group units, respectively.

Weight Trimming As you can see in Equations 7.20 and 7.21, if the
propensity scores are very close to 0 or 1, the estimates will blow up. In
order to prevent this, it is not uncommon to trim the propensity scores
that are less than 𝜖 to 𝜖 and those that are greater than 1 − 𝜖 to 1 − 𝜖
(effectively trimming the weights to be no larger than 1𝜖 ), though this
introduces its own problems such as bias.
CATE Estimation We can extend the ATE estimator in Equation 7.20
to get an IPW estimator for the CATE 𝜏(𝑥) by just restricting to the data
points where 𝑥 𝑖 = 𝑥 :

1(𝑡 𝑖 = 1)𝑦𝑖 1(𝑡 𝑖 = 0)𝑦𝑖


 
1 X
𝜏(𝑥)
ˆ = − (7.22)
𝑛 𝑥 𝑖 :𝑥 𝑖 =𝑥 𝑒ˆ (𝑤 𝑖 ) 1 − 𝑒ˆ (𝑤 𝑖 )
7 Estimation 70

where 𝑛 𝑥 is the number of data points with 𝑥 𝑖 = 𝑥 . However, the estimator


in Equation 7.22 may quickly run into the problem of using very small
amounts of data, leading to high variance. More general CATE estimation
with IPW estimators is more complex and outside the scope of this book.
See, for example, Abrevaya et al. [34] and references therein. [34]: Abrevaya et al. (2015), ‘Estimating
Conditional Average Treatment Effects’

7.7 Doubly Robust Methods

We’ve seen that we can estimate causal effects by modeling 𝜇(𝑡, 𝑤) ,


𝔼[𝑌 | 𝑡, 𝑤] (Sections 7.2 to 7.4) or by modeling 𝑒(𝑤) , 𝑃(𝑇 = 1 | 𝑤)
(Section 7.6). What if we modeled both 𝜇(𝑡, 𝑤) and 𝑒(𝑤)? Well, we can
and estimators that do this are sometimes doubly robust. A doubly robust
estimator has the property that it is a consistent10 estimator of 𝜏 if either 10An estimator is consistent if it converges
𝜇ˆ is a consistent estimator of 𝜇 or 𝑒ˆ is a consistent estimate of 𝑒 . In other in probability to its estimand as the num-
ber of samples 𝑛 grows.
words, only one of 𝜇ˆ and 𝑒ˆ needs to be well-specified. Additionally, the
rate at which a doubly robust estimator converges to 𝜏 is the product of
the rate at which 𝜇ˆ converges to 𝜇 and the rate at which 𝑒ˆ converges to 𝑒 .
This makes double robustness is very useful when we are using flexible
machine learning models in high-dimensions because, in this setting,
each of our individual models (𝜇ˆ and 𝑒ˆ ) converge more slowly that the
ideal rate of 𝑛 − /2 .
1

However, there is some controversy over how well doubly robust meth-
ods work in practice if not at least one of 𝜇ˆ or 𝑒ˆ is well-specified [35]. [35]: Kang and Schafer (2007), ‘Demysti-
Though, this might be contested as we get better at using doubly ro- fying Double Robustness: A Comparison
of Alternative Strategies for Estimating a
bust estimators with flexible machine learning models (see, e.g., [36]). Population Mean from Incomplete Data’
Meanwhile, the estimators that currently seem to do the best all flexibly
[36]: Zivich and Breskin (2020), Machine
model 𝜇 (unlike pure IPW estimators) [37]. This is why we began this learning for causal inference: on the use of
chapter with estimators that model 𝜇 and dedicated several sections to cross-fit estimators
such estimators. [37]: Dorie et al. (2019), ‘Automated versus
Do-It-Yourself Methods for Causal Infer-
Doubly robust methods are largely outside the scope of this book, so ence: Lessons Learned from a Data Analy-
we refer the reader to an introduction by Seaman and Vansteelandt [38], sis Competition’
along with other seminal works on the topic: [39–41]. Additionally, there [38]: Seaman and Vansteelandt (2018), ‘In-
is a large body of doubly robust work on methods that have performed troduction to Double Robust Methods for
Incomplete Data’
reasonably well in competitions [37]; this category is known as targeted
[39]: Tsiatis (2007), Semiparametric theory
maximum likelihood estimation (TMLE). [42–44].
and missing data
[40]: Robins et al. (1994), ‘Estimation of
Regression Coefficients When Some Re-
7.8 Other Methods gressors are not Always Observed’
[41]: Bang and Robins (2005), ‘Doubly
Robust Estimation in Missing Data and
Causal Inference Models’
As this chapter is only an introduction to estimation in causal inference,
[42]: Van Der Laan and Rubin (2006), ‘Tar-
there are some methods that we’ve entirely left out. We’ll briefly describe
geted maximum likelihood learning’
some of the most popular ones in this section. [43]: Schuler and Rose (2017), ‘Targeted
Maximum Likelihood Estimation for
Matching In matching methods, we try to match units in the treatment Causal Inference in Observational Studies’
group with units in the control group and throw away the non-matches [44]: Van der Laan and Rose (2011), Targeted
to create comparable groups. We can match in raw covariate space, learning: causal inference for observational and
experimental data
coarsened covariate space, or propensity score space. There are different
distance functions for deciding how close two units are. Furthermore,
there are different criteria for deciding whether a given distance is close
enough to count as a match (one criterion requires an exact match), how
many matches each treatment group unit can have, how many matches
7 Estimation 71

each control group unit can have, etc. See, for example, Stuart [45] for a [45]: Stuart (2010), ‘Matching Methods for
review. Causal Inference: A Review and a Look
Forward’
Double Machine Learning In double machine learning, we fit three
models in two stages: two in the first stage and a final model in the second
stage. First stage:
1. Fit a model to predict 𝑌 from 𝑊 to get the predicted 𝑌ˆ .11 11
Active reading exercise: How is this
2. Fit a model to predict 𝑇 from 𝑊 to get the to get the predicted 𝑇ˆ . model different from 𝜇ˆ ?

Then, in the second stage, we “partial out” 𝑊 by looking at 𝑌 − 𝑌ˆ and


𝑇 − 𝑇ˆ . In a sense, we have deconfounded the effect of treatment on the
outcome with this partialling out. Then, we fit a model to predict 𝑌 − 𝑌ˆ
from 𝑇 − 𝑇ˆ . This gives us our causal effect estimates. For more on this
topic, see, for example [46–49]. [46]: Chernozhukov et al. (2018), ‘Dou-
ble/debiased machine learning for treat-
Causal Trees and Forests Another popular estimation method is to ment and structural parameters’
recursively partition the data into subsets that have the same treatment [47]: Felton (2018), Chernozhukov et al. on
Double / Debiased Machine Learning
effects [50]. This forms a causal tree where the leaves are subsets of the
[48]: Syrgkanis (2019), Orthogonal/Double
population with similar causal effects. Since random forests generally Machine Learning
perform better than decision trees, it would be great if this kind of [49]: Foster and Syrgkanis (2019), Orthogo-
strategy can be extended to random forests. And it can. This extensions nal Statistical Learning

is known as causal forests [51], which are part of more general class [50]: Athey and Imbens (2016), ‘Recursive
partitioning for heterogeneous causal ef-
known as generalized random forests [52]. Importantly, these methods were
fects’
developed with the goal in mind of yielding valid confidence intervals [51]: Wager and Athey (2018), ‘Estima-
for the estimates. tion and Inference of Heterogeneous Treat-
ment Effects using Random Forests’
[52]: Athey et al. (2019), ‘Generalized ran-

7.9 Concluding Remarks dom forests’

7.9.1 Confidence Intervals

So far, in this chapter, we have only discussed point estimates for causal
effects. We haven’t discussed how we can gauge our uncertainty due
to data sampling. We haven’t discussed how to calculate confidence
intervals on these estimates. This is a machine learning perspective, after
all; who cares about confidence intervals... Jokes aside, because we are
allowing for arbitrary machine learning models in all of the estimators
we discuss, it is actually quite difficult to get valid confidence intervals.
Bootstrapping One way to get confidence intervals is to use bootstrap-
ping. With bootstrapping, we repeat the causal effect estimation process
many times, each time with a different sample (with replacement) from
our data. This allows us to build an empirical distribution for the estimate.
We can then compute whatever confidence interval we like from that em-
pirical distribution. Unfortunately, bootstrapped confidence intervals are
not always valid. For example, if we take a bootstrapped 95% confidence
interval, it might not contain the true value (estimand) 95% of the time.
Specialized Models Another way to get confidence intervals is to
analyze very specific models, rather than allowing for arbitrary models
Linear models are the simplest example of this; it is easy to get confidence
intervals in linear models. Similarly, if we use a linear model as the second
stage model in double machine learning, we can get confidence intervals.
Noticeably, causal trees and causal forests were developed with the goal
in mind of getting confidence intervals.
7 Estimation 72

7.9.2 Comparison to Randomized Experiments

You might read somewhere that some of these adjustment techniques


ensure that we’ve addressed confounding and isolated a causal effect.
Of course, this is not true when there is unobserved confounding. These
methods only address observed confounding. If there are any unobserved
confounders, these methods don’t fix that like randomization does
(Chapter 5). These adjustment methods aren’t magic. And it’s hard to
know when it is reasonable to assume we’ve observed all confounders.
That’s why it is important to run a sensitivity analysis where we gauge
how robust our causal effect estimates are to unobserved confounding.
This is the topic of the next chapter.
Active reading exercise: What kind of estimator did we use back in the
estimation examples in Sections 2.5 and 4.6.2?
Unobserved Confounding:
Bounds and Sensitivity Analysis 8
All of the methods in Chapter 7 assume that we don’t have any unobserved 8.1 Bounds . . . . . . . . . . . . . 73
confounding. However, unconfoundedness is an untestable assumption. No-Assumptions Bound . . 74
In observational studies, there could also be some unobserved con- Monotone Treatment Re-
founder(s). Therefore, we’d like to know how robust our estimates are sponse . . . . . . . . . . . . . 76
to unobserved confounding. The first way we can do is by getting an Monotone Treatment Selec-
upper and lower bound on the causal effect using credible assumptions tion . . . . . . . . . . . . . . . 78
Optimal Treatment Selection79
(Section 8.1). Another way we can do this is by simulating how strong the
8.2 Sensitivity Analysis . . . . . 82
confounder’s effect on the treatment and the confounder’s effect on the
Sensitivity Basics in Linear
outcome need to be to make the true causal effect substantially different Setting . . . . . . . . . . . . 82
from our estimate (Section 8.2). More General Settings . . . 85

𝑊 𝑈
𝑊

𝑇 𝑌 𝑇 𝑌
(a) No unobserved confounding (b) Unobserved confounding (𝑈 )

Figure 8.1: On the left, we have the setting we have considered up till now, where we have
unconfoundedness / the backdoor criterion. On the right, we have a simple graph where
the unobserved confounder 𝑈 make the causal effect of 𝑇 on 𝑌 not identifiable.

8.1 Bounds

There is a tradeoff between how realistic or credible our assumptions


are and how precise of an identification result we can get. Manski [53] [53]: Manski (2003), Partial Identification of
calls this “The Law of Decreasing Credibility: the credibility of inference Probability Distributions: Springer Series in
Statistics
decreases with the strength of the assumptions maintained.”
Depending on what assumptions we are willing to make, we can derive
various nonparametric bounds on causal effects. We have seen that if
we are willing to assume unconfoundedness (or some causal graph in
[54]: Manski (1989), ‘Anatomy of the Selec-
which the causal effect is identifiable) and positivity, we can identify a tion Problem’
single point for the causal effect. However, this might be unrealistic. For [55]: Manski (1990), ‘Nonparametric
example, there could always be unobserved confounding in observational Bounds on Treatment Effects’
[56]: Manski (1993), ‘Identification Prob-
studies. lems in the Social Sciences’
[57]: Manski (1994), ‘The selection prob-
This is what motivates Charles Manski’s work on bounding causal effects
lem’
[53–60]. This gives us an interval that the causal effect must be in, rather [58]: Manski (1997), ‘Monotone Treatment
than telling us exactly what point in that interval the causal effect must Response’
be. In this section, we will give an introduction to these nonparametric [59]: Manski and Pepper (2000), ‘Mono-
tone Instrumental Variables: With an Ap-
bounds and how to derive them.
plication to the Returns to Schooling’
[53]: Manski (2003), Partial Identification of
The assumptions that we consider are weaker than unconfoundedness,
Probability Distributions: Springer Series in
so they give us intervals that the causal effect must fall in (under these Statistics
[60]: Manski (2013), Public Policy in an Un-
certain World
8 Unobserved Confounding: Bounds and Sensitivity Analysis 74

assumptions). If we assumed the stronger assumption of unconfounded-


ness, these intervals would collapse to a single point. This illustrates the
law of decreasing credibility.

8.1.1 No-Assumptions Bound

Say all we know about the potential outcomes 𝑌(0) and 𝑌(1) is that they
are between 0 and 1. Then, the maximum value of an ITE 𝑌𝑖 (1) − 𝑌𝑖 (0) is
1 (1 - 0), and the minimum is -1 (0 - 1):

−1 ≤ 𝑌𝑖 (1) − 𝑌𝑖 (0) ≤ 1 if ∀𝑡, 0 ≤ 𝑌(𝑡) ≤ 1 (8.1)

So we know that all ITEs must be in an interval of length 2. Because


all the ITEs must fall inside this interval of length 2, the ATE must also
fall inside this interval of length 2. Interestingly, for ATEs, it turns out
that we can cut the length of this interval in half without making any
assumptions (beyond the min/max value of outcome); the interval that
the ATE must fall in is only of length 1.
We’ll show this result from Manski [55] in the more general scenario [55]: Manski (1990), ‘Nonparametric
where the outcome is bounded between 𝑎 and 𝑏 : Bounds on Treatment Effects’

Assumption 8.1 (Bounded Potential Outcomes)

∀𝑡, 𝑎 ≤ 𝑌(𝑡) ≤ 𝑏 (8.2)

By the same reasoning as above, this implies the following bounds on


the ITEs and ATE: Active reading exercise: Ensure you follow
how we get to these bounds.

𝑎 − 𝑏 ≤ 𝑌𝑖 (1) − 𝑌𝑖 (0) ≤ 𝑏 − 𝑎 (8.3)


𝑎 − 𝑏 ≤ 𝔼[𝑌(1) − 𝑌(0)] ≤ 𝑏 − 𝑎 (8.4)

These are intervals of length (𝑏 − 𝑎)−(𝑎 −𝑏) = 2(𝑏 − 𝑎). And the bounds for
the ITEs cannot be made tighter without further assumptions. However,
seemingly magically, we can halve the length of the interval for the ATE.
To see this, we rewrite the ATE as follows:

𝔼[𝑌(1) − 𝑌(0)] = 𝔼[𝑌(1)] − 𝔼[𝑌(0)] (8.5)


= 𝑃(𝑇 = 1) 𝔼[𝑌(1) | 𝑇 = 1] + 𝑃(𝑇 = 0) 𝔼[𝑌(1) | 𝑇 = 0]
− 𝑃(𝑇 = 1) 𝔼[𝑌(0) | 𝑇 = 1] − 𝑃(𝑇 = 0) 𝔼[𝑌(0) | 𝑇 = 0]
(8.6)

We immediately recognize the first and last terms as friendly conditional


expectations that we can estimate from observational data:
Active reading exercise: What assumption
= 𝑃(𝑇 = 1) 𝔼[𝑌 | 𝑇 = 1] + 𝑃(𝑇 = 0) 𝔼[𝑌(1) | 𝑇 = 0] are we using here?

− 𝑃(𝑇 = 1) 𝔼[𝑌(0) | 𝑇 = 1] − 𝑃(𝑇 = 0) 𝔼[𝑌 | 𝑇 = 0]


(8.7)

Because this is such an important decomposition, we’ll give it a name


and box before moving on with the bound derivation. We will call this
the observational-counterfactual decomposition (of the ATE). Also, to have
8 Unobserved Confounding: Bounds and Sensitivity Analysis 75

a bit more concise notation, we’ll use 𝜋 , 𝑃(𝑇 = 1) moving forward.

Proposition 8.1 (Observational-Counterfactual Decomposition)

𝔼[𝑌(1) − 𝑌(0)] = 𝜋 𝔼[𝑌 | 𝑇 = 1] + (1 − 𝜋) 𝔼[𝑌(1) | 𝑇 = 0]


− 𝜋 𝔼[𝑌(0) | 𝑇 = 1] − (1 − 𝜋) 𝔼[𝑌 | 𝑇 = 0] (8.8)

Unfortunately, 𝔼[𝑌(1) | 𝑇 = 0] and 𝔼[𝑌(0) | 𝑇 = 1] are counterfactual.


However, we know that they’re bounded between 𝑎 and 𝑏 . Therefore, we
get an upper bound on the complete expression by letting the quantity
that’s being added (𝔼[𝑌(1) | 𝑇 = 0]) equal 𝑏 and letting the quantity
that’s being subtracted (𝔼[𝑌(0) | 𝑇 = 1]) equal 𝑎 . Similarly, we can get a
lower bound by letting the term that’s being added equal 𝑎 and the term
that’s being subtracted equal 𝑏 .

Proposition 8.2 (No-Assumptions Bound) Let 𝜋 denote 𝑃(𝑇 = 1), where


𝑇 is a binary random variable. Given that the outcome 𝑌 is bounded between
𝑎 and 𝑏 (Assumption 8.1), we have the following upper and lower bounds on
the ATE:

𝔼[𝑌(1) − 𝑌(0)] ≤ 𝜋 𝔼[𝑌 | 𝑇 = 1] + (1 − 𝜋) 𝑏 − 𝜋 𝑎 − (1 − 𝜋) 𝔼[𝑌 | 𝑇 = 0]


(8.9)
𝔼[𝑌(1) − 𝑌(0)] ≥ 𝜋 𝔼[𝑌 | 𝑇 = 1] + (1 − 𝜋) 𝑎 − 𝜋 𝑏 − (1 − 𝜋) 𝔼[𝑌 | 𝑇 = 0]
(8.10)

Importantly, the length of this interval is 𝑏 − 𝑎 , half the length of the


naive interval that we saw in Equation 8.4. We can see this by subtracting
the lower bound from the upper bound:

𝜋 𝔼[𝑌 | 𝑇 = 1] + (1 − 𝜋) 𝑏 − 𝜋 𝑎 − (1 − 𝜋) 𝔼[𝑌 | 𝑇 = 0]
−(𝜋 𝔼[𝑌 | 𝑇 = 1] + (1 − 𝜋) 𝑎 − 𝜋 𝑏 − (1 − 𝜋) 𝔼[𝑌 | 𝑇 = 0])
= (1 − 𝜋) 𝑏 + 𝜋 𝑏 − 𝜋 𝑎 − (1 − 𝜋) 𝑎 (8.11)
=𝑏−𝑎 (8.12)

This is sometimes referred to as the “no-assumptions bound” because


we made no assumptions other than that the outcomes are bounded. If
the outcomes are not bounded, then the ATE and ITEs can be anywhere
between −∞ and ∞.

Running Example

Consider that we know that the outcomes are bounded between 0


and 1 (e.g., because we’re in a binary outcomes setting). This means
that the ITEs and must be bounded between -1 (0 - 1) and 1 (1 - 0),
which means that the ATE must also be bounded between -1 and
1. For this example, also consider that 𝜋 = 0.3, 𝔼[𝑌 | 𝑇 = 1] = .9,
and 𝔼[𝑌 | 𝑇 = 0] = .2.1 Then, by plugging these in to Equations 8.9 1Active reading exercise: How would we
and 8.10, we get the following bounds on the ATE: estimate these conditional expectations?

𝔼[𝑌(1) − 𝑌(0)] ≤ (.3)(.9) + (1 − .3)(1) − (.3)(0) − (1 − .3)(.2) (8.13)


𝔼[𝑌(1) − 𝑌(0)] ≥ (.3)(.9) + (1 − .3)(0) − (.3)(1) − (1 − .3)(.2) (8.14)
8 Unobserved Confounding: Bounds and Sensitivity Analysis 76

−0.17 ≤ 𝔼[𝑌(1) − 𝑌(0)] ≤ 0.83 (8.15)


Notice that this interval is of length 1 (𝑏 − 𝑎 = 1), half the length of
the naive interval −1 ≤ 𝔼[𝑌(1) − 𝑌(0)] ≤ 1 (Equation 8.4). We will
use this running example throughout Section 8.1.

Active reading exercises:


1. What kind of bounds can we get for CATEs 𝔼[𝑌(1) − 𝑌(0) | 𝑋],
assuming we have positivity? What goes wrong if we don’t have
positivity?
2. Say the potential outcomes are bounded in different ways: 𝑎 1 ≤
𝑌(1) ≤ 𝑏 1 and 𝑎0 ≤ 𝑌(0) ≤ 𝑏0 . Derive the corresponding no-
assumptions bounds in this more general setting.
The bounds in Proposition 8.2 are as tight as we can get without further
assumptions. Unfortunately, the corresponding interval always contains
0,2 which means that we cannot use this bound to distinguish “no causal 2 To see why the no-assumptions bound
effect” from “causal effect.” Can we get tighter bounds? always contains zero, consider what we
would need for it to not contain zero: we
In order to bound the ATE, we must have some information about the would either need the upper bound to
counterfactual part of this decomposition. We can easily estimate the be less than zero or the lower bound to
be greater than zero. However, this can-
observational part from data. In the no-assumptions bound (Proposi- not be the case. To see why, note that the
tion 8.2), all we assumed is that the outcomes are bounded by 𝑎 and 𝑏 . minimum upper bound is achieved when
If we make more assumptions, we can get smaller intervals. In the next 𝔼[𝑌 | 𝑇 = 1] = 𝑎 and 𝔼[𝑌 | 𝑇 = 0] = 𝑏 ,
which gives us an (inclusive) upper bound
few sections, we will cover some assumptions that are sometimes fairly
of zero. Same with the lower bound.
reasonable, depending on the setting, and what tighter bounds these Active reading exercise: Show that the
assumptions get us. The general strategy we will use for all of them is to maximum lower bound is 0.
start with the observational-counterfactual decomposition of the ATE
(Proposition 8.1),

𝔼[𝑌(1) − 𝑌(0)] = 𝜋 𝔼[𝑌 | 𝑇 = 1] + (1 − 𝜋) 𝔼[𝑌(1) | 𝑇 = 0]


− 𝜋 𝔼[𝑌(0) | 𝑇 = 1] − (1 − 𝜋) 𝔼[𝑌 | 𝑇 = 0] ,
(8.8 revisited)

and get smaller intervals by bounding the counterfactual parts using the
different assumptions we make.
The intervals we will see in the next couple of subsections will all contain
zero. We won’t see an interval that is purely positive or purely negative
until Section 8.1.4, so feel free to skip to that section if you only want to
see those intervals.

8.1.2 Monotone Treatment Response

For our first assumption beyond assuming bounded outcomes, consider


that we find ourselves in a setting where it is feasible that the treatment
can only help; it can’t hurt. This is the setting that Manski [58] considers [58]: Manski (1997), ‘Monotone Treatment
in context. In this setting, we can justify the monotone treatment response Response’

(MTR) assumption:

Assumption 8.2 (Nonnegative Monotone Treatment Response)

∀𝑖 𝑌𝑖 (1) ≥ 𝑌𝑖 (0) (8.16)


8 Unobserved Confounding: Bounds and Sensitivity Analysis 77

This means that every ITE is nonnegative, so we can bring our lower
bound on the ITEs up from 𝑎 − 𝑏 (Equation 8.3) to 0. So, intuitively, this
should mean that our lower bound on the ATE should move up to 0. And
we will now see that this is the case.
Now, rather than lower bounding 𝔼[𝑌(1) | 𝑇 = 0] with 𝑎 and −𝔼[𝑌(0) |
𝑇 = 1] with −𝑏 , we can do better. Because the treatment only helps,
𝔼[𝑌(1) | 𝑇 = 0] ≥ 𝔼[𝑌(0) | 𝑇 = 0] = 𝔼[𝑌 | 𝑇 = 0], so we can lower
bound 𝔼[𝑌(1) | 𝑇 = 0] with 𝔼[𝑌 | 𝑇 = 0]. Similarly, −𝔼[𝑌(0) | 𝑇 = 1] ≥
−𝔼[𝑌(1) | 𝑇 = 1] = 𝔼[𝑌 | 𝑇 = 1] (since multiplying by a negative flips the
inequality), so we can lower bound −𝔼[𝑌(0) | 𝑇 = 1] with −𝔼[𝑌 | 𝑇 = 1].
Therefore, we can improve on the no-assumptions lower bound3 to get 0, 3 Recall that by only assuming that out-
as our intuition suggested: comes are bounded between 𝑎 and 𝑏 ,
we get the no-assumptions lower bound
𝔼[𝑌(1) − 𝑌(0)] = 𝜋 𝔼[𝑌 | 𝑇 = 1] + (1 − 𝜋) 𝔼[𝑌(1) | 𝑇 = 0] (Proposition 8.2):

𝔼[𝑌(1) − 𝑌(0)]
− 𝜋 𝔼[𝑌(0) | 𝑇 = 1] − (1 − 𝜋) 𝔼[𝑌 | 𝑇 = 0]
≥ 𝜋 𝔼[𝑌 | 𝑇 = 1] + (1 − 𝜋) 𝑎
(8.8 revisited)
− 𝜋 𝑏 − (1 − 𝜋) 𝔼[𝑌 | 𝑇 = 0]
≥ 𝜋 𝔼[𝑌 | 𝑇 = 1] + (1 − 𝜋) 𝔼[𝑌 | 𝑇 = 0] (8.10 revisited)
− 𝜋 𝔼[𝑌 | 𝑇 = 1] − (1 − 𝜋) 𝔼[𝑌 | 𝑇 = 0] (8.17)
=0 (8.18)

Proposition 8.3 (Nonnegative MTR Lower Bound) Under the nonnega-


tive MTR assumption, the ATE is bounded from below by 0. Mathematically,

𝔼[𝑌(1) − 𝑌(0)] ≥ 0 (8.19)

Running Example The no-assumptions upper bound4 still applies here, 4 Recall the no-assumptions upper bound

so in our running example from Section 8.1.1 where 𝜋 = .3, 𝔼[𝑌 | 𝑇 = 1] = (Proposition 8.2):
.9, and 𝔼[𝑌 | 𝑇 = 0] = .2, our ATE interval improves from [−0.17 , 0.83] 𝔼[𝑌(1) − 𝑌(0)]
(Equation 8.15) to [0 , 0.83]. ≤ 𝜋 𝔼[𝑌 | 𝑇 = 1] + (1 − 𝜋) 𝑏

Alternatively, say the treatment can only hurt people; it can’t help them − 𝜋 𝑎 − (1 − 𝜋) 𝔼[𝑌 | 𝑇 = 0]
(8.9 revisited)
(e.g. a gunshot wound only hurts chances of staying alive). In those cases,
we would have the nonpositive monotone treatment response assumption
and the nonpositive MTR upper bound:

Assumption 8.3 (Nonpositive Monotone Treatment Response)

∀𝑖 𝑌𝑖 (1) ≤ 𝑌𝑖 (0) (8.20)

Proposition 8.4 (Nonpositive MTR Upper Bound) Under the nonpositive Active reading exercise: Prove Proposi-
tion 8.4.
MTR assumption, the ATE is bounded from above by 0. Mathematically,

𝔼[𝑌(1) − 𝑌(0)] ≤ 0 (8.21)

Running Example And in this setting, the no-assumptions lower 5 Recall the no-assumptions lower bound

bound5 still applies. That means that the ATE interval in our exam- (Proposition 8.2):
ple improves from [−0.17 , 0.83] (Equation 8.15) to [−0.17 , 0]. 𝔼[𝑌(1) − 𝑌(0)]
≥ 𝜋 𝔼[𝑌 | 𝑇 = 1] + (1 − 𝜋) 𝑎
Active reading exercise: What is the ATE interval if we assume both non-
− 𝜋 𝑏 − (1 − 𝜋) 𝔼[𝑌 | 𝑇 = 0]
negative MTR and nonpositive MTR? Does this make sense, intuitively?
(8.10 revisited)
8 Unobserved Confounding: Bounds and Sensitivity Analysis 78

8.1.3 Monotone Treatment Selection

The next assumption that we’ll consider is the assumption that the people
who selected treatment would have better outcomes than those who
didn’t select treatment, under either treatment scenario. Manski and
Pepper [59] introduced this as the monotone treatment selection (MTS) [59]: Manski and Pepper (2000), ‘Mono-
assumption. tone Instrumental Variables: With an Ap-
plication to the Returns to Schooling’

Assumption 8.4 (Monotone Treatment Selection)

𝔼[𝑌(1) | 𝑇 = 1] ≥ 𝔼[𝑌(1) | 𝑇 = 0] (8.22)


𝔼[𝑌(0) | 𝑇 = 1] ≥ 𝔼[𝑌(0) | 𝑇 = 0] (8.23)

As Morgan and Winship [12, Section 12.2.2] point out, you might think of [12]: Morgan and Winship (2014), Counter-
this as positive self-selection. Those who generally get better outcomes factuals and Causal Inference: Methods and
Principles for Social Research
self-select into the treatment group. Again, we start with the observational-
counterfactual decomposition, and we now obtain an upper bound using
the MTS assumption (Assumption 8.4):

Proposition 8.5 (Monotone Treatment Selection Upper Bound) Under


the MTS assumption, the ATE is bounded from above by the associational
difference. Mathematically,

𝔼[𝑌(1) − 𝑌(0)] ≤ 𝔼[𝑌 | 𝑇 = 1] − 𝔼[𝑌 | 𝑇 = 0] (8.24)

Proof.

𝔼[𝑌(1) − 𝑌(0)] = 𝜋 𝔼[𝑌 | 𝑇 = 1] + (1 − 𝜋) 𝔼[𝑌(1) | 𝑇 = 0]


− 𝜋 𝔼[𝑌(0) | 𝑇 = 1] − (1 − 𝜋) 𝔼[𝑌 | 𝑇 = 0]
(8.8 revisited)
≤ 𝜋 𝔼[𝑌 | 𝑇 = 1] + (1 − 𝜋) 𝔼[𝑌 | 𝑇 = 1]
− 𝜋 𝔼[𝑌 | 𝑇 = 0] − (1 − 𝜋) 𝔼[𝑌 | 𝑇 = 0] (8.25)
= 𝔼[𝑌 | 𝑇 = 1] − 𝔼[𝑌 | 𝑇 = 0] (8.26)

where Equation 8.25 followed from the fact that (a) Equation 8.22 of the
MTS assumption allows us to upper bound 𝔼[𝑌(1) | 𝑇 = 0] by 𝔼[𝑌(1) |
𝑇 = 1] = 𝔼[𝑌(1) | 𝑇 = 1] and (b) Equation 8.23 of the MTS assumption
allows us to upper bound −𝔼[𝑌(0) | 𝑇 = 1] by −𝔼[𝑌 | 𝑇 = 0].

Running Example Recall our running example from Section 8.1.1 where
𝜋 = .3, 𝔼[𝑌 | 𝑇 = 1] = .9, and 𝔼[𝑌 | 𝑇 = 0] = .2. The MTS assumption 6 Recall the no-assumptions lower bound

gives us an upper bound, and we still have the no-assumptions lower (Proposition 8.2):

bound.6 That means that the ATE interval in our example improves from 𝔼[𝑌(1) − 𝑌(0)]
[−0.17 , 0.83] (Equation 8.15) to [−0.17 , 0.7]. ≥ 𝜋 𝔼[𝑌 | 𝑇 = 1] + (1 − 𝜋) 𝑎
− 𝜋 𝑏 − (1 − 𝜋) 𝔼[𝑌 | 𝑇 = 0]
Both MTR and MTS Then, we can combine the nonnegative MTR
(8.10 revisited)
assumption (Assumption 8.2) with the MTS assumption (Assumption 8.4)
to get the lower bound in Proposition 8.3 and the upper bound in
Proposition 8.5, respectively. In our running example, this yields the
following interval for the ATE: [0 , 0.7].
8 Unobserved Confounding: Bounds and Sensitivity Analysis 79

Intervals Contain Zero Although bounds from the MTR and MTS
assumptions can be useful for ruling out very large or very small causal
effects, the corresponding intervals still contain zero. This means that
these assumptions are not enough to identify whether there is an effect
or not.

8.1.4 Optimal Treatment Selection

We now consider what we will call the optimal treatment selection (OTS) as-
sumption from Manski [55]. This assumption means that the individuals [55]: Manski (1990), ‘Nonparametric
always receive the treatment that is best for them (e.g. if an expert doctor Bounds on Treatment Effects’

is deciding which treatment to give people). We write this mathematically


as follows:

Assumption 8.5 (Optimal Treatment Selection)

𝑇𝑖 = 1 =⇒ 𝑌𝑖 (1) ≥ 𝑌𝑖 (0) , 𝑇𝑖 = 0 =⇒ 𝑌𝑖 (0) > 𝑌𝑖 (1) (8.27)

From the OTS assumption, we know that

𝔼[𝑌(1) | 𝑇 = 0] ≤ 𝔼[𝑌(0) | 𝑇 = 0] = 𝔼[𝑌 | 𝑇 = 0] . (8.28)

Therefore, we can give an upper bound, by upper bounding


𝔼[𝑌(1) | 𝑇 = 0] with 𝔼[𝑌 | 𝑇 = 0] and upper bounding −𝔼[𝑌(0) | 𝑇 = 1]
with −𝑎 (same as in the no-assumptions upper bound7 ): 7 Recall the no-assumptions upper bound

(Proposition 8.2):
𝔼[𝑌(1) − 𝑌(0)] = 𝜋 𝔼[𝑌 | 𝑇 = 1] + (1 − 𝜋) 𝔼[𝑌(1) | 𝑇 = 0] 𝔼[𝑌(1) − 𝑌(0)]
− 𝜋 𝔼[𝑌(0) | 𝑇 = 1] − (1 − 𝜋) 𝔼[𝑌 | 𝑇 = 0] ≤ 𝜋 𝔼[𝑌 | 𝑇 = 1] + (1 − 𝜋) 𝑏
(8.8 revisited) − 𝜋 𝑎 − (1 − 𝜋) 𝔼[𝑌 | 𝑇 = 0]
≤ 𝜋 𝔼[𝑌 | 𝑇 = 1] + (1 − 𝜋) 𝔼[𝑌 | 𝑇 = 0]
(8.9 revisited)

− 𝜋 𝑎 − (1 − 𝜋) 𝔼[𝑌 | 𝑇 = 0] (8.29)
= 𝜋 𝔼[𝑌 | 𝑇 = 1] − 𝜋 𝑎 (8.30)

The OTS assumption also tells us that

𝔼[𝑌(0) | 𝑇 = 1] ≤ 𝔼[𝑌(1) | 𝑇 = 1] = 𝔼[𝑌 | 𝑇 = 1] , (8.31)

which is equivalent to saying −𝔼[𝑌(0) | 𝑇 = 1] ≥ −𝔼[𝑌 | 𝑇 = 1]. So we


can lower bound −𝔼[𝑌(0) | 𝑇 = 1] with −𝔼[𝑌 | 𝑇 = 1], and we can lower
bound 𝔼[𝑌(1) | 𝑇 = 0] with 𝑎 (just as we did in the no-assumptions lower
bound8 ) to get the following lower bound: 8 Recall the no-assumptions lower bound

(Proposition 8.2):
𝔼[𝑌(1) − 𝑌(0)] = 𝜋 𝔼[𝑌 | 𝑇 = 1] + (1 − 𝜋) 𝔼[𝑌(1) | 𝑇 = 0] 𝔼[𝑌(1) − 𝑌(0)]
− 𝜋 𝔼[𝑌(0) | 𝑇 = 1] − (1 − 𝜋) 𝔼[𝑌 | 𝑇 = 0] ≥ 𝜋 𝔼[𝑌 | 𝑇 = 1] + (1 − 𝜋) 𝑎
(8.8 revisited) − 𝜋 𝑏 − (1 − 𝜋) 𝔼[𝑌 | 𝑇 = 0]
≥ 𝜋 𝔼[𝑌 | 𝑇 = 1] + (1 − 𝜋) 𝑎
(8.10 revisited)

− 𝜋 𝔼[𝑌 | 𝑇 = 1] − (1 − 𝜋) 𝔼[𝑌 | 𝑇 = 0] (8.32)


= (1 − 𝜋) 𝑎 − (1 − 𝜋) 𝔼[𝑌 | 𝑇 = 0] (8.33)
8 Unobserved Confounding: Bounds and Sensitivity Analysis 80

Proposition 8.6 (Optimal Treatment Selection Bound 1) Let 𝜋 denote


𝑃(𝑇 = 1), where 𝑇 is a binary random variable. Given that the outcome 𝑌 is
bounded from below by 𝑎 (Assumption 8.1) and that the optimal treatment is
always selection (Assumption 8.5), we have the following upper and lower
bounds on the ATE:

𝔼[𝑌(1) − 𝑌(0)] < 𝜋 𝔼[𝑌 | 𝑇 = 1] − 𝜋 𝑎 (8.34)


𝔼[𝑌(1) − 𝑌(0)] ≥ (1 − 𝜋) 𝑎 − (1 − 𝜋) 𝔼[𝑌 | 𝑇 = 0] (8.35)
Interval Length = 𝜋 𝔼[𝑌 | 𝑇 = 1] + (1 − 𝜋) 𝔼[𝑌 | 𝑇 = 0] − 𝑎 (8.36)

Unfortunately, this interval also always contains zero!9 This means that 9 Active reading exercise: Show that this
Proposition 8.6 doesn’t tell us whether the causal effect is non-zero or interval always contains zero.
not.
Running Example Recall our running example from Section 8.1.1 where
𝑎 = 0, 𝑏 = 1, 𝜋 = .3, 𝔼[𝑌 | 𝑇 = 1] = .9, and 𝔼[𝑌 | 𝑇 = 0] = .2. Plugging
these in to Proposition 8.6 gives us the following:

𝔼[𝑌(1) − 𝑌(0)] ≤ (.3) (.9) − (.3) (0) (8.37)


𝔼[𝑌(1) − 𝑌(0)] ≥ (1 − .3) (0) − (1 − .3) (.2) (8.38)
−0.14 ≤ 𝔼[𝑌(1) − 𝑌(0)] ≤ 0.27 (8.39)
Interval Length = 0.41 (8.40)

We’ll now give an interval that can be purely positive or purely negative,
potentially identifying the ATE as non-zero.

A Bound That Can Identify the Sign of the ATE

It turns out that, although we take the OTS assumption from Manski
[55], the bound we gave in Proposition 8.6 is not actually the bound that [55]: Manski (1990), ‘Nonparametric
Manski [55] derives with that assumption. For example, where we used Bounds on Treatment Effects’

𝔼[𝑌(1) | 𝑇 = 0] ≤ 𝔼[𝑌 | 𝑇 = 0], Manski uses 𝔼[𝑌(1) | 𝑇 = 0] ≤ 𝔼[𝑌 |


𝑇 = 1]. We’ll quickly prove this inequality that Manski uses from the
OTS assumption:10 We start by applying Equation 8.42: 10 Recall the OTS assumption (Assump-
tion 8.5):
𝔼[𝑌(1) | 𝑇 = 0] = 𝔼[𝑌(1) | 𝑌(0) > 𝑌(1)] (8.45) 𝑇𝑖 = 1 =⇒ 𝑌𝑖 (1) ≥ 𝑌𝑖 (0) (8.41)
𝑇𝑖 = 0 =⇒ 𝑌𝑖 (0) > 𝑌𝑖 (1) (8.42)
Because the random variable we are taking the expectation of is 𝑌(1), if
we flip 𝑌(0) > 𝑌(1) to 𝑌(0) ≤ 𝑌(1), then we get an upper bound:
Because there are only two values that
𝑇 can take on, this is equivalent to the
following (contrapositives):
≤ 𝔼[𝑌(1) | 𝑌(0) ≤ 𝑌(1)] (8.46)
𝑇𝑖 = 0 ⇐= 𝑌𝑖 (1) < 𝑌𝑖 (0) (8.43)
𝑇𝑖 = 1 ⇐= 𝑌𝑖 (0) ≤ 𝑌𝑖 (1) (8.44)
Finally, applying Equation 8.44, we have the result:

= 𝔼[𝑌(1) | 𝑇 = 1] (8.47)
= 𝔼[𝑌 | 𝑇 = 1] (8.48)

Now that we have that 𝔼[𝑌(1) | 𝑇 = 0] ≤ 𝔼[𝑌 | 𝑇 = 1], we can


prove Manski [55]’s upper bound, where we use this key inequality in [55]: Manski (1990), ‘Nonparametric
Bounds on Treatment Effects’
8 Unobserved Confounding: Bounds and Sensitivity Analysis 81

Equation 8.49:

𝔼[𝑌(1) − 𝑌(0)] = 𝜋 𝔼[𝑌 | 𝑇 = 1] + (1 − 𝜋) 𝔼[𝑌(1) | 𝑇 = 0]


− 𝜋 𝔼[𝑌(0) | 𝑇 = 1] − (1 − 𝜋) 𝔼[𝑌 | 𝑇 = 0]
(8.8 revisited)
≤ 𝜋 𝔼[𝑌 | 𝑇 = 1] + (1 − 𝜋) 𝔼[𝑌(1) | 𝑇 = 1]
− 𝜋 𝑎 − (1 − 𝜋) 𝔼[𝑌 | 𝑇 = 0] (8.49)
= 𝜋 𝔼[𝑌 | 𝑇 = 1] + (1 − 𝜋) 𝔼[𝑌 | 𝑇 = 1]
− 𝜋 𝑎 − (1 − 𝜋) 𝔼[𝑌 | 𝑇 = 0] (8.50)
= 𝔼[𝑌 | 𝑇 = 1] − 𝜋 𝑎 − (1 − 𝜋) 𝔼[𝑌 | 𝑇 = 0] (8.51)

Similarly, we can perform an analogous derivation11 to get the lower


bound:
11 Active reading exercise: Derive Equa-
𝔼[𝑌(1) − 𝑌(0)] ≥ 𝜋 𝔼[𝑌 | 𝑇 = 1] + (1 − 𝜋) 𝑎 − 𝔼[𝑌 | 𝑇 = 0] (8.52) tion 8.52 yourself.

Proposition 8.7 (Optimal Treatment Selection Bound 2) Let 𝜋 denote


𝑃(𝑇 = 1), where 𝑇 is a binary random variable. Given that the outcome 𝑌 is
bounded from below by 𝑎 (Assumption 8.1) and that the optimal treatment is
always selection (Assumption 8.5), we have the following upper and lower
bounds on the ATE:

𝔼[𝑌(1) − 𝑌(0)] ≤ 𝔼[𝑌 | 𝑇 = 1] − 𝜋 𝑎 − (1 − 𝜋) 𝔼[𝑌 | 𝑇 = 0] (8.53)


𝔼[𝑌(1) − 𝑌(0)] ≥ 𝜋 𝔼[𝑌 | 𝑇 = 1] + (1 − 𝜋) 𝑎 − 𝔼[𝑌 | 𝑇 = 0] (8.54)
Interval Length = (1 − 𝜋) 𝔼[𝑌 | 𝑇 = 1] + 𝜋 𝔼[𝑌 | 𝑇 = 0] − 𝑎 (8.55)

This interval can also include zero, but it doesn’t have to. For example, in
our running example, it doesn’t.
Running Example Recall our running example from Section 8.1.1 where
𝑎 = 0, 𝑏 = 1, 𝜋 = .3, 𝔼[𝑌 | 𝑇 = 1] = .9, and 𝔼[𝑌 | 𝑇 = 0] = .2. Plugging
these in to Proposition 8.7 gives us the following for the OTS bound 2:

𝔼[𝑌(1) − 𝑌(0)] ≤ (.9) − (.3) (0) − (1 − .3) (.2) (8.56) Application of OTS bound 1 (Proposi-
tion 8.6) to our running example:
𝔼[𝑌(1) − 𝑌(0)] ≥ (.3) (.9) + (1 − .3) (0) − (.2) (8.57)
−0.14 ≤ 𝔼[𝑌(1) − 𝑌(0)] ≤ 0.27
0.07 ≤ 𝔼[𝑌(1) − 𝑌(0)] ≤ 0.76 (8.58) (8.39 revisited)
Interval Length = 0.69 (8.59) Interval Length = 0.41 (8.40 revisited)

So while the OTS bound 2 from Manski [55] identifies the sign of the ATE [55]: Manski (1990), ‘Nonparametric
in our running example, unlike the OTS bound 1, the OTS bound 2 gives Bounds on Treatment Effects’
us a 68% larger interval. You can see this by comparing Equation 8.40 (in
the above margin) with Equation 8.59.
This illustrates some important takeaways:
12
1. Different bounds are better in different cases.12 Active reading exercise: Using Equa-
tions 8.40 and 8.59, derive the conditions
2. Different bounds can be better in different ways (e.g., identifying
under which OTS bound 1 yields a smaller
the sign vs. getting a smaller interval). interval and the conditions under which
OTS bound 2 yields a smaller interval.
Mixing Bounds Fortunately because both the OTS bound 1 and OTS
bound 2 come from the same assumption (Assumption 8.5), we can take
the lower bound from OTS bound 2 and the upper bound from OTS
8 Unobserved Confounding: Bounds and Sensitivity Analysis 82

bound 1 to get the following tighter interval that still identifies the sign: [54]: Manski (1989), ‘Anatomy of the Selec-
tion Problem’

0.07 ≤ 𝔼[𝑌(1) − 𝑌(0)] ≤ 0.27


[55]: Manski (1990), ‘Nonparametric
(8.60) Bounds on Treatment Effects’
[56]: Manski (1993), ‘Identification Prob-
Similarly, we could have mixed the lower bound from OTS bound 1 and
lems in the Social Sciences’
the upper bound from OTS bound 2, but that would have given the worst [57]: Manski (1994), ‘The selection prob-
interval in this subsection for this specific example. It could be the best lem’
in a different example, though. [58]: Manski (1997), ‘Monotone Treatment
Response’
In this section we’ve given you a taste of what kind of results we can get [59]: Manski and Pepper (2000), ‘Mono-
tone Instrumental Variables: With an Ap-
from nonparametric bounds, but, of course, this is just an introduction.
plication to the Returns to Schooling’
For more literature on this, see, e.g., [53–60]. [53]: Manski (2003), Partial Identification of
Probability Distributions: Springer Series in
Statistics
[60]: Manski (2013), Public Policy in an Un-
8.2 Sensitivity Analysis certain World

8.2.1 Sensitivity Basics in Linear Setting

Before this chapter, we have exclusively been working in the setting 𝑊


where causal effects are identifiable. We illustrate the common example
of the confounders 𝑊 as common causes of 𝑇 and 𝑌 in Figure 8.2. In
this example, the causal effect of 𝑇 on 𝑌 is identifiable. However, what if 𝑇 𝑌
there is a single unobserved confounder 𝑈 , as we illustrate in Figure 8.3.
Figure 8.2: Simple causal structure where
Then, the causal effect is not identifiable. 𝑊 confounds the effect of 𝑇 on 𝑌 and
where 𝑊 is the only confounder.
What would be the bias we’d observe if we only adjusted for the observed
confounders 𝑊 ? To illustrate this simply, we’ll start with a noiseless13
linear data generating process. So consider data that are generated by
the following structural equations: 𝑊 𝑈

𝑇 := 𝛼 𝑤 𝑊 + 𝛼 𝑢 𝑈 (8.61)
𝑌 := 𝛽 𝑤 𝑊 + 𝛽 𝑢 𝑈 + 𝛿𝑇 (8.62)
𝑇 𝑌
So the relevant quantity that describes causal effects of 𝑇 on 𝑌 is 𝛿 since Figure 8.3: Simple causal structure where
it is the coefficient in front of 𝑇 in the structural equation for 𝑌 . From the 𝑊 is the observed confounders and 𝑈 is
the unobserved confounders.
backdoor adjustment (Theorem 4.2) / adjustment formula (Theorem 2.1),
we know that
13 Active reading exercise: What assump-
𝔼[𝑌(1) − 𝑌(0)] = 𝔼𝑊 ,𝑈 [𝔼[𝑌 | 𝑇 = 1, 𝑊 , 𝑈] − 𝔼[𝑌 | 𝑇 = 0, 𝑊 , 𝑈]] = 𝛿 tion is violated when the data are gener-
ated by a noiseless process?
(8.63)
But because 𝑈 isn’t observed, the best we can do is adjust for only 𝑊 .
𝛽
This leads to a confounding bias of 𝛼𝑢𝑢 . We’ll be focusing on identification,
not estimation, here, so we’ll consider that we have infinite data. This
means that we have access to 𝑃(𝑊 , 𝑇, 𝑌). Then, we’ll write down and
prove the following proposition about confounding bias:

Proposition 8.8 When 𝑇 and 𝑌 are generated by the noiseless linear process
in Equations 8.61 and 8.62, the confounding bias of adjusting for just 𝑊 (and
8 Unobserved Confounding: Bounds and Sensitivity Analysis 83

𝛽𝑢
not 𝑈 ) is 𝛼𝑢 . Mathematically:

𝔼𝑊 [𝔼[𝑌 | 𝑇 = 1, 𝑊] − 𝔼[𝑌 | 𝑇 = 0, 𝑊]]


𝛽𝑢
− 𝔼𝑊 ,𝑈 [𝔼[𝑌 | 𝑇 = 1 , 𝑊 , 𝑈] − 𝔼[𝑌 | 𝑇 = 0 , 𝑊 , 𝑈]] =
𝛼𝑢
(8.64)

Proof. We’ll prove Proposition 8.8 in 3 steps:


1. Get a closed-form expression for 𝔼𝑊 [𝔼[𝑌 | 𝑇 = 𝑡, 𝑊]] in terms of
𝛼 𝑤 , 𝛼 𝑢 , 𝛽 𝑤 , and 𝛽 𝑢 .
2. Use step 1 to get a closed-form expression for the difference
𝔼𝑊 [𝔼[𝑌 | 𝑇 = 1, 𝑊] − 𝔼[𝑌 | 𝑇 = 0, 𝑊]].
3. Subtract off 𝔼𝑊 ,𝑈 [𝔼[𝑌 | 𝑇 = 1 , 𝑊 , 𝑈] − 𝔼[𝑌 | 𝑇 = 0 , 𝑊 , 𝑈]] = 𝛿 .14 14 Active reading exercise: Show that
𝔼𝑊 ,𝑈 [𝔼[𝑌 | 𝑇 = 1, 𝑊 , 𝑈] − 𝔼[𝑌 | 𝑇 = 0, 𝑊 , 𝑈]]
First, we use the structural equation for 𝑌 (Equation 8.62): equals 𝛿 .

𝔼𝑊 [𝔼[𝑌 | 𝑇 = 𝑡, 𝑊]] = 𝔼𝑊 𝔼[𝛽 𝑤 𝑊 + 𝛽 𝑢 𝑈 + 𝛿𝑇 | 𝑇 = 𝑡, 𝑊]


 
(8.65) 𝑌 := 𝛽 𝑤 𝑊 + 𝛽 𝑢 𝑈 + 𝛿𝑇 (8.62 revisited)
= 𝔼𝑊 𝛽 𝑤 𝑊 + 𝛽 𝑢 𝔼[𝑈 | 𝑇 = 𝑡, 𝑊] + 𝛿𝑡
 
(8.66)

This is where we use the structural equation for 𝑇 (Equation 8.61). 𝑇 := 𝛼 𝑤 𝑊 + 𝛼 𝑢 𝑈 (8.61 revisited)
Rearranging it gives us 𝑈 = 𝑇−𝛼 𝛼𝑢
𝑤𝑊
. We can then use that for the
remaining conditional expectation:

𝑡 − 𝛼𝑤 𝑊
   
= 𝔼𝑊 𝛽 𝑤 𝑊 + 𝛽 𝑢 + 𝛿𝑡 (8.67)
𝛼𝑢
𝛽𝑢 𝛽𝑢 𝛼𝑤
 
= 𝔼𝑊 𝛽 𝑤 𝑊 + 𝑡− 𝑊 + 𝛿𝑡 (8.68)
𝛼𝑢 𝛼𝑢
𝛽𝑢 𝛽𝑢 𝛼𝑤
= 𝛽 𝑤 𝔼[𝑊] + 𝑡− 𝔼[𝑊] + 𝛿𝑡 (8.69)
𝛼𝑢 𝛼𝑢

Then, rearranging a bit, we have the following:

𝛽𝑢 𝛽𝑢 𝛼𝑤
   
= 𝛿+ 𝑡 + 𝛽𝑤 − 𝔼[𝑊] (8.70)
𝛼𝑢 𝛼𝑢

The only parts of this that matter are the parts that depend on 𝑡 because
we want to know the effect of 𝑇 on 𝑌 . For example, consider the expected
ATE estimate we would get if we were to only adjust for 𝑊 :

𝔼𝑊 [𝔼[𝑌 | 𝑇 = 1, 𝑊] − 𝔼[𝑌 | 𝑇 = 0, 𝑊]] (8.71)


𝛽𝑢 𝛽𝑢 𝛼𝑤
   
= 𝛿+ (1) + 𝛽 𝑤 − 𝔼[𝑊]
𝛼𝑢 𝛼𝑢
𝛽𝑢 𝛽𝑢 𝛼𝑤
    
− 𝛿+ (0) + 𝛽 𝑤 − 𝔼[𝑊] (8.72)
𝛼𝑢 𝛼𝑢
𝛽𝑢
=𝛿+ (8.73)
𝛼𝑢
8 Unobserved Confounding: Bounds and Sensitivity Analysis 84

Finally, subtracting off 𝔼𝑊 ,𝑈 [𝔼[𝑌 | 𝑇 = 1 , 𝑊 , 𝑈] − 𝔼[𝑌 | 𝑇 = 0 , 𝑊 , 𝑈]]:

Bias = 𝔼𝑊 [𝔼[𝑌 | 𝑇 = 1 , 𝑊] − 𝔼[𝑌 | 𝑇 = 0 , 𝑊]]


− 𝔼𝑊 ,𝑈 [𝔼[𝑌 | 𝑇 = 1, 𝑊 , 𝑈] − 𝔼[𝑌 | 𝑇 = 0 , 𝑊 , 𝑈]] (8.74)
𝛽𝑢
=𝛿+ −𝛿 (8.75)
𝛼𝑢
𝛽𝑢
= (8.76)
𝛼𝑢

𝑊 𝑈
Generalization to Arbitrary Graphs/Estimands Here, we’ve performed
a sensitivity analysis for the ATE for the simple graph structure in Fig-
ure 8.4. For arbitrary estimands in arbitrary graphs, where the structural
equations are linear, see Cinelli et al. [61]. 𝑇 𝑌
Figure 8.4: Simple causal structure where
Sensitivity Contour Plots 𝑊 is the observed confounders and 𝑈 is
the unobserved confounders.

Because Proposition 8.8 gives us a closed-form expression for the bias in


terms of the unobserved confounder parameters 𝛼 𝑢 and 𝛽 𝑢 , we can plot
[61]: Cinelli et al. (2019), ‘Sensitivity Anal-
ysis of Linear Structural Causal Models’
the levels of bias in contour plots. We show this in Figure 8.5a, where we
have 𝛼1𝑢 on the x-axis and 𝛽 𝑢 on the y-axis.

If we rearrange Equation 8.7315 to solve for 𝛿 , we get the following: 15 Recall Equation 8.73:

𝛽𝑢 𝔼𝑊 [𝔼[𝑌 | 𝑇 = 1, 𝑊] − 𝔼[𝑌 | 𝑇 = 0, 𝑊]]


𝛿 = 𝔼𝑊 [𝔼[𝑌 | 𝑇 = 1 , 𝑊] − 𝔼[𝑌 | 𝑇 = 0 , 𝑊]] − (8.77) 𝛽𝑢
𝛼𝑢 =𝛿+
𝛼𝑢
So for given values of 𝛼 𝑢 and 𝛽 𝑢 , we can compute the true ATE 𝛿 ,
(8.73 revisited)

from the observational quantity 𝔼𝑊 [𝔼[𝑌 | 𝑇 = 1 , 𝑊] − 𝔼[𝑌 | 𝑇 = 0 , 𝑊]].


This allows us to get sensitivity curves that allow us to know how
robust conclusions like “𝔼𝑊 [𝔼[𝑌 | 𝑇 = 1 , 𝑊] − 𝔼[𝑌 | 𝑇 = 0 , 𝑊]] = 25 is
positive, so 𝛿 is likely positive” are to unobserved confounding. We plot
such relevant contours of 𝛿 in in Figure 8.5b.

10.0 100
1 24
7.5 10 15
25 80 0
5.0 50 -25

2.5 60

0.0
40
2.5

5.0 20
7.5
0
10.0
15 10 5 0 5 10 15 0 1 2 3 4 5

(a) Contours of confounding bias


𝛽𝑢 (b) Contours of the true ATE 𝛿, given that
𝛼𝑢
𝔼𝑊 [𝔼[𝑌 | 𝑇 = 1, 𝑊] − 𝔼[𝑌 | 𝑇 = 0, 𝑊]] = 25
.
Figure 8.5: Contour plots for sensitivity where the x-axis for both is 𝛼1𝑢 and the y-axis is 𝛽 𝑢 . There is a color-coded correspondence between
the curves in the upper right of Figure 8.5b and the curves in Figure 8.5
8 Unobserved Confounding: Bounds and Sensitivity Analysis 85

In the example we depict in Figure 8.5, the figure tells us that the green
curve (third from the bottom/left) indicates how strong the confounding
would need to be in order to completely explain the observed association.
In other words, ( 𝛼1𝑢 , 𝛽 𝑢 ) would need be large enough to fall on the green
curve or above in order for the true ATE 𝛿 to be zero or the opposite sign
of 𝔼𝑊 [𝔼[𝑌 | 𝑇 = 1 , 𝑊] − 𝔼[𝑌 | 𝑇 = 0 , 𝑊]] = 25.

8.2.2 More General Settings

We consider a simple linear setting in Section 8.2.1 in order to easily


convey the important concepts in sensitivity analysis. However, there
is existing that allows us to do sensitivity analysis in more general
settings.
Say we are in the common setting where 𝑇 is binary. This is not the case
in the previous section (see Equation 8.61 ). Rosenbaum and Rubin [62]
and Imbens [63]16 consider a simple binary treatment setting with binary 𝑇 := 𝛼 𝑤 𝑊 + 𝛼 𝑢 𝑈 (8.61 revisited)
𝑈 by just putting a logistic sigmoid function around the right-hand side [62]: Rosenbaum and Rubin (1983), ‘Assess-
of Equation 8.61 and using that for the probability of treatment instead ing Sensitivity to an Unobserved Binary
of the actual value of treatment: Covariate in an Observational Study with
Binary Outcome’
1 [63]: Imbens (2003), ‘Sensitivity to Exo-
𝑃(𝑇 = 1 | 𝑊 , 𝑈) := (8.78)
1 + exp(−(𝛼 𝑤 𝑊 + 𝛼 𝑢 𝑈)) geneity Assumptions in Program Evalua-
tion’
16Imbens [63] is the first to introduce con-
No Assumptions on 𝑇 or 𝑈 Fortunately, we can drop a lot of the tour plots like the ones in our Figure 8.5.
assumptions that we’ve seen so far. Unlike the linear form that we
assumed for 𝑇 in Section 8.2.1 and the linearish form that Rosenbaum
and Rubin [62] and Imbens [63] assume, Cinelli and Hazlett [64] develop [64]: Cinelli and Hazlett (2020), ‘Making
a method for sensitivity analysis that is agnostic to the functional form sense of sensitivity: extending omitted

of 𝑇 . Their method also allows for 𝑈 to be non-binary and for 𝑈 to be a


variable bias’

vector, rather than just a single unobserved confounder.


Arbitrary Machine Learning Models for Parametrization of 𝑇 and 𝑌
Recall that all of the estimators that we considered in Chapter 7 allowed
[65]: Veitch and Zaveri (2020), Sense and
us to plug in arbitrary machine learning models to get model-assisted
Sensitivity Analysis: Simple Post-Hoc Analy-
estimators. It might be attractive to have an analogous option in sensitivity sis of Bias Due to Unobserved Confounding
analysis, potentially using the exact same models for the conditional [66]: Liu et al. (2013), ‘An introduction to
outcome model 𝜇 and the propensity score 𝑒 that we used for estimation. sensitivity analysis for unobserved con-
And this is exactly what Veitch and Zaveri [65] give us. And they founding in nonexperimental prevention
research’
are even able to derive a closed-form expression for confounding bias,
[67]: Rosenbaum (2002), Observational
assuming the models we use for 𝜇 and 𝑒 are well-specified, something Studies
that Rosenbaum and Rubin [62] and Imbens [63] didn’t do in their simple [68]: Rosenbaum (2010), Design of Observa-
setting. tional Studies
[69]: Rosenbaum (2017), Observation and
Holy Shit; There Are a Lot of Options Although we only highlighted Experiment
a few options above, there are many different approaches to sensitivity [70]: Franks et al. (2019), ‘Flexible Sensi-
analysis, and people don’t agree on which ones are best. This means that tivity Analysis for Observational Studies
Without Observable Implications’
sensitivity analysis is an active area of current research. See Liu et al.
[71]: Yadlowsky et al. (2020), Bounds on the
[66] for a review of methods that preceeded 2013. Rosenbaum is another conditional and average treatment effect with
key figure in sensitivity analysis with his several different approaches unobserved confounding factors
[67–69]. Here is a non-exhaustive list of a few other flexible sensitivity [72]: Vanderweele and Arah (2011), ‘Bias
formulas for sensitivity analysis of unmea-
analysis methods that you might be interested in looking into: Franks
sured confounding for general outcomes,
et al. [70], Yadlowsky et al. [71], Vanderweele and Arah [72], and Ding treatments, and confounders’
and VanderWeele [73]. [73]: Ding and VanderWeele (2016), ‘Sen-
sitivity Analysis Without Assumptions’
Instrumental Variables 9
How can we identify causal effects when we are in the presence of 9.1 What is an Instrument? . . 86
unobserved confounding? One popular way is to find and use instrumental 9.2 No Nonparametric Identifi-
variables. An instrument (instrumental variable) 𝑍 has three key qualities. cation of the ATE . . . . . . 87
It affects on treatment 𝑇 , it affects 𝑌 only through 𝑇 , and the effect of 9.3 Warm-Up: Binary Linear Set-
𝑍 on 𝑌 is unconfounded. We depict these qualities in Figure 9.1. These ting . . . . . . . . . . . . . . . 87
qualities allow us to use 𝑍 to isolate the causal association flowing from
9.4 Continuous Linear Setting 88
𝑇 to 𝑌 . The intuition is that changes in 𝑍 will be reflected in 𝑇 and lead
9.5 Nonparametric Identifica-
to corresponding changes in 𝑌 . And these specifically 𝑍 -focused changes
tion of Local ATE . . . . . . 90
are unconfounded (unlike the changes to 𝑇 induced by the unobserved
New Potential Notation with
confounder 𝑈 ), so they allow us to isolate the causal association that Instruments . . . . . . . . . 90
flows from 𝑇 to 𝑌 . Principal Stratification . . . 90
Local ATE . . . . . . . . . . . 91
𝑍 𝑈 9.6 More General Settings for
ATE Identification . . . . . 94

𝑇 𝑌
Figure 9.1: Graph where 𝑈 is an unobserved confounder of the effect of 𝑇 on 𝑌 and 𝑍 is an
instrumental variable.

9.1 What is an Instrument?

There are three main assumptions that must be satisfied for a variable 𝑍
to be considered an instrument. The first is that 𝑍 must be relevant in
the sense that it must influence 𝑇 .

Assumption 9.1 (Relevance) 𝑍 has a causal effect on 𝑇

Graphically, the relevance assumption corresponds to the existence of an


active edge from 𝑍 to 𝑇 in the causal graph. The second assumption is
known as the exclusion restriction.

Assumption 9.2 (Exclusion Restriction) 𝑍 causal effect on 𝑌 is fully


mediated by 𝑇

This assumption is known as the exclusion restriction because it excludes


𝑍 from the structural equation for 𝑌 and from any other structural
equations that would make causal association flow from 𝑍 to 𝑌 without
going through 𝑇 . Graphically, this means that we’ve excluded enough
potential edges between variables in the causal graph so that all causal
paths from 𝑍 to 𝑌 go through 𝑇 . Finally, we assume that the causal effect
of 𝑍 on 𝑌 is unconfounded:
9 Instrumental Variables 87

Assumption 9.3 (Instrumental Unconfoundedness) There are no back-


door paths from 𝑍 to 𝑌 .

Conditional Instruments We phrased Assumption 9.3 as uncondi-


tional unconfoundedness, but all the math for instrumental variables still
works if we have unconfoundedness conditional on observed variables as
well. We just have to make sure we condition on those relevant variables.
In this case, you might see 𝑍 referred to as a conditional instrument.

9.2 No Nonparametric Identification of the ATE

You might be wondering “if instrumental variables allow us to identify


causal effects, then why didn’t we see them back in Chapter 6 Non-
parametric Identification?” The answer is that instrumental variables
don’t nonparametrically identify the causal effect. We have nonparametric
identification when we don’t have to make any assumptions about the
parametric form. With instrumental variables, we must make assumptions
about the parametric form (e.g. linear) to identify causal effects.
We saw the following useful necessary condition for nonparametric 𝑍 𝑈
identification in Section 6.3: For each backdoor path from 𝑇 to any child
that is an ancestor of 𝑌 , it is possible to block that path [18, p. 92]. And
we can see in Figure 9.2 that there is a backdoor path from 𝑇 to 𝑌 that 𝑇 𝑌
cannot be blocked: 𝑇 ← 𝑈 → 𝑌 . So this necessary condition tells us that
Figure 9.2: Graph where 𝑈 is an unob-
we can’t use the instrument 𝑍 to nonparametrically identify the effect of served confounder of the effect of 𝑇 on 𝑌
𝑇 on 𝑌 . and 𝑍 is an instrumental variable.

[18]: Pearl (2009), Causality

9.3 Warm-Up: Binary Linear Setting

As a warm-up, we’ll start in the setting where 𝑇 and 𝑍 are binary and
where we make the parametric assumption that 𝑌 is a linear function of
𝑇 and 𝑈 :

Assumption 9.4 (Linear Outcome)

𝑌 := 𝛿𝑇 + 𝛼 𝑢 𝑈 (9.1)

The fact that 𝑍 doesn’t appear in Equation 9.1 is a consequence of the


exclusion restriction (Assumption 9.2).
Then, with this assumption in mind, we’ll try to identify the causal effect
𝛿. Because we have the intuition that 𝑍 will be useful for identifying
the effect of 𝑇 on 𝑌 , we’ll start with the associational difference for the
𝑍 -𝑌 relationship: 𝔼[𝑌 | 𝑍 = 1] − 𝔼[𝑌 | 𝑍 = 0]. By immediately applying
Assumption 9.4, we have the following:

𝔼[𝑌 | 𝑍 = 1] − 𝔼[𝑌 | 𝑍 = 0] (9.2)


= 𝔼[𝛿𝑇 + 𝛼 𝑢 𝑈 | 𝑍 = 1] − 𝔼[𝛿𝑇 + 𝛼 𝑢 𝑈 | 𝑍 = 0] (9.3)
9 Instrumental Variables 88

Using linearity of expectation and rearranging a bit:

= 𝛿 (𝔼[𝑇 | 𝑍 = 1] − 𝔼[𝑇 | 𝑍 = 0]) + 𝛼 𝑢 (𝔼[𝑈 | 𝑍 = 1] − 𝔼[𝑈 | 𝑍 = 0])


(9.4)

Now, we use the instrumental unconfoundedness assumption (Assump-


tion 9.3). This means that 𝑍 and 𝑈 are independent, which allows us to
get rid of the 𝑈 term:

= 𝛿 (𝔼[𝑇 | 𝑍 = 1] − 𝔼[𝑇 | 𝑍 = 0]) + 𝛼 𝑢 (𝔼[𝑈] − 𝔼[𝑈]) (9.5)


= 𝛿 (𝔼[𝑇 | 𝑍 = 1] − 𝔼[𝑇 | 𝑍 = 0]) (9.6)

Then, we can solve for 𝛿 to get the Wald estimand:

Proposition 9.1

𝔼[𝑌 | 𝑍 = 1] − 𝔼[𝑌 | 𝑍 = 0]
𝛿= (9.7)
𝔼[𝑇 | 𝑍 = 1] − 𝔼[𝑇 | 𝑍 = 0]

Because of Assumption 9.1, we know that the denominator is non-zero,


so the right-hand side isn’t undefined. Then, we just plug in empirical
means in place of these conditional expectations to get the Wald estimator
[74]: [74]: Wald (1940), ‘The Fitting of Straight
𝑖 : 𝑧 𝑖 =1 𝑌𝑖 − 𝑛0 𝑖 : 𝑧 𝑖 =0 𝑌𝑖
1 P 1 P Lines if Both Variables are Subject to Error’
𝑛
𝛿ˆ = 1 P
1
(9.8)
𝑖 : 𝑧 𝑖 =1 𝑇𝑖 − 𝑛0 𝑖 : 𝑧 𝑖 =0 𝑇𝑖
1 P
𝑛1

where 𝑛1 is the number of samples where 𝑍 = 1 and 𝑛0 is the number of


samples where 𝑍 = 0. Active reading exercise: Where did we use
each of Assumptions 9.1 to 9.4 in the above
Causal Effects as Multiplying Path Coefficients When the structural derivation of Equation 9.7.
equations are linear, you can think of the causal association flowing from
a variable 𝐴 to a variable 𝐵 as the product of the coefficients along the 𝑍 𝑈
directed path from 𝐴 to 𝐵. If there are multiple paths, you just sum the
causal associations along all those paths. However, we don’t have direct 𝛼𝑧
access to the causal association. Rather, we can measure total association, 𝛿
and unblocked backdoor paths also contribute to total association, which 𝑇 𝑌
is why 𝔼[𝑌 | 𝑇 = 1] − 𝔼[𝑌 | 𝑇 = 0] ≠ 𝛿 . So how can we identify the Figure 9.3: Graph where 𝑈 is an unob-
effect of 𝑇 on 𝑌 in Figure 9.3? Because there are no backdoor paths from served confounder of the effect of 𝑇 on 𝑌
and 𝑍 is an instrumental variable.
the instrument 𝑍 to 𝑌 , we can trivially identify the effect of 𝑍 on 𝑌 :
𝔼[𝑌 | 𝑍 = 1] − 𝔼[𝑌 | 𝑍 = 0] = 𝛼 𝑧 𝛿. Similarly, we can identify the effect
of the instrument on 𝑇 : 𝔼[𝑇 | 𝑍 = 1] − 𝔼[𝑇 | 𝑍 = 0] = 𝛼 𝑧 . Then, we can

𝛼𝑧 𝛿
divide the effect of 𝑍 on 𝑌 by the effect of the 𝑍 on 𝑇 to identify 𝛿 𝛼𝑧 .
And this quotient is exactly the Wald estimand in Proposition 9.1.

9.4 Continuous Linear Setting

We’ll now consider the setting where 𝑇 and 𝑍 are continuous, rather
than binary. We’ll still assume the linear form for 𝑌 (Assumption 9.4),
which means that the causal efffect of 𝑇 on 𝑌 is 𝛿 . In the continuous
setting, we get the natural continuous analog of the Wald estimand:
9 Instrumental Variables 89

Proposition 9.2
Cov(𝑌, 𝑍)
𝛿= (9.9)
Cov(𝑇, 𝑍)

Proof. Just as we started with 𝔼[𝑌 | 𝑍 = 1] − 𝔼[𝑌 | 𝑍 = 0] in the previous


section, here, we’ll start with the continuous analog Cov(𝑌, 𝑍). We start
with a classic covariance identity:

Cov(𝑌, 𝑍) = 𝔼[𝑌𝑍] − 𝔼[𝑌]𝔼[𝑍] (9.10)

Then, applying the linear outcome assumption (Assumption 9.4):

= 𝔼[(𝛿𝑇 + 𝛼 𝑢 𝑈)𝑍] − 𝔼[𝛿𝑇 + 𝛼 𝑢 𝑈]𝔼[𝑍] (9.11)

Distributing and rearranging:

= 𝛿𝔼[𝑇𝑍] + 𝛼 𝑢 𝔼[𝑈 𝑍] − 𝛿𝔼[𝑇]𝔼[𝑍] − 𝛼 𝑢 𝔼[𝑈]𝔼[𝑍] (9.12)


= 𝛿 (𝔼[𝑇𝑍] − 𝔼[𝑇]𝔼[𝑍]) + 𝛼 𝑢 (𝔼[𝑈 𝑍] − 𝔼[𝑈]𝔼[𝑍]) (9.13)

Now, we see that we can apply the same covariance identity again:

= 𝛿Cov(𝑇, 𝑍) + 𝛼 𝑢 Cov(𝑈 , 𝑍) (9.14)

And Cov(𝑈 , 𝑍) = 0 by the instrumental unconfoundedness assumption


(Assumption 9.3):

= 𝛿Cov(𝑇, 𝑍) (9.15)

Finally, we solve for 𝛿 :


Cov(𝑌, 𝑍)
𝛿= (9.16) Active reading exercise: Where did we
Cov(𝑇, 𝑍) use the exclusion restriction assumption
where the relevance assumption (Assumption 9.1) tells us that the de- (Assumption 9.2) in this proof?

nominator is non-zero.
𝑍 𝑈
This leads us to the following natural estimator, similar to the Wald
estimator:
d (𝑌, 𝑍)
𝑇 𝑌
Cov
𝛿ˆ = (9.17)
d (𝑇, 𝑍)
Cov
Figure 9.4: Graph where 𝑈 is an unob-
served confounder of the effect of 𝑇 on 𝑌
Another equivalent estimator is what’s known as the two-stage least squares and 𝑍 is an instrumental variable.
estimator (2SLS). The two stages are as follows:
1. Linearly regress 𝑇 on 𝑍 to estimate 𝔼[𝑇 | 𝑍]. This gives us the
projection of 𝑇 onto 𝑍 : 𝑇ˆ . 𝑍 𝑈
2. Linearly regress 𝑌 on 𝑇ˆ to estimate 𝔼[𝑌 | 𝑇]
ˆ . Obtain our estimate 𝛿ˆ
as the fitted coefficient in front of 𝑇ˆ .
There is helpful intuition that comes with the 2SLS estimator. To see this, 𝑇ˆ 𝑌
start with the canonical instrumental variable graph we’ve been using
Figure 9.5: Augmented version of Fig-
(Figure 9.4). In stage one, we are projecting 𝑇 onto 𝑍 to get 𝑇ˆ as a function ure 9.4, where 𝑇 is replaced with 𝑇ˆ =
of only 𝑍 : 𝑇ˆ = 𝔼ˆ [𝑇 | 𝑍]. Then, imagine a graph where 𝑇 is replaced with 𝔼ˆ [𝑇 | 𝑍], which doesn’t depend on 𝑈 , so
𝑇ˆ (Figure 9.5). Because 𝑇ˆ isn’t a function of 𝑈 , we can think of removing there it no longer has an incoming edge
the 𝑈 → 𝑇ˆ edge in this graph. Now, because there are no backdoor paths from 𝑈 .
9 Instrumental Variables 90

from 𝑇ˆ to 𝑌 , we can get that association is causation in stage two, where


we simply regress 𝑌 on 𝑇ˆ to estimate the causal effect. Note: We can also
use 2SLS in the binary setting we discussed in Section 9.3.

9.5 Nonparametric Identification of Local ATE

The problem with the previous two sections is that we’ve made the strong
parametric assumption of linearity (Assumption 9.4). For example, this
assumption requires homogeneity (that the treatment effect is the same
for every unit). There are other variants that encode the homogeneity
assumption (see, e.g., Hernán and Robins [7, Section 16.3]), and they [7]: Hernán and Robins (2020), Causal In-
are all strong assumptions. Ideally, we’d be able to use instrumental ference: What If

variables for identification without making any parametric assumptions


such as linearity or homogeneity. And we can. We just need to settle for
a more specific causal estimand than the ATE and swap the linearity
assumption out for a new assumption. We will do this in the binary
setting, so both 𝑇 and 𝑍 are binary. Before we can do that, we must define
a bit of new notation in Section 9.5.1 and introduce principal stratification
in Section 9.5.2.

9.5.1 New Potential Notation with Instruments

Just like we use 𝑌(1) , 𝑌(𝑇 = 1) to denote the potential outcome we


would observe if we were to take treatment and 𝑌(0) , 𝑌(𝑇 = 0) to
denote the potential outcome we would observe if we were to not take
treatment, we will define similar potential notation with instruments.
We’ll think of the instrument 𝑍 as encouragement for the treatment, so if
we have 𝑍 = 1, we’re encouraged to take the treatment, and if we have
𝑍 = 0, we’re encouraged to not take the treatment. Let 𝑇(1) , 𝑇(𝑍 = 1)
denote the treatment we would take if we were to get instrument value 1.
Similarly, let 𝑇(0) , 𝑇(𝑍 = 0) denote the treatment we would take if we
were to get instrument value .
Then, we have the same for potential outcomes where we’re intervening
on the instrument, rather than the treatment: 𝑌(𝑍 = 1) denotes the
outcome we would observe if we were to be encouraged to take the
treatment and 𝑌(𝑍 = 0) denotes the outcome we would observe if we
were to be encouraged to not take the treatment.

9.5.2 Principal Stratification

We will segment the population into four principal strata, based on the
relationship between the encouragement 𝑍 and the treatment taken 𝑇 .
There are four strata because there is one for each combination of the
values the binary variables 𝑍 and 𝑇 can take on.

Definition 9.1 (Principal Strata)


1. Compliers - always take the treatment that they’re encouraged to take.
Namely, 𝑇(1) = 1 and 𝑇(0) = 0.
9 Instrumental Variables 91

2. Always-takers - always take the treatment, regardless of encouragement.


Namely, 𝑇(1) = 1 and 𝑇(0) = 1.
3. Never-takers - never take the treatment, regardless of encouragement.
Namely, 𝑇(1) = 0 and 𝑇(0) = 0.
4. Defiers - always take the opposite treatment of the treatment that they
are encouraged to take. Namely, 𝑇(1) = 0 and 𝑇(0) = 1.
Different Causal Graphs Importantly, these strata have different causal 𝑍 𝑈
graphs. While the treatment that the compliers and defiers take depends
on the encouragement (instrument), the treatment that the always-takers
and never-takers take does not. Therefore, the compliers and defiers 𝑇 𝑌
have the normal causal graph (Figure 9.6), whereas the always-takers
Figure 9.6: Causal graph for the compliers
and never-takers have the same causal graph but with the 𝑍 → 𝑇 edge
and defiers.
removed (Figure 9.7). This means that the causal effect of 𝑍 on 𝑇 is
zero for always-takers and never-takers. Then, because of the exclusion
restriction, this means that the causal effect of 𝑍 on 𝑌 is zero for the
always-takers and never-takers. This will be important for the upcoming 𝑍 𝑈
derivation.
Can’t Identify Stratum Given some observed value of 𝑍 and 𝑇 , we can’t
𝑇 𝑌
actually identify which stratum we’re in. There are four combinations of
the binary variables 𝑍 and 𝑇 ; for each of these combinations, we’ll note Figure 9.7: Causal graph for the always-
takers and never-takers.
that more than one stratum is compatible with the observed combinations
of values.
1. 𝑍 = 0, 𝑇 = 0. Compatible strata: compliers or never-takers Active reading exercise: Ensure that you
2. 𝑍 = 0, 𝑇 = 1. Compatible strata: defiers or always-takers follow why these are the compatible strata

𝑍 = 1, 𝑇
for each of these combinations of observed
3. = 0. Compatible strata: defiers or never-takers values.
4. 𝑍 = 1, 𝑇 = 1. Compatible strata: compliers or always-takers
This means that we can’t identify if a given unit is a complier, a defier, an
always-taker, or a never-taker.

9.5.3 Local ATE

Although we won’t be able to use instrumental variables to nonpara-


metrically identify the ATE in the presence of unobserved confounding
(Section 9.2), we will be able to nonparametrically identify what’s known
as the local ATE. The local average treatment effect (LATE) is also known
as the complier average causal effect (CACE), as it is the ATE among the
compliers.

Definition 9.2 (Local Average Treatment Effect (LATE) / Complier


Average Causal Effect (CACE))

𝔼[𝑌(𝑇 = 1) − 𝑌(𝑇 = 0) | 𝑇(𝑍 = 1) = 1, 𝑇(𝑍 = 0) = 0] (9.18)

To identify the LATE, although we will no longer need the linearity as-
sumption (Assumption 9.4), we will need to introduce a new assumption
known as monotonicity.

Assumption 9.5 (Monotonicity)

∀𝑖, 𝑇𝑖 (𝑍 = 1) ≥ 𝑇𝑖 (𝑍 = 0) (9.19)
9 Instrumental Variables 92

Monotonicity means that if we are encouraged to take the treatment


(𝑍 = 1), we are either more likely or equally likely to take the treatment
than we would be if we were encouraged to not take the treatment (𝑍 = 0).
Importantly, this means that we are assuming that there are no defiers.
This is because the compliers satisfy 𝑇(1) > 𝑇(0), the always-takers Compliers: 𝑇(1) = 1 , 𝑇(0) = 0
and never-takers satisfy 𝑇(1) = 𝑇(0), but the defiers don’t satisfy either Always-takers: 𝑇(1) = 1 , 𝑇(0) = 1
of these; among the defiers, 𝑇(1) < 𝑇(0), which is a violation of the Never-takers: 𝑇(1) = 0 , 𝑇(0) = 0
monotonicity assumption. Defiers: 𝑇(1) = 0 , 𝑇(0) = 1
We’ve now introduced the key concepts of principal strata and the
monotonicity assumption. Importantly, we saw that the causal effect of 𝑍
on 𝑌 is zero among the always-takers and never-takers (Section 9.5.2),
and we just saw that monotonicity assumption implies that there are no
defiers. With this in mind, we are now ready to derive the nonparametric
identification result for the LATE estimand.

Theorem 9.3 (LATE Nonparametric Identification) Given that 𝑍 is an


instrument, 𝑍 and 𝑇 are binary variables, and that monotonicity holds, the
following is true:

𝔼[𝑌 | 𝑍 = 1] − 𝔼[𝑌 | 𝑍 = 0]
𝔼[𝑌(1) − 𝑌(0) | 𝑇(1) = 1, 𝑇(0) = 0] =
𝔼[𝑇 | 𝑍 = 1] − 𝔼[𝑇 | 𝑍 = 0]
(9.20)

Proof. Because we’re interested in the causal effect of 𝑇 on 𝑌 and because


know that we’ll use the instrument 𝑍 , we’ll start with the causal effect of
𝑍 on 𝑌 and decompose it into weighted stratum-specific causal effects
using the law of total probability:

𝔼[𝑌(𝑍 = 1) − 𝑌(𝑍 = 0)]


= 𝔼[𝑌(𝑍 = 1) − 𝑌(𝑍 = 0) | 𝑇(1) = 1 , 𝑇(0) = 0] 𝑃(𝑇(1) = 1 , 𝑇(0) = 0)
+ 𝔼[𝑌(𝑍 = 1) − 𝑌(𝑍 = 0) | 𝑇(1) = 0 , 𝑇(0) = 1] 𝑃(𝑇(1) = 0 , 𝑇(0) = 1)
+ 𝔼[𝑌(𝑍 = 1) − 𝑌(𝑍 = 0) | 𝑇(1) = 1 , 𝑇(0) = 1] 𝑃(𝑇(1) = 1 , 𝑇(0) = 1)
+ 𝔼[𝑌(𝑍 = 1) − 𝑌(𝑍 = 0) | 𝑇(1) = 0 , 𝑇(0) = 0] 𝑃(𝑇(1) = 0 , 𝑇(0) = 0)
(9.21)

The first term correponds to the compliers, the second term corresponds
to the the defiers, the third term corresponds to the always-takers, and the
last term corresponds to the never takers. As we discussed in Section 9.5.2,
the causal effect of 𝑍 on 𝑌 among the always-takers and never-takers is
zero, so we can remove those terms.

= 𝔼[𝑌(𝑍 = 1) − 𝑌(𝑍 = 0) | 𝑇(1) = 1 , 𝑇(0) = 0] 𝑃(𝑇(1) = 1 , 𝑇(0) = 0)


+ 𝔼[𝑌(𝑍 = 1) − 𝑌(𝑍 = 0) | 𝑇(1) = 0 , 𝑇(0) = 1] 𝑃(𝑇(1) = 0 , 𝑇(0) = 1)
(9.22)

Because we’ve made the monotonicity assumption, we know that there


are no defiers (𝑃(𝑇(1) = 0 , 𝑇(0) = 1) = 0), so the defiers term is also zero.

= 𝔼[𝑌(𝑍 = 1) − 𝑌(𝑍 = 0) | 𝑇(1) = 1 , 𝑇(0) = 0] 𝑃(𝑇(1) = 1 , 𝑇(0) = 0)


(9.23)

Now, if we solve for this effect of 𝑍 on 𝑌 among the compliers, we get


9 Instrumental Variables 93

the following:

𝔼[𝑌(𝑍 = 1) − 𝑌(𝑍 = 0)]


𝔼[𝑌(𝑍 = 1) − 𝑌(𝑍 = 0) | 𝑇(1) = 1, 𝑇(0) = 0] =
𝑃(𝑇(1) = 1, 𝑇(0) = 0)
(9.24)
And because these are the compliers, people who will take whichever
treatment they are encouraged to take, 𝑌(𝑍 = 1) and 𝑌(𝑍 = 0) are really
equal to 𝑌(𝑇 = 1) and 𝑌(𝑇 = 0), respectively, so we can change the
left-hand side of Equation 9.24 to the LATE, the causal estimand that
we’re trying to identify:

𝔼[𝑌(𝑇 = 1) − 𝑌(𝑇 = 0) | 𝑇(1) = 1, 𝑇(0) = 0] (9.25)


𝔼[𝑌(𝑍 = 1) − 𝑌(𝑍 = 0)]
= (9.26)
𝑃(𝑇(1) = 1 , 𝑇(0) = 0)

Now, we apply the the instrumental unconfoundedness assumption


(Assumption 9.3) to identify the numerator.

𝔼[𝑌 | 𝑍 = 1] − 𝔼[𝑌 | 𝑍 = 0]
= (9.27)
𝑃(𝑇(1) = 1 , 𝑇(0) = 0)

All that’s left is to identify the denominator, the probability of being a


complier. However, we mentioned that we can’t identify the compliers in
Section 9.5.2, so how can we do this? This is where we’ll need to be a bit
clever. We’ll get this probability by taking everyone (probability 1) and
subtracting out the the always-takers and the compliers, since there are
no defiers, due to monotonicity (Assumption 9.5).

𝔼[𝑌 | 𝑍 = 1] − 𝔼[𝑌 | 𝑍 = 0]
= (9.28)
1 − 𝑃(𝑇 = 0 | 𝑍 = 1) − 𝑃(𝑇 = 1 | 𝑍 = 0)

To understand how we got the above equality, consider that everyone


either has 𝑍 = 1 or 𝑍 = 0. We can subtract out all of the never-takers
by removing those that had 𝑇 = 0 among the 𝑍 = 1 subpopulation
(𝑃(𝑇 = 0 | 𝑍 = 1)). Similarly, we can subtract out all of the always-takers
by removing those that had 𝑇 = 1 among the 𝑍 = 0 subpopulation
(𝑃(𝑇 = 1 | 𝑍 = 0)). We know that this removes all of the never-takers
and always-takers because there are no defiers and because we’ve looked
at both the 𝑍 = 1 subpopulation and the 𝑍 = 0 subpopulation. Now, we
just do a bit of manipulation:

𝔼[𝑌 | 𝑍 = 1] − 𝔼[𝑌 | 𝑍 = 0]
=
1 − (1 − 𝑃(𝑇 = 1 | 𝑍 = 1)) − 𝑃(𝑇 = 1 | 𝑍 = 0)
(9.29)
𝔼[𝑌 | 𝑍 = 1] − 𝔼[𝑌 | 𝑍 = 0]
= (9.30)
𝑃(𝑇 = 1 | 𝑍 = 1) − 𝑃(𝑇 = 1 | 𝑍 = 0)

Finally, because 𝑇 is a binary variable, we can swap out probabilities of


𝑇 = 1 for expectations:

𝔼[𝑌 | 𝑍 = 1] − 𝔼[𝑌 | 𝑍 = 0]
= (9.31)
𝔼[𝑇 | 𝑍 = 1] − 𝔼[𝑇 | 𝑍 = 0]
9 Instrumental Variables 94

This is exactly the Wald estimand that we saw back in the linear setting
(Section 9.3) in Equation 9.7. However, this time, it is the corresponding
statistical estimand of the local ATE 𝔼[𝑌(𝑇 = 1) − 𝑌(𝑇 = 0) | 𝑇(1) =
1 , 𝑇(0) = 0], also known as the complier average causal effect (CACE). This
LATE/CACE causal estimand is in contrast to the ATE causal estimand
that we saw in Section 9.3: 𝔼[𝑌(𝑇 = 1) − 𝑌(𝑇 = 0)]. The difference
is that the complier average causal effect is the ATE specifically in the
subpopulation of compliers, rather than the total population. It’s local
(LATE) to that subpopulation, rather than being global over the whole
population like the ATE is. So we’ve seen two different assumptions that
get us to the Wald estimand with instrumental variables:
1. Linearity (or more generally homogeneity)
2. Monotonicity
Problems with LATE/CACE There are a few problems with the Wald
estimand for LATE, though. The first is that monotonicity might not be
satisfied in your setting of interest. The second is that, even if monotonicity
is satisfied, you might not be interested in the causal effect specifically
among the compliers, especially because you can’t even identify who the
compliers are (see Section 9.5.2). Rather, the regular ATE is often a more
useful quantity to know.

9.6 More General Settings for ATE


Identification

A common more general setting instrumental variable setting is to


consider that the outcome is generated by a complex function of treatment
and observed covariates plus some additive unobserved confounders:

𝑌 := 𝑓 (𝑇, 𝑊) + 𝑈 (9.32)

See, for example, Hartford et al. [75] and Xu et al. [76] for using deep [75]: Hartford et al. (2017), ‘Deep IV: A
learning to model 𝑓 . See references in those papers for using other Flexible Approach for Counterfactual Pre-

models such as kernel methods to model 𝑓 . In those models and given


diction’
[76]: Xu et al. (2020), Learning Deep Features
that 𝑈 enters in the structural equation for 𝑌 additively, you can get in Instrumental Variable Regression
identification with instrumental variables.
Alternatively, we could give up on point identification of causal effects,
instead settle for set identification (partial identification), and use instru-
mental variables to get bounds on causal effects. For more on that, see
Pearl [18, Section 8.2]. Additionally, settling for identifying a set, rather [18]: Pearl (2009), Causality
than a point, allows us to relax the additive noise assumption above in
Equation 9.32. For example, Kilbertus et al. [77] considers the setting [77]: Kilbertus et al. (2020), A Class of Al-
where 𝑈 doesn’t enter the structural equation for 𝑌 additively: gorithms for General Instrumental Variable
Models

𝑌 := 𝑓 (𝑇, 𝑈) (9.33)
Difference in Differences 10
Note: the following chapter is much more rough than usual and currently 10.1 Preliminaries . . . . . . . . . 95
does not contain as many figures and intuition as the corresponding 10.2 Introducing Time . . . . . . 96
lecture.
10.3 Identification . . . . . . . . . 96
Assumptions . . . . . . . . . 96
Main Result and Proof . . . 97
10.1 Preliminaries 10.4 Major Problems . . . . . . . 98

We first introduced the unconfoundedness assumption (Assumption 2.1)


in Chapter 2:
(𝑌(1), 𝑌(0)) ⊥
⊥𝑇 (10.1)
Recall that this is equivalent to assuming that there are no unblocked
backdoor paths from 𝑇 to 𝑌 in the causal graph. When this is the case,
we have that association is causation. In other words, it gives us the
following (hopefully familiar) identification of the ATE:

𝔼[𝑌(1) − 𝑌(0)] = 𝔼[𝑌(1)] − 𝔼[𝑌(0)] (10.2)


= 𝔼[𝑌(1) | 𝑇 = 1] − 𝔼[𝑌(0) | 𝑇 = 0] (10.3)
= 𝔼[𝑌 | 𝑇 = 1] − 𝔼[𝑌 | 𝑇 = 0] (10.4)

where we used this unconfoundedness in Equation 10.3.


However, the ATE is not the only average causal effect that we might be
interested in. It is often the case that practioners are interested in the ATE
specifically in the treated subpopulation. This is known as the average
treatment effect on the treated (ATT): 𝔼[𝑌(1) − 𝑌(0) | 𝑇 = 1]. We can make a
weaker assumption if we are only interested in the ATT, rather than the
ATE:
𝑌(0) ⊥ ⊥𝑇 (10.5)
We only have to assume that 𝑌(0) is unconfounded here, rather than that
both 𝑌(0) and 𝑌(1) are unconfounded. We show this in the following
proof:

𝔼[𝑌(1) − 𝑌(0) | 𝑇 = 1] = 𝔼[𝑌(1) | 𝑇 = 1] − 𝔼[𝑌(0) | 𝑇 = 1] (10.6)


= 𝔼[𝑌 | 𝑇 = 1] − 𝔼[𝑌(0) | 𝑇 = 1] (10.7)
= 𝔼[𝑌 | 𝑇 = 1] − 𝔼[𝑌(0) | 𝑇 = 0] (10.8)
= 𝔼[𝑌 | 𝑇 = 1] − 𝔼[𝑌 | 𝑇 = 0] (10.9)

where we used this weaker unconfoundedness in Equation 10.8.


We are generally interested in the ATT estimand with difference-in-
differences, but we will use a different identifying assumption.
10 Difference in Differences 96

10.2 Introducing Time

We will now introduce the time dimension. Using information from the
time dimension will be key for us to get identification without assuming
the usual unconfoundedness. We’ll use 𝜏 for the variable for time.
Setting As usual, we have a treatment group (𝑇 = 1) and a control
group (𝑇 = 0). However, now there is also time, and the treatment group
only gets the treatment after a certain time. So we have some time 𝜏 = 1
that denotes a time after the treatment has been administered to the
treatment group and some time 𝜏 = 0 that denotes some time before
the treatment has been administered to the treatment group. Because
the control group never gets the treatment, the control group hasn’t
received treatment at either of time 𝜏 = 0 or at time 𝜏 = 1. We will denote
the random variable for potential outcome under treatment 𝑡 at time
𝜏 as 𝑌𝜏 (𝑡). Then, the causal estimand we’re interested in is the average
difference in potential outcomes after treatment has been administered
(in time period 𝜏 = 1) in the treatment group:

𝔼[𝑌1 (1) − 𝑌1 (0) | 𝑇 = 1] (10.10)

In other words, we’re interested in the ATT after the treatment has been
administered.

10.3 Identification

10.3.1 Assumptions

You can just treat 𝑌1 and 𝑌0 as two different random variables. So even
though we have a time subscript now, we still have trivial identification
via consistency (recall Assumption 2.5) when the value inside of the
parenthesis for the potential outcome matches the conditioning value for
𝑇:

Assumption 10.1 (Consistency) If the treatment is 𝑇 , then the observed


outcome 𝑌𝜏 at time 𝜏 is the potential outcome under treatment 𝑇 . Formally,

∀𝜏, 𝑇 = 𝑡 =⇒ 𝑌𝜏 = 𝑌𝜏 (𝑡) (10.11)

We could write this equivalently as follow:

∀𝜏, 𝑌𝜏 = 𝑌𝜏 (𝑇) (10.12)

Consistency is what tells us that the causal estimand 𝔼[𝑌𝜏 (1) | 𝑇 =


1] equals the statistical estimand 𝔼[𝑌𝜏 | 𝑇 = 1], and, similarly, that
𝔼[𝑌𝜏 (0) | 𝑇 = 0] = 𝔼[𝑌𝜏 | 𝑇 = 0]. In contrast, 𝔼[𝑌𝜏 (1) | 𝑇 = 0] and
𝔼[𝑌𝜏 (0) | 𝑇 = 1] are counterfactual causal estimands, so consistency does
not directly identify these quantities for us. Note: In our derivations
in this chapter, we are also implicitly assuming the no interference
assumption (Assumption 2.4) extended to this setting where we have a
time subscript.
10 Difference in Differences 97

We have now arrived at the defining assumption of difference-in-differences:


parallel trends. This assumption states that the trend (over time) in the
treatment group would match the trend in the control group (over time)
if the treatment group were not given treatment.

Assumption 10.2 (Parallel Trends)

𝔼[𝑌1 (0) − 𝑌0 (0) | 𝑇 = 1] = 𝔼[𝑌1 (0) − 𝑌0 (0) | 𝑇 = 0] (10.13)

This is like an assumption about unconfoundedness between difference:


Regular unconfoundedness:

(𝑌1 (0) − 𝑌0 (0)) ⊥


⊥𝑇 (10.14) 𝑌(0) ⊥
⊥𝑇 (10.5 revisisted)

So you could see this as like the regular unconfoundedness we saw


in Equation 10.5, but where treatment is independent of a difference
of potential outcomes, rather than being independent of the potential
outcome themselves.
Then, we need one final assumption. This is the assumption that the
treatment has no effect on the treatment group before it is administered.

Assumption 10.3 (No Pretreatment Effect)

𝔼[𝑌0 (1) − 𝑌0 (0) | 𝑇 = 1] = 0 (10.15)

This assumption may seem like it’s obviously true, but that isn’t necessarily
the case. For example, if participants anticipate the treatment, then they
might be able to

10.3.2 Main Result and Proof

Using the assumptions in the previous section, we can show that the
ATT is equal to the difference between the differences across time in
each treatment group. We state this mathematically in the following
proposition.

Proposition 10.1 (Difference-in-differences Identification) Given consis-


tency, parallel trends, and no pretreatment effect, we have the following:

𝔼[𝑌1 (1) − 𝑌1 (0) | 𝑇 = 1] Active reading exercise: How would you


= (𝔼[𝑌1 | 𝑇 = 1] − 𝔼[𝑌0 | 𝑇 = 1]) − (𝔼[𝑌1 | 𝑇 = 0] − 𝔼[𝑌0 | 𝑇 = 0])
estimate the statistical estimand on the
right-hand side of Equation 10.16?
(10.16)

Proof. As usual, we start with linearity of expectation:

𝔼[𝑌1 (1) − 𝑌1 (0) | 𝑇 = 1] = 𝔼[𝑌1 (1) | 𝑇 = 1] − 𝔼[𝑌1 (0) | 𝑇 = 1] (10.17)

We can immediately identify the treated potential outcome in the treated


group using consistency

= 𝔼[𝑌1 | 𝑇 = 1] − 𝔼[𝑌1 (0) | 𝑇 = 1] (10.18)


10 Difference in Differences 98

So we’ve identified the first term, but the second term remains to be
identified. To do that, we’ll solve for this term in the parallel trends
assumption:1 1 Parallel trends assumptions (Assump-
tion 10.2):
𝔼[𝑌1 (0) | 𝑇 = 1] = 𝔼[𝑌0 (0) | 𝑇 = 1] + 𝔼[𝑌1 (0) | 𝑇 = 0] − 𝔼[𝑌0 (0) | 𝑇 = 0] 𝔼[𝑌1 (0) | 𝑇 = 1] − 𝔼[𝑌0 (0) | 𝑇 = 1]
(10.19) = 𝔼[𝑌1 (0) | 𝑇 = 0] − 𝔼[𝑌0 (0) | 𝑇 = 0]
(10.13 revisited)
We can use consistency to identify the last two terms:

= 𝔼[𝑌0 (0) | 𝑇 = 1] + 𝔼[𝑌1 | 𝑇 = 0] − 𝔼[𝑌0 | 𝑇 = 0]


(10.20)

But the first term is counterfactual. This is where we need the no pre-
treatment effect assumption:2 2 No pretreatment effect assumption (As-
sumption 10.3)
= 𝔼[𝑌0 (1) | 𝑇 = 1] + 𝔼[𝑌1 | 𝑇 = 0] − 𝔼[𝑌0 | 𝑇 = 0] 𝔼[𝑌0 (1) | 𝑇 = 1] − 𝔼[𝑌0 (0) | 𝑇 = 1] = 0
(10.21) (10.15 revisited)

Now, we can use consistency to complete the identification:

= 𝔼[𝑌0 | 𝑇 = 1] + 𝔼[𝑌1 | 𝑇 = 0] − 𝔼[𝑌0 | 𝑇 = 0] (10.22)

Now that we’ve identified 𝔼[𝑌1 (0) | 𝑇 = 1], we can plug Equation 10.22
back into Equation 10.18 to complete the proof:

𝔼[𝑌1 (1) | 𝑇 = 1] − 𝔼[𝑌1 (0) | 𝑇 = 1]


= 𝔼[𝑌1 | 𝑇 = 1] − (𝔼[𝑌0 | 𝑇 = 1] + 𝔼[𝑌1 | 𝑇 = 0] − 𝔼[𝑌0 | 𝑇 = 0])
(10.23)
= (𝔼[𝑌1 | 𝑇 = 1] − 𝔼[𝑌0 | 𝑇 = 1]) − (𝔼[𝑌1 | 𝑇 = 0] − 𝔼[𝑌0 | 𝑇 = 0])
(10.24)

10.4 Major Problems

The first major problem with the difference-in-differences methods for


causal effect estimation is that the parallel trends assumption is often not
satisfied. We can try to fix this by controlling for relevant confounders 𝑊
and trying to satisfy the controlled parallel trends assumption:

Assumption 10.4 (Controlled Parallel Trends)

𝔼[𝑌1 (0) − 𝑌0 (0) | 𝑇 = 1, 𝑊] = 𝔼[𝑌1 (0) − 𝑌0 (0) | 𝑇 = 0, 𝑊] (10.25)

This is commonly done in practice, but it still might not be possible to


satisfy this weaker version of the parallel trends assumption. For example,
if there is an interaction term between treatment 𝑇 and time 𝜏 in the
structural equation for 𝑌 , we will never have parallel trends.
Additionally, the parallel trends assumption is scale-specific. For example,
if we satisfy parallel trends, this doesn’t imply that we satisfy parallel
trends under some transformation of 𝑌 . The logarithm is one common
10 Difference in Differences 99

such transformation. This is because the parallel trends assumption


is an assumption about differences, which makes it not fully nonpara-
metric. In this sense, the parallel trends assumption is semi-parametric.
And, similarly, the difference-in-differences method is a semi-parametric
method.
Causal Discovery from
Observational Data 11
Throughout this book, we have done causal inference, assuming we know 11.1 Independence-Based Causal
the causal graph. What if we don’t know the graph? Can we learn it? As Discovery . . . . . . . . . . 100
you might expect, based on this being a running theme in this book, it Assumptions and Theorem 100
will depend on what assumptions we are willing to make. We will refer The PC Algorithm . . . . . . 102
to this problem as structure identification, which is distinct from the causal Can We Get Any Better Iden-
estimand identification that we’ve seen in the book up until now. tification? . . . . . . . . . . . 104
11.2 Semi-Parametric Causal Dis-
covery . . . . . . . . . . . . . 104
No Identifiability Without
11.1 Independence-Based Causal Discovery Parametric Assumptions . 105
Linear Non-Gaussian Noise 105
Nonlinear Models . . . . . . 108
11.1.1 Assumptions and Theorem
11.3 Further Resources . . . . . . 109

The main assumption we’ve seen that relates the graph to the distribution
is the Markov assumption. The Markov assumption tells us if variables are
d-separated in the graph 𝐺 , then they are independent in the distribution
𝑃 (Theorem 3.1):

𝑋⊥
⊥𝐺 𝑌 | 𝑍 =⇒ 𝑋 ⊥
⊥𝑃 𝑌 | 𝑍 (3.20 revisited)

Maybe we can detect independencies in the data and then use that
to infer the causal graph. However, going from independencies in the
distribution 𝑃 to d-separations in the graph 𝐺 isn’t something that the
Markov assumption gives us (see Equation 3.20 above). Rather, we need
the converse of the Markov assumption. This is known as the faithfulness
assumption.

Assumption 11.1 (Faithfulness)

𝑋⊥
⊥𝐺 𝑌 | 𝑍 ⇐= 𝑋 ⊥
⊥𝑃 𝑌 | 𝑍 (11.1)

This assumption allows us to infer d-separations in the graph from


independencies in the distribution. Faithfulness, along with the Markov
assumption, actually implies minimality (Assumption 3.2), so it is a
stronger assumption. Faithfulness is a much less attractive assumption
than the Markov assumption because it is easy to think of counterexam-
ples (where two variables are independent in 𝑃 , but there are unblocked
paths between them in 𝐺 ). 𝐴
𝛾
Faithfulness Counterexample Consider 𝐴 and 𝐷 in the causal graph 𝛼
with coefficients in Figure 11.1. We have a violation of faithfulness when
the 𝐴 → 𝐵 → 𝐷 path cancels out the 𝐴 → 𝐶 → 𝐷 path. To concretely see 𝐵 𝐶
how this could happen, consider the SCM that this graph represents:
𝛽 𝛿
𝐵 := 𝛼𝐴 (11.2)
𝐷
𝐶 := 𝛾𝐴 (11.3)
Figure 11.1: Faithfulness counterexample
𝐷 := 𝛽𝐵 + 𝛿𝐶 (11.4) graph.
11 Causal Discovery from Observational Data 101

We can solve for the dependence between 𝐴 and 𝐷 by plugging in for 𝐵


and 𝐶 in Equation 11.4 to get the following:

𝐷 = (𝛼𝛽 + 𝛾𝛿)𝐴 (11.5)

This means that the association flowing from 𝐴 to 𝐷 is 𝛼𝛽 + 𝛾𝛿 in this


example. The two paths would cancel if 𝛼𝛽 = −𝛾𝛿 , which would make
make 𝐴 ⊥ ⊥ 𝐷 . This violation of faithfulness would incorrectly lead us to
believe that there are no paths between 𝐴 and 𝐷 in the graph.
In addition to faithfulness, many methods also assume that there are no
unobserved confounders, which is known as causal sufficiency.

Assumption 11.2 (Causal Sufficiency) There are no unobserved con-


founders of any of the variables in the graph.

Then, under the Markov, faithfulness, causal sufficiency, and acyclicity


assumptions, we can partially identify the causal graph. We can’t com-
pletely identify the causal graph because different graphs correspond
to the same set of independencies. For example, consider the graphs in
Figure 11.2.

𝑋2

𝑋1 𝑋2 𝑋3 𝑋1 𝑋2 𝑋3 𝑋1 𝑋3
(a) Chain directed to the right (b) Chain directed to the left (c) Fork

Figure 11.2: Three Markov equivalent graphs

Although these are all distinct graphs, they correspond to the same set
of independence/dependence assumptions. Recall from Section 3.5 that
𝑋1 ⊥ ⊥ 𝑋3 | 𝑋2 in distributions that are Markov with respect to any of these
three graphs in Figure 11.2. We also saw that minimality told us that 𝑋1
and 𝑋2 are dependent and that 𝑋2 and 𝑋3 are dependent. And the stronger
faithfulness assumption additionally tells us that in any distributions
that are faithful with respect to any of these graphs, 𝑋1 and 𝑋3 are
dependent if we don’t condition on 𝑋2 . So using the presence/absence
of (conditional) independencies in the data isn’t enough to distinguish
these three graphs from each other; these graphs are Markov equivalent;
We say that two graphs are Markov equivalent if they correspond to 𝑋1 𝑋3
the same set of conditional independencies. Given a graph, we refer to
its Markov equivalence class as the set of graphs that encode the same
conditional independencies. Under faithfulness, we are able to identify a 𝑋2
graph from conditional independencies in the data if it is the only graph
in its Markov equivalence class. Any example of a graph that is the only Figure 11.3: Immoralities are in their own
Markov equivalence class.
one in its Markov equivalence class the basic immorality that we show in
Figure 11.3. Recall from Section 3.6 that immoralities are distinct from
the two other basic graphical building blocks (chains and forks) in that
in Figure 11.3, 𝑋1 is (unconditionally) independent of 𝑋3 , and 𝑋1 and 𝑋3
become dependent if we condition on 𝑋2 . This means that while the basic
chains and fork in Figure 11.2 are in the same Markov equivalence class,
the basic immorality is by itself in its own Markov equivalence class.
11 Causal Discovery from Observational Data 102

We’ve seen that we can identify the causal graph if it’s a basic immorality,
but what else can we identify? We saw that chains and forks are all in
the same Markov equivalence class, but that doesn’t mean that we can’t
get any information from distributions that are Markov and faithful with
respect to those graphs. What do all the chains and forks in Figure 11.2
have in common? They are share the same skeleton. A graph’s skeleton is 𝑋1 𝑋2 𝑋3
the structure we get if we replace all of its directed edges with undirected
Figure 11.4: Chain/fork skeleton.
edges. We depict the skeleton of a basic chain and a basic fork in
Figure 11.4.
A graph’s skeleton also gives us important conditional independence
information that we can use to distinguish it from graphs with different 𝑋1 𝑋2 𝑋3
skeletons. For example, if we add an 𝑋1 → 𝑋3 edge to the chain in Figure 11.5: Complete graph.
Figure 11.2a, we get the complete1 graph Figure 11.5. In this graph, unlike
in a chain or fork graph, 𝑋1 and 𝑋3 are not independent when we 1
condition on 𝑋2 . So this graph is not in the same Markov equivalence
Recall that a complete graph is one where
there is an edge connecting every pair of
class as the chains and fork in Figure 11.2. And we can see that graphically nodes.
by the fact that this graph has a different skeleton than those graphs (this
graph has an additional edge between 𝑋1 and 𝑋3 ).
To recap, we’ve pointed out two structural qualities that we can use to
distinguish graphs from each other:
1. Immoralities
2. Skeleton
And it turns out that we can determine whether graphs are in the same or
different Markov equivalence classes using these two structural qualities,
due to a result by Verma and Pearl [78] and Frydenberg [79]: [78]: Verma and Pearl (1990), ‘Equivalence
and Synthesis of Causal Models’
[79]: Frydenberg (1990), ‘The Chain Graph
Proposition 11.1 (Markov Equivalence via Immoral Skeletons) Two Markov Property’
graphs are Markov equivalent if and only if they have the same skeleton and
same immoralities.

This means that, using conditional independencies in the data, we cannot


distinguish graphs that have the same skeletons and same immoralities.
For example, we cannot distinguish the two-node graph 𝑋 → 𝑌 from
𝑋 ← 𝑌 using just conditional independence information.2 But we can 2Active reading exercise: Check that these
hope to learn the graph’s skeleton and immoralities; this is known graphs encode the same conditional inde-
pendencies.
as the essential graph or CPDAG (Completed Partially Directed Acyclic
Graph). One popular algorithm for learning the essential graph is the PC
algorithm.

[80]: Spirtes et al. (2001), Causation, Predic-


tion, and Search
11.1.2 The PC Algorithm

PC [80] starts with a complete undirected graph and then trims it down 𝐴 𝐵
and orients edges via three steps:
1. Identify the skeleton.
2. Identify immoralities and orient them. 𝐶
3. Orient qualifying edges that are incident on colliders.
We’ll use the true graph in Figure 11.6 as a concrete example as we explain
𝐷 𝐸
each of these steps.
Figure 11.6: True graph for PC example.
11 Causal Discovery from Observational Data 103

Identify the Skeleton We discover the skeleton by starting with a


complete graph (Figure 11.7a) and then removing edges 𝑋 − 𝑌 where
𝑋⊥ ⊥ 𝑌 | 𝑍 for some (potentially empty) conditioning set 𝑍 . So in our
example, we would start with the empty conditioning set and discover
that 𝐴 ⊥ ⊥ 𝐵 (since the only path from 𝐴 to 𝐵 in Figure 11.6 is blocked
by the collider 𝐶 ); this means we can remove the 𝐴 − 𝐵 edge, which
gives us the graph in Figure 11.7b. Then, we would move to conditioning
sets of size one and find that conditioning on 𝐶 tells us that every other
pair of variables is conditionally independent given 𝐶 , which allows
us to remove all edges that aren’t incident on 𝐶 , resulting in the graph
in Figure 11.7c. And, indeed, this is the skeleton of the true graph in
Figure 11.6. More general PC would continue with larger conditioning
sets, to see if we can remove more edges, but conditioning sets of size
one are enough to discover the skeleton in this example.

𝐴 𝐵 𝐴 𝐵 𝐴 𝐵

𝐶 𝐶 𝐶

𝐷 𝐸 𝐷 𝐸 𝐷 𝐸

(a) Complete undirected graph that we (b) Undirected graph that remains after (c) Undirected graph that remains after
start with removing 𝑋 − 𝑌 edges where 𝑋 ⊥ ⊥𝑌 removing 𝑋 − 𝑌 edges where 𝑋 ⊥ ⊥𝑌 | 𝑍

Figure 11.7: Illustration of the process of step 1 of PC, where we start with the complete graph (left) and remove edges until we’ve identified
the skeleton of the graph (right), given that the true graph is the one in Figure 11.6.

Identifying the Immoralities Now for any paths 𝑋 − 𝑍 − 𝑌 in our 𝐴 𝐵


working graph where we discovered that there is no edge between 𝑋 and
𝑌 in our previous step, if 𝑍 was not in the conditioning set that makes
𝑋 and 𝑌 conditionally independent, then we know 𝑋 − 𝑍 − 𝑌 forms 𝐶
an immorality. In other words, this means that 𝑋 6⊥ ⊥ 𝑌 | 𝑍 , which is a
property of an immorality that distinguishes it from chains and forks
(Section 3.6), so we can orient these edges to get 𝑋 → 𝑍 ← 𝑌 . In our 𝐷 𝐸
example, this takes us from Figure 11.7c to Figure 11.8.
Figure 11.8: Graph from PC after we’ve
Orienting Qualifying Edges Incident on Colliders In the final step, oriented the immoralities.
we take advantage of the fact that we might be able to orient more edges
since we know we discovered all of the immoralities in the previous step.
Any edge 𝑍 − 𝑌 part of a partially directed path of the form 𝑋 → 𝑍 − 𝑌 , 3 This is called orientation propagation.
where there is no edge connecting 𝑋 and 𝑌 , can be oriented as 𝑍 → 𝑌 .3
This is because if the true graph has the edge 𝑍 ← 𝑌 , we would have 𝐴 𝐵
found this in the previous step as that would have formed an immorality
𝑋 → 𝑍 ← 𝑌 . Since we didn’t find that immorality in the previous step,
we know that the true direction is 𝑍 → 𝑌 . In our example, this means
𝐶
we can orient the final two remaining edges, taking us from Figure 11.8
to Figure 11.9. It turns out that in this example, we are lucky that we
can orient all of the remaining edges in this last step, but this is not the
case in general. For example, we discussed that we wouldn’t be able 𝐷 𝐸
to distinguish simple chain graphs and simple fork graphs from each Figure 11.9: Graph from PC after we’ve
other. oriented edges that would form immoral-
ities if they were oriented in the other
(incorrect) direction.
11 Causal Discovery from Observational Data 104

Dropping Assumptions There are algorithms that allow us to drop


various assumptions. The FCI (Fast Causal Inference) algorithm [80] [80]: Spirtes et al. (2001), Causation, Predic-
works without assuming causal sufficiency (Assumption 11.2). The CCD tion, and Search

algorithm [81] works without assuming acyclicity. And there is various [81]: Richardson (1996), ‘Feedback Models:
work on SAT-based causal discovery that allows us to drop both of the Interpretation and Discovery’

above assumptions [82, 83]. [82]: Hyttinen et al. (2013), ‘Discovering


Cyclic Causal Models with Latent Vari-
Hardness of Conditional Independence Testing All methods that rely ables: A General SAT-Based Procedure’
on conditional independence tests such as PC, FCI, SAT-based algorithm, [83]: Hyttinen et al. (2014), ‘Constraint-
Based Causal Discovery: Conflict Resolu-
etc. have an important practical issue associated with them. Conditional
tion with Answer Set Programming’
independence tests are hard, and it can sometimes require a lot of data
to get accurate test results [84]. If we have infinite data, this isn’t an issue, [84]: Shah and Peters (2020), ‘The hardness
but we don’t have infinite data in practice. of conditional independence testing and
the generalised covariance measure’

11.1.3 Can We Get Any Better Identification?

We’ve seen that assuming the Markov assumption and faithfulness can
only get us so far; with those assumptions, we can only identify a graph
up to its Markov equivalence class. If we make more assumptions, can
we identify the graph more precisely than just its Markov equivalence
class?
Well, if we are in the case where the distributions are multinomial, we
cannot [85]. Or if we are in the common toy case where the SCMs are [85]: Meek (1995), ‘Strong Completeness
linear with Gaussian noise, we cannot [86]. So we have the following and Faithfulness in Bayesian Networks’

completeness result due to Geiger and Pearl [86] and Meek [85]: [86]: Geiger and Pearl (1988), ‘On the Logic
of Causal Models’

Theorem 11.2 (Markov Completeness) If we have multinomial distribu-


tions or linear Gaussian structural equations, we can only identify a graph
up to its Markov equivalence class.

What if we don’t have multinomial distributions and don’t have linear


Gaussian SCMs, though?

11.2 Semi-Parametric Causal Discovery

In Theorem 11.2, we saw that, if we are in the linear Gaussian setting,


the best we can do is identify the Markov equivalence class; we cannot
hope to identify graphs that are in non-singleton Markov equivalence
classes. But what if we aren’t in the linear Gaussian setting? Can we
identify graphs if we are not in the linear Gaussian setting? We consider
the linear non-Gaussian noise setting in Section 11.2.2 and the nonlinear
additive noise setting in Section 11.2.3. It turns out that in both of these
settings, we can identify the causal graph. And we don’t have to assume
faithfulness (Assumption 11.1) in these settings.
By considering these settings, we are making semi-parametric assump-
tions (about functional form). If we don’t make any assumptions about
functional form, we cannot even identify the direction of the edge in a
two-node graph. We emphasize this in the next section before moving on
to the semi-parametric assumptions that allow us to identify the graph.
11 Causal Discovery from Observational Data 105

11.2.1 No Identifiability Without Parametric


Assumptions

Markov Perspective Consider the two-variable setting, where the two


options of causal graphs are 𝑋 → 𝑌 and 𝑋 ← 𝑌 . Note that these
two graphs are Markov equivalent. Both don’t encode any conditional
independence assumptions, so both can describe arbitrary distributions
𝑃(𝑥, 𝑦). This means that conditional independencies in the data cannot
help us distinguish between 𝑋 → 𝑌 and 𝑋 ← 𝑌 . Using conditional
independencies, the best we can do is discover the corresponding essential
graph 𝑋 − 𝑌 .
SCMs Perspective How about if we consider this problem from the
perspective of SCMs; can we somehow distinguish 𝑋 → 𝑌 from 𝑋 ← 𝑌
using SCMs? For an SCM, we want to write one variable as a function of
the other variable and some noise term variable. As you might expect,
if we don’t make any assumptions, there exist SCMs with the implied
causal graph 𝑋 → 𝑌 and SCMs with the implied causal graph 𝑋 ← 𝑌
that both generate data according to 𝑃(𝑥, 𝑦).

Proposition 11.3 (Non-Identifiability of Two-Node Graphs) For every


joint distribution 𝑃(𝑥, 𝑦) on two real-valued random variables, there is an
SCM in either direction that generates data consistent with 𝑃(𝑥, 𝑦).
Mathematically, there exists a function 𝑓𝑌 such that

𝑌 = 𝑓𝑌 (𝑋 , 𝑈𝑌 ) , 𝑋⊥
⊥ 𝑈𝑌 (11.6)

and there exists a function 𝑓𝑋 such that

𝑋 = 𝑓𝑋 (𝑌, 𝑈𝑋 ) , 𝑌⊥
⊥ 𝑈𝑋 (11.7)

where 𝑈𝑌 and 𝑈𝑋 are real-valued random variables.

See, e.g., Peters et al. [14, p. 44] for a short proof. Similarly, this non- [14]: Peters et al. (2017), Elements of Causal
identifiability result can be extended to more general graphs that have Inference: Foundations and Learning Algo-
rithms
more than two variables [see, e.g., 14, p. 135].
However, if we make assumptions about the parametric form of the
SCM, we can distinguish 𝑋 → 𝑌 from 𝑋 ← 𝑌 and identify graphs more
generally. That’s what we’ll see in the rest of this chapter.

11.2.2 Linear Non-Gaussian Noise

We saw in Theorem 11.2 that we cannot distinguish graphs within the


same Markov equivalence class if the structural equations are linear with
Gaussian noise 𝑈 . For example, this means that we cannot distinguish
𝑋 → 𝑌 from 𝑋 ← 𝑌 . However, if the noise term is non-Gaussian, then
we can identify the causal graph. As usual, we give this key assumption
of non-Gaussianity its own box:

Assumption 11.3 (Linear Non-Gaussian) All structural equations (causal


11 Causal Discovery from Observational Data 106

mechanisms that generate the data) are of the following form:

𝑌 := 𝑓 (𝑋) + 𝑈 (11.8)

where 𝑓 is a linear function, 𝑋 ⊥


⊥ 𝑈 , and 𝑈 is distributed as a non-Gaussian
random variable.
Then, in this linear non-Gaussian setting, we can identify which of
graphs 𝑋 → 𝑌 and 𝑋 ← 𝑌 is the true causal graph. We’ll first present
the theorem and proof and then get to the intuition.

Theorem 11.4 (Identifiability in Linear Non-Gaussian Setting) In the


linear non-Gaussian setting, if the true SCM is

𝑌 := 𝑓 (𝑋) + 𝑈 , 𝑋 ⊥⊥ 𝑈 , (11.9)

then, there does not exist an SCM in the reverse direction

˜ ,
𝑋 := 𝑔(𝑌) + 𝑈 𝑌⊥
⊥𝑈˜ , (11.10)

that can generate data consistent with 𝑃(𝑥, 𝑦).

Proof. We’ll first introduce a important result from Darmois [87] and
Skitovich [88] and Skitovich [88] that we’ll use to prove this theorem: [87]: Darmois (1953), ‘Analyse générale des
liaisons stochastiques: etude particulière
de l’analyse factorielle linéaire’
Theorem 11.5 (Darmois-Skitovich) Let 𝑋1 , . . . , 𝑋𝑛 be independent, non- [88]: Skitovich (1954), ‘Linear forms of in-
degenerate random variables. If there exist coefficients 𝛼 1 , . . . , 𝛼 𝑛 and dependent random variables and the nor-
𝛽 1 , . . . , 𝛽 𝑛 that are all non-zero such that the two linear combinations
mal distribution law’
[88]: Skitovich (1954), ‘Linear forms of in-
dependent random variables and the nor-
𝐴 = 𝛼 1 𝑋1 + . . . + 𝛼 𝑛 𝑋 𝑛 and mal distribution law’
𝐵 = 𝛽 1 𝑋1 + . . . + 𝛽 𝑛 𝑋 𝑛

are independent, then each 𝑋𝑖 is normally distributed.

We will use the contrapositive of the special case of this theorem for
𝑛 = 2 to do almost all of the work for this proof:

Corollary 11.6 If either of the independent random variables 𝑋1 or 𝑋2 is


non-Gaussian, then there are no linear combinations

𝐴 = 𝛼 1 𝑋1 + 𝛼 2 𝑋2 and
𝐵 = 𝛽 1 𝑋1 + 𝛽 2 𝑋2

such that 𝐴 and 𝐵 are independent (so 𝐴 and 𝐵 must be dependent).

Proof Outline With the above corollary in mind, our proof strategy is
to write 𝑌 and 𝑈 ˜ as linear combinations of 𝑋 and 𝑈 . By doing this, we
are effectively mapping our variables in Equations 11.9 and 11.10 onto the
˜ onto 𝐵, 𝑋 onto 𝑋1 , and 𝑈
variables in the corollary as follows: 𝑌 onto 𝐴, 𝑈
onto 𝑋2 . Then, we can apply the above corollary of the Darmois-Skitovich
Theorem to have that 𝑌 and 𝑈 ˜ must be dependent, which violates the
reverse direction SCM in Equation 11.10. We now proceed with the proof.
11 Causal Discovery from Observational Data 107

We already have that we can write 𝑌 as a linear combination of 𝑋 and


𝑈 , since we’ve assumed the true structural equation in Equation 11.9 is
linear:
𝑌 = 𝛿𝑋 + 𝑈 (11.11)

˜ as a linear combination of 𝑋 and 𝑈 , we take the hypothe-


Then, to get 𝑈
sized reverse SCM
˜ +𝑈
𝑋 = 𝛿𝑌 ˜ (11.12)
˜ , and plug in Equation 11.11 for 𝑌 :
from Equation 11.10, solve for 𝑈

˜ = 𝑋 − 𝛿𝑌
𝑈 ˜ (11.13)
˜
= 𝑋 − 𝛿(𝛿𝑋 + 𝑈) (11.14)
˜
= (1 − 𝛿𝛿)𝑋 ˜
+ 𝛿𝑈 (11.15)

Therefore, we’ve written both 𝑌 and 𝑈 ˜ as linear combinations of the


independent random variables 𝑋 and 𝑈 . This allows us to apply Corol-
lary 11.6 of the Darmois-Skitovish Theorem to get that 𝑌 and 𝑈 ˜ must be
dependent: 𝑌 6⊥ ˜
⊥ 𝑈 . This violates the reverse direction SCM:

˜ ,
𝑋 := 𝑔(𝑌) + 𝑈 𝑌⊥
⊥𝑈˜ (11.10 revisited)

We’ve given the proof here for just two variables, but it can be extended
to the more general setting with multiple variables (see [89] and [14, [89]: Shimizu et al. (2006), ‘A Linear Non-
Section 7.1.4]). Gaussian Acyclic Model for Causal Dis-
covery’
[14]: Peters et al. (2017), Elements of Causal
Graphical Intuition Inference: Foundations and Learning Algo-
rithms

When we fit the data in the causal direction, we get residuals that are
independent of the input variable, but when we fit the data in the
anti-causal direction, we get residuals that are dependent on the input
variable. We depict the regression line 𝑓ˆ we get if we linearly regress 𝑌
on 𝑋 (causal direction) in Figure 11.10a, and we depict the regression
line 𝑔ˆ we get if we linearly regress 𝑋 on 𝑌 (anti-causal direction) in
Figure 11.10b. Just from these fits, you can see that the forward model (fit
in the causal direction) looks more pleasing than the backward model
(fit in the ant-causal direction).
Forward model SCM:
To make this graphical intuition more clear, we plot the residuals of the
𝑌 := 𝑓 (𝑋) + 𝑈 , 𝑋⊥ ⊥𝑈
forward model 𝑓ˆ (causal direction) and the backward model 𝑔ˆ (anti- (11.9 revisited)
causal direction) in Figure 11.11. The residuals in the forward direction Backward model SCM:
correspond to the following: 𝑈 ˆ = 𝑌 − 𝑓ˆ(𝑋). And the residuals in the ˜ ,
𝑋 := 𝑔(𝑌) + 𝑈 𝑌⊥ ⊥𝑈 ˜

backward direction correspond to the follow: 𝑈


ˆ˜ = 𝑋 − 𝑔ˆ (𝑌). As you can (11.10 revisited)

see in Figure 11.11a, the residuals of the forward model look independent
of the input variable 𝑋 (on the x-axis). However in Figure 11.10b, the
residuals of the backward model don’t look independent of the input
variable 𝑌 (on the x-axis) at all. Clearly, the range of the residuals (on the
vertical) changes as we move across values of 𝑌 (from left to right).
11 Causal Discovery from Observational Data 108

2.0 2.0

1.5 1.5

1.0 1.0

0.5 0.5

0.0 0.0

0.5 0.5

1.0 1.0

1.5 1.5

2.0 2.0
1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00 1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00
(a) Causal direction fit: linear fit that results from regressing 𝑌 on (b) Anti-causal direction fit: linear fit that results from regressing
𝑋. 𝑋 on 𝑌 .

Figure 11.10: Linear fits (in both directions) of the linear non-Gaussian data.

1.0
0.8

0.5 0.6

0.4
0.0 0.2

0.0
0.5
0.2

0.4
1.0
0.6
1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00 2.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.0

(a) Causal direction residuals: residuals that result from linearly (b) Anti-causal direction residuals: residuals that result from lin-
regressing 𝑌 on 𝑋 . early regressing 𝑋 on 𝑌 .

Figure 11.11: Residuals of linear models (in both directions) fit to the linear non-Gaussian data.

11.2.3 Nonlinear Models

Nonlinear Additive Noise Setting We can also get identifiability of


the causal graph in the nonlinear additive noise setting [90, 91]. This [90]: Hoyer et al. (2009), ‘Nonlinear causal
requires the nonlinear additive noise assumption (below) and other more discovery with additive noise models’
[91]: Peters et al. (2014), ‘Causal Discovery
technical assumptions that we refer you to Hoyer et al. [90] and Peters with Continuous Additive Noise Models’
et al. [91] for.

Assumption 11.4 (Nonlinear Additive Noise) All causal mechanisms are


nonlinear where the noise enters additively. Mathematically,

∀𝑖 , 𝑋𝑖 := 𝑓 (pa𝑖 ) + 𝑈 𝑖 (11.16)

where 𝑓 is nonlinear and pa𝑖 denotes the parents of 𝑋𝑖 .

Post-Nonlinear Setting What if you don’t believe that the noise realis-
tically enters additively. This motivates post-nonlinear models, where
there is another nonlinear transformation after adding the noise as in
11 Causal Discovery from Observational Data 109

Assumption 11.5 below. This setting can also yield identifiability (under
another technical condition). See Zhang and Hyvärinen [92] for more [92]: Zhang and Hyvärinen (2009), ‘On
details. the Identifiability of the Post-Nonlinear
Causal Model’

Assumption 11.5 (Post-Nonlinear)

∀𝑖 , 𝑋𝑖 := 𝑔( 𝑓 (pa𝑖 ) + 𝑈 𝑖 ) (11.17)

where 𝑓 is nonlinear and pa𝑖 denotes the parents of 𝑋𝑖 .

11.3 Further Resources

We conclude this chapter by pointing you to some relevant resources


for where to start learning more (in addition to the references in this
chapter). These references were also used as inspiration when forming
this chapter. See Eberhardt [93] and Glymour et al. [94] for two great [93]: Eberhardt (2017), ‘Introduction to the
review articles from people at the frontier of causal discovery research. foundations of causal discovery’

And then if you want a whole book on this stuff, Peters et al. [14] wrote a [94]: Glymour et al. (2019), ‘Review of
Causal Discovery Methods Based on
popular one!
Graphical Models’
[14]: Peters et al. (2017), Elements of Causal
Inference: Foundations and Learning Algo-
rithms
Causal Discovery from
Interventional Data 12
12.1 Structural Interventions . . 110
12.1 Structural Interventions
Single-Node Interventions 110
Multi-Node Interventions . 110
12.1.1 Single-Node Interventions 12.2 Parametric Interventions . . 110
Coming Soon . . . . . . . . . 110
Coming Soon 12.3 Interventional Markov
Equivalence . . . . . . . . . 110
Coming Soon . . . . . . . . . 110
12.1.2 Multi-Node Interventions 12.4 Miscellaneous Other Set-
tings . . . . . . . . . . . . . . 110
Coming Soon Coming Soon . . . . . . . . . 110

12.2 Parametric Interventions

12.2.1 Coming Soon

12.3 Interventional Markov Equivalence

12.3.1 Coming Soon

12.4 Miscellaneous Other Settings

12.4.1 Coming Soon


Transfer Learning and
Transportability 13
13.1 Causal Insights for Transfer
13.1 Causal Insights for Transfer Learning
Learning . . . . . . . . . . . 111
Coming Soon . . . . . . . . . 111
13.1.1 Coming Soon 13.2 Transportability of Causal
Effects Across Populations 111
Coming Soon . . . . . . . . . 111
13.2 Transportability of Causal Effects Across
Populations

13.2.1 Coming Soon


Counterfactuals and Mediation 14
14.1 Counterfactuals Basics . . . 112
14.1 Counterfactuals Basics
Coming Soon . . . . . . . . . 112
14.2 Important Application: Me-
14.1.1 Coming Soon diation . . . . . . . . . . . . 112
Coming Soon . . . . . . . . . 112

14.2 Important Application: Mediation

14.2.1 Coming Soon


Appendix
Proofs A
A.1 Proof of Equation 6.1 from
A.1 Proof of Equation 6.1 from Section 6.1
Section 6.1 . . . . . . . . . . 114
A.2 Proof of Propensity Score
Claim Given the causal graph is Figure A.1, 𝑃(𝑚 | do(𝑡)) = 𝑃(𝑚 | 𝑡). Theorem (7.1) . . . . . . . . 114
A.3 Proof of IPW Estimand (7.18) 115

Proof. We first apply the Bayesian network factorization (Definition 3.1):


𝑊
𝑃(𝑤, 𝑡, 𝑚, 𝑦) = 𝑃(𝑤) 𝑃(𝑡 | 𝑤) 𝑃(𝑚 | 𝑡) 𝑃(𝑦 | 𝑤, 𝑚) (A.1)

Next, we apply the truncated factorization (Proposition 4.1):


𝑇 𝑀 𝑌
𝑃(𝑤, 𝑚, 𝑦 | do(𝑡)) = 𝑃(𝑤) 𝑃(𝑚 | 𝑡) 𝑃(𝑦 | 𝑤, 𝑚) (A.2)
Figure A.1: Causal graph where 𝑊 is un-
Finally, we marginalize out 𝑤 and 𝑦 : observed, so we cannot block the backdoor
path 𝑇 ← 𝑊 → 𝑌 .
XX XX
𝑃(𝑤, 𝑚, 𝑦 | do(𝑡)) = 𝑃(𝑤) 𝑃(𝑚 | 𝑡) 𝑃(𝑦 | 𝑤, 𝑚) (A.3)
𝑤 𝑦 𝑤 𝑦
  !
X X
𝑃(𝑚 | do(𝑡)) = 𝑃(𝑤) 𝑃(𝑚 | 𝑡) 𝑃(𝑦 | 𝑤, 𝑚)
𝑤 𝑦
(A.4)
= 𝑃(𝑚 | 𝑡) (A.5)

A.2 Proof of Propensity Score Theorem (7.1)

Claim (𝑌(1), 𝑌(0)) ⊥


⊥ 𝑇 | 𝑊 =⇒ (𝑌(1), 𝑌(0)) ⊥⊥ 𝑇 | 𝑒(𝑊).

Proof. Assuming (𝑌(1), 𝑌(0)) ⊥ ⊥ 𝑇 | 𝑊 , we will prove (𝑌(1), 𝑌(0)) ⊥⊥ 𝑇 |


𝑒(𝑊) by showing that 𝑃(𝑇 = 1, | 𝑌(𝑡), 𝑒(𝑊)) does not depend on 𝑌(𝑡),
where 𝑌(𝑡) is either potential outcome.

First, because 𝑇 is binary, can turn this probability into an expectation:

𝑃(𝑇 = 1 , | 𝑌(𝑡), 𝑒(𝑊)) = 𝔼[𝑇 | 𝑌(𝑡), 𝑒(𝑊)] (A.6)

Then, using the law of iterated expectations, we can introduce 𝑊 :

= 𝔼 [𝔼[𝑇 | 𝑌(𝑡), 𝑒(𝑊), 𝑊] | 𝑌(𝑡), 𝑒(𝑊)] (A.7)


A Proofs 115

Because we have now conditioned on all of 𝑊 and 𝑒(𝑊) is a function of


𝑊 , it is redundant, so we can remove 𝑒(𝑊) from the inner expectation:

= 𝔼 [𝔼[𝑇 | 𝑌(𝑡), 𝑊] | 𝑌(𝑡), 𝑒(𝑊)] (A.8)

Now, we apply the unconfoundedness assumption we started with to


remove 𝑌(𝑡) from the inner expectation:

= 𝔼 [𝔼[𝑇 | 𝑊] | 𝑌(𝑡), 𝑒(𝑊)] (A.9)

Again, using the fact that 𝑇 is binary, we can reduce the inner expectation
to 𝑃(𝑇 = 1 | 𝑊) , 𝑒(𝑊), something that is already conditioned on:

= 𝔼 [𝑃(𝑇 = 1 | 𝑊) | 𝑌(𝑡), 𝑒(𝑊)] (A.10)


= 𝔼 [𝑒(𝑊) | 𝑌(𝑡), 𝑒(𝑊)] (A.11)
= 𝑒(𝑊) (A.12)

Because this does not depend on 𝑌(𝑡), we’ve proven that 𝑇 is independent
of 𝑌(𝑡) given 𝑒(𝑊).

A.3 Proof of IPW Estimand (7.18)


1(𝑇=𝑡)𝑌
h i
Claim Under unconfoundedness and positivity, 𝔼[𝑌(𝑡)] = 𝔼 𝑃(𝑡|𝑊)
.

Proof. We will start with the statistical estimand that we get from the ad-
justment formula (Theorem 2.1). Given unconfoundedness and positivity,
the adjustment formula tells us

𝔼[𝑌(𝑡)] = 𝔼[𝔼[𝑌 | 𝑡, 𝑊]] (A.13)

We’ll assume the variable are discrete to break these expectations into
sums (replace with integrals if continuous):
!
X X
= 𝑦 𝑃(𝑦 | 𝑡, 𝑤) 𝑃(𝑤) (A.14)
𝑤 𝑦

𝑃(𝑡|𝑤)
To get 𝑃(𝑡 | 𝑤) in there, we multiply by 𝑃(𝑡|𝑤)
:

XX 𝑃(𝑡 | 𝑤)
= 𝑦 𝑃(𝑦 | 𝑡, 𝑤) 𝑃(𝑤) (A.15)
𝑤 𝑦 𝑃(𝑡 | 𝑤)

Then, noticing that 𝑃(𝑦 | 𝑡, 𝑤) 𝑃(𝑡 | 𝑤) 𝑃(𝑤) is the joint distribution:

XX 1
= 𝑦 𝑃(𝑦, 𝑡, 𝑤) (A.16)
𝑤 𝑦 𝑃(𝑡 | 𝑤)
A Proofs 116

𝑦 𝑃(𝑦, 𝑡, 𝑤) is nearly 𝑦 𝑦 𝑃(𝑦) = 𝔼[𝑌], but because of 𝑇 = 𝑡 and


P P
𝑦
𝑊 = 𝑤 are in the probability, the terms of this sum are only non-zero if
𝑇 = 𝑡 and 𝑊 = 𝑤 . Therefore, we get the indicator random variable for
this event in the expectation that is over all three random variables (𝑇 ,
𝑊 , and 𝑌 ):

𝔼 [1(𝑇 = 𝑡, 𝑊 = 𝑤) 𝑌]
X 1
= (A.17)
𝑤 𝑃(𝑡 | 𝑤)

Now, the 𝑤 𝑃(𝑡1|𝑤) that remains is a weighted expectation over 𝑊 .


P
Integrating this means that because we are now marginalizing over 𝑊 , 𝑤
becomes a random variable (𝑊) and the the 𝑊 = 𝑤 inside the indicator
becomes redundant. This gives us the following:

1(𝑇 = 𝑡) 𝑌
 
=𝔼 (A.18)
𝑃(𝑡 | 𝑊)

Note: For some people, it might be more natural to skip straight from
Equation A.16 to Equation A.18.
Bibliography

Here are the references in citation order.

[1] Tyler Vigen. Spurious correlations. https://www.tylervigen.com/spurious- correlations. 2015


(cited on page 3).
[2] Jerzy Splawa-Neyman. ‘On the Application of Probability Theory to Agricultural Experiments. Essay
on Principles. Section 9.’ Trans. by D. M. Dabrowska and T. P. Speed. In: Statistical Science 5.4 (1923
[1990]), pp. 465–472 (cited on page 6).
[3] Donald B. Rubin. ‘Estimating causal effects of treatments in randomized and nonrandomized studies.’
In: Journal of educational Psychology 66.5 (1974), p. 688 (cited on pages 6, 7).
[4] Jasjeet S. Sekhon. ‘The Neyman-Rubin Model of Causal Inference and Estimation via Matching
Methods’. In: Oxford handbook of political methodology (2008), pp. 271– (cited on page 6).
[5] Paul W. Holland. ‘Statistics and Causal Inference’. In: Journal of the American Statistical Association 81.396
(1986), pp. 945–960. doi: 10.1080/01621459.1986.10478354 (cited on pages 8, 42).
[6] Alexander D’Amour, Peng Ding, Avi Feller, Lihua Lei, and Jasjeet Sekhon. Overlap in Observational
Studies with High-Dimensional Covariates. 2017 (cited on page 13).
[7] Miguel A Hernán and James M Robins. Causal Inference: What If. Boca Raton: Chapman & Hall/CRC,
2020 (cited on pages 14, 27, 90).
[8] Miguel Angel Luque-Fernandez, Michael Schomaker, Daniel Redondo-Sanchez, Maria Jose Sanchez
Perez, Anand Vaidya, and Mireille E Schnitzer. ‘Educational Note: Paradoxical collider effect in the
analysis of non-communicable disease epidemiological data: a reproducible illustration and web
application’. In: International Journal of Epidemiology 48.2 (Dec. 2018), pp. 640–653. doi: 10.1093/ije/
dyy275 (cited on pages 16, 45).
[9] Salim S. Virani et al. ‘Heart Disease and Stroke Statistics—2020 Update: A Report From the American
Heart Association’. In: Circulation (Mar. 2020), pp. 640–653. doi: 10.1161/cir.0000000000000757
(cited on page 16).
[10] Trevor Hastie, Robert Tibshirani, and Jerome Friedman. The Elements of Statistical Learning. Springer
Series in Statistics. New York, NY, USA: Springer New York Inc., 2001 (cited on pages 17, 63).
[11] Dominik Janzing, David Balduzzi, Moritz Grosse-Wentrup, and Bernhard Schölkopf. ‘Quantifying
causal influences’. In: Ann. Statist. 41.5 (Oct. 2013), pp. 2324–2358. doi: 10.1214/13-AOS1145 (cited on
page 18).
[12] Stephen L. Morgan and Christopher Winship. Counterfactuals and Causal Inference: Methods and Principles
for Social Research. 2nd ed. Analytical Methods for Social Research. Cambridge University Press, 2014
(cited on pages 18, 78).
[13] Daphne Koller and Nir Friedman. Probabilistic Graphical Models: Principles and Techniques. Adaptive
Computation and Machine Learning. The MIT Press, 2009 (cited on pages 21, 29, 47).
[14] J. Peters, D. Janzing, and B. Schölkopf. Elements of Causal Inference: Foundations and Learning Algorithms.
Cambridge, MA, USA: MIT Press, 2017 (cited on pages 21, 105, 107, 109).
[15] Judea Pearl, Madelyn Glymour, and Nicholas P Jewell. Causal inference in statistics: A primer. John Wiley
& Sons, 2016 (cited on page 25).
[16] Judea Pearl. Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference. San Francisco,
CA, USA: Morgan Kaufmann Publishers Inc., 1988 (cited on page 29).
[17] Judea Pearl. ‘Causal inference in statistics: An overview’. In: Statist. Surv. 3 (2009), pp. 96–146. doi:
10.1214/09-SS057 (cited on page 42).
[18] Judea Pearl. Causality. Cambridge University Press, 2009 (cited on pages 44, 47, 48, 59, 87, 94).
[19] Felix Elwert and Christopher Winship. ‘Endogenous Selection Bias: The Problem of Conditioning on a
Collider Variable.’ In: Annual review of sociology 40 (2014), pp. 31–53 (cited on page 44).
[20] David Galles and Judea Pearl. ‘An Axiomatic Characterization of Causal Counterfactuals’. In: Founda-
tions of Science 3.1 (1998), pp. 151–182. doi: 10.1023/A:1009602825894 (cited on page 47).
[21] Joseph Y. Halpern. ‘Axiomatizing Causal Reasoning’. In: Proceedings of the Fourteenth Conference on
Uncertainty in Artificial Intelligence. UAI’98. Madison, Wisconsin: Morgan Kaufmann Publishers Inc.,
1998, pp. 202–210 (cited on page 47).
[22] Elizabeth L. Ogburn and Tyler J. VanderWeele. ‘Causal Diagrams for Interference’. In: Statist. Sci. 29.4
(Nov. 2014), pp. 559–578. doi: 10.1214/14-STS501 (cited on page 48).
[23] J. Pearl. ‘On the consistency rule in causal inference: axiom, definition, assumption, or theorem?’ In:
Epidemiology 21.6 (Nov. 2010), pp. 872–875 (cited on page 48).
[24] Judea Pearl. ‘Causal diagrams for empirical research’. In: Biometrika 82.4 (Dec. 1995), pp. 669–688. doi:
10.1093/biomet/82.4.669 (cited on pages 55, 56).
[25] Ilya Shpitser and Judea Pearl. ‘Identification of Joint Interventional Distributions in Recursive Semi-
Markovian Causal Models’. In: Proceedings of the 21st National Conference on Artificial Intelligence - Volume
2. AAAI’06. Boston, Massachusetts: AAAI Press, 2006, pp. 1219–1226 (cited on pages 57, 60).
[26] Yimin Huang and Marco Valtorta. ‘Pearl’s Calculus of Intervention is Complete’. In: Proceedings of the
Twenty-Second Conference on Uncertainty in Artificial Intelligence. UAI’06. Cambridge, MA, USA: AUAI
Press, 2006, pp. 217–224 (cited on page 57).
[27] Jin Tian and Judea Pearl. ‘A General Identification Condition for Causal Effects’. In: Eighteenth National
Conference on Artificial Intelligence. Edmonton, Alberta, Canada: American Association for Artificial
Intelligence, 2002, pp. 567–573 (cited on page 59).
[28] Ilya Shpitser and Judea Pearl. ‘Identification of Conditional Interventional Distributions’. In: Proceedings
of the Twenty-Second Conference on Uncertainty in Artificial Intelligence. UAI’06. Cambridge, MA, USA:
AUAI Press, 2006, pp. 437–444 (cited on page 60).
[29] F. Pedregosa et al. ‘Scikit-learn: Machine Learning in Python’. In: Journal of Machine Learning Research
12 (2011), pp. 2825–2830 (cited on pages 62, 64).
[30] Sören R. Künzel, Jasjeet S. Sekhon, Peter J. Bickel, and Bin Yu. ‘Metalearners for estimating heterogeneous
treatment effects using machine learning’. In: Proceedings of the National Academy of Sciences 116.10 (2019),
pp. 4156–4165. doi: 10.1073/pnas.1804597116 (cited on pages 64–67).
[31] Uri Shalit, Fredrik D. Johansson, and David Sontag. ‘Estimating individual treatment effect: general-
ization bounds and algorithms’. In: ed. by Doina Precup and Yee Whye Teh. Vol. 70. Proceedings of
Machine Learning Research. International Convention Centre, Sydney, Australia: PMLR, June 2017,
pp. 3076–3085 (cited on pages 65, 66).
[32] Paul R. Rosenbaum and Donald B. Rubin. ‘The central role of the propensity score in observational
studies for causal effects’. In: Biometrika 70.1 (Apr. 1983), pp. 41–55. doi: 10.1093/biomet/70.1.41
(cited on page 67).
[33] D. G. Horvitz and D. J. Thompson. ‘A Generalization of Sampling Without Replacement from a
Finite Universe’. In: Journal of the American Statistical Association 47.260 (1952), pp. 663–685. doi:
10.1080/01621459.1952.10483446 (cited on page 69).
[34] Jason Abrevaya, Yu-Chin Hsu, and Robert P. Lieli. ‘Estimating Conditional Average Treatment Effects’. In:
Journal of Business & Economic Statistics 33.4 (2015), pp. 485–505. doi: 10.1080/07350015.2014.975555
(cited on page 70).
[35] Joseph D. Y. Kang and Joseph L. Schafer. ‘Demystifying Double Robustness: A Comparison of
Alternative Strategies for Estimating a Population Mean from Incomplete Data’. In: Statist. Sci. 22.4
(Nov. 2007), pp. 523–539. doi: 10.1214/07-STS227 (cited on page 70).
[36] Paul N Zivich and Alexander Breskin. Machine learning for causal inference: on the use of cross-fit estimators.
2020 (cited on page 70).
[37] Vincent Dorie, Jennifer Hill, Uri Shalit, Marc Scott, and Dan Cervone. ‘Automated versus Do-It-Yourself
Methods for Causal Inference: Lessons Learned from a Data Analysis Competition’. In: Statist. Sci. 34.1
(Feb. 2019), pp. 43–68. doi: 10.1214/18-STS667 (cited on page 70).
[38] Shaun R. Seaman and Stijn Vansteelandt. ‘Introduction to Double Robust Methods for Incomplete
Data’. In: Statist. Sci. 33.2 (May 2018), pp. 184–197. doi: 10.1214/18-STS647 (cited on page 70).
[39] Anastasios Tsiatis. Semiparametric theory and missing data. Springer Science & Business Media, 2007
(cited on page 70).
[40] James M. Robins, Andrea Rotnitzky, and Lue Ping Zhao. ‘Estimation of Regression Coefficients When
Some Regressors are not Always Observed’. In: Journal of the American Statistical Association 89.427
(1994), pp. 846–866. doi: 10.1080/01621459.1994.10476818 (cited on page 70).
[41] Heejung Bang and James M. Robins. ‘Doubly Robust Estimation in Missing Data and Causal Inference
Models’. In: Biometrics 61.4 (2005), pp. 962–973. doi: 10.1111/j.1541-0420.2005.00377.x (cited on
page 70).
[42] Mark J Van Der Laan and Daniel Rubin. ‘Targeted maximum likelihood learning’. In: The international
journal of biostatistics 2.1 (2006) (cited on page 70).
[43] Megan S. Schuler and Sherri Rose. ‘Targeted Maximum Likelihood Estimation for Causal Inference in
Observational Studies’. In: American Journal of Epidemiology 185.1 (Jan. 2017), pp. 65–73. doi: 10.1093/
aje/kww165 (cited on page 70).
[44] Mark J Van der Laan and Sherri Rose. Targeted learning: causal inference for observational and experimental
data. Springer Science & Business Media, 2011 (cited on page 70).
[45] Elizabeth A. Stuart. ‘Matching Methods for Causal Inference: A Review and a Look Forward’. In: Statist.
Sci. 25.1 (Feb. 2010), pp. 1–21. doi: 10.1214/09-STS313 (cited on page 71).
[46] Victor Chernozhukov, Denis Chetverikov, Mert Demirer, Esther Duflo, Christian Hansen, Whitney
Newey, and James Robins. ‘Double/debiased machine learning for treatment and structural parameters’.
In: The Econometrics Journal 21.1 (2018), pp. C1–C68. doi: 10.1111/ectj.12097 (cited on page 71).
[47] Chris Felton. Chernozhukov et al. on Double / Debiased Machine Learning. https://scholar.princeton.
edu/sites/default/files/bstewart/files/felton.chern _ .slides.20190318.pdf. 2018 (cited
on page 71).
[48] Vasilis Syrgkanis. Orthogonal/Double Machine Learning. https://econml.azurewebsites.net/spec/
estimation/dml.html. Accessed: 16 September 2020. 2019 (cited on page 71).
[49] Dylan J. Foster and Vasilis Syrgkanis. Orthogonal Statistical Learning. 2019 (cited on page 71).
[50] Susan Athey and Guido Imbens. ‘Recursive partitioning for heterogeneous causal effects’. In: Proceedings
of the National Academy of Sciences 113.27 (2016), pp. 7353–7360. doi: 10.1073/pnas.1510489113 (cited
on page 71).
[51] Stefan Wager and Susan Athey. ‘Estimation and Inference of Heterogeneous Treatment Effects using
Random Forests’. In: Journal of the American Statistical Association 113.523 (2018), pp. 1228–1242. doi:
10.1080/01621459.2017.1319839 (cited on page 71).
[52] Susan Athey, Julie Tibshirani, and Stefan Wager. ‘Generalized random forests’. In: Ann. Statist. 47.2
(Apr. 2019), pp. 1148–1178. doi: 10.1214/18-AOS1709 (cited on page 71).
[53] Charles F. Manski. Partial Identification of Probability Distributions: Springer Series in Statistics. English.
Springer, 2003 (cited on pages 73, 82).
[54] Charles Manski. ‘Anatomy of the Selection Problem’. In: Journal of Human Resources 24.3 (1989),
pp. 343–360 (cited on pages 73, 82).
[55] Charles F. Manski. ‘Nonparametric Bounds on Treatment Effects’. In: The American Economic Review
80.2 (1990), pp. 319–323 (cited on pages 73, 74, 79–82).
[56] Charles F. Manski. ‘Identification Problems in the Social Sciences’. In: Sociological Methodology 23 (1993),
pp. 1–56 (cited on pages 73, 82).
[57] Charles F. Manski. ‘The selection problem’. In: Advances in Econometrics: Sixth World Congress. Ed. by
Christopher A.Editor Sims. Vol. 1. Econometric Society Monographs. Cambridge University Press,
1994, pp. 143–170. doi: 10.1017/CCOL0521444594.004 (cited on pages 73, 82).
[58] Charles F. Manski. ‘Monotone Treatment Response’. In: Econometrica 65.6 (1997), pp. 1311–1334 (cited
on pages 73, 76, 82).
[59] Charles F. Manski and John V. Pepper. ‘Monotone Instrumental Variables: With an Application to the
Returns to Schooling’. In: Econometrica 68.4 (2000), pp. 997–1010 (cited on pages 73, 78, 82).
[60] Charles F. Manski. Public Policy in an Uncertain World. Harvard University Press, 2013 (cited on pages 73,
82).
[61] Carlos Cinelli, Daniel Kumor, Bryant Chen, Judea Pearl, and Elias Bareinboim. ‘Sensitivity Analysis
of Linear Structural Causal Models’. In: ed. by Kamalika Chaudhuri and Ruslan Salakhutdinov.
Vol. 97. Proceedings of Machine Learning Research. Long Beach, California, USA: PMLR, Sept. 2019,
pp. 1252–1261 (cited on page 84).
[62] P. R. Rosenbaum and D. B. Rubin. ‘Assessing Sensitivity to an Unobserved Binary Covariate in
an Observational Study with Binary Outcome’. In: Journal of the Royal Statistical Society. Series B
(Methodological) 45.2 (1983), pp. 212–218 (cited on page 85).
[63] Guido W. Imbens. ‘Sensitivity to Exogeneity Assumptions in Program Evaluation’. In: American
Economic Review 93.2 (May 2003), pp. 126–132. doi: 10.1257/000282803321946921 (cited on page 85).
[64] Carlos Cinelli and Chad Hazlett. ‘Making sense of sensitivity: extending omitted variable bias’.
In: Journal of the Royal Statistical Society: Series B (Statistical Methodology) 82.1 (2020), pp. 39–67. doi:
10.1111/rssb.12348 (cited on page 85).
[65] Victor Veitch and Anisha Zaveri. Sense and Sensitivity Analysis: Simple Post-Hoc Analysis of Bias Due to
Unobserved Confounding. 2020 (cited on page 85).
[66] W. Liu, S. J. Kuramoto, and E. A. Stuart. ‘An introduction to sensitivity analysis for unobserved
confounding in nonexperimental prevention research’. In: Prev Sci 14.6 (Dec. 2013), pp. 570–580 (cited
on page 85).
[67] Paul Rosenbaum. Observational Studies. Jan. 2002 (cited on page 85).
[68] Paul R Rosenbaum. Design of Observational Studies. Vol. 10. Springer, 2010 (cited on page 85).
[69] Paul R Rosenbaum. Observation and Experiment. Harvard University Press, 2017 (cited on page 85).
[70] AlexanderM. Franks, Alexander D’Amour, and Avi Feller. ‘Flexible Sensitivity Analysis for Observa-
tional Studies Without Observable Implications’. In: Journal of the American Statistical Association 0.0
(2019), pp. 1–33. doi: 10.1080/01621459.2019.1604369 (cited on page 85).
[71] Steve Yadlowsky, Hongseok Namkoong, Sanjay Basu, John Duchi, and Lu Tian. Bounds on the conditional
and average treatment effect with unobserved confounding factors. 2020 (cited on page 85).
[72] T. J. Vanderweele and O. A. Arah. ‘Bias formulas for sensitivity analysis of unmeasured confounding
for general outcomes, treatments, and confounders’. In: Epidemiology 22.1 (Jan. 2011), pp. 42–52 (cited
on page 85).
[73] P. Ding and T. J. VanderWeele. ‘Sensitivity Analysis Without Assumptions’. In: Epidemiology 27.3 (May
2016), pp. 368–377 (cited on page 85).
[74] Abraham Wald. ‘The Fitting of Straight Lines if Both Variables are Subject to Error’. In: Ann. Math.
Statist. 11.3 (Sept. 1940), pp. 284–300. doi: 10.1214/aoms/1177731868 (cited on page 88).
[75] Jason Hartford, Greg Lewis, Kevin Leyton-Brown, and Matt Taddy. ‘Deep IV: A Flexible Approach
for Counterfactual Prediction’. In: ed. by Doina Precup and Yee Whye Teh. Vol. 70. Proceedings of
Machine Learning Research. International Convention Centre, Sydney, Australia: PMLR, June 2017,
pp. 1414–1423 (cited on page 94).
[76] Liyuan Xu, Yutian Chen, Siddarth Srinivasan, Nando de Freitas, Arnaud Doucet, and Arthur Gretton.
Learning Deep Features in Instrumental Variable Regression. 2020 (cited on page 94).
[77] Niki Kilbertus, Matt J. Kusner, and Ricardo Silva. A Class of Algorithms for General Instrumental Variable
Models. 2020 (cited on page 94).
[78] Thomas Verma and Judea Pearl. ‘Equivalence and Synthesis of Causal Models’. In: Proceedings of the
Sixth Annual Conference on Uncertainty in Artificial Intelligence. UAI ’90. USA: Elsevier Science Inc., 1990,
pp. 255–270 (cited on page 102).
[79] Morten Frydenberg. ‘The Chain Graph Markov Property’. In: Scandinavian Journal of Statistics 17.4
(1990), pp. 333–353 (cited on page 102).
[80] Peter Spirtes, Clark Glymour, and Richard Scheines. Causation, Prediction, and Search. MIT Press, Jan.
2001 (cited on pages 102, 104).
[81] Thomas Richardson. ‘Feedback Models: Interpretation and Discovery’. PhD thesis. 1996 (cited on
page 104).
[82] Antti Hyttinen, Patrik O. Hoyer, Frederick Eberhardt, and Matti Järvisalo. ‘Discovering Cyclic Causal
Models with Latent Variables: A General SAT-Based Procedure’. In: Proceedings of the Twenty-Ninth
Conference on Uncertainty in Artificial Intelligence. UAI’13. Bellevue, WA: AUAI Press, 2013, pp. 301–310
(cited on page 104).
[83] Antti Hyttinen, Frederick Eberhardt, and Matti Järvisalo. ‘Constraint-Based Causal Discovery: Conflict
Resolution with Answer Set Programming’. In: Proceedings of the Thirtieth Conference on Uncertainty in
Artificial Intelligence. UAI’14. Quebec City, Quebec, Canada: AUAI Press, 2014, pp. 340–349 (cited on
page 104).
[84] Rajen D. Shah and Jonas Peters. ‘The hardness of conditional independence testing and the generalised
covariance measure’. In: Ann. Statist. 48.3 (June 2020), pp. 1514–1538. doi: 10.1214/19-AOS1857 (cited
on page 104).
[85] Christopher Meek. ‘Strong Completeness and Faithfulness in Bayesian Networks’. In: Proceedings of
the Eleventh Conference on Uncertainty in Artificial Intelligence. UAI’95. Montréal, Qué, Canada: Morgan
Kaufmann Publishers Inc., 1995, pp. 411–418 (cited on page 104).
[86] Dan Geiger and Judea Pearl. ‘On the Logic of Causal Models’. In: Proceedings of the Fourth Annual
Conference on Uncertainty in Artificial Intelligence. UAI ’88. NLD: North-Holland Publishing Co., 1988,
pp. 3–14 (cited on page 104).
[87] G. Darmois. ‘Analyse générale des liaisons stochastiques: etude particulière de l’analyse factorielle
linéaire’. In: Revue de l’Institut International de Statistique / Review of the International Statistical Institute
21.1/2 (1953), pp. 2–8 (cited on page 106).
[88] V. P. Skitovich. ‘Linear forms of independent random variables and the normal distribution law’. In:
Izvestiia Akademii Nauk SSSR, Serija Matematiceskie 18 (1954), pp. 185–200 (cited on page 106).
[89] Shohei Shimizu, Patrik O. Hoyer, Aapo Hyvärinen, and Antti Kerminen. ‘A Linear Non-Gaussian
Acyclic Model for Causal Discovery’. In: Journal of Machine Learning Research 7.72 (2006), pp. 2003–2030
(cited on page 107).
[90] Patrik Hoyer, Dominik Janzing, Joris M Mooij, Jonas Peters, and Bernhard Schölkopf. ‘Nonlinear causal
discovery with additive noise models’. In: Advances in Neural Information Processing Systems. Ed. by
D. Koller, D. Schuurmans, Y. Bengio, and L. Bottou. Vol. 21. Curran Associates, Inc., 2009, pp. 689–696
(cited on page 108).
[91] Jonas Peters, Joris M. Mooij, Dominik Janzing, and Bernhard Schölkopf. ‘Causal Discovery with
Continuous Additive Noise Models’. In: Journal of Machine Learning Research 15.58 (2014), pp. 2009–2053
(cited on page 108).
[92] Kun Zhang and Aapo Hyvärinen. ‘On the Identifiability of the Post-Nonlinear Causal Model’. In:
Proceedings of the Twenty-Fifth Conference on Uncertainty in Artificial Intelligence. UAI ’09. Montreal,
Quebec, Canada: AUAI Press, 2009, pp. 647–655 (cited on page 109).
[93] Frederick Eberhardt. ‘Introduction to the foundations of causal discovery’. In: International Journal of
Data Science and Analytics 3.2 (2017), pp. 81–91. doi: 10.1007/s41060-016-0038-6 (cited on page 109).
[94] Clark Glymour, Kun Zhang, and Peter Spirtes. ‘Review of Causal Discovery Methods Based on
Graphical Models’. In: Frontiers in Genetics 10 (2019), p. 524. doi: 10.3389/fgene.2019.00524 (cited on
page 109).
Alphabetical Index

2SLS, 89 complier average causal effect generalized random forests,


(CACE), 91 71
adjustment formula, 11 conditional average treatment global Markov assumption,
ancestor, 19 effect (CATE), 62 29
association, 4 conditional instrument, 87 graph, 19
causal association, 4, 30 conditional outcome model,
63 hedge criterion, 60
confounding association,
confounder, 4 homogeneity, 90
4, 30
associational difference, 8 consistent, 70 identifiability, 10, 33
ATT, 95 correlation, 4 identification, 10, 16, 33
average treatment effect (ATE), counterfactual, 8 nonparametric, 57
8, 62 covariate balance, 49 parametric, 57, 63
CPDAG, 102 structure, 100
backdoor adjustment, 38 curse of dimensionality, 13 ignorability, 9
backdoor criterion, 37 cycle, 19 immorality, 20, 26
backdoor path, 37 individual treatment effect
Bayesian network, 21 d-connected, 29
(ITE), 7, 62
chain rule, 21 d-separated, 29
individualized average
factorization, 21 d-separation, 29
treatment effect
Berkson’s paradox, 27 data generating process, 27
(IATE), 62
descendant, 19
blocked path, 25, 26, 28 instrument, 86
direct cause, 22, 41
conditional, 87
directed acyclic graph (DAG),
causal Bayesian networks, instrumental variable, 86
19
35 interference, 13
directed graph, 19
causal effect interventional distribution,
directed path, 19
average, 8, 62 32
do-calculus, 55
complier average, 91 interventional SCM, 42
do-operator, 32
conditional average, 62 inverse probability weighting,
double robustness, 70
individual, 7, 62 68
doubly robust, 70
individualized average, IPW, 68
62 edge, 19 local average treatment effect
unit-level, 7 endogenous, 41 (LATE), 91
causal estimand, 15, 33 Equationtown, 53 local Markov assumption,
causal forests, 71 essential graph, 102 20
causal graph, 22 estimand, 15 lurking variable, 4
non-strict, 23 causal, 15, 33
strict, 22 statistical, 15, 33 M-bias, 44, 47
causal mechanism, 34, 41 estimate, 15 magnification, 43
causal sufficiency, 101 estimation, 15, 16 magnify, 43
causal tree, 71 estimator, 15 manipulated graph, 34, 55
cause, 22, 41 exchangeability, 9, 50 Markov compatibility, 21
child, 19 exogenous, 41 Markov equivalence, 101
collider, 26 extrapolation, 13 Markovian, 41
collider bias, 43 mediator, 43
COM estimator, 63 factual, 8 minimality, 21
common cause, 4 faithfulness, 100 misspecification, 18
common support, 13 frontdoor adjustment, 53 model-assisted estimation,
comparability, 49 frontdoor criterion, 53 16
model-assisted estimator, 16, path, 19 structure identification, 100
17 blocked, 25, 26 sufficient adjustment set, 38
monotone treatment response blocked , 28 SUTVA, 14
(MTR), 76 unblocked, 25, 27
monotonicity, 91 unblocked , 28 targeted maximum likelihood
positivity, 12 estimation, 70
no-assumptions bound, 75 post-intervention, 33 TARNet, 65
node, 19 post-treatment covariates, terminology machine gun,
non-Markovian, 41 44 19
nonparametric, 40 potential outcome, 6 TMLE, 70
nonparametric identification, pre-intervention, 33 treatment assignment
57 propensity score, 67 mechanism, 49
pseudo-populations, 68 truncated factorization, 35
observational data, 33 two-stage least squares, 89
observational distribution, randomized control trials
33 unblocked path, 25, 27, 28
(RCTs), 49
unconfounded children
observational-counterfactual randomized experiments,
criterion, 59
decomposition, 75 49
unconfoundedness, 11
optimal treatment selection
semi-Markovian, 41 undirected graph, 19
(OTS), 79
Simpson’s paradox, 1 unit-level treatment effect, 7
orientation propagation, 103
overlap, 13 skeleton, 102
vertex, 19
spurious correlation, 3
parallel trends, 96 statistical estimand, 15, 33 Wald Estimand, 88
parametric identification, 57, structural causal model (SCM), Wald estimator, 88
63 40
parent, 19 structural equation, 40 X-learner, 66

View publication stats

You might also like