0% found this document useful (0 votes)
555 views75 pages

Causal Inference in Statistics - Solution M - Judea Pearl

Uploaded by

dukeblue
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
555 views75 pages

Causal Inference in Statistics - Solution M - Judea Pearl

Uploaded by

dukeblue
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 75

Causal Inference in Statistics:

A Primer
Solution Manual
Text Authors:
Judea Pearl, Madelyn Glymour, and Nicholas Jewell
Solution Authors:
Judea Pearl, Ang Li, Andrew Forney, and Johannes Textor
About This Manual
This document provides solutions, explanations, and intuition for the study questions posed
in Causality in Statistics: A Primer. Students are encouraged to attempt each study question
by hand before consulting the answers herein.

Online Access
As the authors make updates to the text and solution manual, changes and errata will be
posted at the following links:
Textbook Information & Update site: http://bayes.cs.ucla.edu/PRIMER/
Solution Manual Information & Update site: http://bayes.cs.ucla.edu/PRIMER/Manual

Interactive Tutorial using DAGitty


The authors have collaborated with Johannes Textor, the maker of DAGitty (a browser-
based environment for creating, editing, and analyzing causal models), to provide interactive
tutorials for classroom use and self-study. We provide solutions to some exercises in the R
environment for statistical computing, based on the DAGitty R package. Each question with
an accompanying DAGitty explanation has been tagged accordingly, and students may find a
complete list of these examples at the following link:
[DAGitty] Textbook Companion site: http://dagitty.net/primer/
Study Questions and Solutions for
Chapter 1

Study question 1.2.1.


What is wrong with the following claims?
(a) “Data show that income and marriage have a high positive correlation. Therefore, your
earnings will increase if you get married.”
(b) “Data show that as the number of fires increase, so does the number of fire fighters.
Therefore, to cut down on fires, you should reduce the number of fire fighters.”
(c) “Data show that people who hurry tend to be late to their meetings. Don’t hurry, or
you’ll be late.”

Solution to study question 1.2.1

The three claims are obviously wrong, and in subsequent sections of this book we will acquire
the tools to formally prove them wrong. At this stage, however, we will merely explain the
observed correlations using alternative models which do not support the claims cited.

For each problem, we will explain the observed correlation using new variables in the answers
below.

Part (a)
Consider an alternative explanation with a third variable, charm, which has a causal influ-
ence on both income and marriage (charming individuals have a higher propensity to marry
and be promoted in their jobs), but where marriage has no causal influence on income. This
explanation supports the observed data (that marriage and income are highly correlated) but
does not allow us to conclude that marrying will increase one’s income.
4 STUDY QUESTIONS AND SOLUTIONS FOR CHAPTER 1

Part (b)
Consider that the number of fire fighters employed in a district is a direct response to the
frequency of fires in that area. In a natural scenario, a higher frequency of fires causes addi-
tional fire fighters to be hired, and hiring fire fighters decreases the number of fires that would
break out had they not been hired. Hence, hiring fewer fire fighters will actually increase the
frequency of fires.

Part (c)
Let us consider the reason that an individual might hurry to an appointment: they believe that
a slow pace will not allow them to arrive to the appointment on time because they woke up
late. So, waking late is a common cause of hurrying and arriving late to the meeting. This will
cause a high correlation between hurrying and arriving late, even though for a fixed waking
time, hurrying would actually decrease one’s likelihood of arriving late.

Study question 1.2.2.


A baseball batter Tim has a better batting average than his teammate Frank. However,
someone notices that Frank has a better batting average than Tim against both right-handed
and left-handed pitchers. How can this happen? (Present your answer in a table.)

Solution to study question 1.2.2

Observe that this problem requests that we create a Simpson’s reversal in our data. [Hint: We
can use Table 1.2 from the text to scaffold our answer.] Consider the following, somewhat
unrealistic (but without loss of generality) batting averages for Frank and Tim against Right-
and Left-handed pitchers:

Frank Tim
Right-handed 81 hits out of 87 at-bats (.931) 234 hits out of 270 at-bats (.867)
Left-handed 192 hits out of 263 at-bats (.730) 55 hits out of 80 at-bats (.688)
Combined Data 273 hits out of 350 at-bats (.780) 289 hits out of 350 at-bats (.826)

Study question 1.2.3.


Determine, for each of the following causal stories, whether you should use the aggregate or
the segregated data to determine the true effect of treatment on recovery.
(a) There are two treatments used on kidney stones: Treatment 𝐴 and Treatment 𝐵. Doctors
are more likely to use Treatment 𝐴 on large (and therefore, more severe) stones and more
likely to use Treatment 𝐵 on small stones. Should a patient who doesn’t know the size
of his or her stone examine the general population data, or the stone size-specific data
when deciding which treatment they would like to request?
(b) There are two doctors in a small town. Each has performed 100 surgeries in his career,
which are of two types: one very difficult surgery, and one very easy surgery. The first
doctor performs the easy surgery much more often than the difficult surgery, and the
STUDY QUESTIONS AND SOLUTIONS FOR CHAPTER 1 5

second doctor performs the difficult surgery more often than the easy surgery. You need
surgery, but you do not know whether your case is easy or difficult. Should you consult
the success rate of each doctor over all cases, or should you consult their success rates
for the easy and difficult cases separately in choosing which surgeon to perform your
operation.

Solution to study question 1.2.3

To answer each of the questions in this section, and based on the structure of these relation-
ships, we consider the causal relationships behind the described scenario to determine which
interpretation of the data is valid.

Part (a)
Here, the size of the stone is a common cause of the treatment choice and its recovery
outcome. In other words, the size of the stone both affects the likelihood of receiving one
treatment over the other, and also the chance of recovery since larger stones are more severe.
Moreover, treatment does not change the stone size. As such, the structure of this scenario is
identical to that of Example 1.2.1, in which treatment does not affect sex. Similarly, whether
or not the patient knows the Size of their stone, we should consult the segregated data condi-
tioned on stone size to make a correct decision.

Part (b)
The same logic as above applies. Paralleling the structure of Example 1.2.1, Difficulty of
surgery is a common cause of both doctor choice and recovery rates. In other words, the dif-
ficulty of a surgery affects both propensities for choosing one doctor over another as well as
the chance of success, since more difficult cases could inherently have less chance of success.
As such, whether or not the patient knows the difficulty of their surgery, we should consult
the segregated data conditioned on difficulty to make a causally-correct decision.

Study question 1.2.4.


In an attempt to estimate the effectiveness of a new drug, a randomized experiment is
conducted. In all, 50% of the patients are assigned to receive the new drug and 50% to receive
a placebo. A day before the actual experiment, a nurse hands out lollipops to some patients
who show signs of depression, mostly among those who have been assigned to treatment the
next day (i.e., the nurse’s round happened to take her through the treatment-bound ward).
Strangely, the experimental data revealed a Simpson’s reversal: Although the drug proved
beneficial to the population as a whole, drug takers were less likely to recover than nontakers,
among both lollipop receivers and lollipop nonreceivers. Assuming that lollipop sucking in
itself has no effect whatsoever on recovery, answer the following questions:

(a) Is the drug beneficial to the population as a whole or harmful?

(b) Does your answer contradict our gender example, where sex-specific data was deemed
more appropriate?
6 STUDY QUESTIONS AND SOLUTIONS FOR CHAPTER 1

(c) Draw a graph (informally) that more or less captures the story. (Look ahead to Section
1.3 if you wish.)
(d) How would you explain the emergence of Simpson’s reversal in this story?
(e) Would your answer change if the lollipops were handed out (by the same criterion) a day
after the study?
[Hint: Use the fact that receiving a lollipop indicates a greater likelihood of being assigned
to drug treatment, as well as depression, which is a symptom of risk factors that lower the
likelihood of recovery.]

Solution to study question 1.2.4

The arguments behind this problem are somewhat intricate, but they can be made intuitive
if we take an extreme case and assume that, among those in the treatment ward, patients
received a lollipop regardless of their health, while among those in the placebo ward only
extremely sick patients were given a lollipop. Under these circumstances, the group of
lollipop-receiving patients would show a strong correlation between treatment and recov-
ery even if the treatment is totally ineffective; treated individuals consist of typical patients
while untreated individuals consist of only extremely sick people. Thus, the treatment will
appear to improve chances of recovery even if it has no physical effect on recovery. The same
applies to the group of lollipop-denied patients; a spurious correlation will appear between
treatment and recovery, merely because the untreated patients where chosen among the very
sick.

Such spurious correlations will not occur in the aggregated population because if we dis-
regard the lollipop we find a perfectly randomized experiment; those chosen for treatment
are chosen at random from the population, in total disregard of their health status. Another
way to understand the difference in populations is to note that, when we compare treated
and untreated patients in the lollipop-receiving group we are comparing apples and oranges;
these two groups are not exchangeable in terms of their health status.

We conclude that, in this example, the aggregated data reveal the correct answer to our ques-
tion, while the disaggregated data is biased. In Chapter 3 of this book we will see that, in sto-
ries of this nature, disaggregation is to be avoided regardless of the specific lollipop-handling
strategy used by the nurse. We will further learn to identify such situations directly from the
graph, without invoking arguments concerning exchangeability or apples and oranges.

Part (a)
Per the above, we know that disaggregated data is biased, so we instead consult the aggre-
gated data and conclude that the drug is beneficial to the population.

Part (b)
Our decision here does not contradict the gender example from Table 1.1 where we deemed
it appropriate to consult the segregated data. In the gender example, gender was not merely
STUDY QUESTIONS AND SOLUTIONS FOR CHAPTER 1 7

correlated with treatment and recovery, it was actually a cause of both. Not so in the present
story; lollipop receipt correlates with, but is not a cause of, either treatment or recovery. The
two different stories warrant different treatments.

Part (c)
Let 𝑋 indicate treatment receipt, 𝑍 indicate lollipop receipt, 𝑌 indicate recovery, and 𝑈1 , 𝑈2
indicate two unobserved factors that correlate 𝑍 with 𝑋 and 𝑌 , respectively. The causal
graph illustrating our story can be modeled as:

𝑈1 𝑈2

𝑍
𝑋 𝑌
Part (d)
Suppose we (incorrectly) decided to use segregated data where we condition on lollipop
receipt. Simpson’s Reversal could display benefit of the drug to the population (aggregate
data) and harm to both lollipop-specific groups (segregated data) by a "trick" of the segregated
group sizes, as we’ve seen many times in this chapter (Table 1.2, for example). Consider that
the "got lollipop" group consists of a subset of the "treated" group and that if we got unlucky
and gave lollipops to all of the treated individuals who were going to recover, it would give
the impression of (negative) association between treatment and recovery even when there is
no causal effect of treatment on recovery. The same argument applies for the "didn’t get lol-
lipop" group.

Part (e)
Our answer will not change since lollipop receipt is still only spuriously connected to treat-
ment, even if the lollipops were distributed after the study. With these analyses, we always
consult the "causal story" behind them.

Study question 1.3.1.


Identify the variables and events invoked in the lollipop story of Study question 1.2.4

Solution to study question 1.3.1

Variables: Let 𝑋 indicate Treatment / Drug receipt, 𝑍 indicate Lollipop receipt, and 𝑌 indi-
cate Recovery Status.
Events: “𝑋 = 1 𝑎𝑛𝑑 𝑍 = 1 𝑎𝑛𝑑 𝑌 = 1” indicates the event where an individual takes the
drug, receives a lollipop, and recovers (the same applies for other values of each variable).
8 STUDY QUESTIONS AND SOLUTIONS FOR CHAPTER 1

Table 1.5 The proportion of males and females achieving a given


education level
Gender Highest education achieved Occurrence
(in hundreds of thousands)
Male Never finished high school 112
Male High school 231
Male College 595
Male Graduate school 242
Female Never finished high school 136
Female High school 189
Female College 763
Female Graduate school 172

Study question 1.3.2.


Consider Table 1.5 showing the relationship between gender and education level in the U.S.
adult population.
(a) Estimate 𝑃 (High School)
(b) Estimate 𝑃 (High School OR Female)
(c) Estimate 𝑃 (High School | Female)
(d) Estimate 𝑃 (Female | High School).

Solution to study question 1.3.2

Using Table 1.5, for each of the specified quantities of interest, we simply sum over the cases
in the matching attributes and divide by the appropriate population.

Part (a)
By marginalization, we can write:
𝑃 (𝐻𝑖𝑔ℎ 𝑆𝑐ℎ𝑜𝑜𝑙) = 𝑃 (𝐻𝑖𝑔ℎ 𝑆𝑐ℎ𝑜𝑜𝑙, 𝑀 𝑎𝑙𝑒) + 𝑃 (𝐻𝑖𝑔ℎ 𝑆𝑐ℎ𝑜𝑜𝑙, 𝐹 𝑒𝑚𝑎𝑙𝑒)
231 + 189
=
112 + 231 + 595 + 242 + 136 + 189 + 763 + 172
= 0.1721

Part (b)
Summing over all cases falling in either the High School or Female categories, we
have:
𝑃 (𝐹 𝑒𝑚𝑎𝑙𝑒 𝑜𝑟 𝐻𝑖𝑔ℎ 𝑆𝑐ℎ𝑜𝑜𝑙) = 𝑃 (𝐹 𝑒𝑚𝑎𝑙𝑒) + 𝑃 (𝑀 𝑎𝑙𝑒, 𝐻𝑖𝑔ℎ 𝑆𝑐ℎ𝑜𝑜𝑙)
189 + 136 + 763 + 172 + 231
=
112 + 231 + 595 + 242 + 136 + 189 + 763 + 172
= 0.6111
STUDY QUESTIONS AND SOLUTIONS FOR CHAPTER 1 9

Part (c)
By Bayes’ conditioning, we can write:
𝑃 (𝐻𝑖𝑔ℎ 𝑆𝑐ℎ𝑜𝑜𝑙|𝐹 𝑒𝑚𝑎𝑙𝑒) = 𝑃 (𝐻𝑖𝑔ℎ 𝑆𝑐ℎ𝑜𝑜𝑙, 𝐹 𝑒𝑚𝑎𝑙𝑒)/𝑃 (𝐹 𝑒𝑚𝑎𝑙𝑒)
189
=
136 + 189 + 763 + 172
= 0.15
Part (d)
Again by Bayes’ conditioning, we can write:
𝑃 (𝐹 𝑒𝑚𝑎𝑙𝑒|𝐻𝑖𝑔ℎ 𝑆𝑐ℎ𝑜𝑜𝑙) = 𝑃 (𝐹 𝑒𝑚𝑎𝑙𝑒, 𝐻𝑖𝑔ℎ 𝑆𝑐ℎ𝑜𝑜𝑙)/𝑃 (𝐻𝑖𝑔ℎ 𝑆𝑐ℎ𝑜𝑜𝑙)
189
=
231 + 189
= 0.45
Study question 1.3.3.
Consider the casino problem described in Section 1.3.7
(a) Compute 𝑃 (“craps”|“11”) assuming that there are twice as many roulette tables as craps
games at the casino.
(b) Compute 𝑃 (“roulette”|“10”) assuming that there are twice as many craps games as
roulette tables at the casino.

Solution to study question 1.3.3


Part (a)
Assuming that there are twice as many roulette tables as craps games at the casino, we have:

𝑃 (“𝑟𝑜𝑢𝑙𝑒𝑡𝑡𝑒”) = 2/3
𝑃 (“𝑐𝑟𝑎𝑝𝑠”) = 1/3
So, by the law of total probability, we can write our target quantity 𝑃 (“11”) in terms of what
we know:
𝑃 (“11”) = 𝑃 (“11”|“𝑐𝑟𝑎𝑝𝑠”)𝑃 (“𝑐𝑟𝑎𝑝𝑠”) + 𝑃 (“11”|“𝑟𝑜𝑢𝑙𝑒𝑡𝑡𝑒”)𝑃 (“𝑟𝑜𝑢𝑙𝑒𝑡𝑡𝑒”)
= 1/18 * 1/3 + 1/38 * 2/3
= 37/1026
= 0.036
𝑃 (“𝑐𝑟𝑎𝑝𝑠”|“11”) = 𝑃 (“𝑐𝑟𝑎𝑝𝑠”, “11”)/𝑃 (“11”)
1/18 * 1/3
=
37/1026
= 0.514
10 STUDY QUESTIONS AND SOLUTIONS FOR CHAPTER 1

Part (b)
Assuming that there are twice as many craps games as roulette tables at the casino, we have:

𝑃 (“𝑟𝑜𝑢𝑙𝑒𝑡𝑡𝑒”) = 1/3
𝑃 (“𝑐𝑟𝑎𝑝𝑠”) = 2/3
We can use the same tactic as in (a) (the law of total probability) to write our target quantity
in terms of what we know:
𝑃 (“10”) = 𝑃 (“10”|“𝑐𝑟𝑎𝑝𝑠”)𝑃 (“𝑐𝑟𝑎𝑝𝑠”) + 𝑃 (“10”|“𝑟𝑜𝑢𝑙𝑒𝑡𝑡𝑒”)𝑃 (“𝑟𝑜𝑢𝑙𝑒𝑡𝑡𝑒”)
= 1/12 * 2/3 + 1/38 * 1/3
= 11/171
= 0.064
𝑃 (“𝑟𝑜𝑢𝑙𝑒𝑡𝑡𝑒”|“10”) = 𝑃 (“𝑟𝑜𝑢𝑙𝑒𝑡𝑡𝑒”, “10”)/𝑃 (“10”)
1/38 * 1/3
=
11/171
= 0.136

Study question 1.3.4.


Suppose we have three cards. Card 1 has two black faces, one on each side; Card 2 has two
white faces; and Card 3 has one white face and one back face. You select a card at random
and place it on the table. You find that it is black on the face-up side. What is the probability
that the face-down side of the card is also black?
(a) Use your intuition to argue that the probability that the face-down side of the card is also
black is 1/2. Why might it be greater than 1/2?
(b) Express the probabilities and conditional probabilities that you find easy to estimate (for
example, 𝑃 (𝐶𝐷 = Black)), in terms of the following variables:

𝐼 = Identity of the card selected (Card 1, Card 2, or Card 3)


𝐶𝐷 = Color of the face-down side (Black, White)
𝐶𝑈 = Color of the face-up side (Black, White)

Find the probability that the face-down side of the selected card is black, using your
estimates above.
(c) Use Bayes’ theorem to find the correct probability of a randomly selected card’s back
being black if you observe that its front is black.
STUDY QUESTIONS AND SOLUTIONS FOR CHAPTER 1 11

Solution to study question 1.3.4

Part (a)

The face-up side is black, so it is either card 1 or card 3. Given that cards have equal probabil-
ities of being selected, the probability that the face-down side of the card is also black is 1/2.
However, cards do not have equal probabilities conditioned on the evidence; if the face-up
side is black, the card is more likely to be card 1, so the probability that the face-down side
of the card is also black is greater than 1/2.

Part (b)

Since we don’t know which card is face-up, we’ll use the law of total probability indexing on
the card number to compute our quantity of interest.

𝑃 (𝐶𝐷 = 𝐵𝑙𝑎𝑐𝑘) = 𝑃 (𝐶𝐷 = 𝐵𝑙𝑎𝑐𝑘|𝐼 = 1)𝑃 (𝐼 = 1) + 𝑃 (𝐶𝐷 = 𝐵𝑙𝑎𝑐𝑘|𝐼 = 2)𝑃 (𝐼 = 2)


+ 𝑃 (𝐶𝐷 = 𝐵𝑙𝑎𝑐𝑘|𝐼 = 3)𝑃 (𝐼 = 3)
= 1 * 1/3 + 0 * 1/3 + 1/2 * 1/3
= 1/2

Part (c)

This is a straightforward application of Bayes’ theorem:

𝑃 (𝐶𝑈 = 𝐵𝑙𝑎𝑐𝑘|𝐼 = 1)𝑃 (𝐼 = 1)


𝑃 (𝐼 = 1|𝐶𝑈 = 𝐵𝑙𝑎𝑐𝑘) =
𝑃 (𝐶𝑈 = 𝐵𝑙𝑎𝑐𝑘)
1 * 1/3
=
1/2
= 2/3

Study question 1.3.5 (Monty Hall).


Prove, using Bayes’ theorem, that switching doors improves your chances of winning the car
in the Monty Hall problem.

Solution to study question 1.3.5

We’ll adopt the variable labeling used in the text, where each may have values indicating
doors 𝐴, 𝐵, 𝐶: Let 𝑋 indicate the door chosen by the player, 𝑌 indicate the door hiding the
car, and 𝑍 indicate the door opened by the host. We want to prove that:

𝑃 (𝑌 = 𝐴|𝑋 = 𝐴, 𝑍 = 𝐶) < 𝑃 (𝑌 = 𝐵|𝑋 = 𝐴, 𝑍 = 𝐶)


12 STUDY QUESTIONS AND SOLUTIONS FOR CHAPTER 1

So, we’ll compute the components of this expression necessary to illustrate this inequality,
and then combine them.
𝑃 (𝑍 = 𝐶|𝑋 = 𝐴) = 𝑃 (𝑍 = 𝐶|𝑋 = 𝐴, 𝑌 = 𝐴)𝑃 (𝑌 = 𝐴)
+ 𝑃 (𝑍 = 𝐶|𝑋 = 𝐴, 𝑌 = 𝐵)𝑃 (𝑌 = 𝐵)
+ 𝑃 (𝑍 = 𝐶|𝑋 = 𝐴, 𝑌 = 𝐶)𝑃 (𝑌 = 𝐶)
= 1/2 * 1/3 + 1 * 1/3 + 0 * 1/3
= 1/2
𝑃 (𝑍 = 𝐶|𝑋 = 𝐴, 𝑌 = 𝐴)𝑃 (𝑌 = 𝐴|𝑋 = 𝐴)
𝑃 (𝑌 = 𝐴|𝑋 = 𝐴, 𝑍 = 𝐶) =
𝑃 (𝑍 = 𝐶|𝑋 = 𝐴)
1/2 * 1/3
=
1/2
= 1/3
𝑃 (𝑌 = 𝐵|𝑋 = 𝐴, 𝑍 = 𝐶) = 1 − 𝑃 (𝑌 = 𝐴|𝑋 = 𝐴, 𝑍 = 𝐶) − 𝑃 (𝑌 = 𝐶|𝑋 = 𝐴, 𝑍 = 𝐶)
= 1 − 1/3 − 0
= 2/3

Thus, switching doors doubles the chances of winning the car.

Study question 1.3.6.

(a) Prove that, in general, both 𝜎𝑋𝑌 and 𝜌𝑋𝑌 vanish when 𝑋 and 𝑌 are independent.
[Hint: Use Eqs. (1.16) and (1.17).]
(b) Give an example of two variables that are highly dependent and, yet, their correlation
coefficient vanishes.

Solution to study question 1.3.6


Part (a)
By assumption, 𝑋 and 𝑌 are independent, allowing us to write:
∑︁
𝐸(𝑋𝑌 ) = 𝑥𝑦 * 𝑃 (𝑥𝑦)
𝑥𝑦
∑︁
= 𝑥𝑦𝑥𝑦 * 𝑃 (𝑥)𝑃 (𝑦)
∑︁ ∑︁
= 𝑥𝑃 (𝑥) * 𝑦𝑃 (𝑦)
𝑥 𝑦

= 𝐸(𝑋)𝐸(𝑌 )
STUDY QUESTIONS AND SOLUTIONS FOR CHAPTER 1 13

Using this decomposition, we can now show:


∴ 𝜎𝑋𝑌 = 𝐸[(𝑋 − 𝐸(𝑋))(𝑌 − 𝐸(𝑌 ))]
= 𝐸(𝑋)𝐸(𝑌 ) − 2𝐸(𝑋)𝐸(𝑌 ) + 𝐸(𝑋𝑌 )
= 𝐸(𝑋𝑌 ) − 𝐸(𝑋)𝐸(𝑌 )
=0
= 𝜌𝑋𝑌
Part (b)
Consider an abstract gambling game with a player and “the house” (e.g., a casino dealer). Let
𝑋 represent the possible winnings/losses of the player and 𝑌 represent the winnings/losses
of the house such that 𝑋 ∈ {−1, 1}, 𝑌 ∈ {−1, 0, 1}. In this game, the winnings of the house
depend on the winnings of the player, as illustrated in the table that follows.
Furthermore, let 𝑃 (𝑋 = −1) = 𝑃 (𝑋 = 1) = 0.5 ⇒ 𝐸(𝑋) = 0

𝑃 (𝑌 |𝑋) 𝑋 = −1 𝑋=1
𝑌 = −1 0.5 0.5
𝑌 =0 0.5 0.5
𝑌 =1 0 1

Above, we see that X and Y are dependent (𝑃 (𝑌 = 1|𝑋 = −1) ̸= 𝑃 (𝑌 = 1|𝑋 = 1)), yet:
∑︁ ∑︁
𝐸(𝑋𝑌 ) = 𝑋𝑌 * 𝑃 (𝑋𝑌 )
𝑥 𝑦
∑︁ ∑︁
= 𝑋𝑌 * 𝑃 (𝑌 |𝑋)𝑃 (𝑋)
𝑥 𝑦

= (−1)(−1) * 0.5 * 0.5 + (−1)(0) * 1 * 0.5 + (−1)(1) * 0 * 0.5


+ (1)(−1) * 0.5 * 0.5 + (1)(0) * 0.5 * 0.5 + (1)(1) * 0 * 0.5
=0
𝐸(𝑋𝑌 ) = 0, 𝐸(𝑋)𝐸(𝑌 ) = 0, so 𝐶𝑜𝑣(𝑋, 𝑌 ) = 0.
Thus, we have found a scenario where two variables are dependent, but their correlation coef-
ficient vanishes.

Study question 1.3.7.


Two fair coins are flipped simultaneously to determine the payoffs of two players in the town’s
casino. Player 1 wins a dollar if and only if at least one coin lands on head. Player 2 receives
a dollar if and only if the two coins land on the same face. Let 𝑋 stand for the payoff of
Player 1 and 𝑌 for the payoff of Player 2.
(a) Find and describe the probability distributions
𝑃 (𝑥), 𝑃 (𝑦), 𝑃 (𝑥, 𝑦), 𝑃 (𝑦|𝑥) and 𝑃 (𝑥|𝑦)
14 STUDY QUESTIONS AND SOLUTIONS FOR CHAPTER 1

(b) Using the descriptions in (a), compute the following measures:

𝐸[𝑋], 𝐸[𝑌 ], 𝐸[𝑌 |𝑋 = 𝑥], 𝐸[𝑋|𝑌 = 𝑦]

𝑉 𝑎𝑟(𝑋), 𝑉 𝑎𝑟(𝑌 ), 𝐶𝑜𝑣(𝑋, 𝑌 ), 𝜌𝑋𝑌

(c) Given that Player 2 won a dollar, what is your best guess of Player 1’s payoff?
(d) Given that Player 1 won a dollar, what is your best guess of Player 2’s payoff?
(e) Are there two events, 𝑋 = 𝑥 and 𝑌 = 𝑦, that are mutually independent?

Solution to study question 1.3.7

Let 𝑋 and 𝑌 stand for the winnings of Player 1 and Player 2, respectively. We have:
Part (a)
The descriptions of these distributions are as follows:
𝑃 (𝑥): The probability that player 1 gets 𝑥 dollars.
𝑃 (𝑦): The probability that player 2 gets 𝑦 dollars.
𝑃 (𝑥, 𝑦): The probability that player 1 gets 𝑥 dollars and player 2 gets 𝑦 dollars.
𝑃 (𝑦|𝑥): The probability that player 2 gets 𝑦 dollars given that player 1 gets 𝑥 dollars.
𝑃 (𝑥|𝑦): The probability that player 1 gets 𝑥 dollars given that player 2 gets 𝑦 dollars.

Part (b)
We’ll compute each measure by its definition, using the fact that each coin flip is fair and
independent:

First, observe that Player 1 wins a dollar if at least 1 of the coins lands on heads. Another
way to think about this scenario is that Player 1 loses if both coins land on tails, which we
can subtract from 1 to find the probability of them winning. Specifically:

𝑃 (𝑋 = 1) = 1 − 𝑃 (𝑋 = 0) = 1 − 𝑃 (𝑡𝑎𝑖𝑙𝑠1 )𝑃 (𝑡𝑎𝑖𝑙𝑠2 ) = 1 − 1/2 * 1/2 = 3/4

Computing the expected value follows from Eq. (1.10), summing over all outcomes and their
associated probabilities:
∑︁
𝐸[𝑋] = 𝑥 * 𝑃 (𝑥) = 1 * 𝑃 (𝑋 = 1) + 0 * 𝑃 (𝑋 = 0) = 3/4
𝑥

We’ll use a similar approach to computing the winning probability for Player 2 as well as the
expected value of their winnings. Observe that the winning conditions for Player 2 are when
both coins land on the same face, specifically:

𝑃 (𝑌 = 1) = 𝑃 (ℎ𝑒𝑎𝑑𝑠1 )𝑃 (ℎ𝑒𝑎𝑑𝑠2 ) + 𝑃 (𝑡𝑎𝑖𝑙𝑠1 )𝑃 (𝑡𝑎𝑖𝑙𝑠2 ) = 1/2 * 1/2 + 1/2 * 1/2 = 1/2


∑︁
𝐸[𝑌 ] = 𝑦 * 𝑃 (𝑦) = 1 * 𝑃 (𝑌 = 1) + 0 * 𝑃 (𝑌 = 0) = 1/2
𝑦
STUDY QUESTIONS AND SOLUTIONS FOR CHAPTER 1 15

To compute the conditional expected values, we will use Eq. (1.13), which intuitively sums
over all possible values of the query and weights by the conditional probability of each:
∑︁
𝐸[𝑌 |𝑋 = 𝑥] = 𝑃 (𝑦|𝑋 = 𝑥)
𝑦

= 1 * 𝑃 (𝑌 = 1|𝑋 = 𝑥) + 0 * 𝑃 (𝑌 = 0|𝑋 = 𝑥)
= 𝑃 (𝑌 = 1|𝑋 = 𝑥)
∑︁
𝐸[𝑋|𝑌 = 𝑦] = 𝑃 (𝑥|𝑌 = 𝑦)
𝑥

= 1 * 𝑃 (𝑋 = 1|𝑌 = 𝑦) + 0 * 𝑃 (𝑋 = 0|𝑌 = 𝑦)
= 𝑃 (𝑋 = 1|𝑌 = 𝑦)

Next, we can compute the variances of each variable using Eq. (1.15), their covariance using
Eq. (1.16), and their correlation coefficient using Eq. (1.17).

𝑉 𝑎𝑟(𝑋) = 𝐸((𝑋 − 3/4)2 )


= (1 − 3/4)2 * 𝑃 (𝑋 = 1) + (0 − 3/4)2 * 𝑃 (𝑋 = 0)
= 1/16 * 3/4 + 9/16 * 1/4
= 3/16
𝑉 𝑎𝑟(𝑌 ) = 𝐸((𝑌 − 1/2)2 )
= (1 − 1/2)2 * 𝑃 (𝑌 = 1) + (0 − 1/2)2 * 𝑃 (𝑋 = 0)
= 1/4 * 1/2 + 1/4 * 1/2
= 1/4
𝐶𝑜𝑣(𝑋, 𝑌 ) = 𝐸[(𝑋 − 3/4)(𝑌 − 1/2)]
= 1/4 * 1/2 * 𝑃 (𝑋 = 1, 𝑌 = 1) − 3/4 * 1/2 * 𝑃 (𝑋 = 0, 𝑌 = 1)
+ 1/4 * −1/2 * 𝑃 (𝑋 = 1, 𝑌 = 0) − 3/4 * −1/2 * 𝑃 (𝑋 = 0, 𝑌 = 0)
= 1/4 * 1/2 * 1/4 − 3/4 * 1/2 * 1/4 + 1/4 * −1/2 * 1/2 − 3/4 * −1/2 * 0
= −1/8
𝐶𝑜𝑣(𝑋, 𝑌 )
𝜌𝑋𝑌 =
𝜎𝑋 𝜎𝑌
−1/8
= √︀ √︀
3/16 1/4

= −1/ 3
16 STUDY QUESTIONS AND SOLUTIONS FOR CHAPTER 1

Part (c)
To answer this query, we know that if both 𝑋 = 1 and 𝑌 = 1, then the outcome of the two
coins must have been both heads, meaning that 𝑃 (𝑋 = 1, 𝑌 = 1) = 1/4. Furthermore, we
can phrase our query as 𝐸[𝑋|𝑌 = 1], since we are interested in the expectation of Player 1’s
winnings having observed that Player 2 won a dollar. Combining this knowledge with our
solution to each conditional expected value from part (b) above, we have:

𝐸[𝑋|𝑌 = 1] = 𝑃 (𝑋 = 1|𝑌 = 1)
𝑃 (𝑋 = 1, 𝑌 = 1)
=
𝑌 =1
1/4
=
1/2
= 1/2

Part (d)
We use the same strategy as in part (c) above, and have:
𝐸[𝑌 |𝑋 = 1] = 𝑃 (𝑌 = 1|𝑋 = 1)
𝑃 (𝑋 = 1, 𝑌 = 1)
=
𝑋=1
1/4
=
3/4
= 1/3

Part (e)
Consider what we know about the joint events:
𝑃 (𝑋 = 1, 𝑌 = 1) = 1/4
𝑃 (𝑋 = 0, 𝑌 = 1) = 1/4
𝑃 (𝑋 = 1, 𝑌 = 0) = 1/2
𝑃 (𝑋 = 0, 𝑌 = 0) = 0
Now, examining their priors, we have:
𝑃 (𝑋 = 1) = 3/4
𝑃 (𝑋 = 0) = 1/4
𝑃 (𝑌 = 1) = 𝑃 (𝑌 = 0) = 1/2
Plainly, there are no two values for 𝑋 and 𝑌 such that the product of their priors will equal
their joint, i.e., for no two values 𝑋 = 𝑥, 𝑌 = 𝑦 do we have: 𝑃 (𝑌 = 𝑦, 𝑋 = 𝑥) = 𝑃 (𝑌 =
𝑦) * 𝑃 (𝑋 = 𝑥). Therefore, we conclude that there are no two mutually independent events.
STUDY QUESTIONS AND SOLUTIONS FOR CHAPTER 1 17

Study question 1.3.8.


Compute the following theoretical measures of the outcome of a single game of craps (one
roll of two independent dice), where 𝑋 stands for the outcome of Die 1, 𝑍 for the outcome of
Die 2, and 𝑌 for their sum.
(a)

𝐸[𝑋], 𝐸[𝑌 ], 𝐸[𝑌 |𝑋 = 𝑥], 𝐸[𝑋|𝑌 = 𝑦], for each value of 𝑥 and 𝑦, and
𝑉 𝑎𝑟(𝑋), 𝑉 𝑎𝑟(𝑌 ), 𝐶𝑜𝑣(𝑋, 𝑌 ), 𝜌𝑋𝑌 , 𝐶𝑜𝑣(𝑋, 𝑍)

Table 1.6 describes the outcomes of 12 craps games.

Table 1.6 Results of 12 rolls of


two fair dice
𝑋 𝑍 𝑌
Die 1 Die 2 Sum
Roll 1 6 3 9
Roll 2 3 4 7
Roll 3 4 6 10
Roll 4 6 2 8
Roll 5 6 4 10
Roll 6 5 3 8
Roll 7 1 5 6
Roll 8 3 5 8
Roll 9 6 5 11
Roll 10 3 5 8
Roll 11 5 3 8
Roll 12 4 5 9

(b) Find the sample estimates of the measures computed in (a), based on the data from
Table 1.6. [Hint: Many software packages are available for doing this computation for
you.]
(c) Use the results in (a) to determine the best estimate of the sum, 𝑌 , given that we measured
𝑋 = 3.
(d) What is the best estimate of 𝑋, given that we measured 𝑌 = 4?
(e) What is the best estimate of 𝑋, given that we measured 𝑌 = 4 and 𝑍 = 1? Explain why
it is not the same as in (d).

Solution to study question 1.3.8


Part (a)
Because we are playing craps, the outcomes of each dice 𝑋 and 𝑍 are, by assumption,
independent. However, we know that 𝑌 is not independent of either 𝑋 nor 𝑍, since it is
18 STUDY QUESTIONS AND SOLUTIONS FOR CHAPTER 1

representative of their sum. So, let us first deduce the expected values of each variable indi-
vidually, which exploits the fact that each dice has a possible outcome of equally likely
integers between 1 and 6.

𝐸[𝑋] = 1 * 1/6 + 2 * 1/6 + 3 * 1/6 + 4 * 1/6 + 5 * 1/6 + 6 * 1/6


= 3.5
= 𝐸[𝑍]
𝐸[𝑌 ] = 𝐸[𝑋 + 𝑍]
= 𝐸[𝑋] + 𝐸[𝑍]
=7
Next we consider, without loss of generality, 𝐸[𝑌 |𝑋 = 𝑥] ∀𝑥. To determine these quantities,
we again exploit the facts that 𝑌 = 𝑋 + 𝑍 and that 𝑋 and 𝑍 are independent to write:

𝐸[𝑌 |𝑋 = 1] = 𝐸[𝑋 + 𝑍|𝑋 = 1] = 𝐸[𝑋|𝑋 = 1] + 𝐸[𝑍|𝑋 = 1] = 1 + 3.5 = 4.5


𝐸[𝑌 |𝑋 = 2] = 𝐸[𝑋 + 𝑍|𝑋 = 2] = 𝐸[𝑋|𝑋 = 2] + 𝐸[𝑍|𝑋 = 1] = 2 + 3.5 = 5.5
𝐸[𝑌 |𝑋 = 3] = 𝐸[𝑋 + 𝑍|𝑋 = 3] = 𝐸[𝑋|𝑋 = 3] + 𝐸[𝑍|𝑋 = 1] = 3 + 3.5 = 6.5
𝐸[𝑌 |𝑋 = 4] = 𝐸[𝑋 + 𝑍|𝑋 = 4] = 𝐸[𝑋|𝑋 = 4] + 𝐸[𝑍|𝑋 = 1] = 4 + 3.5 = 7.5
𝐸[𝑌 |𝑋 = 5] = 𝐸[𝑋 + 𝑍|𝑋 = 5] = 𝐸[𝑋|𝑋 = 5] + 𝐸[𝑍|𝑋 = 1] = 5 + 3.5 = 8.5
𝐸[𝑌 |𝑋 = 6] = 𝐸[𝑋 + 𝑍|𝑋 = 6] = 𝐸[𝑋|𝑋 = 6] + 𝐸[𝑍|𝑋 = 1] = 6 + 3.5 = 9.5

By a similar token, we consider quantities 𝐸[𝑋|𝑌 = 𝑦] ∀𝑦. These can be computed by the
same method above, exploiting the knowledge that:

𝐸[𝑋|𝑌 = 𝑦] = 𝐸[𝑌 − 𝑍|𝑌 = 𝑦]


= 𝐸[𝑌 |𝑌 = 𝑦] − 𝐸[𝑍|𝑌 = 𝑦]
= 𝑦 − 𝐸[𝑋|𝑌 = 𝑦]
2 * 𝐸[𝑋|𝑌 = 𝑦] = 𝑦
𝐸[𝑋|𝑌 = 𝑦] = 𝑦/2

So, using this derivation, we see that:

𝐸[𝑋|𝑌 = 2] = 1
𝐸[𝑋|𝑌 = 3] = 1.5
𝐸[𝑋|𝑌 = 4] = 2
𝐸[𝑋|𝑌 = 5] = 2.5
𝐸[𝑋|𝑌 = 6] = 3
𝑒𝑡𝑐.
STUDY QUESTIONS AND SOLUTIONS FOR CHAPTER 1 19

Next we compute the variances of our variables, which is a simple application of Eq.
(1.15):

𝑉 𝑎𝑟(𝑋) = 𝐸[(𝑋 − 𝐸[𝑋])2 ]


= 𝐸[(𝑋 − 3.5)2 ]
= (1 − 3.5)2 * 1/6 + (2 − 3.5)2 * 1/6 + (3 − 3.5)2 * 1/6
+ (4 − 3.5)2 * 1/6 + (5 − 3.5)2 * 1/6 + (6 − 3.5)2 * 1/6
= 2.917
= 𝑉 𝑎𝑟(𝑍)
𝑉 𝑎𝑟(𝑌 ) = 𝐸[(𝑌 − 𝐸[𝑌 ])2 ]
= 𝐸[(𝑌 − 7)2 ]
= 𝐸[𝑌 2 − 14𝑌 + 49]
= 𝐸[𝑌 2 ] − 98 + 49
= 𝐸[𝑋 2 + 2𝑋𝑍 + 𝑍 2 ] − 49
= 2(𝐸[𝑋 2 ] + 𝐸[𝑋𝑍]) − 49
91 21 * 21
= 2( + ) − 49
6 36
= 5.833

Now knowing our variances, we can compute 𝐶𝑜𝑣(𝑋, 𝑌 ) and 𝐶𝑜𝑣(𝑋, 𝑍) through
application of Eq. (1.16):

𝐶𝑜𝑣(𝑋, 𝑌 ) = 𝐸[(𝑋 − 𝐸[𝑋])(𝑌 − 𝐸[𝑌 ])]


= 𝐸[(𝑋 − 3.5)(𝑌 − 7)]
= 𝐸[𝑋𝑌 − 3.5𝑌 − 7𝑋 + 24.5]
= 𝐸[𝑋𝑌 ] − 24.5 − 24.5 + 24.5
= 𝐸[𝑋(𝑋 + 𝑍)] − 24.5
= 𝐸[𝑋 2 ] + 𝐸[𝑋𝑍] − 24.5
= 91/6 + 21 * 21/36 − 24.5
= 2.917
𝐶𝑜𝑣(𝑋, 𝑍) = 𝐸[(𝑋 − 3.5)(𝑍 − 3.5)]
= 𝐸[𝑋𝑍] − 3.52
= 21 * 21/36 − 3.52
=0
20 STUDY QUESTIONS AND SOLUTIONS FOR CHAPTER 1

Intuitively, we can check our answer that 𝐶𝑜𝑣(𝑋, 𝑍) = 0 because 𝑋 and 𝑍 are independent
dice rolls. Finally, we compute the correlation between 𝑋 and 𝑌 by Eq. (1.17):
𝐶𝑜𝑣(𝑋, 𝑌 )
𝜌𝑋𝑌 =
𝜎𝑋 𝜎𝑌
2.917
=√ √
2.917 5.833
= 0.707

Part (b)
You can use programming packages in R (see DAGitty package), Matlab, Python(Numpy),
etc. to calculate the quantities of interest from the sample in Table 1.6. The same computa-
tional strategies we used in part (a) apply, except that now our frequencies come from the
data rather than our analysis of craps. Specifically, we get:

𝐸[𝑋] = 4.33
𝐸[𝑌 ] = 8.5
𝑉 𝑎𝑟(𝑋) = 2.389
𝑉 𝑎𝑟(𝑌 ) = 1.75
𝐶𝑜𝑣(𝑋, 𝑌 ) = 1.545
𝐶𝑜𝑣(𝑋, 𝑍) = −1.06
𝜌𝑋𝑌 = 0.756
Part (c)
From our computations in part (a), we have 𝐸[𝑌 |𝑋 = 3] = 6.5

Part (d)
From our computations in part (a), we have 𝐸[𝑋|𝑌 = 4] = 2

Part (e)
We can compute 𝐸[𝑋|𝑌 = 4, 𝑍 = 1] by application of Eq. (1.13), summing over the six
possible values of 𝑋 and the probabilities of 𝑌 = 4, 𝑍 = 1 associated with each:
∑︁
𝐸[𝑋|𝑌 = 4, 𝑍 = 1] = 𝑥 𝑃 (𝑋 = 𝑥|𝑌 = 4, 𝑍 = 1)
𝑥

= 1 * 𝑃 (𝑋 = 1|𝑌 = 4, 𝑍 = 1) + 2 * 𝑃 (𝑋 = 2|𝑌 = 4, 𝑍 = 1)
+ 3 * 𝑃 (𝑋 = 3|𝑌 = 4, 𝑍 = 1) + 4 * 𝑃 (𝑋 = 4|𝑌 = 4, 𝑍 = 1)
+ 5 * 𝑃 (𝑋 = 5|𝑌 = 4, 𝑍 = 1) + 6 * 𝑃 (𝑋 = 6|𝑌 = 4, 𝑍 = 1)
=1*0+2*0+3*1+4*0+5*0+6*0
=3
STUDY QUESTIONS AND SOLUTIONS FOR CHAPTER 1 21

Intuitively, this quantity will not be the same as in part (d) because knowing that 𝑍 = 1 pre-
cludes the possibilities of some values of 𝑋 given that 𝑌 = 𝑋 + 𝑍 = 4. For example, in
(d), we allowed for the possibility that 𝑋 = 2 (and that therefore 𝑍 = 2 in order to sum to
𝑌 = 4), which is impossible in this problem given that 𝑍 = 1.

Study question 1.3.9.

(a) Prove Eq. (1.22) using the orthogonality principle. [Hint: Follow the treatment of Eq.
(1.26).]

(b) Find all partial regression coefficients

𝑅𝑌 𝑋·𝑍 , 𝑅𝑋𝑌 ·𝑍 , 𝑅𝑌 𝑍·𝑋 , 𝑅𝑍𝑌 ·𝑋 , 𝑅𝑋𝑍·𝑌 , and 𝑅𝑍𝑋·𝑌

for the craps game described in Study question 1.3.8. [Hint: Apply Eq. (1.27) and use
the variances and covariances computed for part (a) of this question.]

Solution to study question 1.3.9

Part (a)
By assumption, from Eq. (1.21), we have:

𝑌 = 𝑎 + 𝑏𝑋

So, using the linear property of expected value, we know that:

𝐸[𝑌 ] = 𝐸[𝑎 + 𝑏𝑋] = 𝑎 + 𝑏𝐸[𝑋]

Thus, using the hint to follow the treatment of Eq. (1.26), we have:

𝐸[𝑋𝑌 ] = 𝑎𝐸[𝑋] + 𝑏𝐸[𝑋 2 ]

Finally, the above allows us to prove Eq. (1.22):

𝐸[𝑋𝑌 ] − 𝐸[𝑋]𝐸[𝑌 ] 𝜎𝑋𝑌


𝑏= 2 2
= 2
𝐸[𝑋 ] − 𝐸 [𝑋] 𝜎𝑋

Part (b)
From our answer to study question 1.3.8, we have:
𝜎𝑋𝑌 = 𝜎𝑍𝑌 = 2.917, 𝜎𝑋𝑍 = 0, 𝜎𝑌2 = 5.833, 𝜎𝑋 2 2
= 𝜎𝑍 = 2.917.

Using the above, we can use Eq. (1.27) to compute each partial regression coefficient:
22 STUDY QUESTIONS AND SOLUTIONS FOR CHAPTER 1

2
𝜎𝑍 𝜎𝑌 𝑋 − 𝜎𝑌 𝑍 𝜎𝑍𝑋 2.9172 − 0
𝑅𝑌 𝑋·𝑍 = 2 𝜎2 − 𝜎2 = =1
𝜎𝑋 𝑍 𝑋𝑍 2.9172 − 0
2
𝜎𝑍 𝜎𝑋𝑌 − 𝜎𝑋𝑍 𝜎𝑍𝑌 2.9172 − 0
𝑅𝑋𝑌 ·𝑍 = = =1
𝜎𝑌2 𝜎𝑍
2 − 𝜎2
𝑌𝑍 5.833 * 2.917 − 2.9172
2
𝜎𝑋 𝜎𝑌 𝑍 − 𝜎𝑌 𝑋 𝜎𝑋𝑍 2.9172 − 0
𝑅𝑌 𝑍·𝑋 = 2 2 2 = =1
𝜎𝑍 𝜎𝑋 − 𝜎𝑍𝑋 2.9172 − 0
2
𝜎𝑋 𝜎𝑍𝑌 − 𝜎𝑍𝑋 𝜎𝑋𝑌 2.9172 − 0
𝑅𝑍𝑌 ·𝑋 = 2 2 2 = =1
𝜎𝑌 𝜎𝑋 − 𝜎𝑌 𝑋 5.833 * 2.917 − 2.9172
𝜎𝑌2 𝜎𝑋𝑍 − 𝜎𝑋𝑌 𝜎𝑌 𝑍 0 − 2.9172
𝑅𝑋𝑍·𝑌 = 2 2 2 = = −1
𝜎𝑍 𝜎𝑌 − 𝜎𝑍𝑌 5.833 * 2.917 − 2.9172
𝜎𝑌2 𝜎𝑍𝑋 − 𝜎𝑍𝑌 𝜎𝑌 𝑋 0 − 2.9172
𝑅𝑍𝑋·𝑌 = = = −1
𝜎𝑌2 𝜎𝑋
2 − 𝜎2
𝑌𝑋 5.833 * 2.917 − 2.9172
Study question 1.4.1.
Consider the graph shown in Figure 1.8:

X Y Z
T
Figure 1.8: A directed graph used in Study question 1.4.1

(a) Name all of the parents of 𝑍.


(b) Name all the ancestors of 𝑍.
(c) Name all the children of 𝑊 .
(d) Name all the descendants of 𝑊 .
(e) Draw all (simple) paths between 𝑋 and 𝑇 (i.e., no node should appear more than once).
(f) Draw all the directed paths between 𝑋 and 𝑇 .
STUDY QUESTIONS AND SOLUTIONS FOR CHAPTER 1 23

Solution to study question 1.4.1

A basic interactive tutorial where students can practice their knowledge of graph termi-
nology is provided at dagitty.net/learn/graphs/

A more advanced tutorial where students can apply these terms to recognize causal con-
cepts like "mediator" and "confounder" is provided at dagitty.net/learn/graphs/roles.html

An R solution of this exercise is provided at dagitty.net/primer/1.4.1

Part (a)
Parents of 𝑍: 𝑊, 𝑌

Part (b)
Ancestors of 𝑍: 𝑋, 𝑊, 𝑌

Part (c)
Children of 𝑊 : 𝑌, 𝑍

Part (d)
Descendants of 𝑊 : 𝑌, 𝑍, 𝑇

Part (e)
Assuming cycles are not allowed, the simple paths between 𝑋 and 𝑇 are:
{𝑋, 𝑌, 𝑇 }, {𝑋, 𝑌, 𝑍, 𝑇 }, {𝑋, 𝑌, 𝑊, 𝑍, 𝑇 }, {𝑋, 𝑊, 𝑌, 𝑇 }, {𝑋, 𝑊, 𝑌, 𝑍, 𝑇 }, {𝑋, 𝑊, 𝑍, 𝑇 },
{𝑋, 𝑊, 𝑍, 𝑌, 𝑇 }

Part (f)
Assuming cycles are not allowed, the directed paths between 𝑋 and 𝑇 are:
{𝑋, 𝑌, 𝑇 }, {𝑋, 𝑌, 𝑍, 𝑇 }, {𝑋, 𝑊, 𝑌, 𝑇 }, {𝑋, 𝑊, 𝑌, 𝑍, 𝑇 }, {𝑋, 𝑊, 𝑍, 𝑇 }

Study question 1.5.1.


Suppose we have the following SCM. Assume all exogenous variables are independent and
that the expected value of each is 0.
SCM 1.5.1.
𝑉 = {𝑋, 𝑌, 𝑍}, 𝑈 = {𝑈𝑋 , 𝑈𝑌 , 𝑈𝑍 }, 𝐹 = {𝑓𝑋 , 𝑓𝑌 , 𝑓𝑍 }
𝑓𝑋 : 𝑋 = 𝑢𝑋
𝑓𝑌 : 𝑌 = 𝑋3 + 𝑈𝑌
𝑌
𝑓𝑍 : 𝑍 = 16 + 𝑈𝑍

(a) Draw the graph that complies with the model.


(b) Determine the best guess of the value (expected value) of 𝑍, given that we observe 𝑌 = 3.
24 STUDY QUESTIONS AND SOLUTIONS FOR CHAPTER 1

(c) Determine the best guess of the value of 𝑍, given that we observe 𝑋 = 3.
(d) Determine the best guess of the value of 𝑍, given that we observe 𝑋 = 1 and 𝑌 = 3.
(e) Assume that all exogenous variables are normally distributed with zero means and unit
variance, that is, 𝜎 = 1.
(i) Determine the best guess of 𝑋, given that we observed 𝑌 = 2.
(ii) (Advanced) Determine the best guess of 𝑌 , given that we observed 𝑋 = 1 and
𝑍 = 3. [Hint: You may wish to use the technique of multiple regression, together
with the fact that, for every three normally distributed variables, say 𝑋, 𝑌 , and 𝑍,
we have 𝐸[𝑌 |𝑋 = 𝑥, 𝑍 = 𝑧] = 𝑅𝑌 𝑋·𝑍 𝑥 + 𝑅𝑌 𝑍·𝑋 𝑧.]

Solution to study question 1.5.1

An R solution of this exercise is provided at dagitty.net/primer/1.5.1

Part (a)
The following graph complies with the given model:
𝑈𝑋 𝑈𝑌 𝑈𝑍

𝑋 𝑌 𝑍
Part (b)
Assuming that 𝑈𝑋 , 𝑈𝑌 , 𝑈𝑍 are independent and have 0 means, we have:

𝐸[𝑍|𝑌 = 3] = 𝐸[𝑌 /16 + 𝑈𝑍 |𝑌 = 3]


= 𝐸[𝑌 |𝑌 = 3]/16 + 𝐸[𝑈𝑍 |𝑌 = 3]
= 3/16 + 0
= 3/16
Part (c)

𝐸[𝑍|𝑋 = 3] = 𝐸[𝑌 /16 + 𝑈𝑍 |𝑋 = 3]


= 𝐸[𝑌 |𝑋 = 3]/16 + 𝐸[𝑈𝑍 |𝑋 = 3]
= 𝐸[𝑋/3 + 𝑈𝑌 |𝑋 = 3]/16 + 0
= 𝐸[𝑋|𝑋 = 3]/3/16 + 𝐸[𝑈𝑌 |𝑋 = 3]/16
= 3/3/16 + 0/16
= 1/16
STUDY QUESTIONS AND SOLUTIONS FOR CHAPTER 1 25

Part (d)

𝐸[𝑍|𝑋 = 1, 𝑌 = 3] = 𝐸[𝑌 /16 + 𝑈𝑍 |𝑋 = 1, 𝑌 = 3]


= 𝐸[𝑌 |𝑋 = 1, 𝑌 = 3]/16 + 𝐸[𝑈𝑍 |𝑋 = 1, 𝑌 = 3]
= 3/16 + 0
= 3/16
(e.i)
𝐸[𝑋|𝑌 = 2] = 𝐸[3𝑌 − 3𝑈𝑌 |𝑌 = 2]
= 3𝐸[𝑌 |𝑌 = 2] − 3𝐸[𝑈𝑌 |𝑌 = 2]
=3*2−3*0
=6
(e.ii)
𝐸[𝑌 |𝑋 = 1, 𝑍 = 3] = 𝛼𝑥 + 𝛽𝑧
= 𝑅𝑌 𝑋·𝑍 𝑥 + 𝑅𝑌 𝑍·𝑋 𝑧
2
𝜎𝑍 𝜎𝑌 𝑋 − 𝜎𝑌 𝑍 𝜎𝑋𝑍 𝜎 2 𝜎𝑌 𝑍 − 𝜎𝑌 𝑋 𝜎𝑋𝑍
= 2 2 2 𝑥+ 𝑋 2 2 2 𝑧
𝜎𝑋 𝜎𝑍 − 𝜎𝑋𝑍 𝜎𝑋 𝜎𝑍 − 𝜎𝑋𝑍
2
𝜎𝑍 𝜎𝑌 𝑋 − 𝜎𝑌 𝑍 𝜎𝑋𝑍 𝜎 2 𝜎𝑌 𝑍 − 𝜎𝑌 𝑋 𝜎𝑋𝑍
= 2 2 2 + 𝑋 2 2 2 3
𝜎𝑋 𝜎𝑍 − 𝜎𝑋𝑍 𝜎𝑋 𝜎𝑍 − 𝜎𝑋𝑍
2
∵ 𝜎𝑋 = 1, 𝜎𝑌2 = 10/9, 𝜎𝑍
2
= 10/(256 * 9) + 1, 𝜎𝑋𝑌 = 1/3, 𝜎𝑌 𝑍 = 5/72, 𝜎𝑋𝑍 = 1/48
∴ 𝐸[𝑌 |𝑋 = 1, 𝑍 = 3] = 400/771

Study question 1.5.2.


Assume that a population of patients contains a fraction 𝑟 of individuals who suffer from a
certain fatal syndrome 𝑍, which simultaneously makes it uncomfortable for them to take a
life-prolonging drug 𝑋 (Figure 1.8).

Let 𝑍 = 𝑧1 and 𝑍 = 𝑧0 represent, respectively, the presence and absence of the syndrome,
𝑌 = 𝑦1 and 𝑌 = 𝑦0 represent death and survival, respectively, and 𝑋 = 𝑥1 and 𝑋 = 𝑥0
represent taking and not taking the drug. Assume that patients not carrying the syndrome,
𝑍 = 𝑧0 , die with probability 𝑝2 if they take the drug and with probability 𝑝1 if they don’t.
Patients carrying the syndrome, 𝑍 = 𝑧1 , on the other hand, die with probability 𝑝3 if they
do not take the drug and with probability 𝑝4 if they do take the drug. Further, patients hav-
ing the syndrome are more likely to avoid the drug, with probabilities 𝑞1 = 𝑃 (𝑥1 |𝑧0 ) and
𝑞2 = 𝑝(𝑥1 |𝑧1 ).
(a) Based on this model, compute the joint distributions 𝑃 (𝑥, 𝑦, 𝑧), 𝑃 (𝑥, 𝑦), 𝑃 (𝑥, 𝑧) and
𝑃 (𝑦, 𝑧) for all values of x, 𝑦, and 𝑧, in terms of the parameters (𝑟, 𝑝1 , 𝑝2 , 𝑝3 , 𝑝4 , 𝑞1 , 𝑞2 ).
[Hint: Use the product decomposition of Section 1.5.2.]
26 STUDY QUESTIONS AND SOLUTIONS FOR CHAPTER 1

(b) Calculate the difference 𝑃 (𝑦1 |𝑥1 ) − 𝑃 (𝑦1 |𝑥𝑜 ) for three populations: (1) those carrying
the syndrome, (2) those not carrying the syndrome and (3) the population as a whole.

(c) Using your results for (b), find a combination of parameters that exhibits Simpson’s
reversal.

Z (Syndrome)

(Treatment) X Y (Outcome)

Figure 1.8: Model showing an unobserved syndrome, 𝑍, affecting both treatment (𝑋) and
outcome (𝑌 )

Solution to study question 1.5.2

Part (a)

The following two tables describe our distributions of interest:

𝑃 (𝑌 |𝑋, 𝑍) y x z
𝑝1 1 0 0
𝑝2 1 1 0
𝑝3 1 0 1
𝑝4 1 1 1

𝑃 (𝑋|𝑍) x z
𝑞1 1 0
𝑞2 1 1

We also have that: 𝑃 (𝑍1 ) = 𝑟

By the chain rule, we know that: 𝑃 (𝑥, 𝑦, 𝑧) = 𝑃 (𝑦|𝑥, 𝑧)𝑃 (𝑥|𝑧)𝑃 (𝑧)
STUDY QUESTIONS AND SOLUTIONS FOR CHAPTER 1 27

So, substituting the table rows into the above factorization, we have:

𝑃 (𝑥0 , 𝑦0 , 𝑧0 ) = (1 − 𝑝1 )(1 − 𝑞1 )(1 − 𝑟)


𝑃 (𝑥0 , 𝑦0 , 𝑧1 ) = (1 − 𝑝3 )(1 − 𝑞2 )(𝑟)
𝑃 (𝑥0 , 𝑦1 , 𝑧0 ) = (𝑝1 )(1 − 𝑞1 )(1 − 𝑟)
𝑃 (𝑥0 , 𝑦1 , 𝑧1 ) = (𝑝3 )(1 − 𝑞2 )(𝑟)
𝑃 (𝑥1 , 𝑦0 , 𝑧0 ) = (1 − 𝑝2 )(𝑞1 )(1 − 𝑟)
𝑃 (𝑥1 , 𝑦0 , 𝑧1 ) = (1 − 𝑝4 )(𝑞2 )(𝑟)
𝑃 (𝑥1 , 𝑦1 , 𝑧0 ) = (𝑝2 )(𝑞1 )(1 − 𝑟)
𝑃 (𝑥1 , 𝑦1 , 𝑧1 ) = (𝑝4 )(𝑞2 )(𝑟)

By marginalization, we know that: 𝑃 (𝑥, 𝑦) = Σ𝑧 𝑃 (𝑥, 𝑦, 𝑧)


So, again substituting our table rows into the above, the joint distribution of 𝑋, 𝑌
reads:

𝑃 (𝑥0 , 𝑦0 ) = 𝑃 (𝑥0 , 𝑦0 , 𝑧0 ) + 𝑃 (𝑥0 , 𝑦0 , 𝑧1 ) = (1 − 𝑝1 )(1 − 𝑞1 )(1 − 𝑟) + (1 − 𝑝3 )(1 − 𝑞2 )(𝑟)


𝑃 (𝑥0 , 𝑦1 ) = 𝑃 (𝑥0 , 𝑦1 , 𝑧0 ) + 𝑃 (𝑥0 , 𝑦1 , 𝑧1 ) = (𝑝1 )(1 − 𝑞1 )(1 − 𝑟) + (𝑝3 )(1 − 𝑞2 )(𝑟)
𝑃 (𝑥1 , 𝑦0 ) = 𝑃 (𝑥1 , 𝑦0 , 𝑧0 ) + 𝑃 (𝑥1 , 𝑦0 , 𝑧1 ) = (1 − 𝑝2 )(𝑞1 )(1 − 𝑟) + (1 − 𝑝4 )(𝑞2 )(𝑟)
𝑃 (𝑥1 , 𝑦1 ) = 𝑃 (𝑥1 , 𝑦1 , 𝑧0 ) + 𝑃 (𝑥1 , 𝑦1 , 𝑧1 ) = (𝑝2 )(𝑞1 )(1 − 𝑟) + (𝑝4 )(𝑞2 )(𝑟)

Similarly, for 𝑋, 𝑍 we know that: 𝑃 (𝑥, 𝑧) = Σ𝑦 𝑃 (𝑥, 𝑦, 𝑧)

𝑃 (𝑥0 , 𝑧0 ) = (1 − 𝑝1 )(1 − 𝑞1 )(1 − 𝑟) + (𝑝1 )(1 − 𝑞1 )(1 − 𝑟)


𝑃 (𝑥0 , 𝑧1 ) = (1 − 𝑝3 )(1 − 𝑞2 )(𝑟) + (𝑝3 )(1 − 𝑞2 )(𝑟)
𝑃 (𝑥1 , 𝑧0 ) = (1 − 𝑝2 )(𝑞1 )(1 − 𝑟) + (𝑝2 )(𝑞1 )(1 − 𝑟)
𝑃 (𝑥1 , 𝑧1 ) = (1 − 𝑝4 )(𝑞2 )(𝑟) + (𝑝4 )(𝑞2 )(𝑟)

And for 𝑌, 𝑍, we know that: 𝑃 (𝑦, 𝑧) = Σ𝑥 𝑃 (𝑥, 𝑦, 𝑧)

𝑃 (𝑦0 , 𝑧0 ) = (1 − 𝑝1 )(1 − 𝑞1 )(1 − 𝑟) + (1 − 𝑝2 )(𝑞1 )(1 − 𝑟)


𝑃 (𝑦0 , 𝑧1 ) = (1 − 𝑝3 )(1 − 𝑞2 )(𝑟) + (1 − 𝑝4 )(𝑞2 )(𝑟)
𝑃 (𝑦1 , 𝑧0 ) = (𝑝1 )(1 − 𝑞1 )(1 − 𝑟) + (𝑝2 )(𝑞1 )(1 − 𝑟)
𝑃 (𝑦1 , 𝑧1 ) = (𝑝3 )(1 − 𝑞2 )(𝑟) + (𝑝4 )(𝑞2 )(𝑟)

Part (b)

(b.1)

𝑃 (𝑦1 |𝑥1 , 𝑧1 ) − 𝑃 (𝑦1 |𝑥0 , 𝑧1 ) = 𝑝4 − 𝑝3


28 STUDY QUESTIONS AND SOLUTIONS FOR CHAPTER 1

(b.2)

𝑃 (𝑦1 |𝑥1 , 𝑧0 ) − 𝑃 (𝑦1 |𝑥0 , 𝑧0 ) = 𝑝2 − 𝑝1

(b.3)

𝑃 (𝑦1 , 𝑥1 ) 𝑃 (𝑦1 , 𝑥0 )
𝑃 (𝑦1 |𝑥1 ) − 𝑃 (𝑦1 |𝑥0 ) = −
𝑃 (𝑥1 ) 𝑃 (𝑥0 )
𝑃 (𝑦1 , 𝑥1 ) 𝑃 (𝑦1 , 𝑥0 )
= −
𝑃 (𝑥1 , 𝑦1 ) + 𝑃 (𝑥1 , 𝑦0 ) 𝑃 (𝑥0 , 𝑦1 ) + 𝑃 (𝑥0 , 𝑦0 )
𝑝2 𝑞1 (1 − 𝑟) + 𝑝4 𝑞2 𝑟 𝑝1 (1 − 𝑞1 )(1 − 𝑟) + 𝑝3 (1 − 𝑞2 )𝑟
= −
𝑟𝑞2 + (1 − 𝑟)𝑞1 𝑟(1 − 𝑞2 ) + (1 − 𝑟)(1 − 𝑞1 )

Part (c)

To elicit Simpson’s reversal, we want to find a combination of parameters such that parts
(b.1) and (b.2) above have a different sign than (b.3). As such, consider the following
parameterization:

𝑝1 = 0.1, 𝑝2 = 0, 𝑝3 = 0.3, 𝑝4 = 0.2, 𝑞1 = 0, 𝑞2 = 1, 𝑟 = 0.1

Now, substituting the above into (b.1), (b.2), and (b.3), we have:

𝑃 (𝑦1 |𝑥1 , 𝑧1 ) − 𝑃 (𝑦1 |𝑥0 , 𝑧1 ) = 𝑝4 − 𝑝3 = −0.1


𝑃 (𝑦1 |𝑥1 , 𝑧0 ) − 𝑃 (𝑦1 |𝑥0 , 𝑧0 ) = 𝑝2 − 𝑝1 = −0.1
𝑃 (𝑦1 , 𝑥1 ) 𝑃 (𝑦1 , 𝑥0 )
𝑃 (𝑦1 |𝑥1 ) − 𝑃 (𝑦1 |𝑥0 ) = −
𝑃 (𝑥1 ) 𝑃 (𝑥0 )
𝑃 (𝑦1 , 𝑥1 ) 𝑃 (𝑦1 , 𝑥0 )
= −
𝑃 (𝑥1 , 𝑦1 ) + 𝑃 (𝑥1 , 𝑦0 ) 𝑃 (𝑥0 , 𝑦1 ) + 𝑃 (𝑥0 , 𝑦0 )
𝑝2 𝑞1 (1 − 𝑟) + 𝑝4 𝑞2 𝑟 𝑝1 (1 − 𝑞1 )(1 − 𝑟) + 𝑝3 (1 − 𝑞2 )𝑟
= −
𝑟𝑞2 + (1 − 𝑟)𝑞1 𝑟(1 − 𝑞2 ) + (1 − 𝑟)(1 − 𝑞1 )
= 0.2 − 0.1
= 0.1

Study question 1.5.3.


Consider a graph 𝑋1 → 𝑋2 → 𝑋3 → 𝑋4 of binary random variables, and assume that the
conditional probabilities between any two consecutive variables are given by

𝑃 (𝑋𝑖 = 1|𝑋𝑖−1 = 1) = 𝑝
𝑃 (𝑋𝑖 = 1|𝑋𝑖−1 = 0) = 𝑞
𝑃 (𝑋1 = 1) = 𝑝0
STUDY QUESTIONS AND SOLUTIONS FOR CHAPTER 1 29

Compute the following probabilities

𝑃 (𝑋1 = 1, 𝑋2 = 0, 𝑋3 = 1, 𝑋4 = 0)
𝑃 (𝑋4 = 1|𝑋1 = 1)
𝑃 (𝑋1 = 1|𝑋4 = 1)
𝑃 (𝑋3 = 1|𝑋1 = 0, 𝑋4 = 1)

Solution to study question 1.5.3


Part (1)
We’ll begin the first problem using the chain rule and factor it as:

𝑃 (𝑋1 = 1, 𝑋2 = 0, 𝑋3 = 1, 𝑋4 = 0) = 𝑃 (𝑋1 = 1)𝑃 (𝑋2 = 0|𝑋1 = 1)


* 𝑃 (𝑋3 = 1|𝑋2 = 0)𝑃 (𝑋4 = 0|𝑋3 = 1)
= 𝑝0 (1 − 𝑝)𝑞(1 − 𝑝)
= 𝑝0 (1 − 𝑝)2 𝑞
We can compute the next three quantities using marginalization and Bayes’ conditioning.
Part (2)

𝑃 (𝑋4 = 1, 𝑋1 = 1)
𝑃 (𝑋4 = 1|𝑋1 = 1) =
𝑃 (𝑋1 = 1)
Σ𝑋2 ,𝑋3 𝑃 (𝑋4 = 1, 𝑋1 = 1, 𝑋2 , 𝑋3 )
=
𝑃 (𝑋1 = 1)
𝑝0 𝑝3 + 2𝑝0 𝑝𝑞(1 − 𝑝) + 𝑝0 𝑞(1 − 𝑝)(1 − 𝑞)
=
𝑝0

Part (3)

𝑃 (𝑋1 = 1, 𝑋4 = 1)
𝑃 (𝑋1 = 1|𝑋4 = 1) =
𝑃 (𝑋4 = 1)
Σ𝑋2 ,𝑋3 𝑃 (𝑋4 = 1, 𝑋1 = 1, 𝑋2 , 𝑋3 )
=
Σ𝑋1 ,𝑋2 ,𝑋3 𝑃 (𝑋1 , 𝑋2 , 𝑋3 , 𝑋4 = 1)
𝑝0 𝑝3 + 2𝑝0 𝑝𝑞(1 − 𝑝) + 𝑝0 𝑞(1 − 𝑝)(1 − 𝑞)
=
𝑝0 𝑝3 + 2𝑝0 𝑝𝑞(1 − 𝑝) + 𝑝0 𝑞(1 − 𝑝)(1 − 𝑞)
+ (1 − 𝑝0 )(𝑞𝑝2 + 𝑞 2 (1 − 𝑝) + 𝑞𝑝(1 − 𝑞) + 𝑞(1 − 𝑞)2 )
30 STUDY QUESTIONS AND SOLUTIONS FOR CHAPTER 1

Part (4)

𝑃 (𝑋1 = 0, 𝑋3 = 1, 𝑋4 = 1)
𝑃 (𝑋3 = 1|𝑋1 = 0, 𝑋4 = 1) =
𝑃 (𝑋1 = 0, 𝑋4 = 1)
Σ𝑋2 𝑃 (𝑋2 , 𝑋1 = 0, 𝑋3 = 1, 𝑋4 = 1)
=
Σ𝑋2 ,𝑋3 𝑃 (𝑋2 , 𝑋3 , 𝑋1 = 0, 𝑋4 = 1)
(1 − 𝑝0 )(1 − 𝑞)𝑝𝑞 + (1 − 𝑝0 )𝑞𝑝2
=
(1 − 𝑝0 )(𝑞𝑝2 + 𝑞 2 (1 − 𝑝) + 𝑞𝑝(1 − 𝑞) + 𝑞(1 − 𝑞)2 )

Study question 1.5.4.


Define the structural model that corresponds to the Monty Hall problem, and use it to
describe the joint distribution of all variables.

Solution to study question 1.5.4

Again, we’ll adopt the variables from the text: let 𝑋 indicate the door chosen by the player,
𝑌 indicate the door hiding the car, and 𝑍 indicate the door opened by the host.

From the story, we know that each door has an equal chance of being the winner, and that
each has an equal chance of being selected by the player. This suggests that both 𝑋 and 𝑌
are selected independently from one another as a function of some unmodeled factors.

Furthermore, we know that the door revealed by the host, 𝑍, will be neither the one opened
by the player, 𝑋, nor the winning door, 𝑌 , and in the event that 𝑋 = 𝑌 , then 𝑍 has an equal
chance of being one of the two remaining doors. These observations suggest that 𝑍 must be
a function of not only 𝑋, 𝑌 , but also of some unmodeled factors whenever 𝑋 = 𝑌 .

Combining our observations from above gives us the following model specification:
𝑉 = {𝑋, 𝑌, 𝑍}, 𝑈 = {𝑈𝑋 , 𝑈𝑌 , 𝑈𝑍 }, 𝐹 = {𝑓 },
𝑋 = 𝑈𝑋 , 𝑌 = 𝑈𝑌 , 𝑍 = 𝑓 (𝑋, 𝑌 ) + 𝑈𝑍
We can also depict this model graphically:
𝑈𝑋 𝑈𝑍 𝑈𝑌

𝑋 𝑍 𝑌

And lastly, the joint distribution can be factorized by using the chain rule and independence
relations (i.e., the Markovian factorization, Eq. (1.29), which decomposes a joint distribution
into family factors) to write:
𝑃 (𝑋, 𝑌, 𝑍) = 𝑃 (𝑍|𝑋, 𝑌 )𝑃 (𝑌 )𝑃 (𝑋)
STUDY QUESTIONS AND SOLUTIONS FOR CHAPTER 1 31

So, to explore some example queries, we begin by acknowledging that each door has an equal
chance of being the winner and the one chosen by the player, namely:

𝑃 (𝑌 ) = 1/3 ∀𝑦 ∈ 𝑌 and 𝑃 (𝑋) = 1/3 ∀𝑥 ∈ 𝑋

Furthermore, we know that the host cannot reveal the winning door or the one chosen by the
player, so:

𝑃 (𝑍|𝑋, 𝑌 ) = 0 ∀𝑧 = 𝑥 𝑜𝑟 𝑧 = 𝑦

Lastly, we know that the host must open the last remaining door if 𝑋 and 𝑌 are different, and
has an equal chance of choosing one of the remaining two when 𝑋 = 𝑌 , giving us:

𝑃 (𝑍|𝑋, 𝑌 ) = 1 ∀𝑧 ̸= 𝑥 𝑎𝑛𝑑 𝑧 ̸= 𝑦 𝑎𝑛𝑑 𝑥 ̸= 𝑦


𝑃 (𝑍|𝑋, 𝑌 ) = 0.5 ∀𝑧 ̸= 𝑥 𝑎𝑛𝑑 𝑧 ̸= 𝑦 𝑎𝑛𝑑 𝑥 = 𝑦

These observations allow us to compute arbitrary probability queries because we know the
decomposition of the joint distribution. For example, for doors A, B, and C:

𝑃 (𝑋 = 𝐴, 𝑌 = 𝐵, 𝑍 = 𝐶) = 𝑃 (𝑍 = 𝐶|𝑋 = 𝐴, 𝑌 = 𝐵)𝑃 (𝑌 = 𝐵)𝑃 (𝑋 = 𝐴)


= 1 * 1/3 * 1/3
= 1/9
Study Questions and Solutions for
Chapter 2

Study question 2.3.1.

X R S T U V Y
Figure 2.5: A directed graph for demonstrating conditional independence (error terms are not
shown explicitly)

(a) List all pairs of variables in Figure 2.5 that are independent conditional on the set
𝑍 = {𝑅, 𝑉 }.
(b) For each pair of nonadjacent variables in Figure 2.5, give a set of variables that, when
conditioned on, renders that pair independent.

X R S T U V Y

P
Figure 2.6: A directed graph in which 𝑃 is a descendant of a collider

(c) List all pairs of variables in Figure 2.6 that are independent conditional on the set
𝑍 = {𝑅, 𝑃 }.
(d) For each pair of nonadjacent variables in Figure 2.6, give a set of variables that, when
conditioned on,renders that pair independent.
(e) Suppose we generate data by the model described in Figure 2.5, and we fit them with the
linear equation 𝑌 = 𝑎 + 𝑏𝑋 + 𝑐𝑍 . Which of the variables in the model may be chosen
for 𝑍 so as to guarantee that the slope 𝑏 would be equal to zero? [Hint: Recall, a non
zero slope implies that 𝑌 and 𝑋 are dependent given 𝑍.]
34 STUDY QUESTIONS AND SOLUTIONS FOR CHAPTER 2

(f) Continuing question (e), suppose we fit the data with the equation:

𝑌 = 𝑎 + 𝑏𝑋 + 𝑐𝑅 + 𝑑𝑆 + 𝑒𝑇 + 𝑓 𝑃

which of the coefficients would be zero?

Solution to study question 2.3.1

An interactive tutorial explaining d-separation is provided at dagitty.net/learn/dsep/

An R solution of this exercise is provided at dagitty.net/primer/2.3.1

To determine if two variables are independent in a network, we consider all simple paths
between them and determine if all paths are "blocked" by Rules 1, 2, and 3 in this chapter
that detail the graphical criteria for conditional independence. If a single path is not blocked,
then the two variables are dependent. They are therefore independent (or conditionally inde-
pendent, when conditioning on some other variables) whenever all simple paths between
them are blocked.

Part (a)
In the table below, each row may be read as: “𝑋 is independent of 𝑌 given 𝑍 because..."
where 𝑋 is the variable listed in the first column of every row, 𝑌 is listed in the second col-
umn, 𝑍 in the third, and an explanation in the fourth. Note that for each row, 𝑋 and 𝑌 may
be swapped with the same valid claim of independence.

X Y Z Reason
𝑋 𝑆 {𝑅, 𝑉 } 𝑋 → 𝑅 → 𝑆 is blocked at chain 𝑋 → 𝑅 → 𝑆 (𝑅 is given)
𝑋 𝑇 {𝑅, 𝑉 } 𝑋 → 𝑅 → 𝑆 → 𝑇 is blocked at chain 𝑋 → 𝑅 → 𝑆 (𝑅 is given)
𝑋 𝑈 {𝑅, 𝑉 } 𝑋 → 𝑅 → 𝑆 → 𝑇 ← 𝑈 is blocked at chain 𝑋 → 𝑅 → 𝑆 (𝑅 is
given)
𝑋 𝑌 {𝑅, 𝑉 } 𝑋 → 𝑅 → 𝑆 → 𝑇 → 𝑈 ← 𝑉 ← 𝑌 is blocked at chain 𝑋 → 𝑅 →
𝑆 (𝑅 is given)
𝑆 𝑈 {𝑅, 𝑉 } 𝑆 → 𝑇 ← 𝑈 is blocked at collider 𝑆 → 𝑇 ← 𝑈 (neither 𝑇 nor
descendants of 𝑇 given)
𝑆 𝑌 {𝑅, 𝑉 } 𝑆 → 𝑇 ← 𝑈 ← 𝑉 → 𝑌 is blocked at collider 𝑆 → 𝑇 ← 𝑈 (neither
𝑇 nor descendants of 𝑇 given)
𝑇 𝑌 {𝑅, 𝑉 } 𝑇 ← 𝑈 ← 𝑉 → 𝑌 is blocked at fork 𝑈 ← 𝑉 → 𝑌 (𝑉 is given)
𝑈 𝑌 {𝑅, 𝑉 } 𝑈 ← 𝑉 → 𝑌 is blocked at fork 𝑈 ← 𝑉 → 𝑌 (𝑉 is given)
STUDY QUESTIONS AND SOLUTIONS FOR CHAPTER 2 35

Part (b)
Answers to part (b) can be found in the following table, formatted in the same way as (a)
above:

X Y Z Reason
𝑋 𝑆 {𝑅} 𝑋 → 𝑅 → 𝑆 is blocked at chain 𝑋 → 𝑅 → 𝑆 (𝑅 is given)
𝑋 𝑇 {𝑅} 𝑋 → 𝑅 → 𝑆 → 𝑇 is blocked at chain 𝑋 → 𝑅 → 𝑆 (𝑅 is given)
𝑋 𝑈 {𝑅} 𝑋 → 𝑅 → 𝑆 → 𝑇 ← 𝑈 is blocked at chain 𝑋 → 𝑅 → 𝑆 (𝑅 is given)
𝑋 𝑉 {𝑅} 𝑋 → 𝑅 → 𝑆 → 𝑇 ← 𝑈 ← 𝑉 is blocked at chain 𝑋 → 𝑅 → 𝑆 (𝑅 is
given)
𝑋 𝑌 {𝑅} 𝑋 → 𝑅 → 𝑆 → 𝑇 ← 𝑈 ← 𝑉 → 𝑌 is blocked at chain 𝑋 → 𝑅 → 𝑆
(𝑅 is given)
𝑅 𝑇 {𝑆} 𝑅 → 𝑆 → 𝑇 is blocked at chain 𝑅 → 𝑆 → 𝑇 (𝑆 is given)
𝑅 𝑈 {𝑆} 𝑅 → 𝑆 → 𝑇 ← 𝑈 is blocked at chain 𝑅 → 𝑆 → 𝑇 (𝑆 is given)
𝑅 𝑉 {𝑆} 𝑅 → 𝑆 → 𝑇 ← 𝑈 ← 𝑉 is blocked at chain 𝑅 → 𝑆 → 𝑇 (𝑆 is given)
𝑅 𝑌 {𝑆} 𝑅 → 𝑆 → 𝑇 ← 𝑈 ← 𝑉 → 𝑌 is blocked at chain 𝑅 → 𝑆 → 𝑇 (𝑆 is
given)
𝑆 𝑈 {} 𝑆 → 𝑇 ← 𝑈 is blocked at collider 𝑆 → 𝑇 ← 𝑈 (neither 𝑇 nor
descendants of 𝑇 are given)
𝑆 𝑉 {} 𝑆 → 𝑇 ← 𝑈 ← 𝑉 is blocked at collider 𝑆 → 𝑇 ← 𝑈 (neither 𝑇 nor
descendants of 𝑇 are given)
𝑆 𝑌 {} 𝑆 → 𝑇 ← 𝑈 ← 𝑉 → 𝑌 is blocked at collider 𝑆 → 𝑇 ← 𝑈 (neither 𝑇
nor descendants of 𝑇 are given)
𝑇 𝑉 {𝑈 } 𝑇 ← 𝑈 ← 𝑉 is blocked at chain 𝑇 ← 𝑈 ← 𝑉 (𝑈 is given)
𝑇 𝑌 {𝑈 } 𝑇 ← 𝑈 ← 𝑉 → 𝑌 is blocked at chain 𝑇 ← 𝑈 ← 𝑉 (𝑈 is given)
𝑈 𝑌 {𝑉 } 𝑈 ← 𝑉 → 𝑌 is blocked at fork 𝑈 ← 𝑉 → 𝑌 (𝑉 is given)
Part (c)

Observe that conditioning on {𝑅, 𝑃 } blockes only the chain 𝑋 → 𝑅 → 𝑆 (𝑅 is given) and
opens the collider 𝑆 → 𝑇 ← 𝑈 (𝑃 , a descendant of 𝑇 , is given). Thus, we render only 𝑋
independent of all other variables in the model; specifically, the pairs of independent vari-
ables conditioned on {𝑅, 𝑃 } are: (𝑋, 𝑅), (𝑋, 𝑆), (𝑋, 𝑇 ), (𝑋, 𝑃 ), (𝑋, 𝑈 ), (𝑋, 𝑉 ), (𝑋, 𝑌 )

Part (d)
Now that we’re familiar with Figure 2.6, we can summarize independence relationships in
the following table, which is similar to the previous two in parts (a) and (b) except that every
row may be read, “variable 𝑋 is independent of all variables in set 𝑌 given set 𝑍.”
36 STUDY QUESTIONS AND SOLUTIONS FOR CHAPTER 2

X Y Z Reason
𝑋 {𝑆, 𝑇, 𝑈, 𝑉, 𝑌, 𝑃 } {𝑅} Chain blocked at 𝑋 → 𝑅 → 𝑆 (𝑅 is given)
𝑅 {𝑇, 𝑈, 𝑉, 𝑌, 𝑃 } {𝑆} Chain blocked at 𝑅 → 𝑆 → 𝑇 (𝑆 is given)
𝑆 {𝑈, 𝑉, 𝑌 } {} Collider blocked at 𝑆 → 𝑇 ← 𝑈 (Neither 𝑇 nor
descendants of 𝑇 are given)
𝑃 {𝑆, 𝑈, 𝑉, 𝑌 } {𝑇 } Chains blocked at 𝑆 → 𝑇 → 𝑃 (𝑇 is given) and 𝑈 →
𝑇 → 𝑃 (𝑇 is given)
𝑇 {𝑉, 𝑌 } {𝑈 } Chain blocked at 𝑉 → 𝑈 → 𝑇 (𝑈 is given)
𝑈 {𝑌 } {𝑉 } Fork blocked at 𝑈 ← 𝑉 → 𝑌 (𝑉 is given)

Part (e)
Since 𝑌 and 𝑋 are independent conditional on any member of the set {𝑅, 𝑆, 𝑈, 𝑉 }, we may
choose 𝑍 to be any of these variables.

Part (f)
To determine which slopes will be equal to zero, we can again consider if 𝑌 is independent
of each variable given the other variables on the RHS of our equation. Specifically:
1. 𝑏 (the slope associated with 𝑋) will be zero, since 𝑌 and 𝑋 are independent given
𝑅, 𝑆, 𝑇, 𝑃 .
2. 𝑐 (the slope associated with 𝑅) will be zero, since 𝑌 and 𝑅 are independent given
𝑋, 𝑆, 𝑇, 𝑃 .
3. 𝑓 (the slope associated with 𝑃 ) will be zero, since 𝑌 and 𝑃 are independent given
𝑋, 𝑅, 𝑆, 𝑇 .

Study question 2.4.1.


Figure 2.9 below represents a causal graph from which the error terms have been deleted.
Assume that all those errors are mutually independent.

Z1
Z2

Z3

X
W
Y
Figure 2.9: A causal graph used in study question 2.4.1, all 𝑈 terms (not shown) are assumed
independent
STUDY QUESTIONS AND SOLUTIONS FOR CHAPTER 2 37

(a) For each pair of nonadjacent nodes in this graph, find a set of variables that 𝑑-separates
that pair. What does this list tell us about independencies in the data?
(b) Repeat question (a) assuming that only variables in the set {𝑍3 , 𝑊, 𝑋, 𝑍1 } can be
measured.
(c) For each pair of nonadjacent nodes in the graph, determined whether they are
independent conditional on all other variables.
(d) For every variable 𝑉 in the graph find a minimal set of nodes that renders 𝑉 independent
of all other variables in the graph.
(e) Suppose we wish to estimate the value of 𝑌 from measurements taken on all other
variables in the model. Find the smallest set of variables that would yield as good an
estimate of 𝑌 as when we measured all variables.
(f) Repeat question (e) assuming that we wish to estimate the value of 𝑍2 .
(g) Suppose we wish to predict the value of 𝑍2 from measurements of 𝑍3 . Would the quality
of our prediction improve if we add measurement of 𝑊 ? Explain.

Solution to study question 2.4.1

An R solution of this exercise is provided at dagitty.net/primer/2.4.1

Part (a)
Recall that two variables are d-separated if all simple paths between them are blocked by the
given set of variables (see Definition 2.4.1, and Rules 1, 2, and 3 for the graphical criteria for
conditional independence). Below, we list all pairs of variables that are d-separated, along
with the paths that must be blocked between them.
(𝑋, 𝑌 ) are independent conditioned on set {𝑍1 , 𝑍3 , 𝑊 }:
1. 𝑋 → 𝑊 → 𝑌 is blocked at chain 𝑋 → 𝑊 → 𝑌 (𝑊 is given).
2. 𝑋 ← 𝑍3 → 𝑌 is blocked at fork 𝑋 ← 𝑍3 → 𝑌 (𝑊 is given).
3. 𝑋 ← 𝑍3 ← 𝑍2 → 𝑌 is blocked at fork 𝑍3 ← 𝑍2 → 𝑌 (𝑍2 is given).
4. 𝑋 ← 𝑍1 ← 𝑍3 → 𝑌 is blocked at chain 𝑍1 → 𝑍3 → 𝑌 (𝑍3 is given).
5. 𝑋 ← 𝑍1 → 𝑍3 ← 𝑍2 → 𝑌 is blocked at fork 𝑋 ← 𝑍1 → 𝑍3 (𝑍1 is given).

(𝑋, 𝑍2 ) are independent conditioned on set {𝑍1 , 𝑍3 }:


1. 𝑋 ← 𝑍1 → 𝑍3 ← 𝑍2 is blocked at fork 𝑋 ← 𝑍1 → 𝑍3 (𝑍1 is given).
2. 𝑋 ← 𝑍1 → 𝑍3 → 𝑌 ← 𝑍2 is blocked at fork 𝑋 ← 𝑍1 → 𝑍3 (𝑍1 is given).
3. 𝑋 ← 𝑍3 ← 𝑍2 is blocked at chain 𝑋 ← 𝑍3 ← 𝑍2 (𝑍3 is given).
4. 𝑋 ← 𝑍3 → 𝑌 ← 𝑍2 is blocked at fork 𝑋 ← 𝑍3 → 𝑌 (𝑍3 is given).
5. 𝑋 → 𝑊 → 𝑌 ← 𝑍2 is blocked at collider 𝑊 → 𝑌 ← 𝑍2 (𝑌 is not given).
38 STUDY QUESTIONS AND SOLUTIONS FOR CHAPTER 2

6. 𝑋 → 𝑊 → 𝑌 ← 𝑍3 ← 𝑍2 is blocked at collider 𝑊 → 𝑌 ← 𝑍3 (𝑌 is not given).

Viewing the above d-separation path analysis, we can similarly deduce the remaining
independent pairs.
∙ (𝑌, 𝑍1 ) are independent conditioned on set {𝑋, 𝑍2 , 𝑍3 }.
∙ (𝑊, 𝑍1 ) are independent conditioned on set {𝑋}.
∙ (𝑊, 𝑍2 ) are independent conditioned on set {𝑋}.
∙ (𝑊, 𝑍3 ) are independent conditioned on set {𝑋}.
∙ (𝑍1 , 𝑍2 ) are independent conditioned on set {}.
What does this list of independencies tell us about those in the data? Assuming that the data
was generated from this model, then the independencies will also be manifest in the data.

Part (b)
Using the path analyses we performed in (a) above, we can determine if each pair of variables
can be d-separated conditioning only on variables in the set {𝑍3 , 𝑊, 𝑋, 𝑍1 }.
∙ (𝑋, 𝑌 ) are independent conditioned on set {𝑍1 , 𝑍3 , 𝑊 }.
∙ (𝑋, 𝑍2 ) are independent conditioned on set {𝑍1 , 𝑍3 }.
∙ (𝑊, 𝑍1 ) are independent conditioned on set {𝑋}.
∙ (𝑊, 𝑍2 ) are independent conditioned on set {𝑋}.
∙ (𝑊, 𝑍3 ) are independent conditioned on set {𝑋}.
∙ (𝑍1 , 𝑍2 ) are independent conditioned on set {}.
(𝑌, 𝑍1 ) were independent when we could condition on {𝑋, 𝑍2 , 𝑍3 }, but now there is no such
set that will render (𝑌, 𝑍1 ) independent. Since we can no longer condition on 𝑍2 , we must
address two paths:
1. 𝑍1 → 𝑍3 ← 𝑍2 → 𝑌 is blocked if we do not condition on 𝑍3 , but open if we do.
2. 𝑍1 → 𝑍3 → 𝑌 is blocked if we condition on 𝑍3 , but open if we do not.
Observe that these two requirements are mutually exclusive, so one of these two paths
remains open, and so (𝑌, 𝑍1 ) cannot be d-separated using the covariates available.

Part (c)
Again using our path analysis from (a), we have:
∙ (𝑋, 𝑌 ) are independent given {𝑍1 , 𝑍2 , 𝑍3 , 𝑊 }.
∙ (𝑋, 𝑍2 ) are independent given {𝑍1 , 𝑍3 , 𝑊, 𝑌 }.
∙ (𝑌, 𝑍1 ) are independent given {𝑍2 , 𝑍3 , 𝑋, 𝑊 }.
∙ (𝑊, 𝑍1 ) are independent given {𝑍2 , 𝑍3 , 𝑋, 𝑌 }.
STUDY QUESTIONS AND SOLUTIONS FOR CHAPTER 2 39

∙ (𝑊, 𝑍2 ) are dependent given {𝑍1 , 𝑍3 , 𝑋, 𝑌 }; there is an open path 𝑊 → 𝑌 ← 𝑍2


(𝑊 → 𝑌 ← 𝑍2 is a collider and 𝑌 is given).
∙ (𝑊, 𝑍3 ) are dependent given {𝑍1 , 𝑍2 , 𝑋, 𝑌 }: there is an open path 𝑊 → 𝑌 ← 𝑍3
(𝑊 → 𝑌 ← 𝑍3 is a collider and 𝑌 is given).
∙ (𝑍1 , 𝑍2 ) are dependent given {𝑍3 , 𝑋, 𝑊, 𝑌 }: there is an open path 𝑍1 → 𝑍3 ← 𝑍2
(𝑍1 → 𝑍3 ← 𝑍2 is a collider and 𝑍3 is given).

Part (d)
Consider the set of variables Z comprised of the parents, children, and spouses of 𝑉 : Z
= {𝑝𝑎𝑟𝑒𝑛𝑡𝑠(𝑉 ), 𝑐ℎ𝑖𝑙𝑑𝑟𝑒𝑛(𝑉 ), 𝑠𝑝𝑜𝑢𝑠𝑒𝑠(𝑉 )}.
Let us first convince ourselves that conditioning on Z is guanteed to render 𝑉 independent of
all other variables in the graph.
1. 𝑝𝑎𝑟𝑒𝑛𝑡𝑠(𝑉 ): conditioning on the parents of 𝑉 will block any forks and chains
incumbent to 𝑉 .
2. 𝑐ℎ𝑖𝑙𝑑𝑟𝑒𝑛(𝑉 ): conditioning on the children of 𝑉 will block any forks and chains
emanating from 𝑉 .
3. 𝑠𝑝𝑜𝑢𝑠𝑒𝑠(𝑉 ): conditioning on the spouses of 𝑉 will block paths that were opened
at a collider formed on a child of 𝑉 (which was opened when we conditioned on
𝑐ℎ𝑖𝑙𝑑𝑟𝑒𝑛(𝑉 )).
Conditioning on Z is guaranteed to d-separate 𝑉 from all other nodes in the model (which is
referred to as the Markov Blanket for a node 𝑉 , sometimes denoted 𝑀 𝐵(𝑉 )).

𝑀 𝐵(𝑋) = {𝑊, 𝑍1 , 𝑍3 }
𝑀 𝐵(𝑌 ) = {𝑍2 , 𝑍3 , 𝑊 }
𝑀 𝐵(𝑊 ) = {𝑋, 𝑌, 𝑍2 , 𝑍3 }
𝑀 𝐵(𝑍1 ) = {𝑋, 𝑍2 , 𝑍3 }
𝑀 𝐵(𝑍2 ) = {𝑍1 , 𝑍3 , 𝑌, 𝑊 }
𝑀 𝐵(𝑍3 ) = {𝑍1 , 𝑍3 , 𝑋, 𝑊, 𝑌 }

Part (e)
The minimal set would be the Markov Blanket of 𝑌 , 𝑀 𝐵(𝑌 ) = {𝑍2 , 𝑍3 , 𝑊 }, since {𝑋, 𝑍1 }
are independent from 𝑌 given 𝑀 𝐵(𝑌 ), and so their addition would not improve our estimate.

Part (f)
The minimal set would be the Markov Blanket of 𝑍2 , 𝑀 𝐵(𝑍2 ) = {𝑍1 , 𝑍3 , 𝑌, 𝑊 }, since
{𝑋} is independent from 𝑍2 given 𝑀 𝐵(𝑍2 ). Observe that here we include 𝑌 since it
improves our estimate of 𝑍2 (they are dependent), but this inclusion opens a path that was
not open when we conditioned on all variables: 𝑌 ← 𝑊 ← 𝑋 ← 𝑍1 → 𝑍3 ← 𝑍2 . We then
block this path by conditioning on 𝑊 .
40 STUDY QUESTIONS AND SOLUTIONS FOR CHAPTER 2

Part (g)
Yes, since 𝑍2 and 𝑊 are dependent given 𝑍3 (there is an open path 𝑊 ← 𝑋 ← 𝑍1 → 𝑍3 ←
𝑍2 such that information about 𝑊 provides information about 𝑍2 ).

Study question 2.5.1.


(a) Which of the arrows in Figure 2.9 can be reversed without being detected by any
statistical test? [Hint: Use the criterion for equivalence class.]
(b) List all graphs that are observationally equivalent to the one in Figure 2.9.
(c) List the arrows in Figure 2.9 whose directionality can be determined from non-
experimental data.
(d) Write down a regression equation for 𝑌 such that, if a certain coefficient in that equation
is non-zero the model of Figure 2.9 is wrong.
(e) Repeat question (d) for variable 𝑍3 .
(f) Repeat question (e) assuming the 𝑋 is not measured.
(g) How many regression equations of the type described in (d) and (e) are needed to
ensure that the model is fully tested, namely, that if it passes all these tests it cannot be
refuted additional tests of these kind. [Hint: Ensure that you test every vanishing partial
regression coefficient that is implied by the product decomposition (1.29).]

Solution to study question 2.5.1

An R solution of this exercise is provided at dagitty.net/primer/2.5.1

Part (a)
To determine which arrows can be reversed without being detectable by a statistical test, we
consider models that are in the equivalence class of Figure 2.9 (see paragraphs previous to
these study questions in the text). Accordingly, we first find the v-structures in the graph (i.e.,
colliders whose parents are not adjacent), which are:
∙ 𝑍1 → 𝑍3 ← 𝑍2
∙ 𝑍3 → 𝑌 ← 𝑊
∙ 𝑍2 → 𝑌 ← 𝑊
So, to find other models in this equivalence class, we may flip the direction of edges such that
the resulting model abides by two criteria:
∙ We neither create nor destroy any v-structures.
∙ We must not introduce a cycle into the resulting graph. Note that while d-separation
is valid in linear cyclical models, it is not valid in general, namely, for any non-linear
functions.
STUDY QUESTIONS AND SOLUTIONS FOR CHAPTER 2 41

With these constraints, we conclude that there are no such edges that can be reversed within
the model to find another within its equivalence class. We would verify this claim by testing
our constraints against each edge reversal. Here is a complete list of our tests and a reason (of
which there might be several) why each reversal fails to produce a model in the equivalence
class of Figure 2.9 (note, we define the “old model” as the one pre-edge-reversal, and the
“new model” as the one resulting from the reversal):

Edge Reason
𝑍1 → 𝑋 Creates a cycle in new model, 𝑍1 → 𝑍3 → 𝑋 → 𝑍1
𝑍1 → 𝑍3 Destroys a v-structure in old model, 𝑍1 → 𝑍3 ← 𝑍2
𝑍3 → 𝑋 Creates a v-structure in new model, 𝑋 → 𝑍3 ← 𝑍2
𝑍3 → 𝑌 Creates a v-structure in new model, 𝑌 → 𝑍3 ← 𝑍1
𝑋→𝑊 Creates a v-structure in new model, 𝑊 → 𝑋 ← 𝑍1
𝑊 →𝑌 Destroys a v-structure in old model, 𝑊 → 𝑌 ← 𝑍2
𝑍2 → 𝑍3 Destroys a v-structure in old model, 𝑍1 → 𝑍3 ← 𝑍2
𝑍2 → 𝑌 Destroys a v-structure in old model, 𝑍2 → 𝑌 ← 𝑊

Part (b)
There are no additional models in the equivalence class of Figure 2.9 for the reasons stated
in part (a) above.

Part (c)
No edge may be reversed to produce a model in the equivalence class of Figure 2.9 (see
explanation in part (a)). Therefore, all edge directionalities in the graph may be determined
from non-experimental data.

Part (d)
The model in 2.4.1(d) implies that 𝑌 is independent of 𝑍1 given {𝑍2 , 𝑍3 , 𝑊 }.
So, suppose we fit the data with the equation:

𝑦 = 𝑟2 𝑧2 + 𝑟3 𝑧3 + 𝑟𝑤 𝑤 + 𝑟1 𝑧1

If 𝑟1 is non-zero in the fitted equation, then the model of Figure 2.9 is wrong since the data
violates the conditional independence between 𝑌 and 𝑍1 as claimed by the model.

Part (e)
The model in 2.4.1(d) implies that 𝑍3 is independent of 𝑊 given {𝑋}.
So, suppose we fit the data with the equation:

𝑧3 = 𝑟𝑥 𝑥 + 𝑟𝑤 𝑤

If 𝑟𝑤 is non-zero in the fitted equation, then the model of Figure 2.9 is wrong since the data
violates the conditional independence between 𝑍3 and 𝑊 as claimed by the model.
42 STUDY QUESTIONS AND SOLUTIONS FOR CHAPTER 2

Part (f)
No such regression exists because no variable can be separated from 𝑍3 by any set of
observed variables.

Part (g)
According to Equation 1.29, the joint probability distribution can be factorized as:
𝑃 (𝑍1 , 𝑍2 , 𝑍3 , 𝑋, 𝑊, 𝑌 ) = 𝑃 (𝑍1 )𝑃 (𝑍2 )𝑃 (𝑍3 |𝑍1 , 𝑍2 )𝑃 (𝑋|𝑍1 , 𝑍3 )𝑃 (𝑊 |𝑋)𝑃 (𝑌 |𝑊, 𝑍2 , 𝑍3 )
So, to fully test the model, we need to examine every factor in this factorization and establish
that, conditional on its parents, every variable 𝑉 is independent of its non-descendants. The
corresponding regression equations necessary to perform these tests are:
1. 𝑧2 = 𝑟1 𝑧1 with vanishing 𝑟1 .
2. 𝑥 = 𝑟1 𝑧1 + 𝑟2 𝑧2 + 𝑟3 𝑧3 with vanishing 𝑟2 .
3. 𝑤 = 𝑟𝑥 𝑥 + 𝑟1 𝑧1 + 𝑟2 𝑧2 + 𝑟3 𝑧3 with vanishing 𝑟1 , 𝑟2 , 𝑟3 .
4. 𝑦 = 𝑟𝑤 𝑤 + 𝑟𝑥 𝑥 + 𝑟1 𝑧1 + 𝑟2 𝑧2 + 𝑟3 𝑧3 with vanishing 𝑟𝑥 , 𝑟1 .
So, in total, to fully test the model we would need 4 regression equations, through which we
would perform 1 + 1 + 3 + 2 = 7 tests for vanishing regression coefficients.
Study Questions and Solutions for
Chapter 3

Study question 3.2.1.


Referring to Study question 1.5.2 (Figure 1.8) and the parameters listed therein,
(a) Compute 𝑃 (𝑦|𝑑𝑜(𝑥)) for all values of 𝑥 and 𝑦, by simulating the intervention 𝑑𝑜(𝑥) on
the model.
(b) Compute 𝑃 (𝑦|𝑑𝑜(𝑥)) for all values of 𝑥 and 𝑦, using the adjustment formula (3.5).
(c) Compute the ACE

𝐴𝐶𝐸 = 𝑃 (𝑦1 |𝑑𝑜(𝑥1 )) − 𝑃 (𝑦1 |𝑑𝑜(𝑥0 ))

and compare it to the Risk Difference

𝑅𝐷 = 𝑃 (𝑦1 |𝑥1 ) − 𝑃 (𝑦1 |𝑥0 ).

(d) Find a combination of parameters that exhibit Simpson’s reversal (as in Study question
1.5.2(c)) and show explicitly that the overall causal effect of the drug is obtained from
the aggregate data.

Solution to study question 3.2.1

First, let’s save the graph for ease of reference:

𝑋 𝑌
44 STUDY QUESTIONS AND SOLUTIONS FOR CHAPTER 3

Part (a)
Now, to compute 𝑃 (𝑦|𝑑𝑜(𝑥)) for all values of 𝑥 and 𝑦, we can consider the mutilated model
𝑚 wherein all causal influences to 𝑋 are severed, and 𝑋 is forced to some value 𝑥:

𝑋 𝑌

With this mutilated model in hand, we can write the product decomposition, Eq. (1.29), for
𝑚 to solve for our quantities of interest. Observe three facts that will help us with our task:
𝑃 (𝑌 |𝑍, 𝑋) = 𝑃𝑚 (𝑌 |𝑍, 𝑋), 𝑃 (𝑍) = 𝑃𝑚 (𝑍), and 𝑃𝑚 (𝑍|𝑋) = 𝑃𝑚 (𝑍) = 𝑃 (𝑍).

𝑃 (𝑌 = 𝑦|𝑑𝑜(𝑋 = 𝑥)) = 𝑃𝑚 (𝑌 = 𝑦|𝑋 = 𝑥)


∑︁
= 𝑃𝑚 (𝑌 = 𝑦, 𝑍 = 𝑧|𝑋 = 𝑥)
𝑧
∑︁
= 𝑃𝑚 (𝑌 = 𝑦|𝑍 = 𝑧, 𝑋 = 𝑥)𝑃𝑚 (𝑍 = 𝑧|𝑋 = 𝑥)
𝑧
∑︁
= 𝑃𝑚 (𝑌 = 𝑦|𝑍 = 𝑧, 𝑋 = 𝑥)𝑃𝑚 (𝑍 = 𝑧)
𝑧
∑︁
= 𝑃 (𝑌 = 𝑦|𝑍 = 𝑧, 𝑋 = 𝑥)𝑃 (𝑍 = 𝑧)
𝑧

𝑃 (𝑦1 |𝑑𝑜(𝑥1 )) = 𝑃 (𝑦1 |𝑥1 , 𝑧1 )𝑃 (𝑧1 ) + 𝑃 (𝑦1 |𝑥1 , 𝑧0 )𝑃 (𝑧0 ) = 𝑟𝑝4 + (1 − 𝑟)𝑝2
𝑃 (𝑦1 |𝑑𝑜(𝑥0 )) = 𝑃 (𝑦1 |𝑥0 , 𝑧1 )𝑃 (𝑧1 ) + 𝑃 (𝑦1 |𝑥0 , 𝑧0 )𝑃 (𝑧0 ) = 𝑟𝑝3 + (1 − 𝑟)𝑝1
𝑃 (𝑦0 |𝑑𝑜(𝑥1 )) = 1 − 𝑃 (𝑦1 |𝑑𝑜(𝑥1 )) = 1 − (𝑟𝑝4 + (1 − 𝑟)𝑝2 )
𝑃 (𝑦0 |𝑑𝑜(𝑥0 )) = 1 − 𝑃 (𝑦1 |𝑑𝑜(𝑥0 )) = 1 − (𝑟𝑝3 + (1 − 𝑟)𝑝1 )

Part (b)
By the adjustment formula, Eq. (3.6), we have the same as in (a):

𝑃 (𝑦1 |𝑑𝑜(𝑥1 )) = 𝑃 (𝑦1 |𝑥1 , 𝑧1 )𝑃 (𝑧1 ) + 𝑃 (𝑦1 |𝑥1 , 𝑧0 )𝑃 (𝑧0 ) = 𝑟𝑝4 + (1 − 𝑟)𝑝2
𝑃 (𝑦1 |𝑑𝑜(𝑥0 )) = 𝑃 (𝑦1 |𝑥0 , 𝑧1 )𝑃 (𝑧1 ) + 𝑃 (𝑦1 |𝑥0 , 𝑧0 )𝑃 (𝑧0 ) = 𝑟𝑝3 + (1 − 𝑟)𝑝1
𝑃 (𝑦0 |𝑑𝑜(𝑥1 )) = 1 − (𝑟𝑝4 + (1 − 𝑟)𝑝2 )
𝑃 (𝑦0 |𝑑𝑜(𝑥0 )) = 1 − (𝑟𝑝3 + (1 − 𝑟)𝑝1 )
STUDY QUESTIONS AND SOLUTIONS FOR CHAPTER 3 45

Part (c)
To find the ACE, we simply substitute our computations from (a) into the ACE formula, giv-
ing us:

𝐴𝐶𝐸 = 𝑃 (𝑦1 |𝑑𝑜(𝑥1 )) − 𝑃 (𝑦1 |𝑑𝑜(𝑥0 ))


= 𝑟𝑝4 + (1 − 𝑟)𝑝2 − 𝑟𝑝3 − (1 − 𝑟)𝑝1
To compute RD, we use study question 1.5.2(d), and obtain:

𝑅𝐷 = 𝑃 (𝑦1 |𝑥1 ) − 𝑃 (𝑦1 |𝑥0 )


[𝑟𝑝4 𝑞2 + (1 − 𝑟)𝑝2 𝑞1 ] [𝑟𝑝3 (1 − 𝑞2 ) + (1 − 𝑟)𝑝1 (1 − 𝑞1 )]
= −
[𝑟𝑞2 + (1 − 𝑟)𝑞1 ] [𝑟(1 − 𝑞2 ) + (1 − 𝑟)(1 − 𝑞1 )]
We see that, in general, the 𝐴𝐶𝐸 ̸= 𝑅𝐷: the Average Causal Effect (ACE) measures the
effect on 𝑌 from intervening and forcing 𝑋 to change from 𝑥0 to 𝑥1 . In contrast, the Risk
Difference (RD) measures the effect on 𝑌 from observing change from 𝑥0 to 𝑥1 .

Comparing the expressions for ACE and RD, we see that when 𝑟 = 0, 𝑞1 ̸= 0, 𝑞1 ̸= 1, the
ACE is equivalent to RD, namely:
𝐴𝐶𝐸 − 𝑅𝐷 = 𝑝2 − 𝑝1 − 𝑝2 + 𝑝1 = 0
Part (d)
Using our answer to study question 1.5.2(c), we note that the following parameterization will
yield Simpson’s reversal:
𝑝1 = 0.1, 𝑝2 = 0, 𝑝3 = 0.3, 𝑝4 = 0.2, 𝑞1 = 0, 𝑞2 = 1, 𝑟 = 0.1
Recall that this parameterization yields Simpson’s reversal because, for each value of 𝑧,
the difference 𝑃 (𝑦|𝑥1 , 𝑧) − 𝑃 (𝑦|𝑥0 , 𝑧) has the same sign as the ACE. We also know that,
because 𝑍 is the only confounder, to determine the overall causal effect of the drug on recov-
ery, we’re interested in its average influence across all 𝑍-specific conditions. To aid us in this
analysis, we can use the ACE from part (c) above, and consult the segregated data, namely:

𝐴𝐶𝐸 = 𝑃 (𝑦1 |𝑑𝑜(𝑥1 )) − 𝑃 (𝑦1 |𝑑𝑜(𝑥0 ))


= 𝑟𝑝4 + (1 − 𝑟)𝑝2 − 𝑟𝑝3 − (1 − 𝑟)𝑝1
= 0.02 + 0 − 0.03 − 0.09
= −0.1
Study question 3.3.1.
Consider the graph in Figure 3.8:
(a) List all of the sets of variables that satisfy the backdoor criterion to determine the causal
effect of 𝑋 on 𝑌 .
(b) List all of the minimal sets of variables that satisfy the backdoor criterion to determine
the causal effect of 𝑋 on Y (i.e., any set of variables such that, if you removed any one of
the variables from the set, it would no longer meet the criterion).
46 STUDY QUESTIONS AND SOLUTIONS FOR CHAPTER 3

B C

A D
Z

X W Y
Figure 3.8: Causal graph used to illustrate the backdoor criterion in the following study
questions

(c) List all minimal sets of variables that need be measured in order to identify the effect of
𝐷 on 𝑌 . Repeat, for the effect of {𝑊, 𝐷} on 𝑌 .

Solution to study question 3.3.1

The graph of this exercise is available at dagitty.net/m331. Students can interactively


modify this graph and see the implications for the backdoor adjustment sets.

An R solution of this exercise is provided at dagitty.net/primer/3.3.1

Part (a)
Let us use the abbreviation "backdoor admissible" to denote a set of variables that satisfy
the backdoor criterion of Definition 3.3.1; for the present model, a backdoor admissible set
Z blocks all spurious paths between 𝑋 and 𝑌 while leaving all directed paths from 𝑋 to 𝑌
open. We can easily verify that the following sets satisfy the backdoor criterion to determine
the causal effect of 𝑋 on 𝑌 .
1. Sets of 2 nodes: {𝑍, 𝐴}, {𝑍, 𝐵}, {𝑍, 𝐶}, {𝑍, 𝐷}
2. Sets of 3 nodes: {𝑍, 𝐴, 𝐵}, {𝑍, 𝐴, 𝐶}, {𝑍, 𝐴, 𝐷}, {𝑍, 𝐵, 𝐶}, {𝑍, 𝐵, 𝐷}, {𝑍, 𝐶, 𝐷}
3. Sets of 4 nodes: {𝑍, 𝐴, 𝐵, 𝐶}, {𝑍, 𝐴, 𝐵, 𝐷}, {𝑍, 𝐴, 𝐶, 𝐷}, {𝑍, 𝐵, 𝐶, 𝐷}
4. Sets of 5 nodes: {𝑍, 𝐴, 𝐵, 𝐶, 𝐷}
Part (b)
According to (a), the following 4 sets are minimal, since in every other set, a node could be
removed and still ensure that the backdoor criterion is satisfied:
{𝑍, 𝐴}, {𝑍, 𝐵}, {𝑍, 𝐶}, {𝑍, 𝐷}.
STUDY QUESTIONS AND SOLUTIONS FOR CHAPTER 3 47

Part (c)
For identifying the effect of 𝐷 on 𝑌 :
We want to find a set Z that blocks all backdoor paths from 𝐷 to 𝑌 . Notice that the set {𝐶} is
one solution, so any other set that contains 𝐶 is not minimal. Also, if the set does not contain
𝐶, then it must contain 𝑍, otherwise, we have an open path 𝑌 ← 𝑍 ← 𝐶 → 𝐷. However,
by including 𝑍, the backdoor path 𝑌 ← 𝑊 ← 𝑋 ← 𝐴 ← 𝐵 → 𝑍 ← 𝐶 → 𝐷 is open. To
block this path, we add any of 𝐴, 𝐵, 𝑋, or 𝑊 . So, there are a total of 5 minimal sets:
{𝐶}, {𝑍, 𝐴}, {𝑍, 𝐵}, {𝑍, 𝑋}, {𝑍, 𝑊 }.

For identifying the effect of {𝑊, 𝐷} on 𝑌 :


Again, we want to verify that the only open paths from {𝑊, 𝐷} to 𝑌 are the direct edges.
So similar to above, if the set Z contains 𝑍, then all backdoor paths from 𝐷 to 𝑌 or from
𝑊 to 𝑌 will be blocked. Also if the set Z does not contain 𝑍, then it must contain 𝐶 and
𝑋, otherwise, we have an open path through 𝐷 ← 𝐶 → 𝑍 → 𝑌 or 𝑊 ← 𝑋 ← 𝑍 → 𝑌 . So
there are a total of 2 minimal sets as follows:
{𝑍}, {𝐶, 𝑋}.

Study question 3.3.2. (Lord’s Paradox).


At the beginning of the year, a boarding school offers its students a choice between two
meal plans for the year: Plan 𝐴 and Plan 𝐵. The students’ weights are recorded at the
beginning and the end of the year. To determine how each plan affects students’ weight
gain, the school hired two statisticians who, oddly, reached different conclusions. The first
statistician calculated the difference between each student’s weight in June (𝑊𝐹 ) and in
September (𝑊𝐼 ) and found that the average weight gain in each plan was zero.
The second statistician divided the students into several subgroups, one for each initial
weight, 𝑊𝐼 . He finds that for each initial weight, the final weight for Plan 𝐵 is higher than
the final weight for Plan 𝐴.

Figure 3.9: Scatter plot with students’ initial weights on the 𝑥-axis and final weights on the
𝑦-axis. The vertical line indicates students whose initial weights are the same, and whose
final weights are higher (on average) for plan B compared with plan A
48 STUDY QUESTIONS AND SOLUTIONS FOR CHAPTER 3

So, the first statistician concluded that there was no effect of diet on weight gain and the
second concluded there was.

Figure 3.9 illustrates data sets that can cause the two statisticians to reach conflicting
conclusions. Statistician-1 examined the weight gain 𝑊𝐹 − 𝑊𝐼 which, for each student, is
represented by the shortest distance to the 45∘ line. Indeed, the average gain for each diet
plan is zero; the two groups are each situated symmetrically relative to the zero-gain line,
𝑊𝐹 = 𝑊𝐼 . Statistician-2, on the other hand, compared the final weights of plan 𝐴 students
to those of plan-𝐵 students who entered school with the same initial weight 𝑊0 and, as the
vertical line in the figure indicates, plan 𝐵 students are situated above plan 𝐴 students along
this vertical line. The same will be the case for any other vertical line, regardless of 𝑊0 .

(a) Draw a causal graph representing the situation.

(b) Determine which statistician is correct.

(c) How is this example related to Simpson’s paradox?

Solution to study question 3.3.2

Part (a)

First, let us configure the variables we’ll use in our model. Let 𝑋 be the students’ meal
plan choice, 𝑊𝐹 be the students’ final weights, and 𝑊𝐼 be the students’ initial weights.
We hypothesize that a student’s initial weight influences both their choice of meal plan and
their final weight. Additionally, meal plan influences final weight. So, the causal graph is as
follows:

𝑊𝐼

𝑋 𝑊𝐹

Part (b)

The 2nd statistician is correct, since the initial weight 𝑊𝐼 is the common cause of plan choice
𝑋 and final weight 𝑊𝐹 . As such, when we estimate the effect of 𝑋 on 𝑊𝐹 , we should seg-
regate data on the initial weight. Also, 𝑊𝐼 satisfies the backdoor criterion to determine the
causal effect of 𝑋 on 𝑊𝐹 . The first statistician mistakenly used the aggregated data, failing
to account for the confounder 𝑊𝐼 .

Statistician 1’s argument sounds compelling only because it is expressed in terms of the
gain 𝐺 = 𝑊𝐹 − 𝑊𝐼 , which people perceive to be the quantity of interest, and which Figure
3.9 clearly shows to have the same mean in Diet A as in Diet B. However, once we add 𝐺 to
the graph (see below) the error in this argument becomes clear: to compute the effect of 𝑋
on 𝐺 we still need to adjust for 𝑊𝐼 .
STUDY QUESTIONS AND SOLUTIONS FOR CHAPTER 3 49

𝑊𝐼 𝐺

𝑋 𝑊𝐹
Part (c)
Comparing the model in Part (a) to the standard model in Simpson’s paradox (e.g., Fig. 1.8)
we see that the structures of the two models are the same, with a slightly different causal story.
The difference is that, in Simpson’s paradox, we have complete reversal while in Lord’s para-
dox, we are going from inequality to equality. To visualize this transition, we can examine the
𝑊𝐼 -specific distributions of 𝑊𝐹 for each of the diets, and ask whether the two distributions
differ. This we do by projecting samples corresponding to an initial weight 𝑊0 on to the 𝑊𝐹
axis, as shown in the graph below.

We see that for individuals having the same initial weight, 𝑊0 , their final weight will be
higher in Plan B than in Plan A (on the average). The distributions corresponding to the two
scatter plots are shifted. On the other hand, if we project the two scatter plots onto the G axis,
the two distributions coincide. Thus, the segregated data (conditioned on 𝑊𝐼 ) yields prefer-
ence of one diet over the other, while the unsegregated data (unconditioned on 𝑊𝐼 ) claims
equality for the two diets.

In Simpson’s paradox, on the other hand, we encounter sign reversal as we go from the
segregated to the unsegregated data. This is shown, for example, in the age-specific slopes of
Figure 1.1 in the text, which have opposite sign to the slope of the aggregated ellipse.

Study question 3.3.3.


Revisit the lollipop story of Study question 1.2.4 and answer the following questions:

(a) Draw a graph that captures the story.

(b) Determine which variables must be adjusted for by applying the backdoor criterion.
variables need to be adjusted for.
50 STUDY QUESTIONS AND SOLUTIONS FOR CHAPTER 3

(c) Write the adjustment formula for the effect of the drug on recovery.

(d) Repeat questions (a)–(c) assuming that the nurse gave lollipops a day after the study,
still preferring patients who received treatment over those who received placebo.

Solution to study question 3.3.3

Students who wish to study Simpson’s paradox in more detail can use the interactive
simulator of the “Simpson Machine” at dagitty.net/learn/simpson/

Part (a)

As with any modeling problem, we begin by formalizing our variable choices. Let 𝑋 indicate
Treatment receipt, 𝑍 indicate Lollipop receipt, 𝑌 indicate Recovery, and 𝑈1 , 𝑈2 indicate 2
unobserved factors that correlate 𝑍 with 𝑋 and 𝑌 , respectively. The causal graph will be:

𝑈1 𝑈2

𝑍
𝑋 𝑌
Part (b)

By Definition 3.3.1, the backdoor criterion, to estimate the effect of X on Y, we need not
adjust for any variable, since 𝑈1 → 𝑍 ← 𝑈2 is a collider that is closed when 𝑍 is not given.
As we discussed in the solution to study question 1.2.4, the structure of the graph permits us
to skip considerations of exchangeability (i.e., comparing apples and oranges), and get the
answer mechanically and reliably.

Part (c)

According to (b), since we need not adjust for any covariates to block any spurious paths, we
may simply say that: 𝑃 (𝑦|𝑑𝑜(𝑥)) = 𝑃 (𝑦|𝑥)

Part (d)

Our answers do not change; timing of the Lollipop receipt does not change the causal struc-
ture of the model, as long as receiving a Lollipop is assumed to have no effect on either
treatment or outcome. In other words, 𝑍 is not an ancestor of either 𝑋 or 𝑌 .

Study question 3.4.1.


Assume that in Figure 3.8, only 𝑋, 𝑌, and one additional variable can be measured. Which
variable would allow the identification of the effect of 𝑋 on 𝑌 ? What would that effect be?
STUDY QUESTIONS AND SOLUTIONS FOR CHAPTER 3 51

Solution to study question 3.4.1

Variable 𝑊 satisfies the Front-Door criterion in Definition 3.4.1, so 𝑊 would allow the
identification of the effect of 𝑋 on 𝑌 , namely:
𝑃 (𝑦|𝑑𝑜(𝑥)) = Σ𝑤 𝑃 (𝑤|𝑥)Σ𝑥′ 𝑃 (𝑦|𝑥′ , 𝑤)𝑃 (𝑥′ )

Study question 3.4.2.


I went to a pharmacy to buy a certain drug, and I found that it was available in two different
bottles: one priced at $1, the other at $10. I asked the druggist, “What’s the difference?”
and he told me, “The $10 bottle is fresh, whereas the $1 bottle one has been on the shelf for
3 years. But, you know, data shows that the percentage of recovery is much higher among
those who bought the cheap stuff. Amazing isn’t it?” I asked if the aged drug was ever tested.
He said, “Yes, and this is even more amazing; 95% of the aged drug and only 5% of the fresh
drug has lost the active ingredient, yet the percentage of recovery among those who got bad
bottles, with none of the active ingredient, is still much higher than among those who got
good bottles, with the active ingredient.”
Before ordering a cheap bottle, it occurred to me to have a good look at the data. The
data were, for each previous customer, the type of bottle purchased (aged or fresh), the
concentration of the active ingredient in the bottle (high or low), and whether the customer
recovered from the illness. The data perfectly confirmed the druggist’s story. However, after
making some additional calculations, I decided to buy the expensive bottle after all; even
without testing its content, I could determine that a fresh bottle would offer the average
patient a greater chance of recovery.
Based on two very reasonable assumptions, the data show clearly that the fresh drug is more
effective. The assumptions are as follows:
(i) Customers had no information about the chemical content (high or low) of the
specific bottle of the drug that they were buying; their choices were influenced by
price and shelf-age alone.
(ii) The effect of the drug on any given individual depends only on its chemical content,
not on its shelf age (fresh or aged).
(a) Describe this scenario in a causal graph.
(b) Construct a data set compatible with the story and the decision to buy the expensive
bottle.
(c) Determine the effect of choosing the fresh versus the aged drug by using assumptions (i)
and (ii) and the data given in (b).

Solution to study question 3.4.2


Part (i)
Let 𝑋 denote Drug Price (the cheap / old vs. expensive / fresh), 𝑍 indicate chemical Potency
of the drug, and 𝑌 indicate Recovery. We hypothesize that the age of the Drug affects its
52 STUDY QUESTIONS AND SOLUTIONS FOR CHAPTER 3

chemical Potency, which in turn affects Recovery. We also hypothesize the existence of an
unobserved confounder that makes people influenced by Price to have more health problems.
The graph is as follows:

𝑋 𝑍 𝑌
Part (ii)
(Hint: the data is the same as in table 3.2 in the book, and they share the same causal relations
and corresponding graphical model)

Cheap Expensive All Subjects


400 400 800
Low | High Low | High Low | High
380 | 20 20 | 380 400 | 400
Recovery 323 | 18 1 | 38 324 | 56
Not Recovery 57 | 2 19 | 324 76 | 344

Part (iii)
Using (ii) above, our knowledge that 𝑍 satisfies the Front-Door criterion, and Theorem 3.4.1,
we can compare the causal effects of the two drug types to see which is superior. Let 𝑦1 denote
recovery, 𝑥1 denote choosing the expensive drug, and 𝑧1 denote high chemical content.

𝑃 (𝑦1 |𝑑𝑜(𝑥1 )) = Σ𝑧 𝑃 (𝑧|𝑥1 )Σ𝑥′ 𝑃 (𝑦1 |𝑥′ , 𝑧)𝑃 (𝑥′ )


= 𝑃 (𝑧1 |𝑥1 )[𝑃 (𝑦1 |𝑥1 , 𝑧1 )𝑃 (𝑥1 ) + 𝑃 (𝑦1 |𝑥0 , 𝑧1 )𝑃 (𝑥0 )]
+ 𝑃 (𝑧0 |𝑥1 )[𝑃 (𝑦1 |𝑥1 , 𝑧0 )𝑃 (𝑥1 ) + 𝑃 (𝑦1 |𝑥0 , 𝑧0 )𝑃 (𝑥0 )]
= 0.95 * (0.1 * 0.5 + 0.9 * 0.5) + 0.05 * (0.05 * 0.5 + 0.85 * 0.5)
= 0.4975
𝑃 (𝑦1 |𝑑𝑜(𝑥0 )) = Σ𝑧 𝑃 (𝑧|𝑥0 )Σ𝑥′ 𝑃 (𝑦1 |𝑥′ , 𝑧)𝑃 (𝑥′ )
= 𝑃 (𝑧1 |𝑥0 )[𝑃 (𝑦1 |𝑥1 , 𝑧1 )𝑃 (𝑥1 ) + 𝑃 (𝑦1 |𝑥0 , 𝑧1 )𝑃 (𝑥0 )]
+ 𝑃 (𝑧0 |𝑥0 )[𝑃 (𝑦1 |𝑥1 , 𝑧0 )𝑃 (𝑥1 ) + 𝑃 (𝑦1 |𝑥0 , 𝑧0 )𝑃 (𝑥0 )]
= 0.05 * (0.1 * 0.5 + 0.9 * 0.5) + 0.95 * (0.05 * 0.5 + 0.85 * 0.5)
= 0.4525
𝐴𝐶𝐸 = 𝑃 (𝑦1 |𝑑𝑜(𝑥1 )) − 𝑃 (𝑦1 |𝑑𝑜(𝑥0 ))
= 0.045 > 0
So, we see that recovery is indeed more likely when using the expensive drug than the cheap
one.
STUDY QUESTIONS AND SOLUTIONS FOR CHAPTER 3 53

We can rationalize these findings by first remembering that our data is observational, and
that an unobserved confounder between drug choice and recovery might create the illusion
that the cheap drug is more effective. This would happen, for instance, if the more frugal
customers are also on a more healthy diet. To eliminate such illusions we must evaluate the
ACE which, given our story, is obtained from the front door formula.

We can also foster an intuition for how the front-door formula is useful. First, we can com-
pute the causal effect of the drug’s active ingredients on recovery. Let 𝑝 (respectively, 𝑝′ )
represent the recovery probability of a randomly chosen person forced to take a drug high
(respectively, low) in the active ingredients, i.e.,

𝑝 = 𝑃 (𝑌 = 𝑟𝑒𝑐𝑜𝑣𝑒𝑟|𝑑𝑜(𝑍 = ℎ𝑖𝑔ℎ))
𝑝′ = 𝑃 (𝑌 = 𝑟𝑒𝑐𝑜𝑣𝑒𝑟|𝑑𝑜(𝑍 = 𝑙𝑜𝑤))

To counteract possible confounding between recovery and active ingredient potency, we


adjust for price. We find the recovery probability for those who bought cheap bottles, then
for those who bought expensive bottles, and we take a weighted average of the two based on
the relative size of the two groups, doing the same with those given a bottle with low potency.
Taking the difference gives us the causal effect difference, 𝑝 − 𝑝′ , averaged over the entire
population.

Now that we know the causal effect of the ingredient, we reason as follows: If I choose a
cheap bottle, I stand a 5% chance of getting a good bottle, with recovery probability 𝑝, and
95% chance of getting a bad bottle with a recovery probability of 𝑝′ . Thus, the average prob-
ability of recovery on choosing a cheap bottle is .05𝑝 + .95𝑝′ . Things turn around if I buy
an expensive bottle, giving me an average probability of recovery of .05𝑝′ + .95𝑝. Thus the
difference between the two choices amounts to:

.05(𝑝 − 𝑝′ ) + .95(𝑝′ − 𝑝) = .9(𝑝′ − 𝑝)

So, if 𝑝 > 𝑝′ , it is clearly advantageous to buy the expensive bottle.

Study question 3.5.1.


Consider the causal model of Figure 3.8.
(a) Find an expression for the 𝑐-specific effect of 𝑋 on 𝑌 .
(b) Identify a set of four variables that need to be measured in order to estimate the z-specific
effect of 𝑋 on 𝑌 , and find an expression for the size of that effect.
(c) Using your answer to part (b), determine the expected value of 𝑌 under a 𝑍-dependent
strategy where 𝑋 is set to 0 when 𝑍 is smaller or equal to 2 and 𝑋 is set to 1 when 𝑍 is
larger than 2. (Assume 𝑍 takes on integer values from 1 to 5.)
54 STUDY QUESTIONS AND SOLUTIONS FOR CHAPTER 3

Solution to study question 3.5.1

The graph of this exercise is available at dagitty.net/m331. Students can solve parts (a)
and (b) interactively in class by forcing adjustment for single covariates (move mouse
pointer over the variable and press "a" key).

An R solution of this exercise is provided at dagitty.net/primer/3.5.1

Repeating Figure 3.8 for ease of reference:

𝐵 𝐶

𝐴 𝑍 𝐷

𝑋 𝑊 𝑌
Part (a)
By Rule 2, we must adjust for a set of variables that satisfies the backdoor criterion,
conditional on 𝐶. We observe that when we condition on 𝐶, there is still an open
backdoor from 𝑋 ← 𝑍 → 𝑌 , which we can block by conditioning on 𝑍. So, we may claim
that:
∑︀
𝑃 (𝑌 = 𝑦|𝑑𝑜(𝑋 = 𝑥), 𝐶 = 𝑐) = 𝑧 𝑃 (𝑌 = 𝑦|𝑋 = 𝑥, 𝑍 = 𝑧, 𝐶 = 𝑐)𝑃 (𝑍 = 𝑧)

Above, Rule 2 is applicable because the set {𝑍, 𝐶} satisfies the backdoor criterion to assess
the 𝑐-specific effect of 𝑋 on 𝑌 .

Part (b)
Again using Rule 2, we see that {𝑋, 𝑌, 𝑍, 𝐶} is such a set since {𝑍, 𝐶} satisfies the backdoor
criterion. We can then write:
∑︀
𝑃 (𝑌 = 𝑦|𝑑𝑜(𝑋 = 𝑥), 𝑍 = 𝑧) = 𝑐 𝑃 (𝑦|𝑥, 𝑧, 𝑐)𝑃 (𝑐)

Advanced students may be challenged to show that {𝑋, 𝑌, 𝑍, 𝑊 } is also such a set, since 𝑊
satisfies the front-door criterion when 𝑍 is specified.

Part (c)
Since our choice of 𝑋 relies upon the value of 𝑍, we need to adopt the conditional policy
𝑑𝑜(𝑋 = 𝑔(𝑍)), where:
{︃
0 𝑍≤2
𝑔(𝑍) =
1 𝑍>2
STUDY QUESTIONS AND SOLUTIONS FOR CHAPTER 3 55

We then assume that 𝑍 ∈ {1, 2, 3, 4, 5}, so by Eq. (3.17) we have:


∑︁
𝑃 (𝑌 = 𝑦|𝑑𝑜(𝑋 = 𝑔(𝑍))) = 𝑃 (𝑌 = 𝑦|𝑑𝑜(𝑋 = 𝑔(𝑧)), 𝑍 = 𝑧)𝑃 (𝑍 = 𝑧)
𝑧

= 𝑃 (𝑌 = 𝑦|𝑑𝑜(𝑋 = 0), 𝑍 = 1)𝑃 (𝑍 = 1)


+ 𝑃 (𝑌 = 𝑦|𝑑𝑜(𝑋 = 0), 𝑍 = 2)𝑃 (𝑍 = 2)
+ 𝑃 (𝑌 = 𝑦|𝑑𝑜(𝑋 = 1), 𝑍 = 3)𝑃 (𝑍 = 3)
+ 𝑃 (𝑌 = 𝑦|𝑑𝑜(𝑋 = 1), 𝑍 = 4)𝑃 (𝑍 = 4)
+ 𝑃 (𝑌 = 𝑦|𝑑𝑜(𝑋 = 1), 𝑍 = 5)𝑃 (𝑍 = 5)
So, for each term above in the format 𝑃 (𝑌 = 𝑦|𝑑𝑜(𝑋 = 𝑥), 𝑍 = 𝑧), we can substitute our
findings from (b) to find an expression free of the do-operator.
56 STUDY QUESTIONS AND SOLUTIONS FOR CHAPTER 3

Study question 3.8.1.


Consider the causal model of Figure 3.10.

Z1 Z2

a1 a3 b3 c2

W1 W2
Z3
t1 t2 b c

X c3 W3 a Y
Figure 3.10

(a) Identify three testable implications of this model.


(b) Identify a testable implication assuming that only 𝑋, 𝑌, 𝑊3 , and 𝑍3 are observed.
(c) For each of the parameters in the model, write a regression equation in which one of the
coefficients is equal to that parameter. Identify the parameters for which more than one
such equation exists.
(d) Suppose 𝑋, 𝑌 and 𝑊3 are the only variables observed. Which parameters can be
identified from the data? Can the total effect of 𝑋 on 𝑌 be estimated?
(e) If we regress 𝑍1 on all other variables in the model, which regression coefficient will be
zero?
(f) The model in Figure 3.10 implies that certain regression coefficients will remain invariant
when an additional variable is added as a regressor. Identify five such coefficients with
their added regressors.
(g) Assume that variables 𝑍2 and 𝑊2 cannot be measured. Find a way to estimate 𝑏 using
regression coefficients. [Hint: Find a way to turn 𝑍1 into an instrumental variable for 𝑏.]

Solution to study question 3.8.1


Part (a)
Testable implications are conditional independence relationships implied by the structure of
the graph. These conditional independences translate into vanishing regression coefficients
in the data. Examining Figure 3.10, three regression equations that could be used to test the
model could be:
STUDY QUESTIONS AND SOLUTIONS FOR CHAPTER 3 57

1. 𝑊3 is independent of 𝑊1 given 𝑋, giving us the regression equation:


𝑊3 = 𝑟𝑋 𝑋 + 𝑟𝑊1 𝑊1 with 𝑟𝑊1 = 0. This means that if we fit the data to the line
𝑊3 = 𝑟𝑋 𝑋 + 𝑟𝑊1 𝑊1 , we expect to find 𝑟𝑊1 = 0, or else the model is wrong.
2. 𝑊1 is independent of 𝑍3 given 𝑍1 , giving us the regression equation:
𝑊1 = 𝑟𝑍1 𝑍1 + 𝑟𝑍3 𝑍3 with 𝑟𝑍3 = 0
3. 𝑌 is independent of 𝑍1 given 𝑊1 , 𝑍2 , and 𝑍3 , giving us the regression equation:
𝑌 = 𝑟𝑍1 𝑍1 + 𝑟𝑊1 𝑊1 + 𝑟𝑍2 𝑍2 + 𝑟𝑍3 𝑍3 with 𝑟𝑍1 = 0

Part (b)
The only conditional independence that involves the measured variables is the one between
𝑍3 and 𝑊3 given 𝑋, which leads to 𝑟𝑍3 = 0 in the corresponding regression equation:
𝑊3 = 𝑟𝑍3 𝑍3 + 𝑟𝑋 𝑋 with 𝑟𝑍3 = 0

Part (c)
(i) If we regress a variable on its parents, we get a regression equation whose coefficients
equal the model parameters. Therefore:
1. 𝑎 = 𝑟𝑊3 , 𝑏 = 𝑟𝑍3 , 𝑐 = 𝑟𝑊2 in the equation:

𝑌 = 𝑟𝑊3 𝑊3 + 𝑟𝑍3 𝑍3 + 𝑟𝑊2 𝑊2

2. 𝑎1 = 𝑟𝑍1 in:

𝑊1 = 𝑟𝑍1 𝑍1

3. 𝑎3 = 𝑟𝑍1 , 𝑏3 = 𝑟𝑍2 in:

𝑍3 = 𝑟𝑍1 𝑍1 + 𝑟𝑍2 𝑍2

4. 𝑐2 = 𝑟𝑍2 in:

𝑊2 = 𝑟𝑍2 𝑍2

5. 𝑐3 = 𝑟𝑋 in:

𝑊3 = 𝑟𝑋 𝑋

6. 𝑡1 = 𝑟𝑊1 , 𝑡2 = 𝑟𝑍3 :

𝑋 = 𝑟𝑊1 𝑊1 + 𝑟𝑍3 𝑍3

(ii) The "Regression Rule for Identification" tells us that, if 𝐺𝛼 has several backdoor sets,
each would lead to a regression equation in which 𝛼 is a coefficient. Therefore, 𝑎, 𝑏, 𝑐 can be
identified by:
58 STUDY QUESTIONS AND SOLUTIONS FOR CHAPTER 3

1. 𝑎 = 𝑟𝑊2 , 𝑏 = 𝑟𝑍3 , 𝑐 = 𝑟𝑊2 in:

𝑌 = 𝑟𝑊3 𝑊3 + 𝑟𝑍3 𝑍3 + 𝑟𝑊2 𝑊2

Or, by 𝑎 = 𝑟𝑊2 , 𝑏 = 𝑟𝑍3 , 𝑐 = 𝑟𝑍2 in:

𝑌 = 𝑟𝑊3 𝑊3 + 𝑟𝑍3 𝑍3 + 𝑟𝑍2 𝑍2

2. Likewise, 𝑡1 can be identified either by 𝑡1 = 𝑟𝑊1 in:

𝑋 = 𝑟𝑊1 𝑊1 + 𝑟𝑍2 𝑍2

Or, by 𝑡2 = 𝑟𝑊1 in:

𝑋 = 𝑟𝑊1 𝑊1 + 𝑟𝑍2 𝑍3

Part (d)
To determine which parameters are estimable from data, we consult "The Regression Rule
for Identification." For example, the parameter 𝑐3 can be estimated from data because
𝑊3 = 𝑟𝑋 𝑋 + 𝑈3′ = 𝑐3 𝑋 + 𝑈3′ , since 𝑊3 is 𝑑-separated from 𝑌 given 𝑋 in 𝐺𝑊3 . Like-
wise, 𝑎 = 𝑟𝑌 𝑊3 ·𝑋 .

Lastly, we note that 𝑊3 is a front-door admissible variable for attaining the total effect of 𝑋
on 𝑌 , and so the effect is estimable. Indeed the total effect of 𝑋 on 𝑌 is simply the product
of 𝑎 * 𝑐3 , which we identified above.

Part (e)
Regressing 𝑍1 on all other variables in the model gives:
𝑍1 = 𝑟𝑍2 𝑍2 + 𝑟𝑍3 𝑍3 + 𝑟𝑋 𝑋 + 𝑟𝑤1 𝑊1 + 𝑟𝑊2 𝑊2 + 𝑟𝑊3 𝑊3 + 𝑟𝑌 𝑌
By 𝑑-separation, we see that 𝑍1 is independent of {𝑋, 𝑊3 , 𝑌, 𝑊2 } given 𝑊1 , 𝑍3 , 𝑍2 . There-
fore, 𝑟𝑋 = 0, 𝑟𝑊2 = 0, 𝑟𝑊3 = 0, 𝑟𝑌 = 0

Part (f)
In order for a coefficient to remain invariant under the addition of a new regressor,
the dependent variable must be independent of the added regressor given all of the old
regressors.
Thus, for example, if we regress 𝑊1 on 𝑍3 and 𝑋, adding 𝑊3 would keep all regression
coefficients in tact, but adding 𝑌 or 𝑍2 would change them, because of the path: 𝑌 ← 𝑊2 ←
𝑍2 → 𝑍3 ← 𝑍1 → 𝑊1 is opened by conditioning on 𝑍3 . If we regress 𝑊1 on 𝑍1 , then we
can add 𝑍3 , 𝑍2 , or 𝑊2 without changing the regression coefficient.
STUDY QUESTIONS AND SOLUTIONS FOR CHAPTER 3 59

Part (g)
Note that if we condition on 𝑊1 , we turn 𝑍1 into an instrument relative to the effect 𝜏 of 𝑍3
on 𝑌 . Using this idea, we can write the regression of 𝑌 on 𝑍1 given 𝑊1 , as the product 𝜏 𝑎3
where 𝜏 = 𝑡2 𝑐3 𝑎 + 𝑏. Since each of 𝑡2 , 𝑐3 , 𝑎 can be separately identified (see Part a above),
we can then solve for 𝑏. Formally, we have:

𝑡2 𝑐3 𝑎 + 𝑏 = 𝑟𝑍1 /𝑟𝑍1′

Where 𝑟𝑍1 and 𝑟𝑍1′ are the regression coefficients in the following equations:

𝑌 = 𝑟𝑍1 𝑍1 + 𝑟𝑊1 𝑊1 + 𝜖
𝑍3 = 𝑟𝑍1′ 𝑍1 + 𝜖
Study Questions and Solutions for
Chapter 4

Study question 4.3.1.


Consider the model in Figure 4.3 and assume that 𝑈1 and 𝑈2 are two independent Gaussian
variables, each with zero mean and unit variance.

U1 U2

X a Z b Y
(College) (Skill) (Salary)

Figure 4.3: A model representing Eq. (4.7), illustrating the causal relations between college
education (𝑋), skills (𝑍), and salary (𝑌 )

(a) Find the expected salary of workers at skill level 𝑍 = 𝑧 had they received 𝑥 years of
college education. [Hint: Use Theorem 4.3.2, with 𝑒 : 𝑍 = 𝑧, and the fact that for any two
Gaussian variables, say 𝑋 and 𝑍, we have 𝐸[𝑋|𝑍 = 𝑧] = 𝐸[𝑥] + 𝑅𝑋𝑍 (𝑧 − 𝐸[𝑍]).
Use the material in Sections 3.8.2 and 3.8.3 to express all regression coefficients in terms
of structural parameters, and show that 𝐸[𝑌𝑥 |𝑍 = 𝑧] = 𝑎𝑏𝑥 + 𝑏𝑧/(1 + 𝑎2 ).]
(b) Based on the solution for (a), show that the skill-specific effect of education on salary is
independent on the skill level.
62 STUDY QUESTIONS AND SOLUTIONS FOR CHAPTER 4

Solution to study question 4.3.1

Part (a)

The quantity of interest will be 𝐸[𝑌𝑥 |𝑍 = 𝑧] in the linear model:

𝑈1 𝑈2

a b
𝑋 𝑍 𝑌

Using the counterfactual formula of Theorem 4.3.2

𝐸[𝑌𝑥 |𝑒] = 𝐸[𝑌 |𝑒] + 𝜏 [𝑥 − 𝐸(𝑋|𝑒)]

we insert 𝑒 = {𝑍 = 𝑧}, and obtain

𝐸[𝑌𝑥 |𝑍 = 𝑧] = 𝐸[𝑌 |𝑧] + 𝜏 (𝑥 − 𝐸[𝑋|𝑧])

Assuming 𝑈1 and 𝑈2 are standardized, we have

2
𝜎𝑋 1
𝐸[𝑋|𝑧] = 𝛽𝑥𝑧 𝑧 = 𝛽𝑧𝑥 2 𝑧 = 𝑎 𝑐𝑜𝑣(𝑎𝑋 + 𝑈 ) 𝑧
𝜎𝑍 2
𝑎
=𝑧
(1 + 𝑎2 )

which gives

𝑧𝑎
𝐸[𝑌𝑥 |𝑍 = 𝑧] = 𝑏𝑧 + 𝑎𝑏(𝑥 − )
(1 + 𝑎2 )
𝑏𝑧
= 𝑎𝑏𝑥 +
1 + 𝛼2

Part (b)

We want to show that 𝐸[𝑌1 − 𝑌0 |𝑍 = 𝑧] = 𝐸[𝑌1 − 𝑌0 ]. According to (a), we know that:

𝐸[𝑌1 − 𝑌0 |𝑍 = 𝑧] = 𝐸[𝑌1 |𝑍 = 𝑧] − 𝐸[𝑌0 |𝑍 = 𝑧]


𝑏𝑧 𝑏𝑧
= 𝑎𝑏 + −0−
1 + 𝑎2 1 + 𝑎2
= 𝑎𝑏
STUDY QUESTIONS AND SOLUTIONS FOR CHAPTER 4 63

Also, we know that 𝐸[𝑌𝑥 ] = 𝐸[𝑌 ] + 𝑅𝑌 𝑋 (𝑥 − 𝐸[𝑋]), so we may conclude:

𝐸[𝑌1 − 𝑌0 ] = 𝐸[𝑌1 ] − 𝐸[𝑌0 ]


= 𝐸[𝑌 ] + 𝑎𝑏(1 − 𝐸[𝑋]) − 𝐸[𝑌 ] − 𝑎𝑏(0 − 𝐸[𝑋])
= 𝑎𝑏
= 𝐸[𝑌1 − 𝑌0 |𝑍 = 𝑧]

Study question 4.3.2.


(a) Describe how the parameters 𝑎, 𝑏, 𝑐 of Model 2 (Figure 4.1) can be estimated from
nonexperimental data.
(b) In the model of Figure 4.3, find the effect of education on those students whose salary is
𝑌 = 1. [Hint: use Theorem 4.3.2 to compute 𝐸[𝑌1 − 𝑌0 |𝑌 = 1].]
(c) Estimate 𝜏 and the 𝐸𝑇 𝑇 = 𝐸[𝑌1 − 𝑌0 |𝑋 = 1] for the model described in Eq. (4.19)
[Hint: Use the basic definition of counterfactuals, Eq. (4.5) and the equality 𝐸[𝑍|𝑋 =
𝑥′ ] = 𝑅𝑍𝑋 𝑥′ .]

(Encouragement) (Homework) (Exam Score)


X a = 0.5 H c = 0.4 Y

b = 0.7

Figure 4.1: A model depicting the effect of Encouragement (𝑋) on student’s score
(Encouragement) (Homework) (Exam Score)
X H= 2 c = 0.4 Y

b = 0.7

Figure 4.2: Answering a counterfactual question about a specific student’s score, predicated
on the assumption that homework would have increased to 𝐻 = 2

Solution to study question 4.3.2


Part (a)
If the model is correct, then we can estimate the parameters from non-experimental data
by simple regression. To estimate 𝑎, we regress 𝐻 on 𝑋 and compute the slope via Eq.
(1.22):
𝐻 = 𝛼 + 𝑎𝑥 + 𝜖ℎ
64 STUDY QUESTIONS AND SOLUTIONS FOR CHAPTER 4

U1 U2

X a Z b Y
(College) (Skill) (Salary)

Figure 4.3: A model representing Eq. (4.7), illustrating the causal relations between college
education (𝑋), skills (𝑍), and salary (𝑌 )

In other words, our estimate of 𝑎 is the coefficient of 𝑥 in the equation 𝐻 = 𝛼 + 𝑎𝑥 which


fits the data best. Similarly (using Eqs. (1.27) and (1.28)), our estimates of 𝑏 and 𝑐 are the
coefficients of 𝑥 and ℎ, respectively, in the regression equation 𝑌 = 𝛾 + 𝑏𝑥 + 𝑐ℎ.

These "best fit" coefficients can be computed by efficient "least-square" algorithms.

Part (b)
By Theorem 4.3.2 we know that:

𝐸[𝑌1 − 𝑌0 |𝑌 = 1] = 𝐸[𝑌 |𝑌 = 1] + 𝜏 (1 − 𝐸[𝑋|𝑌 = 1]) − 𝐸[𝑌 |𝑌 = 1] − 𝜏 (0 − 𝐸[𝑋|𝑌 = 1])


=𝜏
= 𝑎𝑏

Part (c)
First, we define the model 𝑀 as follows:

𝑀:
𝐻 = 𝑈𝐻
𝑋 = 𝑎𝐻 + 𝑈𝑋
𝑌 = 𝑏𝑋 + 𝑐𝐻 + 𝛿𝑋𝐻 + 𝑈𝑌
From 𝑀 , we may also describe the mutilated model 𝑀𝑥 , representing the intervention
𝑋 = 𝑥, as given by:

𝑀𝑥 :
𝐻 = 𝑈𝐻
𝑋=𝑥
𝑌 = 𝑏𝑥 + 𝑐𝐻 + 𝛿𝑥𝐻 + 𝑈𝑌
So by definition, to compute the counterfactual quantities needed for the ETT, we use Eq.
(4.5) and write:
STUDY QUESTIONS AND SOLUTIONS FOR CHAPTER 4 65

𝐸[𝑌𝑥 |𝑋 = 𝑥′ ] = 𝐸[𝑏𝑥 + 𝑐𝐻 + 𝛿𝑥𝐻 + 𝑈𝑌 |𝑋 = 𝑥′ ]


= 𝑏𝑥 + (𝑐 + 𝛿𝑥)𝐸[𝐻|𝑋 = 𝑥′ ]
= 𝑏𝑥 + (𝑐 + 𝛿𝑥)𝛽𝐻𝑋 𝑥′
2
𝜎𝐻
= 𝑏𝑥 + (𝑐 + 𝛿𝑥)𝛽𝑋𝐻 2 𝑥′
𝜎𝑋
1
= 𝑏𝑥 + (𝑐 + 𝛿𝑥)𝑎 𝑥′
𝑉 𝑎𝑟(𝑎𝐻 + 𝑈𝑋 )
𝑎𝑥′
= 𝑏𝑥 + (𝑐 + 𝛿𝑥)
1 + 𝑎2
Consequently, we can compute the ETT using Eq. (4.18) as:
𝐸𝑇 𝑇 = 𝐸[𝑌1 − 𝑌0 |𝑋 = 1]
= 𝐸[𝑌1 |𝑋 = 1] − 𝐸[𝑌0 |𝑋 = 1]
𝑐𝑎
= 𝑏 + (𝑐 + 𝛿)𝐸[𝐻|𝑋 = 1] −
1 + 𝑎2
𝑎 𝑐𝑎
= 𝑏 + (𝑐 + 𝛿) −
1 + 𝑎2 1 + 𝑎2
𝛿𝑎
=𝑏+
1 + 𝑎2
Knowing also that:
𝜏 = 𝐸[𝑏 + 𝑐𝐻 + 𝛿𝐻 + 𝑈𝑌 ] − 𝐸[𝑐𝐻 + 𝑈𝑌 ]
= 𝑏 + 𝛿𝐸[𝐻]
=𝑏
We may conclude:
𝛿𝑎
𝐸𝑇 𝑇 − 𝜏 =
1 + 𝑎2

Study question 4.3.3.


(a) Prove that, if 𝑋 is binary, the effect of treatment on the treated can be estimated from
both observational and experimental data. Hint: Decompose 𝐸[𝑌𝑥 ] into
𝐸[𝑌𝑥 ] = 𝐸[𝑌𝑥 |𝑥′ ]𝑃 (𝑥′ ) + 𝐸[𝑌𝑥 |𝑥]𝑃 (𝑥)

(b) Apply the result of Question (a) to Simpson’s story with the nonexperimental data of
Table 1.1, and estimate the effect of treatment on those who used the drug by choice.
[Hint: Estimate 𝐸[𝑌𝑥 ] assuming that gender is the only confounder.]
(c) Repeat Question (b) using the fact that 𝑍 in Figure 3.3, satisfies the backdoor criterion.
Show that the answers to (b) and (c) coincide.
66 STUDY QUESTIONS AND SOLUTIONS FOR CHAPTER 4

UZ

Z
UX UY

X Y
Figure 3.3: A graphical model representing the effects of a new drug, with 𝑍 representing
gender, 𝑋 standing for drug usage, and 𝑌 standing for recovery

Solution to study question 4.3.3


Part (a)
We begin by noting that, since 𝑋 is binary, we can simply use the law of total probability and
write:

𝐸[𝑌𝑥 ] = 𝐸[𝑌𝑥 |𝑋 = 𝑥]𝑃 (𝑋 = 𝑥) + 𝐸[𝑌𝑥 |𝑋 = 𝑥′ ]𝑃 (𝑋 = 𝑥′ )

By the consistency axiom, we also know that:

𝐸[𝑌𝑥 |𝑋 = 𝑥] = 𝐸[𝑌 |𝑋 = 𝑥]
∴ 𝐸[𝑌𝑥 ] = 𝐸[𝑌 |𝑋 = 𝑥]𝑃 (𝑋 = 𝑥) + 𝐸[𝑌𝑥 |𝑋 = 𝑥′ ]𝑃 (𝑋 = 𝑥′ )

The consistency axiom intuitively follows from the notion that a counterfactual predicated on
an actual observation is not counterfactual (here we observed 𝑋 = 𝑥 and were hypothesizing
about 𝑌𝑥 ). The term 𝐸[𝑌 |𝑋 = 𝑥]𝑃 (𝑋 = 𝑥) is already estimable from observational data, so
it remains to address the other. Solving for 𝐸[𝑌𝑥 |𝑋 = 𝑥′ ], gives:
𝐸[𝑌𝑥 ] − 𝐸[𝑌 |𝑋 = 𝑥]𝑃 (𝑋 = 𝑥)
𝐸[𝑌𝑥 |𝑋 = 𝑥′ ] =
𝑃 (𝑋 = 𝑥′ )
Now, substituting back into our equation for the ETT:

𝐸𝑇 𝑇 = 𝐸[𝑌𝑥 − 𝑌𝑥′ |𝑋 = 𝑥]
= 𝐸[𝑌 |𝑋 = 𝑥] − 𝐸[𝑌𝑥′ |𝑋 = 𝑥]
𝐸[𝑌 |𝑑𝑜(𝑋 = 𝑥′ )] − 𝐸[𝑌 |𝑋 = 𝑥′ ]𝑃 (𝑋 = 𝑥′ )
= 𝐸[𝑌 |𝑋 = 𝑥] −
𝑃 (𝑋 = 𝑥)
We see that the effect of treatment on the treated can be estimated from non-experimental
(do-free expressions) and experimental data (expressions with the do-operator).
STUDY QUESTIONS AND SOLUTIONS FOR CHAPTER 4 67

Part (b)
Because Gender is the only confounder, we can adjust for it and substitute the values of Table
1.1, specifically:

𝐸[𝑌𝑋=0 ] = 𝑃 (𝑌 = 1|𝑑𝑜(𝑋 = 0))


∑︁
= 𝑃 (𝑌 = 1|𝑋 = 0, 𝑍 = 𝑧)𝑃 (𝑍 = 𝑧)
𝑧

= 𝑃 (𝑌 = 1|𝑋 = 0, 𝑍 = 1)𝑃 (𝑍 = 1) + 𝑃 (𝑌 = 1|𝑋 = 0, 𝑍 = 0)𝑃 (𝑍 = 0)


= 0.87 * 357/700 + 0.69 * 343/700
= 0.78
Now, we can use this estimation in concert with our findings from (a):
𝐸[𝑌 = 1|𝑑𝑜(𝑋 = 0)] − 𝐸[𝑌 = 1|𝑋 = 0]𝑃 (𝑋 = 0)
𝐸𝑇 𝑇 = 𝐸[𝑌 = 1|𝑋 = 1] −
𝑃 (𝑋 = 1)
0.78 − 0.83 * 0.5
= 0.78 −
0.5
= 0.05
Part (c)
Because 𝑍 satisfies the backdoor criterion, we can directly write:
𝐸𝑇 𝑇 = 𝐸[𝑌1 − 𝑌0 |𝑋 = 1]
= 𝐸[𝑌𝑋=1 |𝑋 = 1] − 𝐸[𝑌𝑋=0 |𝑋 = 1]
= 𝐸[𝑌 |𝑋 = 1] − Σ𝑧 𝑃 (𝑌 = 1|𝑋 = 0, 𝑍 = 𝑧)𝑃 (𝑍 = 𝑧|𝑋 = 1)
= 𝐸[𝑌 |𝑋 = 1] − 𝑃 (𝑌 = 1|𝑋 = 0, 𝑍 = 1)𝑃 (𝑍 = 1|𝑋 = 1)
− 𝑃 (𝑌 = 1|𝑋 = 0, 𝑍 = 0)𝑃 (𝑍 = 0|𝑋 = 1)
= 0.78 − 0.87 * 87/350 − 0.69 * 263/350
= 0.05
We see that this method agrees with our findings from (b).

Study question 4.3.4. Joe has never smoked before but, as a result of peer pressure and
other personal factors he decided to start smoking. He buys a pack of cigarettes, comes home
and asks himself: “I am about to start smoking, should I?”
(a) Formulate Joe’s question mathematically, in terms of ETT, assuming that the outcome of
interest is lung cancer.
(b) What type of data would enable Joe to estimate his chances of getting cancer given that
he goes ahead with the decision to smoke, versus refraining from smoking.
(c) Use the data in Table 3.1 to estimate the chances associated with the decision in (b).
68 STUDY QUESTIONS AND SOLUTIONS FOR CHAPTER 4

Solution to study question 4.3.4

Part (a)

Let 𝑌 stand for lung cancer, with 𝑌 = 1 denoting that it is present. Let 𝑋 be Joe’s choice,
with 𝑋 = 1 indicating that he has decided to start smoking. So, we want to compute the
effect of treatment on the treated to determine if Joe should start smoking given that he was
about to, or specifically: 𝐸𝑇 𝑇 = 𝐸[𝑌1 − 𝑌0 |𝑋 = 1], if 𝐸𝑇 𝑇 > 0, it means that smoking
yields a higher chance of lung cancer for Joe (given that he was about to start smoking) than
refraining from smoking.

Part (b)

Referencing our findings from study question 4.4.1, if we can find a set of variables that
satisfy the backdoor or front-door criterion of the effect of 𝑋 on 𝑌 , then we only need non-
experimental data (see study question 4.4.1c); otherwise, we need both non-experimental and
experimental data (see study question 4.4.1b).

Part (c)

First, let us recall the graphical model associated with this problem from Figure 3.10 in the
book, wherein 𝑍 indicates the presence of Tar deposits in the lung, and 𝑈 , an unmeasured
Genotype that influences individuals to both smoke and get cancer:

𝑋 𝑍 𝑌

Since 𝑍 satisfies the front-door criterion, by Theorem 3.4.1, and again referencing Table 3.1,
we have:

∑︁ ∑︁
𝐸[𝑌 |𝑑𝑜(𝑋 = 0)] = 𝑃 (𝑍 = 𝑧|𝑋 = 0) 𝑃 (𝑌 = 1|𝑋 = 𝑥′ , 𝑍 = 𝑧)𝑃 (𝑋 = 𝑥′ )
𝑧 𝑥′

= 𝑃 (𝑍 = 1|𝑋 = 0)[𝑃 (𝑌 = 1|𝑋 = 1, 𝑍 = 1)𝑃 (𝑋 = 1)


+ 𝑃 (𝑌 = 1|𝑋 = 0, 𝑍 = 1)𝑃 (𝑋 = 0)]
+ 𝑃 (𝑍 = 0|𝑋 = 0)[𝑃 (𝑌 = 1|𝑋 = 1, 𝑍 = 0)𝑃 (𝑋 = 1)
+ 𝑃 (𝑌 = 1|𝑋 = 0, 𝑍 = 0)𝑃 (𝑋 = 0)]
= 20/400 * [0.15 * 0.5 + 0.95 * 0.5] + 380/400 * [0.1 * 0.5 + 0.9 * 0.5]
= 0.5025
STUDY QUESTIONS AND SOLUTIONS FOR CHAPTER 4 69

So, from our findings in 4.4.1 (b), we know:

𝐸𝑇 𝑇 = 𝐸[𝑌1 − 𝑌0 |𝑋 = 1]
= 𝐸[𝑌 |𝑋 = 1] − 𝐸[𝑌0 |𝑋 = 1]
𝐸[𝑌 |𝑑𝑜(𝑋 = 0)] − 𝐸[𝑌 |𝑋 = 0]𝑃 (𝑋 = 0)
= 𝐸[𝑌 |𝑋 = 1] −
𝑃 (𝑋 = 1)
0.5025 − 0.9025 * 0.5
= 0.15 −
0.5
= 0.0475 > 0

Since our ETT is greater than 0, we know that the chance of cancer from smoking (given
that Joe was about to start smoking) is greater than the alternative (no cancer). Therefore, Joe
should refrain from smoking.

In this solution, we relied on two assumptions: (1) 𝑋 is binary and (2) the model satis-
fies the front-door criterion. A more advanced analysis shows that assumtion (2) is sufficient
for estimating ETT; 𝑋 need not be binary once we have a front-door structure (Shpitser and
Pearl, 2009).

Study question 4.5.1.


Consider the dilemma faced by Ms. Jones, as described in Example 4.4.3. Assume that,
in addition to the experimental results of Fisher et al. (2002), she also gains access to an
observational study, according to which the probability of recurrent tumor in all patients
(regardless of therapy) is 30%, whereas among the recurrent cases, 70% did not choose
therapy. Use the bounds provided in Eq. (4.30) to update her estimate that her decision was
necessary for remission.

Solution to study question 4.5.1

First, according to the problem description, let 𝑋 represent treatment (with 𝑥′ representing
lumpectomy alone and 𝑥 representing Ms. Jones’ decision: lumpectomy plus raditation) and
𝑌 represent recovery (with 𝑦 ′ representing recurrence of cancer, and 𝑦 representing the out-
come for Ms. Jones: no recurrence). We also are given that:

𝑃 (𝑦 ′ ) = 0.3
𝑃 (𝑥′ |𝑦 ′ ) = 0.7
𝑃 (𝑦|𝑑𝑜(𝑥)) = 0.39
𝑃 (𝑦|𝑑𝑜(𝑥′ )) = 0.14
70 STUDY QUESTIONS AND SOLUTIONS FOR CHAPTER 4

Our goal is to determine if Ms. Jones’ decision was necessary for remission. So, we’ll see if
PN is more probable than not using the lower bound afforded by Eq. (4.30).

𝑃 (𝑦) − 𝑃 (𝑦|𝑑𝑜(𝑥′ ))
𝑃𝑁 ≥
𝑃 (𝑥, 𝑦)
𝑃 (𝑦) − 𝑃 (𝑦|𝑑𝑜(𝑥′ ))
=
𝑃 (𝑦|𝑥)𝑃 (𝑥)
𝑃 (𝑦) − 𝑃 (𝑦|𝑑𝑜(𝑥′ ))
= 𝑃 (𝑥|𝑦 ′ )𝑃 (𝑦 ′ )
(1 − 𝑃 (𝑥) )𝑃 (𝑥)
𝑃 (𝑦) − 𝑃 (𝑦|𝑑𝑜(𝑥′ ))
=
𝑃 (𝑥) − 𝑃 (𝑥|𝑦 ′ )𝑃 (𝑦 ′ )

At this point we consider whether we have all of the components for our computation, and
see that we can find all parameters except for 𝑃 (𝑥). However, because we are computing a
lower bound for the PN, we can consider the parameterization that would yield its smallest
value, namely, when the denominator is as large as possible with 𝑃 (𝑥) = 1. So, with this
assumption, we write:

𝑃 (𝑦) − 𝑃 (𝑦|𝑑𝑜(𝑥′ ))
𝑃𝑁 ≥
𝑃 (𝑥) − 𝑃 (𝑥|𝑦 ′ )𝑃 (𝑦 ′ )
0.7 − 0.14

1 − (1 − 0.7) * 0.3
= 0.62 > 0.5

Since PN is greater than 0.5, we conclude that Ms. Jones’ decision was likely necessary for
remission.

Study question 4.5.2.


Consider the structural model:

𝑦 = 𝛽 1 𝑚 + 𝛽2 𝑡 + 𝑢 𝑦 (4.53)
𝑚 = 𝛾1 𝑡 + 𝑢𝑚 (4.54)

(a) Use the basic definition of the natural effects (Eqs. (4.46)–(4.47)) to determine TE, NDE
and NIE.
(b) Repeat (a) assuming that 𝑢𝑦 is correlated with 𝑢𝑚 .

Solution to study question 4.5.2


Part (a)
To compute the NDE, we measure the expected increase in 𝑌 as the treatment 𝑇 changes
from 𝑇 = 0 to 𝑇 = 1 while the mediator 𝑀 is set to the value it would have attained prior to
STUDY QUESTIONS AND SOLUTIONS FOR CHAPTER 4 71

the change (i.e., under 𝑇 = 0), so by Eq. (4.46):

𝑁 𝐷𝐸 = 𝐸[𝑌1,𝑀0 − 𝑌0,𝑀0 ]
= 𝐸[𝑌1,𝑀0 ] − 𝐸[𝑌0,𝑀0 ]
= (𝛽1 [𝛾1 * 0 + 𝑢𝑚 ] + 𝛽2 * 1 + 𝑢𝑦 ) − (𝛽1 [𝛾1 * 0 + 𝑢𝑚 ] + 𝛽2 * 0 + 𝑢𝑦 )
= 𝛽2 − 0
= 𝛽2

Similarly, the NIE is defined as the expected increase in 𝑌 when the treatment is held constant
at 𝑇 = 0 and 𝑀 changes to whatever value it would have attained under 𝑇 = 1. By Eq.
(4.47), we have:

𝑁 𝐼𝐸 = 𝐸[𝑌0,𝑀1 − 𝑌0,𝑀0 ]
= 𝐸[𝑌0,𝑀1 ] − 𝐸[𝑌0,𝑀0 ]
= (𝛽1 [𝛾1 * 1 + 𝑢𝑚 ] + 𝛽2 * 0 + 𝑢𝑦 ) − (𝛽1 [𝛾1 * 0 + 𝑢𝑚 ] + 𝛽2 * 0 + 𝑢𝑦 )
= 𝛾1 𝛽 1 − 0
= 𝛾1 𝛽 1

Finally, the total effect can be computed using Eq. (4.48):

𝑇 𝐸 = 𝐸[𝑌1 − 𝑌0 ]
= 𝑁 𝐷𝐸 + 𝑁 𝐼𝐸
= 𝛽2 + 𝛾 1 𝛽1

Note that, in this question, we did not have to assume that the treatment is randomized or,
equivalently, that 𝑢𝑡 is not correlated with 𝑢𝑦 or 𝑦𝑚 . This is because we have the functional
form of the equations (linear), and we take the structural parameters as given.

Part (b)
The computations will remain the same since none of the above require that 𝑢𝑦 is uncorre-
lated with 𝑢𝑚 . This is because we are dealing with a linear system with given parameters.

Study question 4.5.3.


Consider the structural model:

𝑦 = 𝛽1 𝑚 + 𝛽2 𝑡 + 𝛽3 𝑡𝑚 + 𝛽4 𝑤 + 𝑢𝑦 (4.55)
𝑚 = 𝛾1 𝑡 + 𝛾2 𝑤 + 𝑢𝑚 (4.56)
𝑤 = 𝛼𝑡 + 𝑢𝑤 (4.57)

with 𝛽3 𝑡𝑚 representing an interaction term.


72 STUDY QUESTIONS AND SOLUTIONS FOR CHAPTER 4

(a) Use the basic definition of the natural effects (Eqs. (4.46) and (4.47)) (treating 𝑀 as
the mediator), to determine the portion of the effect for which mediation is necessary
(𝑇 𝐸 − 𝑁 𝐷𝐸) and the portion for which mediation is sufficient (𝑁 𝐼𝐸). Hint: Show
that:

𝑁 𝐷𝐸 = 𝛽2 + 𝛼𝛽4 (4.58)
𝑁 𝐼𝐸 = 𝛽1 (𝛾1 + 𝛼𝛾2 ) (4.59)
𝑇 𝐸 = 𝛽2 + (𝛾1 + 𝛼𝛾2 )(𝛽3 + 𝛽1 ) + 𝛼𝛽4 (4.60)
𝑇 𝐸 − 𝑁 𝐷𝐸 = (𝛽1 + 𝛽3 )(𝛾1 + 𝛼𝛾2 ) (4.61)

(b) Repeat, using 𝑊 as the mediator.

Solution to study question 4.5.3


Part (a)
We can use the same definitions and strategy as in Study Question 4.5.2. Let us first compute
some values of our covariates for ease of reference within our solution:

𝐸[𝑊0 ] = 𝛼 * 0 = 0
𝐸[𝑊1 ] = 𝛼 * 1 = 𝛼
𝐸[𝑀0 ] = 𝛾1 * 0 + 𝛾2 * 0 = 0
𝐸[𝑀1 ] = 𝛾1 * 1 + 𝛾2 𝛼 = 𝛾1 + 𝛾2 𝛼
Now we can compute our target quantities:
𝑁 𝐷𝐸 = 𝐸[𝑌1,𝑀0 − 𝑌0,𝑀0 ]
= 𝐸[𝑌1,𝑀0 ] − 𝐸[𝑌0,𝑀0 ]
= (𝛽1 * 0 + 𝛽2 * 1 + 𝛽3 * 1 * 0 + 𝛽4 𝛼) − (𝛽1 * 0 + 𝛽2 * 0 + 𝛽3 * 0 * 0 + 𝛽4 * 0)
= 𝛽2 + 𝛼𝛽4
𝑁 𝐼𝐸 = 𝐸[𝑌0,𝑀1 − 𝑌0,𝑀0 ]
= 𝐸[𝑌0,𝑀1 ] − 𝐸[𝑌0,𝑀0 ]
= (𝛽1 [𝛾1 + 𝛾2 𝛼] + 𝛽2 * 0 + 𝛽3 * 0 + 𝛽4 * 0) − (𝛽1 * 0 + 𝛽2 * 0 + 𝛽3 * 0 * 0 + 𝛽4 * 0)
= 𝛽1 (𝛾1 + 𝛼𝛾2 )
𝑇 𝐸 = 𝐸[𝑌1 − 𝑌0 ]
= 𝐸[𝑌1 ] − 𝐸[𝑌0 ]
= (𝛽1 [𝛾1 + 𝛾2 𝛼] + 𝛽2 * 1 + 𝛽3 [𝛾1 + 𝛾2 𝛼] + 𝛽4 𝛼) − (𝛽1 * 0 + 𝛽2 * 0 + 𝛽3 * 0 * 0 + 𝛽4 * 0)
= 𝛽2 + (𝛾1 + 𝛼𝛾2 )(𝛽1 + 𝛽3 ) + 𝛼𝛽4
STUDY QUESTIONS AND SOLUTIONS FOR CHAPTER 4 73

So, combining our computations from above, we find that the portion of the effect for which
mediation is necessary is:

𝑇 𝐸 − 𝑁 𝐷𝐸 = 𝛽2 + (𝛾1 + 𝛼𝛾2 )(𝛽1 + 𝛽3 ) + 𝛼𝛽4 − (𝛽2 + 𝛼𝛽4 )


= (𝛾1 + 𝛼𝛾2 )(𝛽1 + 𝛽3 )
Part (b)
We’ll use the same strategy that we did for part (a), first computing the values of 𝑊, 𝑀 under
the changing treatment 𝑇 .

𝐸[𝑊0 ] = 𝛼 * 0 = 0
𝐸[𝑊1 ] = 𝛼 * 1 = 𝛼
𝐸[𝑀0,𝑊0 ] = 𝛾1 * 0 + 𝛾2 * 0 = 0
𝐸[𝑀0,𝑊1 ] = 𝛾1 * 0 + 𝛾2 * 𝛼 = 𝛾2 * 𝛼
𝐸[𝑀1,𝑊0 ] = 𝛾1 * 1 + 𝛾2 * 0 = 𝛾1
𝐸[𝑀1,𝑊1 ] = 𝛾1 * 1 + 𝛾2 * 𝛼
Once more, using the above to compute our target quantities, we have:

𝑁 𝐷𝐸 = 𝐸[𝑌1,𝑊0 − 𝑌0,𝑊0 ]
= 𝐸[𝑌1,𝑊0 ] − 𝐸[𝑌0,𝑊0 ]
= (𝛽1 [𝛾1 ] + 𝛽2 * 1 + 𝛽3 * 1[𝛾1 ] + 𝛽4 * 0) − (𝛽1 * 0 + 𝛽2 * 0 + 𝛽3 * 0 + 𝛽4 * 0)
= 𝛽1 𝛾 1 + 𝛽2 + 𝛽3 𝛾 1
𝑁 𝐼𝐸 = 𝐸[𝑌0,𝑊1 − 𝑌0,𝑊0 ]
= 𝐸[𝑌0,𝑊1 ] − 𝐸[𝑌0,𝑊0 ]
= (𝛽1 [𝛾2 𝛼] + 𝛽2 * 0 + 𝛽3 * 0 + 𝛽4 * 𝛼) − (𝛽1 * 0 + 𝛽2 * 0 + 𝛽3 * 0 + 𝛽4 * 0)
= 𝛼𝛽1 𝛾2 + 𝛼𝛽4
𝑇 𝐸 = 𝐸[𝑌1 − 𝑌0 ]
= 𝐸[𝑌1 ] − 𝐸[𝑌0 ]
= (𝛽1 [𝛾1 + 𝛾2 𝛼] + 𝛽2 * 1 + 𝛽3 [𝛾1 + 𝛾2 𝛼] + 𝛽4 𝛼) − (𝛽1 * 0 + 𝛽2 * 0 + 𝛽3 * 0 * 0 + 𝛽4 * 0)
= 𝛽2 + (𝛾1 + 𝛼𝛾2 )(𝛽1 + 𝛽3 ) + 𝛼𝛽4
So, combining our computations from above, we know that the portion of the effect for which
mediation is necessary is:

𝑇 𝐸 − 𝑁 𝐷𝐸 = 𝛽2 + (𝛾1 + 𝛼𝛾2 )(𝛽1 + 𝛽3 ) + 𝛼𝛽4 − [𝛽1 𝛾1 + 𝛽2 + 𝛽3 𝛾1 ]


= 𝛽2 + 𝛽1 𝛾1 + 𝛽1 𝛼𝛾2 + 𝛽3 𝛾1 + 𝛽3 𝛼𝛾2 + 𝛼𝛽4 − 𝛽1 𝛾1 − 𝛽2 − 𝛽3 𝛾1
= 𝛼𝛾2 (𝛽1 + 𝛽3 ) + 𝛼𝛽4
74 STUDY QUESTIONS AND SOLUTIONS FOR CHAPTER 4

Study question 4.5.4.


Apply the mediation formulas provided in this section to the discrimination case discussed in
Section 4.4.4, and determine the extent to which ABC International practiced discrimination
in their hiring criteria. Use the data in Tables 4.6 and 4.7, with 𝑇 = 1 standing for male
applicants, 𝑀 = 1 standing for highly qualified applicants, and 𝑌 = 1 standing for hiring.
(Find the proportion of the hiring disparity that is due to gender, and the proportion that
could be explained by disparity in qualification alone.)

Solution to study question 4.5.4

[Hint: this is precisely the numerical problem on homework mediation as presented in the
text; see Tables 4.5, 4.7 and computations that follow]. Our goal is to find the proportion
of hiring disparity that is due to gender, and the proportion that could be explained by dis-
parity in qualification alone. Using the same strategies as the homework-training program
example: the quantity 𝑁 𝐼𝐸/𝑇 𝐸 tells us what proportion of the disparity is due to qualifica-
tion alone and the quantity 1 − 𝑁 𝐷𝐸/𝑇 𝐸 tells us what proportion of the disparity is due to
gender. Assuming that there exists no unobserved confounding, we’ll compute our quantities
of interest using Eqs. (4.51), (4.52), (4.44):
∑︁
𝑁 𝐷𝐸 = [𝐸[𝑌 |𝑇 = 1, 𝑀 = 𝑚] − 𝐸[𝑌 |𝑇 = 0, 𝑀 = 𝑚]]𝑃 (𝑀 = 𝑚|𝑇 = 0)
𝑚

= [𝐸[𝑌 |𝑇 = 1, 𝑀 = 0] − 𝐸[𝑌 |𝑇 = 0, 𝑀 = 0]]𝑃 (𝑀 = 0|𝑇 = 0)


+ [𝐸[𝑌 |𝑇 = 1, 𝑀 = 1] − 𝐸[𝑌 |𝑇 = 0, 𝑀 = 1]]𝑃 (𝑀 = 1|𝑇 = 0)
= (0.4 − 0.2)(1 − 0.4) + (0.8 − 0.3)0.4
= 0.32
∑︁
𝑁 𝐼𝐸 = 𝐸[𝑌 |𝑇 = 0, 𝑀 = 𝑚][𝑃 (𝑀 = 𝑚|𝑇 = 1) − 𝑃 (𝑀 = 𝑚|𝑇 = 0)]
𝑚

= 𝐸[𝑌 |𝑇 = 0, 𝑀 = 0][𝑃 (𝑀 = 0|𝑇 = 1) − 𝑃 (𝑀 = 0|𝑇 = 0)]


+ 𝐸[𝑌 |𝑇 = 0, 𝑀 = 1][𝑃 (𝑀 = 1|𝑇 = 1) − 𝑃 (𝑀 = 1|𝑇 = 0)]
= (0.75 − 0.4)(0.3 − 0.2)
= 0.035
STUDY QUESTIONS AND SOLUTIONS FOR CHAPTER 4 75

𝑇 𝐸 = 𝐸[𝑌1 − 𝑌0 ]
= 𝐸[𝑌1 ] − 𝐸[𝑌0 ]
= 𝐸[𝑌 |𝑑𝑜(𝑇 = 1)] − 𝐸[𝑌 |𝑑𝑜(𝑇 = 1)]
∑︁
= 𝐸[𝑌 |𝑑𝑜(𝑇 = 1), 𝑀 = 𝑚]𝑃 (𝑀 = 𝑚|𝑑𝑜(𝑇 = 1))
𝑚
∑︁
− 𝐸[𝑌 |𝑑𝑜(𝑇 = 0), 𝑀 = 𝑚]𝑃 (𝑀 = 𝑚|𝑑𝑜(𝑇 = 0))
𝑚
∑︁ ∑︁
= 𝐸[𝑌 |𝑇 = 1, 𝑀 = 𝑚]𝑃 (𝑀 = 𝑚|𝑇 = 1) − 𝐸[𝑌 |𝑇 = 0, 𝑀 = 𝑚]𝑃 (𝑀 = 𝑚|𝑇 = 0)
𝑚 𝑚

= [𝐸[𝑌 |𝑇 = 1, 𝑀 = 0]𝑃 (𝑀 = 0|𝑇 = 1) + 𝐸[𝑌 |𝑇 = 1, 𝑀 = 1]𝑃 (𝑀 = 1|𝑇 = 1)]


− [𝐸[𝑌 |𝑇 = 0, 𝑀 = 0]𝑃 (𝑀 = 0|𝑇 = 0) + 𝐸[𝑌 |𝑇 = 0, 𝑀 = 1]𝑃 (𝑀 = 1|𝑇 = 0)]
= [0.4 * (1 − 0.75) + 0.8 * 0.75] − [0.2 * (1 − 0.4) + 0.3 * 0.4]
= 0.46

Now, we have all the components needed to make claims about the hiring disparity; in
particular:

𝑁 𝐼𝐸/𝑇 𝐸 = 0.035/0.46 = 0.07


1 − 𝑁 𝐷𝐸/𝑇 𝐸 = 1 − 0.32/0.46 = 0.304

So, from the above, we conclude that 30.4% of the hiring disparity is due to gender and 7%
of the proportion could be explained by disparity in qualification alone.

You might also like