Causal Inference in Statistics - Solution M - Judea Pearl
Causal Inference in Statistics - Solution M - Judea Pearl
A Primer
Solution Manual
Text Authors:
Judea Pearl, Madelyn Glymour, and Nicholas Jewell
Solution Authors:
Judea Pearl, Ang Li, Andrew Forney, and Johannes Textor
About This Manual
This document provides solutions, explanations, and intuition for the study questions posed
in Causality in Statistics: A Primer. Students are encouraged to attempt each study question
by hand before consulting the answers herein.
Online Access
As the authors make updates to the text and solution manual, changes and errata will be
posted at the following links:
Textbook Information & Update site: http://bayes.cs.ucla.edu/PRIMER/
Solution Manual Information & Update site: http://bayes.cs.ucla.edu/PRIMER/Manual
The three claims are obviously wrong, and in subsequent sections of this book we will acquire
the tools to formally prove them wrong. At this stage, however, we will merely explain the
observed correlations using alternative models which do not support the claims cited.
For each problem, we will explain the observed correlation using new variables in the answers
below.
Part (a)
Consider an alternative explanation with a third variable, charm, which has a causal influ-
ence on both income and marriage (charming individuals have a higher propensity to marry
and be promoted in their jobs), but where marriage has no causal influence on income. This
explanation supports the observed data (that marriage and income are highly correlated) but
does not allow us to conclude that marrying will increase one’s income.
4 STUDY QUESTIONS AND SOLUTIONS FOR CHAPTER 1
Part (b)
Consider that the number of fire fighters employed in a district is a direct response to the
frequency of fires in that area. In a natural scenario, a higher frequency of fires causes addi-
tional fire fighters to be hired, and hiring fire fighters decreases the number of fires that would
break out had they not been hired. Hence, hiring fewer fire fighters will actually increase the
frequency of fires.
Part (c)
Let us consider the reason that an individual might hurry to an appointment: they believe that
a slow pace will not allow them to arrive to the appointment on time because they woke up
late. So, waking late is a common cause of hurrying and arriving late to the meeting. This will
cause a high correlation between hurrying and arriving late, even though for a fixed waking
time, hurrying would actually decrease one’s likelihood of arriving late.
Observe that this problem requests that we create a Simpson’s reversal in our data. [Hint: We
can use Table 1.2 from the text to scaffold our answer.] Consider the following, somewhat
unrealistic (but without loss of generality) batting averages for Frank and Tim against Right-
and Left-handed pitchers:
Frank Tim
Right-handed 81 hits out of 87 at-bats (.931) 234 hits out of 270 at-bats (.867)
Left-handed 192 hits out of 263 at-bats (.730) 55 hits out of 80 at-bats (.688)
Combined Data 273 hits out of 350 at-bats (.780) 289 hits out of 350 at-bats (.826)
second doctor performs the difficult surgery more often than the easy surgery. You need
surgery, but you do not know whether your case is easy or difficult. Should you consult
the success rate of each doctor over all cases, or should you consult their success rates
for the easy and difficult cases separately in choosing which surgeon to perform your
operation.
To answer each of the questions in this section, and based on the structure of these relation-
ships, we consider the causal relationships behind the described scenario to determine which
interpretation of the data is valid.
Part (a)
Here, the size of the stone is a common cause of the treatment choice and its recovery
outcome. In other words, the size of the stone both affects the likelihood of receiving one
treatment over the other, and also the chance of recovery since larger stones are more severe.
Moreover, treatment does not change the stone size. As such, the structure of this scenario is
identical to that of Example 1.2.1, in which treatment does not affect sex. Similarly, whether
or not the patient knows the Size of their stone, we should consult the segregated data condi-
tioned on stone size to make a correct decision.
Part (b)
The same logic as above applies. Paralleling the structure of Example 1.2.1, Difficulty of
surgery is a common cause of both doctor choice and recovery rates. In other words, the dif-
ficulty of a surgery affects both propensities for choosing one doctor over another as well as
the chance of success, since more difficult cases could inherently have less chance of success.
As such, whether or not the patient knows the difficulty of their surgery, we should consult
the segregated data conditioned on difficulty to make a causally-correct decision.
(b) Does your answer contradict our gender example, where sex-specific data was deemed
more appropriate?
6 STUDY QUESTIONS AND SOLUTIONS FOR CHAPTER 1
(c) Draw a graph (informally) that more or less captures the story. (Look ahead to Section
1.3 if you wish.)
(d) How would you explain the emergence of Simpson’s reversal in this story?
(e) Would your answer change if the lollipops were handed out (by the same criterion) a day
after the study?
[Hint: Use the fact that receiving a lollipop indicates a greater likelihood of being assigned
to drug treatment, as well as depression, which is a symptom of risk factors that lower the
likelihood of recovery.]
The arguments behind this problem are somewhat intricate, but they can be made intuitive
if we take an extreme case and assume that, among those in the treatment ward, patients
received a lollipop regardless of their health, while among those in the placebo ward only
extremely sick patients were given a lollipop. Under these circumstances, the group of
lollipop-receiving patients would show a strong correlation between treatment and recov-
ery even if the treatment is totally ineffective; treated individuals consist of typical patients
while untreated individuals consist of only extremely sick people. Thus, the treatment will
appear to improve chances of recovery even if it has no physical effect on recovery. The same
applies to the group of lollipop-denied patients; a spurious correlation will appear between
treatment and recovery, merely because the untreated patients where chosen among the very
sick.
Such spurious correlations will not occur in the aggregated population because if we dis-
regard the lollipop we find a perfectly randomized experiment; those chosen for treatment
are chosen at random from the population, in total disregard of their health status. Another
way to understand the difference in populations is to note that, when we compare treated
and untreated patients in the lollipop-receiving group we are comparing apples and oranges;
these two groups are not exchangeable in terms of their health status.
We conclude that, in this example, the aggregated data reveal the correct answer to our ques-
tion, while the disaggregated data is biased. In Chapter 3 of this book we will see that, in sto-
ries of this nature, disaggregation is to be avoided regardless of the specific lollipop-handling
strategy used by the nurse. We will further learn to identify such situations directly from the
graph, without invoking arguments concerning exchangeability or apples and oranges.
Part (a)
Per the above, we know that disaggregated data is biased, so we instead consult the aggre-
gated data and conclude that the drug is beneficial to the population.
Part (b)
Our decision here does not contradict the gender example from Table 1.1 where we deemed
it appropriate to consult the segregated data. In the gender example, gender was not merely
STUDY QUESTIONS AND SOLUTIONS FOR CHAPTER 1 7
correlated with treatment and recovery, it was actually a cause of both. Not so in the present
story; lollipop receipt correlates with, but is not a cause of, either treatment or recovery. The
two different stories warrant different treatments.
Part (c)
Let 𝑋 indicate treatment receipt, 𝑍 indicate lollipop receipt, 𝑌 indicate recovery, and 𝑈1 , 𝑈2
indicate two unobserved factors that correlate 𝑍 with 𝑋 and 𝑌 , respectively. The causal
graph illustrating our story can be modeled as:
𝑈1 𝑈2
𝑍
𝑋 𝑌
Part (d)
Suppose we (incorrectly) decided to use segregated data where we condition on lollipop
receipt. Simpson’s Reversal could display benefit of the drug to the population (aggregate
data) and harm to both lollipop-specific groups (segregated data) by a "trick" of the segregated
group sizes, as we’ve seen many times in this chapter (Table 1.2, for example). Consider that
the "got lollipop" group consists of a subset of the "treated" group and that if we got unlucky
and gave lollipops to all of the treated individuals who were going to recover, it would give
the impression of (negative) association between treatment and recovery even when there is
no causal effect of treatment on recovery. The same argument applies for the "didn’t get lol-
lipop" group.
Part (e)
Our answer will not change since lollipop receipt is still only spuriously connected to treat-
ment, even if the lollipops were distributed after the study. With these analyses, we always
consult the "causal story" behind them.
Variables: Let 𝑋 indicate Treatment / Drug receipt, 𝑍 indicate Lollipop receipt, and 𝑌 indi-
cate Recovery Status.
Events: “𝑋 = 1 𝑎𝑛𝑑 𝑍 = 1 𝑎𝑛𝑑 𝑌 = 1” indicates the event where an individual takes the
drug, receives a lollipop, and recovers (the same applies for other values of each variable).
8 STUDY QUESTIONS AND SOLUTIONS FOR CHAPTER 1
Using Table 1.5, for each of the specified quantities of interest, we simply sum over the cases
in the matching attributes and divide by the appropriate population.
Part (a)
By marginalization, we can write:
𝑃 (𝐻𝑖𝑔ℎ 𝑆𝑐ℎ𝑜𝑜𝑙) = 𝑃 (𝐻𝑖𝑔ℎ 𝑆𝑐ℎ𝑜𝑜𝑙, 𝑀 𝑎𝑙𝑒) + 𝑃 (𝐻𝑖𝑔ℎ 𝑆𝑐ℎ𝑜𝑜𝑙, 𝐹 𝑒𝑚𝑎𝑙𝑒)
231 + 189
=
112 + 231 + 595 + 242 + 136 + 189 + 763 + 172
= 0.1721
Part (b)
Summing over all cases falling in either the High School or Female categories, we
have:
𝑃 (𝐹 𝑒𝑚𝑎𝑙𝑒 𝑜𝑟 𝐻𝑖𝑔ℎ 𝑆𝑐ℎ𝑜𝑜𝑙) = 𝑃 (𝐹 𝑒𝑚𝑎𝑙𝑒) + 𝑃 (𝑀 𝑎𝑙𝑒, 𝐻𝑖𝑔ℎ 𝑆𝑐ℎ𝑜𝑜𝑙)
189 + 136 + 763 + 172 + 231
=
112 + 231 + 595 + 242 + 136 + 189 + 763 + 172
= 0.6111
STUDY QUESTIONS AND SOLUTIONS FOR CHAPTER 1 9
Part (c)
By Bayes’ conditioning, we can write:
𝑃 (𝐻𝑖𝑔ℎ 𝑆𝑐ℎ𝑜𝑜𝑙|𝐹 𝑒𝑚𝑎𝑙𝑒) = 𝑃 (𝐻𝑖𝑔ℎ 𝑆𝑐ℎ𝑜𝑜𝑙, 𝐹 𝑒𝑚𝑎𝑙𝑒)/𝑃 (𝐹 𝑒𝑚𝑎𝑙𝑒)
189
=
136 + 189 + 763 + 172
= 0.15
Part (d)
Again by Bayes’ conditioning, we can write:
𝑃 (𝐹 𝑒𝑚𝑎𝑙𝑒|𝐻𝑖𝑔ℎ 𝑆𝑐ℎ𝑜𝑜𝑙) = 𝑃 (𝐹 𝑒𝑚𝑎𝑙𝑒, 𝐻𝑖𝑔ℎ 𝑆𝑐ℎ𝑜𝑜𝑙)/𝑃 (𝐻𝑖𝑔ℎ 𝑆𝑐ℎ𝑜𝑜𝑙)
189
=
231 + 189
= 0.45
Study question 1.3.3.
Consider the casino problem described in Section 1.3.7
(a) Compute 𝑃 (“craps”|“11”) assuming that there are twice as many roulette tables as craps
games at the casino.
(b) Compute 𝑃 (“roulette”|“10”) assuming that there are twice as many craps games as
roulette tables at the casino.
𝑃 (“𝑟𝑜𝑢𝑙𝑒𝑡𝑡𝑒”) = 2/3
𝑃 (“𝑐𝑟𝑎𝑝𝑠”) = 1/3
So, by the law of total probability, we can write our target quantity 𝑃 (“11”) in terms of what
we know:
𝑃 (“11”) = 𝑃 (“11”|“𝑐𝑟𝑎𝑝𝑠”)𝑃 (“𝑐𝑟𝑎𝑝𝑠”) + 𝑃 (“11”|“𝑟𝑜𝑢𝑙𝑒𝑡𝑡𝑒”)𝑃 (“𝑟𝑜𝑢𝑙𝑒𝑡𝑡𝑒”)
= 1/18 * 1/3 + 1/38 * 2/3
= 37/1026
= 0.036
𝑃 (“𝑐𝑟𝑎𝑝𝑠”|“11”) = 𝑃 (“𝑐𝑟𝑎𝑝𝑠”, “11”)/𝑃 (“11”)
1/18 * 1/3
=
37/1026
= 0.514
10 STUDY QUESTIONS AND SOLUTIONS FOR CHAPTER 1
Part (b)
Assuming that there are twice as many craps games as roulette tables at the casino, we have:
𝑃 (“𝑟𝑜𝑢𝑙𝑒𝑡𝑡𝑒”) = 1/3
𝑃 (“𝑐𝑟𝑎𝑝𝑠”) = 2/3
We can use the same tactic as in (a) (the law of total probability) to write our target quantity
in terms of what we know:
𝑃 (“10”) = 𝑃 (“10”|“𝑐𝑟𝑎𝑝𝑠”)𝑃 (“𝑐𝑟𝑎𝑝𝑠”) + 𝑃 (“10”|“𝑟𝑜𝑢𝑙𝑒𝑡𝑡𝑒”)𝑃 (“𝑟𝑜𝑢𝑙𝑒𝑡𝑡𝑒”)
= 1/12 * 2/3 + 1/38 * 1/3
= 11/171
= 0.064
𝑃 (“𝑟𝑜𝑢𝑙𝑒𝑡𝑡𝑒”|“10”) = 𝑃 (“𝑟𝑜𝑢𝑙𝑒𝑡𝑡𝑒”, “10”)/𝑃 (“10”)
1/38 * 1/3
=
11/171
= 0.136
Find the probability that the face-down side of the selected card is black, using your
estimates above.
(c) Use Bayes’ theorem to find the correct probability of a randomly selected card’s back
being black if you observe that its front is black.
STUDY QUESTIONS AND SOLUTIONS FOR CHAPTER 1 11
Part (a)
The face-up side is black, so it is either card 1 or card 3. Given that cards have equal probabil-
ities of being selected, the probability that the face-down side of the card is also black is 1/2.
However, cards do not have equal probabilities conditioned on the evidence; if the face-up
side is black, the card is more likely to be card 1, so the probability that the face-down side
of the card is also black is greater than 1/2.
Part (b)
Since we don’t know which card is face-up, we’ll use the law of total probability indexing on
the card number to compute our quantity of interest.
Part (c)
We’ll adopt the variable labeling used in the text, where each may have values indicating
doors 𝐴, 𝐵, 𝐶: Let 𝑋 indicate the door chosen by the player, 𝑌 indicate the door hiding the
car, and 𝑍 indicate the door opened by the host. We want to prove that:
So, we’ll compute the components of this expression necessary to illustrate this inequality,
and then combine them.
𝑃 (𝑍 = 𝐶|𝑋 = 𝐴) = 𝑃 (𝑍 = 𝐶|𝑋 = 𝐴, 𝑌 = 𝐴)𝑃 (𝑌 = 𝐴)
+ 𝑃 (𝑍 = 𝐶|𝑋 = 𝐴, 𝑌 = 𝐵)𝑃 (𝑌 = 𝐵)
+ 𝑃 (𝑍 = 𝐶|𝑋 = 𝐴, 𝑌 = 𝐶)𝑃 (𝑌 = 𝐶)
= 1/2 * 1/3 + 1 * 1/3 + 0 * 1/3
= 1/2
𝑃 (𝑍 = 𝐶|𝑋 = 𝐴, 𝑌 = 𝐴)𝑃 (𝑌 = 𝐴|𝑋 = 𝐴)
𝑃 (𝑌 = 𝐴|𝑋 = 𝐴, 𝑍 = 𝐶) =
𝑃 (𝑍 = 𝐶|𝑋 = 𝐴)
1/2 * 1/3
=
1/2
= 1/3
𝑃 (𝑌 = 𝐵|𝑋 = 𝐴, 𝑍 = 𝐶) = 1 − 𝑃 (𝑌 = 𝐴|𝑋 = 𝐴, 𝑍 = 𝐶) − 𝑃 (𝑌 = 𝐶|𝑋 = 𝐴, 𝑍 = 𝐶)
= 1 − 1/3 − 0
= 2/3
(a) Prove that, in general, both 𝜎𝑋𝑌 and 𝜌𝑋𝑌 vanish when 𝑋 and 𝑌 are independent.
[Hint: Use Eqs. (1.16) and (1.17).]
(b) Give an example of two variables that are highly dependent and, yet, their correlation
coefficient vanishes.
= 𝐸(𝑋)𝐸(𝑌 )
STUDY QUESTIONS AND SOLUTIONS FOR CHAPTER 1 13
𝑃 (𝑌 |𝑋) 𝑋 = −1 𝑋=1
𝑌 = −1 0.5 0.5
𝑌 =0 0.5 0.5
𝑌 =1 0 1
Above, we see that X and Y are dependent (𝑃 (𝑌 = 1|𝑋 = −1) ̸= 𝑃 (𝑌 = 1|𝑋 = 1)), yet:
∑︁ ∑︁
𝐸(𝑋𝑌 ) = 𝑋𝑌 * 𝑃 (𝑋𝑌 )
𝑥 𝑦
∑︁ ∑︁
= 𝑋𝑌 * 𝑃 (𝑌 |𝑋)𝑃 (𝑋)
𝑥 𝑦
(c) Given that Player 2 won a dollar, what is your best guess of Player 1’s payoff?
(d) Given that Player 1 won a dollar, what is your best guess of Player 2’s payoff?
(e) Are there two events, 𝑋 = 𝑥 and 𝑌 = 𝑦, that are mutually independent?
Let 𝑋 and 𝑌 stand for the winnings of Player 1 and Player 2, respectively. We have:
Part (a)
The descriptions of these distributions are as follows:
𝑃 (𝑥): The probability that player 1 gets 𝑥 dollars.
𝑃 (𝑦): The probability that player 2 gets 𝑦 dollars.
𝑃 (𝑥, 𝑦): The probability that player 1 gets 𝑥 dollars and player 2 gets 𝑦 dollars.
𝑃 (𝑦|𝑥): The probability that player 2 gets 𝑦 dollars given that player 1 gets 𝑥 dollars.
𝑃 (𝑥|𝑦): The probability that player 1 gets 𝑥 dollars given that player 2 gets 𝑦 dollars.
Part (b)
We’ll compute each measure by its definition, using the fact that each coin flip is fair and
independent:
First, observe that Player 1 wins a dollar if at least 1 of the coins lands on heads. Another
way to think about this scenario is that Player 1 loses if both coins land on tails, which we
can subtract from 1 to find the probability of them winning. Specifically:
Computing the expected value follows from Eq. (1.10), summing over all outcomes and their
associated probabilities:
∑︁
𝐸[𝑋] = 𝑥 * 𝑃 (𝑥) = 1 * 𝑃 (𝑋 = 1) + 0 * 𝑃 (𝑋 = 0) = 3/4
𝑥
We’ll use a similar approach to computing the winning probability for Player 2 as well as the
expected value of their winnings. Observe that the winning conditions for Player 2 are when
both coins land on the same face, specifically:
To compute the conditional expected values, we will use Eq. (1.13), which intuitively sums
over all possible values of the query and weights by the conditional probability of each:
∑︁
𝐸[𝑌 |𝑋 = 𝑥] = 𝑃 (𝑦|𝑋 = 𝑥)
𝑦
= 1 * 𝑃 (𝑌 = 1|𝑋 = 𝑥) + 0 * 𝑃 (𝑌 = 0|𝑋 = 𝑥)
= 𝑃 (𝑌 = 1|𝑋 = 𝑥)
∑︁
𝐸[𝑋|𝑌 = 𝑦] = 𝑃 (𝑥|𝑌 = 𝑦)
𝑥
= 1 * 𝑃 (𝑋 = 1|𝑌 = 𝑦) + 0 * 𝑃 (𝑋 = 0|𝑌 = 𝑦)
= 𝑃 (𝑋 = 1|𝑌 = 𝑦)
Next, we can compute the variances of each variable using Eq. (1.15), their covariance using
Eq. (1.16), and their correlation coefficient using Eq. (1.17).
Part (c)
To answer this query, we know that if both 𝑋 = 1 and 𝑌 = 1, then the outcome of the two
coins must have been both heads, meaning that 𝑃 (𝑋 = 1, 𝑌 = 1) = 1/4. Furthermore, we
can phrase our query as 𝐸[𝑋|𝑌 = 1], since we are interested in the expectation of Player 1’s
winnings having observed that Player 2 won a dollar. Combining this knowledge with our
solution to each conditional expected value from part (b) above, we have:
𝐸[𝑋|𝑌 = 1] = 𝑃 (𝑋 = 1|𝑌 = 1)
𝑃 (𝑋 = 1, 𝑌 = 1)
=
𝑌 =1
1/4
=
1/2
= 1/2
Part (d)
We use the same strategy as in part (c) above, and have:
𝐸[𝑌 |𝑋 = 1] = 𝑃 (𝑌 = 1|𝑋 = 1)
𝑃 (𝑋 = 1, 𝑌 = 1)
=
𝑋=1
1/4
=
3/4
= 1/3
Part (e)
Consider what we know about the joint events:
𝑃 (𝑋 = 1, 𝑌 = 1) = 1/4
𝑃 (𝑋 = 0, 𝑌 = 1) = 1/4
𝑃 (𝑋 = 1, 𝑌 = 0) = 1/2
𝑃 (𝑋 = 0, 𝑌 = 0) = 0
Now, examining their priors, we have:
𝑃 (𝑋 = 1) = 3/4
𝑃 (𝑋 = 0) = 1/4
𝑃 (𝑌 = 1) = 𝑃 (𝑌 = 0) = 1/2
Plainly, there are no two values for 𝑋 and 𝑌 such that the product of their priors will equal
their joint, i.e., for no two values 𝑋 = 𝑥, 𝑌 = 𝑦 do we have: 𝑃 (𝑌 = 𝑦, 𝑋 = 𝑥) = 𝑃 (𝑌 =
𝑦) * 𝑃 (𝑋 = 𝑥). Therefore, we conclude that there are no two mutually independent events.
STUDY QUESTIONS AND SOLUTIONS FOR CHAPTER 1 17
𝐸[𝑋], 𝐸[𝑌 ], 𝐸[𝑌 |𝑋 = 𝑥], 𝐸[𝑋|𝑌 = 𝑦], for each value of 𝑥 and 𝑦, and
𝑉 𝑎𝑟(𝑋), 𝑉 𝑎𝑟(𝑌 ), 𝐶𝑜𝑣(𝑋, 𝑌 ), 𝜌𝑋𝑌 , 𝐶𝑜𝑣(𝑋, 𝑍)
(b) Find the sample estimates of the measures computed in (a), based on the data from
Table 1.6. [Hint: Many software packages are available for doing this computation for
you.]
(c) Use the results in (a) to determine the best estimate of the sum, 𝑌 , given that we measured
𝑋 = 3.
(d) What is the best estimate of 𝑋, given that we measured 𝑌 = 4?
(e) What is the best estimate of 𝑋, given that we measured 𝑌 = 4 and 𝑍 = 1? Explain why
it is not the same as in (d).
representative of their sum. So, let us first deduce the expected values of each variable indi-
vidually, which exploits the fact that each dice has a possible outcome of equally likely
integers between 1 and 6.
By a similar token, we consider quantities 𝐸[𝑋|𝑌 = 𝑦] ∀𝑦. These can be computed by the
same method above, exploiting the knowledge that:
𝐸[𝑋|𝑌 = 2] = 1
𝐸[𝑋|𝑌 = 3] = 1.5
𝐸[𝑋|𝑌 = 4] = 2
𝐸[𝑋|𝑌 = 5] = 2.5
𝐸[𝑋|𝑌 = 6] = 3
𝑒𝑡𝑐.
STUDY QUESTIONS AND SOLUTIONS FOR CHAPTER 1 19
Next we compute the variances of our variables, which is a simple application of Eq.
(1.15):
Now knowing our variances, we can compute 𝐶𝑜𝑣(𝑋, 𝑌 ) and 𝐶𝑜𝑣(𝑋, 𝑍) through
application of Eq. (1.16):
Intuitively, we can check our answer that 𝐶𝑜𝑣(𝑋, 𝑍) = 0 because 𝑋 and 𝑍 are independent
dice rolls. Finally, we compute the correlation between 𝑋 and 𝑌 by Eq. (1.17):
𝐶𝑜𝑣(𝑋, 𝑌 )
𝜌𝑋𝑌 =
𝜎𝑋 𝜎𝑌
2.917
=√ √
2.917 5.833
= 0.707
Part (b)
You can use programming packages in R (see DAGitty package), Matlab, Python(Numpy),
etc. to calculate the quantities of interest from the sample in Table 1.6. The same computa-
tional strategies we used in part (a) apply, except that now our frequencies come from the
data rather than our analysis of craps. Specifically, we get:
𝐸[𝑋] = 4.33
𝐸[𝑌 ] = 8.5
𝑉 𝑎𝑟(𝑋) = 2.389
𝑉 𝑎𝑟(𝑌 ) = 1.75
𝐶𝑜𝑣(𝑋, 𝑌 ) = 1.545
𝐶𝑜𝑣(𝑋, 𝑍) = −1.06
𝜌𝑋𝑌 = 0.756
Part (c)
From our computations in part (a), we have 𝐸[𝑌 |𝑋 = 3] = 6.5
Part (d)
From our computations in part (a), we have 𝐸[𝑋|𝑌 = 4] = 2
Part (e)
We can compute 𝐸[𝑋|𝑌 = 4, 𝑍 = 1] by application of Eq. (1.13), summing over the six
possible values of 𝑋 and the probabilities of 𝑌 = 4, 𝑍 = 1 associated with each:
∑︁
𝐸[𝑋|𝑌 = 4, 𝑍 = 1] = 𝑥 𝑃 (𝑋 = 𝑥|𝑌 = 4, 𝑍 = 1)
𝑥
= 1 * 𝑃 (𝑋 = 1|𝑌 = 4, 𝑍 = 1) + 2 * 𝑃 (𝑋 = 2|𝑌 = 4, 𝑍 = 1)
+ 3 * 𝑃 (𝑋 = 3|𝑌 = 4, 𝑍 = 1) + 4 * 𝑃 (𝑋 = 4|𝑌 = 4, 𝑍 = 1)
+ 5 * 𝑃 (𝑋 = 5|𝑌 = 4, 𝑍 = 1) + 6 * 𝑃 (𝑋 = 6|𝑌 = 4, 𝑍 = 1)
=1*0+2*0+3*1+4*0+5*0+6*0
=3
STUDY QUESTIONS AND SOLUTIONS FOR CHAPTER 1 21
Intuitively, this quantity will not be the same as in part (d) because knowing that 𝑍 = 1 pre-
cludes the possibilities of some values of 𝑋 given that 𝑌 = 𝑋 + 𝑍 = 4. For example, in
(d), we allowed for the possibility that 𝑋 = 2 (and that therefore 𝑍 = 2 in order to sum to
𝑌 = 4), which is impossible in this problem given that 𝑍 = 1.
(a) Prove Eq. (1.22) using the orthogonality principle. [Hint: Follow the treatment of Eq.
(1.26).]
for the craps game described in Study question 1.3.8. [Hint: Apply Eq. (1.27) and use
the variances and covariances computed for part (a) of this question.]
Part (a)
By assumption, from Eq. (1.21), we have:
𝑌 = 𝑎 + 𝑏𝑋
Thus, using the hint to follow the treatment of Eq. (1.26), we have:
Part (b)
From our answer to study question 1.3.8, we have:
𝜎𝑋𝑌 = 𝜎𝑍𝑌 = 2.917, 𝜎𝑋𝑍 = 0, 𝜎𝑌2 = 5.833, 𝜎𝑋 2 2
= 𝜎𝑍 = 2.917.
Using the above, we can use Eq. (1.27) to compute each partial regression coefficient:
22 STUDY QUESTIONS AND SOLUTIONS FOR CHAPTER 1
2
𝜎𝑍 𝜎𝑌 𝑋 − 𝜎𝑌 𝑍 𝜎𝑍𝑋 2.9172 − 0
𝑅𝑌 𝑋·𝑍 = 2 𝜎2 − 𝜎2 = =1
𝜎𝑋 𝑍 𝑋𝑍 2.9172 − 0
2
𝜎𝑍 𝜎𝑋𝑌 − 𝜎𝑋𝑍 𝜎𝑍𝑌 2.9172 − 0
𝑅𝑋𝑌 ·𝑍 = = =1
𝜎𝑌2 𝜎𝑍
2 − 𝜎2
𝑌𝑍 5.833 * 2.917 − 2.9172
2
𝜎𝑋 𝜎𝑌 𝑍 − 𝜎𝑌 𝑋 𝜎𝑋𝑍 2.9172 − 0
𝑅𝑌 𝑍·𝑋 = 2 2 2 = =1
𝜎𝑍 𝜎𝑋 − 𝜎𝑍𝑋 2.9172 − 0
2
𝜎𝑋 𝜎𝑍𝑌 − 𝜎𝑍𝑋 𝜎𝑋𝑌 2.9172 − 0
𝑅𝑍𝑌 ·𝑋 = 2 2 2 = =1
𝜎𝑌 𝜎𝑋 − 𝜎𝑌 𝑋 5.833 * 2.917 − 2.9172
𝜎𝑌2 𝜎𝑋𝑍 − 𝜎𝑋𝑌 𝜎𝑌 𝑍 0 − 2.9172
𝑅𝑋𝑍·𝑌 = 2 2 2 = = −1
𝜎𝑍 𝜎𝑌 − 𝜎𝑍𝑌 5.833 * 2.917 − 2.9172
𝜎𝑌2 𝜎𝑍𝑋 − 𝜎𝑍𝑌 𝜎𝑌 𝑋 0 − 2.9172
𝑅𝑍𝑋·𝑌 = = = −1
𝜎𝑌2 𝜎𝑋
2 − 𝜎2
𝑌𝑋 5.833 * 2.917 − 2.9172
Study question 1.4.1.
Consider the graph shown in Figure 1.8:
X Y Z
T
Figure 1.8: A directed graph used in Study question 1.4.1
A basic interactive tutorial where students can practice their knowledge of graph termi-
nology is provided at dagitty.net/learn/graphs/
A more advanced tutorial where students can apply these terms to recognize causal con-
cepts like "mediator" and "confounder" is provided at dagitty.net/learn/graphs/roles.html
Part (a)
Parents of 𝑍: 𝑊, 𝑌
Part (b)
Ancestors of 𝑍: 𝑋, 𝑊, 𝑌
Part (c)
Children of 𝑊 : 𝑌, 𝑍
Part (d)
Descendants of 𝑊 : 𝑌, 𝑍, 𝑇
Part (e)
Assuming cycles are not allowed, the simple paths between 𝑋 and 𝑇 are:
{𝑋, 𝑌, 𝑇 }, {𝑋, 𝑌, 𝑍, 𝑇 }, {𝑋, 𝑌, 𝑊, 𝑍, 𝑇 }, {𝑋, 𝑊, 𝑌, 𝑇 }, {𝑋, 𝑊, 𝑌, 𝑍, 𝑇 }, {𝑋, 𝑊, 𝑍, 𝑇 },
{𝑋, 𝑊, 𝑍, 𝑌, 𝑇 }
Part (f)
Assuming cycles are not allowed, the directed paths between 𝑋 and 𝑇 are:
{𝑋, 𝑌, 𝑇 }, {𝑋, 𝑌, 𝑍, 𝑇 }, {𝑋, 𝑊, 𝑌, 𝑇 }, {𝑋, 𝑊, 𝑌, 𝑍, 𝑇 }, {𝑋, 𝑊, 𝑍, 𝑇 }
(c) Determine the best guess of the value of 𝑍, given that we observe 𝑋 = 3.
(d) Determine the best guess of the value of 𝑍, given that we observe 𝑋 = 1 and 𝑌 = 3.
(e) Assume that all exogenous variables are normally distributed with zero means and unit
variance, that is, 𝜎 = 1.
(i) Determine the best guess of 𝑋, given that we observed 𝑌 = 2.
(ii) (Advanced) Determine the best guess of 𝑌 , given that we observed 𝑋 = 1 and
𝑍 = 3. [Hint: You may wish to use the technique of multiple regression, together
with the fact that, for every three normally distributed variables, say 𝑋, 𝑌 , and 𝑍,
we have 𝐸[𝑌 |𝑋 = 𝑥, 𝑍 = 𝑧] = 𝑅𝑌 𝑋·𝑍 𝑥 + 𝑅𝑌 𝑍·𝑋 𝑧.]
Part (a)
The following graph complies with the given model:
𝑈𝑋 𝑈𝑌 𝑈𝑍
𝑋 𝑌 𝑍
Part (b)
Assuming that 𝑈𝑋 , 𝑈𝑌 , 𝑈𝑍 are independent and have 0 means, we have:
Part (d)
Let 𝑍 = 𝑧1 and 𝑍 = 𝑧0 represent, respectively, the presence and absence of the syndrome,
𝑌 = 𝑦1 and 𝑌 = 𝑦0 represent death and survival, respectively, and 𝑋 = 𝑥1 and 𝑋 = 𝑥0
represent taking and not taking the drug. Assume that patients not carrying the syndrome,
𝑍 = 𝑧0 , die with probability 𝑝2 if they take the drug and with probability 𝑝1 if they don’t.
Patients carrying the syndrome, 𝑍 = 𝑧1 , on the other hand, die with probability 𝑝3 if they
do not take the drug and with probability 𝑝4 if they do take the drug. Further, patients hav-
ing the syndrome are more likely to avoid the drug, with probabilities 𝑞1 = 𝑃 (𝑥1 |𝑧0 ) and
𝑞2 = 𝑝(𝑥1 |𝑧1 ).
(a) Based on this model, compute the joint distributions 𝑃 (𝑥, 𝑦, 𝑧), 𝑃 (𝑥, 𝑦), 𝑃 (𝑥, 𝑧) and
𝑃 (𝑦, 𝑧) for all values of x, 𝑦, and 𝑧, in terms of the parameters (𝑟, 𝑝1 , 𝑝2 , 𝑝3 , 𝑝4 , 𝑞1 , 𝑞2 ).
[Hint: Use the product decomposition of Section 1.5.2.]
26 STUDY QUESTIONS AND SOLUTIONS FOR CHAPTER 1
(b) Calculate the difference 𝑃 (𝑦1 |𝑥1 ) − 𝑃 (𝑦1 |𝑥𝑜 ) for three populations: (1) those carrying
the syndrome, (2) those not carrying the syndrome and (3) the population as a whole.
(c) Using your results for (b), find a combination of parameters that exhibits Simpson’s
reversal.
Z (Syndrome)
(Treatment) X Y (Outcome)
Figure 1.8: Model showing an unobserved syndrome, 𝑍, affecting both treatment (𝑋) and
outcome (𝑌 )
Part (a)
𝑃 (𝑌 |𝑋, 𝑍) y x z
𝑝1 1 0 0
𝑝2 1 1 0
𝑝3 1 0 1
𝑝4 1 1 1
𝑃 (𝑋|𝑍) x z
𝑞1 1 0
𝑞2 1 1
By the chain rule, we know that: 𝑃 (𝑥, 𝑦, 𝑧) = 𝑃 (𝑦|𝑥, 𝑧)𝑃 (𝑥|𝑧)𝑃 (𝑧)
STUDY QUESTIONS AND SOLUTIONS FOR CHAPTER 1 27
So, substituting the table rows into the above factorization, we have:
Part (b)
(b.1)
(b.2)
(b.3)
𝑃 (𝑦1 , 𝑥1 ) 𝑃 (𝑦1 , 𝑥0 )
𝑃 (𝑦1 |𝑥1 ) − 𝑃 (𝑦1 |𝑥0 ) = −
𝑃 (𝑥1 ) 𝑃 (𝑥0 )
𝑃 (𝑦1 , 𝑥1 ) 𝑃 (𝑦1 , 𝑥0 )
= −
𝑃 (𝑥1 , 𝑦1 ) + 𝑃 (𝑥1 , 𝑦0 ) 𝑃 (𝑥0 , 𝑦1 ) + 𝑃 (𝑥0 , 𝑦0 )
𝑝2 𝑞1 (1 − 𝑟) + 𝑝4 𝑞2 𝑟 𝑝1 (1 − 𝑞1 )(1 − 𝑟) + 𝑝3 (1 − 𝑞2 )𝑟
= −
𝑟𝑞2 + (1 − 𝑟)𝑞1 𝑟(1 − 𝑞2 ) + (1 − 𝑟)(1 − 𝑞1 )
Part (c)
To elicit Simpson’s reversal, we want to find a combination of parameters such that parts
(b.1) and (b.2) above have a different sign than (b.3). As such, consider the following
parameterization:
Now, substituting the above into (b.1), (b.2), and (b.3), we have:
𝑃 (𝑋𝑖 = 1|𝑋𝑖−1 = 1) = 𝑝
𝑃 (𝑋𝑖 = 1|𝑋𝑖−1 = 0) = 𝑞
𝑃 (𝑋1 = 1) = 𝑝0
STUDY QUESTIONS AND SOLUTIONS FOR CHAPTER 1 29
𝑃 (𝑋1 = 1, 𝑋2 = 0, 𝑋3 = 1, 𝑋4 = 0)
𝑃 (𝑋4 = 1|𝑋1 = 1)
𝑃 (𝑋1 = 1|𝑋4 = 1)
𝑃 (𝑋3 = 1|𝑋1 = 0, 𝑋4 = 1)
𝑃 (𝑋4 = 1, 𝑋1 = 1)
𝑃 (𝑋4 = 1|𝑋1 = 1) =
𝑃 (𝑋1 = 1)
Σ𝑋2 ,𝑋3 𝑃 (𝑋4 = 1, 𝑋1 = 1, 𝑋2 , 𝑋3 )
=
𝑃 (𝑋1 = 1)
𝑝0 𝑝3 + 2𝑝0 𝑝𝑞(1 − 𝑝) + 𝑝0 𝑞(1 − 𝑝)(1 − 𝑞)
=
𝑝0
Part (3)
𝑃 (𝑋1 = 1, 𝑋4 = 1)
𝑃 (𝑋1 = 1|𝑋4 = 1) =
𝑃 (𝑋4 = 1)
Σ𝑋2 ,𝑋3 𝑃 (𝑋4 = 1, 𝑋1 = 1, 𝑋2 , 𝑋3 )
=
Σ𝑋1 ,𝑋2 ,𝑋3 𝑃 (𝑋1 , 𝑋2 , 𝑋3 , 𝑋4 = 1)
𝑝0 𝑝3 + 2𝑝0 𝑝𝑞(1 − 𝑝) + 𝑝0 𝑞(1 − 𝑝)(1 − 𝑞)
=
𝑝0 𝑝3 + 2𝑝0 𝑝𝑞(1 − 𝑝) + 𝑝0 𝑞(1 − 𝑝)(1 − 𝑞)
+ (1 − 𝑝0 )(𝑞𝑝2 + 𝑞 2 (1 − 𝑝) + 𝑞𝑝(1 − 𝑞) + 𝑞(1 − 𝑞)2 )
30 STUDY QUESTIONS AND SOLUTIONS FOR CHAPTER 1
Part (4)
𝑃 (𝑋1 = 0, 𝑋3 = 1, 𝑋4 = 1)
𝑃 (𝑋3 = 1|𝑋1 = 0, 𝑋4 = 1) =
𝑃 (𝑋1 = 0, 𝑋4 = 1)
Σ𝑋2 𝑃 (𝑋2 , 𝑋1 = 0, 𝑋3 = 1, 𝑋4 = 1)
=
Σ𝑋2 ,𝑋3 𝑃 (𝑋2 , 𝑋3 , 𝑋1 = 0, 𝑋4 = 1)
(1 − 𝑝0 )(1 − 𝑞)𝑝𝑞 + (1 − 𝑝0 )𝑞𝑝2
=
(1 − 𝑝0 )(𝑞𝑝2 + 𝑞 2 (1 − 𝑝) + 𝑞𝑝(1 − 𝑞) + 𝑞(1 − 𝑞)2 )
Again, we’ll adopt the variables from the text: let 𝑋 indicate the door chosen by the player,
𝑌 indicate the door hiding the car, and 𝑍 indicate the door opened by the host.
From the story, we know that each door has an equal chance of being the winner, and that
each has an equal chance of being selected by the player. This suggests that both 𝑋 and 𝑌
are selected independently from one another as a function of some unmodeled factors.
Furthermore, we know that the door revealed by the host, 𝑍, will be neither the one opened
by the player, 𝑋, nor the winning door, 𝑌 , and in the event that 𝑋 = 𝑌 , then 𝑍 has an equal
chance of being one of the two remaining doors. These observations suggest that 𝑍 must be
a function of not only 𝑋, 𝑌 , but also of some unmodeled factors whenever 𝑋 = 𝑌 .
Combining our observations from above gives us the following model specification:
𝑉 = {𝑋, 𝑌, 𝑍}, 𝑈 = {𝑈𝑋 , 𝑈𝑌 , 𝑈𝑍 }, 𝐹 = {𝑓 },
𝑋 = 𝑈𝑋 , 𝑌 = 𝑈𝑌 , 𝑍 = 𝑓 (𝑋, 𝑌 ) + 𝑈𝑍
We can also depict this model graphically:
𝑈𝑋 𝑈𝑍 𝑈𝑌
𝑋 𝑍 𝑌
And lastly, the joint distribution can be factorized by using the chain rule and independence
relations (i.e., the Markovian factorization, Eq. (1.29), which decomposes a joint distribution
into family factors) to write:
𝑃 (𝑋, 𝑌, 𝑍) = 𝑃 (𝑍|𝑋, 𝑌 )𝑃 (𝑌 )𝑃 (𝑋)
STUDY QUESTIONS AND SOLUTIONS FOR CHAPTER 1 31
So, to explore some example queries, we begin by acknowledging that each door has an equal
chance of being the winner and the one chosen by the player, namely:
Furthermore, we know that the host cannot reveal the winning door or the one chosen by the
player, so:
𝑃 (𝑍|𝑋, 𝑌 ) = 0 ∀𝑧 = 𝑥 𝑜𝑟 𝑧 = 𝑦
Lastly, we know that the host must open the last remaining door if 𝑋 and 𝑌 are different, and
has an equal chance of choosing one of the remaining two when 𝑋 = 𝑌 , giving us:
These observations allow us to compute arbitrary probability queries because we know the
decomposition of the joint distribution. For example, for doors A, B, and C:
X R S T U V Y
Figure 2.5: A directed graph for demonstrating conditional independence (error terms are not
shown explicitly)
(a) List all pairs of variables in Figure 2.5 that are independent conditional on the set
𝑍 = {𝑅, 𝑉 }.
(b) For each pair of nonadjacent variables in Figure 2.5, give a set of variables that, when
conditioned on, renders that pair independent.
X R S T U V Y
P
Figure 2.6: A directed graph in which 𝑃 is a descendant of a collider
(c) List all pairs of variables in Figure 2.6 that are independent conditional on the set
𝑍 = {𝑅, 𝑃 }.
(d) For each pair of nonadjacent variables in Figure 2.6, give a set of variables that, when
conditioned on,renders that pair independent.
(e) Suppose we generate data by the model described in Figure 2.5, and we fit them with the
linear equation 𝑌 = 𝑎 + 𝑏𝑋 + 𝑐𝑍 . Which of the variables in the model may be chosen
for 𝑍 so as to guarantee that the slope 𝑏 would be equal to zero? [Hint: Recall, a non
zero slope implies that 𝑌 and 𝑋 are dependent given 𝑍.]
34 STUDY QUESTIONS AND SOLUTIONS FOR CHAPTER 2
(f) Continuing question (e), suppose we fit the data with the equation:
𝑌 = 𝑎 + 𝑏𝑋 + 𝑐𝑅 + 𝑑𝑆 + 𝑒𝑇 + 𝑓 𝑃
To determine if two variables are independent in a network, we consider all simple paths
between them and determine if all paths are "blocked" by Rules 1, 2, and 3 in this chapter
that detail the graphical criteria for conditional independence. If a single path is not blocked,
then the two variables are dependent. They are therefore independent (or conditionally inde-
pendent, when conditioning on some other variables) whenever all simple paths between
them are blocked.
Part (a)
In the table below, each row may be read as: “𝑋 is independent of 𝑌 given 𝑍 because..."
where 𝑋 is the variable listed in the first column of every row, 𝑌 is listed in the second col-
umn, 𝑍 in the third, and an explanation in the fourth. Note that for each row, 𝑋 and 𝑌 may
be swapped with the same valid claim of independence.
X Y Z Reason
𝑋 𝑆 {𝑅, 𝑉 } 𝑋 → 𝑅 → 𝑆 is blocked at chain 𝑋 → 𝑅 → 𝑆 (𝑅 is given)
𝑋 𝑇 {𝑅, 𝑉 } 𝑋 → 𝑅 → 𝑆 → 𝑇 is blocked at chain 𝑋 → 𝑅 → 𝑆 (𝑅 is given)
𝑋 𝑈 {𝑅, 𝑉 } 𝑋 → 𝑅 → 𝑆 → 𝑇 ← 𝑈 is blocked at chain 𝑋 → 𝑅 → 𝑆 (𝑅 is
given)
𝑋 𝑌 {𝑅, 𝑉 } 𝑋 → 𝑅 → 𝑆 → 𝑇 → 𝑈 ← 𝑉 ← 𝑌 is blocked at chain 𝑋 → 𝑅 →
𝑆 (𝑅 is given)
𝑆 𝑈 {𝑅, 𝑉 } 𝑆 → 𝑇 ← 𝑈 is blocked at collider 𝑆 → 𝑇 ← 𝑈 (neither 𝑇 nor
descendants of 𝑇 given)
𝑆 𝑌 {𝑅, 𝑉 } 𝑆 → 𝑇 ← 𝑈 ← 𝑉 → 𝑌 is blocked at collider 𝑆 → 𝑇 ← 𝑈 (neither
𝑇 nor descendants of 𝑇 given)
𝑇 𝑌 {𝑅, 𝑉 } 𝑇 ← 𝑈 ← 𝑉 → 𝑌 is blocked at fork 𝑈 ← 𝑉 → 𝑌 (𝑉 is given)
𝑈 𝑌 {𝑅, 𝑉 } 𝑈 ← 𝑉 → 𝑌 is blocked at fork 𝑈 ← 𝑉 → 𝑌 (𝑉 is given)
STUDY QUESTIONS AND SOLUTIONS FOR CHAPTER 2 35
Part (b)
Answers to part (b) can be found in the following table, formatted in the same way as (a)
above:
X Y Z Reason
𝑋 𝑆 {𝑅} 𝑋 → 𝑅 → 𝑆 is blocked at chain 𝑋 → 𝑅 → 𝑆 (𝑅 is given)
𝑋 𝑇 {𝑅} 𝑋 → 𝑅 → 𝑆 → 𝑇 is blocked at chain 𝑋 → 𝑅 → 𝑆 (𝑅 is given)
𝑋 𝑈 {𝑅} 𝑋 → 𝑅 → 𝑆 → 𝑇 ← 𝑈 is blocked at chain 𝑋 → 𝑅 → 𝑆 (𝑅 is given)
𝑋 𝑉 {𝑅} 𝑋 → 𝑅 → 𝑆 → 𝑇 ← 𝑈 ← 𝑉 is blocked at chain 𝑋 → 𝑅 → 𝑆 (𝑅 is
given)
𝑋 𝑌 {𝑅} 𝑋 → 𝑅 → 𝑆 → 𝑇 ← 𝑈 ← 𝑉 → 𝑌 is blocked at chain 𝑋 → 𝑅 → 𝑆
(𝑅 is given)
𝑅 𝑇 {𝑆} 𝑅 → 𝑆 → 𝑇 is blocked at chain 𝑅 → 𝑆 → 𝑇 (𝑆 is given)
𝑅 𝑈 {𝑆} 𝑅 → 𝑆 → 𝑇 ← 𝑈 is blocked at chain 𝑅 → 𝑆 → 𝑇 (𝑆 is given)
𝑅 𝑉 {𝑆} 𝑅 → 𝑆 → 𝑇 ← 𝑈 ← 𝑉 is blocked at chain 𝑅 → 𝑆 → 𝑇 (𝑆 is given)
𝑅 𝑌 {𝑆} 𝑅 → 𝑆 → 𝑇 ← 𝑈 ← 𝑉 → 𝑌 is blocked at chain 𝑅 → 𝑆 → 𝑇 (𝑆 is
given)
𝑆 𝑈 {} 𝑆 → 𝑇 ← 𝑈 is blocked at collider 𝑆 → 𝑇 ← 𝑈 (neither 𝑇 nor
descendants of 𝑇 are given)
𝑆 𝑉 {} 𝑆 → 𝑇 ← 𝑈 ← 𝑉 is blocked at collider 𝑆 → 𝑇 ← 𝑈 (neither 𝑇 nor
descendants of 𝑇 are given)
𝑆 𝑌 {} 𝑆 → 𝑇 ← 𝑈 ← 𝑉 → 𝑌 is blocked at collider 𝑆 → 𝑇 ← 𝑈 (neither 𝑇
nor descendants of 𝑇 are given)
𝑇 𝑉 {𝑈 } 𝑇 ← 𝑈 ← 𝑉 is blocked at chain 𝑇 ← 𝑈 ← 𝑉 (𝑈 is given)
𝑇 𝑌 {𝑈 } 𝑇 ← 𝑈 ← 𝑉 → 𝑌 is blocked at chain 𝑇 ← 𝑈 ← 𝑉 (𝑈 is given)
𝑈 𝑌 {𝑉 } 𝑈 ← 𝑉 → 𝑌 is blocked at fork 𝑈 ← 𝑉 → 𝑌 (𝑉 is given)
Part (c)
Observe that conditioning on {𝑅, 𝑃 } blockes only the chain 𝑋 → 𝑅 → 𝑆 (𝑅 is given) and
opens the collider 𝑆 → 𝑇 ← 𝑈 (𝑃 , a descendant of 𝑇 , is given). Thus, we render only 𝑋
independent of all other variables in the model; specifically, the pairs of independent vari-
ables conditioned on {𝑅, 𝑃 } are: (𝑋, 𝑅), (𝑋, 𝑆), (𝑋, 𝑇 ), (𝑋, 𝑃 ), (𝑋, 𝑈 ), (𝑋, 𝑉 ), (𝑋, 𝑌 )
Part (d)
Now that we’re familiar with Figure 2.6, we can summarize independence relationships in
the following table, which is similar to the previous two in parts (a) and (b) except that every
row may be read, “variable 𝑋 is independent of all variables in set 𝑌 given set 𝑍.”
36 STUDY QUESTIONS AND SOLUTIONS FOR CHAPTER 2
X Y Z Reason
𝑋 {𝑆, 𝑇, 𝑈, 𝑉, 𝑌, 𝑃 } {𝑅} Chain blocked at 𝑋 → 𝑅 → 𝑆 (𝑅 is given)
𝑅 {𝑇, 𝑈, 𝑉, 𝑌, 𝑃 } {𝑆} Chain blocked at 𝑅 → 𝑆 → 𝑇 (𝑆 is given)
𝑆 {𝑈, 𝑉, 𝑌 } {} Collider blocked at 𝑆 → 𝑇 ← 𝑈 (Neither 𝑇 nor
descendants of 𝑇 are given)
𝑃 {𝑆, 𝑈, 𝑉, 𝑌 } {𝑇 } Chains blocked at 𝑆 → 𝑇 → 𝑃 (𝑇 is given) and 𝑈 →
𝑇 → 𝑃 (𝑇 is given)
𝑇 {𝑉, 𝑌 } {𝑈 } Chain blocked at 𝑉 → 𝑈 → 𝑇 (𝑈 is given)
𝑈 {𝑌 } {𝑉 } Fork blocked at 𝑈 ← 𝑉 → 𝑌 (𝑉 is given)
Part (e)
Since 𝑌 and 𝑋 are independent conditional on any member of the set {𝑅, 𝑆, 𝑈, 𝑉 }, we may
choose 𝑍 to be any of these variables.
Part (f)
To determine which slopes will be equal to zero, we can again consider if 𝑌 is independent
of each variable given the other variables on the RHS of our equation. Specifically:
1. 𝑏 (the slope associated with 𝑋) will be zero, since 𝑌 and 𝑋 are independent given
𝑅, 𝑆, 𝑇, 𝑃 .
2. 𝑐 (the slope associated with 𝑅) will be zero, since 𝑌 and 𝑅 are independent given
𝑋, 𝑆, 𝑇, 𝑃 .
3. 𝑓 (the slope associated with 𝑃 ) will be zero, since 𝑌 and 𝑃 are independent given
𝑋, 𝑅, 𝑆, 𝑇 .
Z1
Z2
Z3
X
W
Y
Figure 2.9: A causal graph used in study question 2.4.1, all 𝑈 terms (not shown) are assumed
independent
STUDY QUESTIONS AND SOLUTIONS FOR CHAPTER 2 37
(a) For each pair of nonadjacent nodes in this graph, find a set of variables that 𝑑-separates
that pair. What does this list tell us about independencies in the data?
(b) Repeat question (a) assuming that only variables in the set {𝑍3 , 𝑊, 𝑋, 𝑍1 } can be
measured.
(c) For each pair of nonadjacent nodes in the graph, determined whether they are
independent conditional on all other variables.
(d) For every variable 𝑉 in the graph find a minimal set of nodes that renders 𝑉 independent
of all other variables in the graph.
(e) Suppose we wish to estimate the value of 𝑌 from measurements taken on all other
variables in the model. Find the smallest set of variables that would yield as good an
estimate of 𝑌 as when we measured all variables.
(f) Repeat question (e) assuming that we wish to estimate the value of 𝑍2 .
(g) Suppose we wish to predict the value of 𝑍2 from measurements of 𝑍3 . Would the quality
of our prediction improve if we add measurement of 𝑊 ? Explain.
Part (a)
Recall that two variables are d-separated if all simple paths between them are blocked by the
given set of variables (see Definition 2.4.1, and Rules 1, 2, and 3 for the graphical criteria for
conditional independence). Below, we list all pairs of variables that are d-separated, along
with the paths that must be blocked between them.
(𝑋, 𝑌 ) are independent conditioned on set {𝑍1 , 𝑍3 , 𝑊 }:
1. 𝑋 → 𝑊 → 𝑌 is blocked at chain 𝑋 → 𝑊 → 𝑌 (𝑊 is given).
2. 𝑋 ← 𝑍3 → 𝑌 is blocked at fork 𝑋 ← 𝑍3 → 𝑌 (𝑊 is given).
3. 𝑋 ← 𝑍3 ← 𝑍2 → 𝑌 is blocked at fork 𝑍3 ← 𝑍2 → 𝑌 (𝑍2 is given).
4. 𝑋 ← 𝑍1 ← 𝑍3 → 𝑌 is blocked at chain 𝑍1 → 𝑍3 → 𝑌 (𝑍3 is given).
5. 𝑋 ← 𝑍1 → 𝑍3 ← 𝑍2 → 𝑌 is blocked at fork 𝑋 ← 𝑍1 → 𝑍3 (𝑍1 is given).
Viewing the above d-separation path analysis, we can similarly deduce the remaining
independent pairs.
∙ (𝑌, 𝑍1 ) are independent conditioned on set {𝑋, 𝑍2 , 𝑍3 }.
∙ (𝑊, 𝑍1 ) are independent conditioned on set {𝑋}.
∙ (𝑊, 𝑍2 ) are independent conditioned on set {𝑋}.
∙ (𝑊, 𝑍3 ) are independent conditioned on set {𝑋}.
∙ (𝑍1 , 𝑍2 ) are independent conditioned on set {}.
What does this list of independencies tell us about those in the data? Assuming that the data
was generated from this model, then the independencies will also be manifest in the data.
Part (b)
Using the path analyses we performed in (a) above, we can determine if each pair of variables
can be d-separated conditioning only on variables in the set {𝑍3 , 𝑊, 𝑋, 𝑍1 }.
∙ (𝑋, 𝑌 ) are independent conditioned on set {𝑍1 , 𝑍3 , 𝑊 }.
∙ (𝑋, 𝑍2 ) are independent conditioned on set {𝑍1 , 𝑍3 }.
∙ (𝑊, 𝑍1 ) are independent conditioned on set {𝑋}.
∙ (𝑊, 𝑍2 ) are independent conditioned on set {𝑋}.
∙ (𝑊, 𝑍3 ) are independent conditioned on set {𝑋}.
∙ (𝑍1 , 𝑍2 ) are independent conditioned on set {}.
(𝑌, 𝑍1 ) were independent when we could condition on {𝑋, 𝑍2 , 𝑍3 }, but now there is no such
set that will render (𝑌, 𝑍1 ) independent. Since we can no longer condition on 𝑍2 , we must
address two paths:
1. 𝑍1 → 𝑍3 ← 𝑍2 → 𝑌 is blocked if we do not condition on 𝑍3 , but open if we do.
2. 𝑍1 → 𝑍3 → 𝑌 is blocked if we condition on 𝑍3 , but open if we do not.
Observe that these two requirements are mutually exclusive, so one of these two paths
remains open, and so (𝑌, 𝑍1 ) cannot be d-separated using the covariates available.
Part (c)
Again using our path analysis from (a), we have:
∙ (𝑋, 𝑌 ) are independent given {𝑍1 , 𝑍2 , 𝑍3 , 𝑊 }.
∙ (𝑋, 𝑍2 ) are independent given {𝑍1 , 𝑍3 , 𝑊, 𝑌 }.
∙ (𝑌, 𝑍1 ) are independent given {𝑍2 , 𝑍3 , 𝑋, 𝑊 }.
∙ (𝑊, 𝑍1 ) are independent given {𝑍2 , 𝑍3 , 𝑋, 𝑌 }.
STUDY QUESTIONS AND SOLUTIONS FOR CHAPTER 2 39
Part (d)
Consider the set of variables Z comprised of the parents, children, and spouses of 𝑉 : Z
= {𝑝𝑎𝑟𝑒𝑛𝑡𝑠(𝑉 ), 𝑐ℎ𝑖𝑙𝑑𝑟𝑒𝑛(𝑉 ), 𝑠𝑝𝑜𝑢𝑠𝑒𝑠(𝑉 )}.
Let us first convince ourselves that conditioning on Z is guanteed to render 𝑉 independent of
all other variables in the graph.
1. 𝑝𝑎𝑟𝑒𝑛𝑡𝑠(𝑉 ): conditioning on the parents of 𝑉 will block any forks and chains
incumbent to 𝑉 .
2. 𝑐ℎ𝑖𝑙𝑑𝑟𝑒𝑛(𝑉 ): conditioning on the children of 𝑉 will block any forks and chains
emanating from 𝑉 .
3. 𝑠𝑝𝑜𝑢𝑠𝑒𝑠(𝑉 ): conditioning on the spouses of 𝑉 will block paths that were opened
at a collider formed on a child of 𝑉 (which was opened when we conditioned on
𝑐ℎ𝑖𝑙𝑑𝑟𝑒𝑛(𝑉 )).
Conditioning on Z is guaranteed to d-separate 𝑉 from all other nodes in the model (which is
referred to as the Markov Blanket for a node 𝑉 , sometimes denoted 𝑀 𝐵(𝑉 )).
𝑀 𝐵(𝑋) = {𝑊, 𝑍1 , 𝑍3 }
𝑀 𝐵(𝑌 ) = {𝑍2 , 𝑍3 , 𝑊 }
𝑀 𝐵(𝑊 ) = {𝑋, 𝑌, 𝑍2 , 𝑍3 }
𝑀 𝐵(𝑍1 ) = {𝑋, 𝑍2 , 𝑍3 }
𝑀 𝐵(𝑍2 ) = {𝑍1 , 𝑍3 , 𝑌, 𝑊 }
𝑀 𝐵(𝑍3 ) = {𝑍1 , 𝑍3 , 𝑋, 𝑊, 𝑌 }
Part (e)
The minimal set would be the Markov Blanket of 𝑌 , 𝑀 𝐵(𝑌 ) = {𝑍2 , 𝑍3 , 𝑊 }, since {𝑋, 𝑍1 }
are independent from 𝑌 given 𝑀 𝐵(𝑌 ), and so their addition would not improve our estimate.
Part (f)
The minimal set would be the Markov Blanket of 𝑍2 , 𝑀 𝐵(𝑍2 ) = {𝑍1 , 𝑍3 , 𝑌, 𝑊 }, since
{𝑋} is independent from 𝑍2 given 𝑀 𝐵(𝑍2 ). Observe that here we include 𝑌 since it
improves our estimate of 𝑍2 (they are dependent), but this inclusion opens a path that was
not open when we conditioned on all variables: 𝑌 ← 𝑊 ← 𝑋 ← 𝑍1 → 𝑍3 ← 𝑍2 . We then
block this path by conditioning on 𝑊 .
40 STUDY QUESTIONS AND SOLUTIONS FOR CHAPTER 2
Part (g)
Yes, since 𝑍2 and 𝑊 are dependent given 𝑍3 (there is an open path 𝑊 ← 𝑋 ← 𝑍1 → 𝑍3 ←
𝑍2 such that information about 𝑊 provides information about 𝑍2 ).
Part (a)
To determine which arrows can be reversed without being detectable by a statistical test, we
consider models that are in the equivalence class of Figure 2.9 (see paragraphs previous to
these study questions in the text). Accordingly, we first find the v-structures in the graph (i.e.,
colliders whose parents are not adjacent), which are:
∙ 𝑍1 → 𝑍3 ← 𝑍2
∙ 𝑍3 → 𝑌 ← 𝑊
∙ 𝑍2 → 𝑌 ← 𝑊
So, to find other models in this equivalence class, we may flip the direction of edges such that
the resulting model abides by two criteria:
∙ We neither create nor destroy any v-structures.
∙ We must not introduce a cycle into the resulting graph. Note that while d-separation
is valid in linear cyclical models, it is not valid in general, namely, for any non-linear
functions.
STUDY QUESTIONS AND SOLUTIONS FOR CHAPTER 2 41
With these constraints, we conclude that there are no such edges that can be reversed within
the model to find another within its equivalence class. We would verify this claim by testing
our constraints against each edge reversal. Here is a complete list of our tests and a reason (of
which there might be several) why each reversal fails to produce a model in the equivalence
class of Figure 2.9 (note, we define the “old model” as the one pre-edge-reversal, and the
“new model” as the one resulting from the reversal):
Edge Reason
𝑍1 → 𝑋 Creates a cycle in new model, 𝑍1 → 𝑍3 → 𝑋 → 𝑍1
𝑍1 → 𝑍3 Destroys a v-structure in old model, 𝑍1 → 𝑍3 ← 𝑍2
𝑍3 → 𝑋 Creates a v-structure in new model, 𝑋 → 𝑍3 ← 𝑍2
𝑍3 → 𝑌 Creates a v-structure in new model, 𝑌 → 𝑍3 ← 𝑍1
𝑋→𝑊 Creates a v-structure in new model, 𝑊 → 𝑋 ← 𝑍1
𝑊 →𝑌 Destroys a v-structure in old model, 𝑊 → 𝑌 ← 𝑍2
𝑍2 → 𝑍3 Destroys a v-structure in old model, 𝑍1 → 𝑍3 ← 𝑍2
𝑍2 → 𝑌 Destroys a v-structure in old model, 𝑍2 → 𝑌 ← 𝑊
Part (b)
There are no additional models in the equivalence class of Figure 2.9 for the reasons stated
in part (a) above.
Part (c)
No edge may be reversed to produce a model in the equivalence class of Figure 2.9 (see
explanation in part (a)). Therefore, all edge directionalities in the graph may be determined
from non-experimental data.
Part (d)
The model in 2.4.1(d) implies that 𝑌 is independent of 𝑍1 given {𝑍2 , 𝑍3 , 𝑊 }.
So, suppose we fit the data with the equation:
𝑦 = 𝑟2 𝑧2 + 𝑟3 𝑧3 + 𝑟𝑤 𝑤 + 𝑟1 𝑧1
If 𝑟1 is non-zero in the fitted equation, then the model of Figure 2.9 is wrong since the data
violates the conditional independence between 𝑌 and 𝑍1 as claimed by the model.
Part (e)
The model in 2.4.1(d) implies that 𝑍3 is independent of 𝑊 given {𝑋}.
So, suppose we fit the data with the equation:
𝑧3 = 𝑟𝑥 𝑥 + 𝑟𝑤 𝑤
If 𝑟𝑤 is non-zero in the fitted equation, then the model of Figure 2.9 is wrong since the data
violates the conditional independence between 𝑍3 and 𝑊 as claimed by the model.
42 STUDY QUESTIONS AND SOLUTIONS FOR CHAPTER 2
Part (f)
No such regression exists because no variable can be separated from 𝑍3 by any set of
observed variables.
Part (g)
According to Equation 1.29, the joint probability distribution can be factorized as:
𝑃 (𝑍1 , 𝑍2 , 𝑍3 , 𝑋, 𝑊, 𝑌 ) = 𝑃 (𝑍1 )𝑃 (𝑍2 )𝑃 (𝑍3 |𝑍1 , 𝑍2 )𝑃 (𝑋|𝑍1 , 𝑍3 )𝑃 (𝑊 |𝑋)𝑃 (𝑌 |𝑊, 𝑍2 , 𝑍3 )
So, to fully test the model, we need to examine every factor in this factorization and establish
that, conditional on its parents, every variable 𝑉 is independent of its non-descendants. The
corresponding regression equations necessary to perform these tests are:
1. 𝑧2 = 𝑟1 𝑧1 with vanishing 𝑟1 .
2. 𝑥 = 𝑟1 𝑧1 + 𝑟2 𝑧2 + 𝑟3 𝑧3 with vanishing 𝑟2 .
3. 𝑤 = 𝑟𝑥 𝑥 + 𝑟1 𝑧1 + 𝑟2 𝑧2 + 𝑟3 𝑧3 with vanishing 𝑟1 , 𝑟2 , 𝑟3 .
4. 𝑦 = 𝑟𝑤 𝑤 + 𝑟𝑥 𝑥 + 𝑟1 𝑧1 + 𝑟2 𝑧2 + 𝑟3 𝑧3 with vanishing 𝑟𝑥 , 𝑟1 .
So, in total, to fully test the model we would need 4 regression equations, through which we
would perform 1 + 1 + 3 + 2 = 7 tests for vanishing regression coefficients.
Study Questions and Solutions for
Chapter 3
(d) Find a combination of parameters that exhibit Simpson’s reversal (as in Study question
1.5.2(c)) and show explicitly that the overall causal effect of the drug is obtained from
the aggregate data.
𝑋 𝑌
44 STUDY QUESTIONS AND SOLUTIONS FOR CHAPTER 3
Part (a)
Now, to compute 𝑃 (𝑦|𝑑𝑜(𝑥)) for all values of 𝑥 and 𝑦, we can consider the mutilated model
𝑚 wherein all causal influences to 𝑋 are severed, and 𝑋 is forced to some value 𝑥:
𝑋 𝑌
With this mutilated model in hand, we can write the product decomposition, Eq. (1.29), for
𝑚 to solve for our quantities of interest. Observe three facts that will help us with our task:
𝑃 (𝑌 |𝑍, 𝑋) = 𝑃𝑚 (𝑌 |𝑍, 𝑋), 𝑃 (𝑍) = 𝑃𝑚 (𝑍), and 𝑃𝑚 (𝑍|𝑋) = 𝑃𝑚 (𝑍) = 𝑃 (𝑍).
𝑃 (𝑦1 |𝑑𝑜(𝑥1 )) = 𝑃 (𝑦1 |𝑥1 , 𝑧1 )𝑃 (𝑧1 ) + 𝑃 (𝑦1 |𝑥1 , 𝑧0 )𝑃 (𝑧0 ) = 𝑟𝑝4 + (1 − 𝑟)𝑝2
𝑃 (𝑦1 |𝑑𝑜(𝑥0 )) = 𝑃 (𝑦1 |𝑥0 , 𝑧1 )𝑃 (𝑧1 ) + 𝑃 (𝑦1 |𝑥0 , 𝑧0 )𝑃 (𝑧0 ) = 𝑟𝑝3 + (1 − 𝑟)𝑝1
𝑃 (𝑦0 |𝑑𝑜(𝑥1 )) = 1 − 𝑃 (𝑦1 |𝑑𝑜(𝑥1 )) = 1 − (𝑟𝑝4 + (1 − 𝑟)𝑝2 )
𝑃 (𝑦0 |𝑑𝑜(𝑥0 )) = 1 − 𝑃 (𝑦1 |𝑑𝑜(𝑥0 )) = 1 − (𝑟𝑝3 + (1 − 𝑟)𝑝1 )
Part (b)
By the adjustment formula, Eq. (3.6), we have the same as in (a):
𝑃 (𝑦1 |𝑑𝑜(𝑥1 )) = 𝑃 (𝑦1 |𝑥1 , 𝑧1 )𝑃 (𝑧1 ) + 𝑃 (𝑦1 |𝑥1 , 𝑧0 )𝑃 (𝑧0 ) = 𝑟𝑝4 + (1 − 𝑟)𝑝2
𝑃 (𝑦1 |𝑑𝑜(𝑥0 )) = 𝑃 (𝑦1 |𝑥0 , 𝑧1 )𝑃 (𝑧1 ) + 𝑃 (𝑦1 |𝑥0 , 𝑧0 )𝑃 (𝑧0 ) = 𝑟𝑝3 + (1 − 𝑟)𝑝1
𝑃 (𝑦0 |𝑑𝑜(𝑥1 )) = 1 − (𝑟𝑝4 + (1 − 𝑟)𝑝2 )
𝑃 (𝑦0 |𝑑𝑜(𝑥0 )) = 1 − (𝑟𝑝3 + (1 − 𝑟)𝑝1 )
STUDY QUESTIONS AND SOLUTIONS FOR CHAPTER 3 45
Part (c)
To find the ACE, we simply substitute our computations from (a) into the ACE formula, giv-
ing us:
Comparing the expressions for ACE and RD, we see that when 𝑟 = 0, 𝑞1 ̸= 0, 𝑞1 ̸= 1, the
ACE is equivalent to RD, namely:
𝐴𝐶𝐸 − 𝑅𝐷 = 𝑝2 − 𝑝1 − 𝑝2 + 𝑝1 = 0
Part (d)
Using our answer to study question 1.5.2(c), we note that the following parameterization will
yield Simpson’s reversal:
𝑝1 = 0.1, 𝑝2 = 0, 𝑝3 = 0.3, 𝑝4 = 0.2, 𝑞1 = 0, 𝑞2 = 1, 𝑟 = 0.1
Recall that this parameterization yields Simpson’s reversal because, for each value of 𝑧,
the difference 𝑃 (𝑦|𝑥1 , 𝑧) − 𝑃 (𝑦|𝑥0 , 𝑧) has the same sign as the ACE. We also know that,
because 𝑍 is the only confounder, to determine the overall causal effect of the drug on recov-
ery, we’re interested in its average influence across all 𝑍-specific conditions. To aid us in this
analysis, we can use the ACE from part (c) above, and consult the segregated data, namely:
B C
A D
Z
X W Y
Figure 3.8: Causal graph used to illustrate the backdoor criterion in the following study
questions
(c) List all minimal sets of variables that need be measured in order to identify the effect of
𝐷 on 𝑌 . Repeat, for the effect of {𝑊, 𝐷} on 𝑌 .
Part (a)
Let us use the abbreviation "backdoor admissible" to denote a set of variables that satisfy
the backdoor criterion of Definition 3.3.1; for the present model, a backdoor admissible set
Z blocks all spurious paths between 𝑋 and 𝑌 while leaving all directed paths from 𝑋 to 𝑌
open. We can easily verify that the following sets satisfy the backdoor criterion to determine
the causal effect of 𝑋 on 𝑌 .
1. Sets of 2 nodes: {𝑍, 𝐴}, {𝑍, 𝐵}, {𝑍, 𝐶}, {𝑍, 𝐷}
2. Sets of 3 nodes: {𝑍, 𝐴, 𝐵}, {𝑍, 𝐴, 𝐶}, {𝑍, 𝐴, 𝐷}, {𝑍, 𝐵, 𝐶}, {𝑍, 𝐵, 𝐷}, {𝑍, 𝐶, 𝐷}
3. Sets of 4 nodes: {𝑍, 𝐴, 𝐵, 𝐶}, {𝑍, 𝐴, 𝐵, 𝐷}, {𝑍, 𝐴, 𝐶, 𝐷}, {𝑍, 𝐵, 𝐶, 𝐷}
4. Sets of 5 nodes: {𝑍, 𝐴, 𝐵, 𝐶, 𝐷}
Part (b)
According to (a), the following 4 sets are minimal, since in every other set, a node could be
removed and still ensure that the backdoor criterion is satisfied:
{𝑍, 𝐴}, {𝑍, 𝐵}, {𝑍, 𝐶}, {𝑍, 𝐷}.
STUDY QUESTIONS AND SOLUTIONS FOR CHAPTER 3 47
Part (c)
For identifying the effect of 𝐷 on 𝑌 :
We want to find a set Z that blocks all backdoor paths from 𝐷 to 𝑌 . Notice that the set {𝐶} is
one solution, so any other set that contains 𝐶 is not minimal. Also, if the set does not contain
𝐶, then it must contain 𝑍, otherwise, we have an open path 𝑌 ← 𝑍 ← 𝐶 → 𝐷. However,
by including 𝑍, the backdoor path 𝑌 ← 𝑊 ← 𝑋 ← 𝐴 ← 𝐵 → 𝑍 ← 𝐶 → 𝐷 is open. To
block this path, we add any of 𝐴, 𝐵, 𝑋, or 𝑊 . So, there are a total of 5 minimal sets:
{𝐶}, {𝑍, 𝐴}, {𝑍, 𝐵}, {𝑍, 𝑋}, {𝑍, 𝑊 }.
Figure 3.9: Scatter plot with students’ initial weights on the 𝑥-axis and final weights on the
𝑦-axis. The vertical line indicates students whose initial weights are the same, and whose
final weights are higher (on average) for plan B compared with plan A
48 STUDY QUESTIONS AND SOLUTIONS FOR CHAPTER 3
So, the first statistician concluded that there was no effect of diet on weight gain and the
second concluded there was.
Figure 3.9 illustrates data sets that can cause the two statisticians to reach conflicting
conclusions. Statistician-1 examined the weight gain 𝑊𝐹 − 𝑊𝐼 which, for each student, is
represented by the shortest distance to the 45∘ line. Indeed, the average gain for each diet
plan is zero; the two groups are each situated symmetrically relative to the zero-gain line,
𝑊𝐹 = 𝑊𝐼 . Statistician-2, on the other hand, compared the final weights of plan 𝐴 students
to those of plan-𝐵 students who entered school with the same initial weight 𝑊0 and, as the
vertical line in the figure indicates, plan 𝐵 students are situated above plan 𝐴 students along
this vertical line. The same will be the case for any other vertical line, regardless of 𝑊0 .
Part (a)
First, let us configure the variables we’ll use in our model. Let 𝑋 be the students’ meal
plan choice, 𝑊𝐹 be the students’ final weights, and 𝑊𝐼 be the students’ initial weights.
We hypothesize that a student’s initial weight influences both their choice of meal plan and
their final weight. Additionally, meal plan influences final weight. So, the causal graph is as
follows:
𝑊𝐼
𝑋 𝑊𝐹
Part (b)
The 2nd statistician is correct, since the initial weight 𝑊𝐼 is the common cause of plan choice
𝑋 and final weight 𝑊𝐹 . As such, when we estimate the effect of 𝑋 on 𝑊𝐹 , we should seg-
regate data on the initial weight. Also, 𝑊𝐼 satisfies the backdoor criterion to determine the
causal effect of 𝑋 on 𝑊𝐹 . The first statistician mistakenly used the aggregated data, failing
to account for the confounder 𝑊𝐼 .
Statistician 1’s argument sounds compelling only because it is expressed in terms of the
gain 𝐺 = 𝑊𝐹 − 𝑊𝐼 , which people perceive to be the quantity of interest, and which Figure
3.9 clearly shows to have the same mean in Diet A as in Diet B. However, once we add 𝐺 to
the graph (see below) the error in this argument becomes clear: to compute the effect of 𝑋
on 𝐺 we still need to adjust for 𝑊𝐼 .
STUDY QUESTIONS AND SOLUTIONS FOR CHAPTER 3 49
𝑊𝐼 𝐺
𝑋 𝑊𝐹
Part (c)
Comparing the model in Part (a) to the standard model in Simpson’s paradox (e.g., Fig. 1.8)
we see that the structures of the two models are the same, with a slightly different causal story.
The difference is that, in Simpson’s paradox, we have complete reversal while in Lord’s para-
dox, we are going from inequality to equality. To visualize this transition, we can examine the
𝑊𝐼 -specific distributions of 𝑊𝐹 for each of the diets, and ask whether the two distributions
differ. This we do by projecting samples corresponding to an initial weight 𝑊0 on to the 𝑊𝐹
axis, as shown in the graph below.
We see that for individuals having the same initial weight, 𝑊0 , their final weight will be
higher in Plan B than in Plan A (on the average). The distributions corresponding to the two
scatter plots are shifted. On the other hand, if we project the two scatter plots onto the G axis,
the two distributions coincide. Thus, the segregated data (conditioned on 𝑊𝐼 ) yields prefer-
ence of one diet over the other, while the unsegregated data (unconditioned on 𝑊𝐼 ) claims
equality for the two diets.
In Simpson’s paradox, on the other hand, we encounter sign reversal as we go from the
segregated to the unsegregated data. This is shown, for example, in the age-specific slopes of
Figure 1.1 in the text, which have opposite sign to the slope of the aggregated ellipse.
(b) Determine which variables must be adjusted for by applying the backdoor criterion.
variables need to be adjusted for.
50 STUDY QUESTIONS AND SOLUTIONS FOR CHAPTER 3
(c) Write the adjustment formula for the effect of the drug on recovery.
(d) Repeat questions (a)–(c) assuming that the nurse gave lollipops a day after the study,
still preferring patients who received treatment over those who received placebo.
Students who wish to study Simpson’s paradox in more detail can use the interactive
simulator of the “Simpson Machine” at dagitty.net/learn/simpson/
Part (a)
As with any modeling problem, we begin by formalizing our variable choices. Let 𝑋 indicate
Treatment receipt, 𝑍 indicate Lollipop receipt, 𝑌 indicate Recovery, and 𝑈1 , 𝑈2 indicate 2
unobserved factors that correlate 𝑍 with 𝑋 and 𝑌 , respectively. The causal graph will be:
𝑈1 𝑈2
𝑍
𝑋 𝑌
Part (b)
By Definition 3.3.1, the backdoor criterion, to estimate the effect of X on Y, we need not
adjust for any variable, since 𝑈1 → 𝑍 ← 𝑈2 is a collider that is closed when 𝑍 is not given.
As we discussed in the solution to study question 1.2.4, the structure of the graph permits us
to skip considerations of exchangeability (i.e., comparing apples and oranges), and get the
answer mechanically and reliably.
Part (c)
According to (b), since we need not adjust for any covariates to block any spurious paths, we
may simply say that: 𝑃 (𝑦|𝑑𝑜(𝑥)) = 𝑃 (𝑦|𝑥)
Part (d)
Our answers do not change; timing of the Lollipop receipt does not change the causal struc-
ture of the model, as long as receiving a Lollipop is assumed to have no effect on either
treatment or outcome. In other words, 𝑍 is not an ancestor of either 𝑋 or 𝑌 .
Variable 𝑊 satisfies the Front-Door criterion in Definition 3.4.1, so 𝑊 would allow the
identification of the effect of 𝑋 on 𝑌 , namely:
𝑃 (𝑦|𝑑𝑜(𝑥)) = Σ𝑤 𝑃 (𝑤|𝑥)Σ𝑥′ 𝑃 (𝑦|𝑥′ , 𝑤)𝑃 (𝑥′ )
chemical Potency, which in turn affects Recovery. We also hypothesize the existence of an
unobserved confounder that makes people influenced by Price to have more health problems.
The graph is as follows:
𝑋 𝑍 𝑌
Part (ii)
(Hint: the data is the same as in table 3.2 in the book, and they share the same causal relations
and corresponding graphical model)
Part (iii)
Using (ii) above, our knowledge that 𝑍 satisfies the Front-Door criterion, and Theorem 3.4.1,
we can compare the causal effects of the two drug types to see which is superior. Let 𝑦1 denote
recovery, 𝑥1 denote choosing the expensive drug, and 𝑧1 denote high chemical content.
We can rationalize these findings by first remembering that our data is observational, and
that an unobserved confounder between drug choice and recovery might create the illusion
that the cheap drug is more effective. This would happen, for instance, if the more frugal
customers are also on a more healthy diet. To eliminate such illusions we must evaluate the
ACE which, given our story, is obtained from the front door formula.
We can also foster an intuition for how the front-door formula is useful. First, we can com-
pute the causal effect of the drug’s active ingredients on recovery. Let 𝑝 (respectively, 𝑝′ )
represent the recovery probability of a randomly chosen person forced to take a drug high
(respectively, low) in the active ingredients, i.e.,
𝑝 = 𝑃 (𝑌 = 𝑟𝑒𝑐𝑜𝑣𝑒𝑟|𝑑𝑜(𝑍 = ℎ𝑖𝑔ℎ))
𝑝′ = 𝑃 (𝑌 = 𝑟𝑒𝑐𝑜𝑣𝑒𝑟|𝑑𝑜(𝑍 = 𝑙𝑜𝑤))
Now that we know the causal effect of the ingredient, we reason as follows: If I choose a
cheap bottle, I stand a 5% chance of getting a good bottle, with recovery probability 𝑝, and
95% chance of getting a bad bottle with a recovery probability of 𝑝′ . Thus, the average prob-
ability of recovery on choosing a cheap bottle is .05𝑝 + .95𝑝′ . Things turn around if I buy
an expensive bottle, giving me an average probability of recovery of .05𝑝′ + .95𝑝. Thus the
difference between the two choices amounts to:
The graph of this exercise is available at dagitty.net/m331. Students can solve parts (a)
and (b) interactively in class by forcing adjustment for single covariates (move mouse
pointer over the variable and press "a" key).
𝐵 𝐶
𝐴 𝑍 𝐷
𝑋 𝑊 𝑌
Part (a)
By Rule 2, we must adjust for a set of variables that satisfies the backdoor criterion,
conditional on 𝐶. We observe that when we condition on 𝐶, there is still an open
backdoor from 𝑋 ← 𝑍 → 𝑌 , which we can block by conditioning on 𝑍. So, we may claim
that:
∑︀
𝑃 (𝑌 = 𝑦|𝑑𝑜(𝑋 = 𝑥), 𝐶 = 𝑐) = 𝑧 𝑃 (𝑌 = 𝑦|𝑋 = 𝑥, 𝑍 = 𝑧, 𝐶 = 𝑐)𝑃 (𝑍 = 𝑧)
Above, Rule 2 is applicable because the set {𝑍, 𝐶} satisfies the backdoor criterion to assess
the 𝑐-specific effect of 𝑋 on 𝑌 .
Part (b)
Again using Rule 2, we see that {𝑋, 𝑌, 𝑍, 𝐶} is such a set since {𝑍, 𝐶} satisfies the backdoor
criterion. We can then write:
∑︀
𝑃 (𝑌 = 𝑦|𝑑𝑜(𝑋 = 𝑥), 𝑍 = 𝑧) = 𝑐 𝑃 (𝑦|𝑥, 𝑧, 𝑐)𝑃 (𝑐)
Advanced students may be challenged to show that {𝑋, 𝑌, 𝑍, 𝑊 } is also such a set, since 𝑊
satisfies the front-door criterion when 𝑍 is specified.
Part (c)
Since our choice of 𝑋 relies upon the value of 𝑍, we need to adopt the conditional policy
𝑑𝑜(𝑋 = 𝑔(𝑍)), where:
{︃
0 𝑍≤2
𝑔(𝑍) =
1 𝑍>2
STUDY QUESTIONS AND SOLUTIONS FOR CHAPTER 3 55
Z1 Z2
a1 a3 b3 c2
W1 W2
Z3
t1 t2 b c
X c3 W3 a Y
Figure 3.10
Part (b)
The only conditional independence that involves the measured variables is the one between
𝑍3 and 𝑊3 given 𝑋, which leads to 𝑟𝑍3 = 0 in the corresponding regression equation:
𝑊3 = 𝑟𝑍3 𝑍3 + 𝑟𝑋 𝑋 with 𝑟𝑍3 = 0
Part (c)
(i) If we regress a variable on its parents, we get a regression equation whose coefficients
equal the model parameters. Therefore:
1. 𝑎 = 𝑟𝑊3 , 𝑏 = 𝑟𝑍3 , 𝑐 = 𝑟𝑊2 in the equation:
2. 𝑎1 = 𝑟𝑍1 in:
𝑊1 = 𝑟𝑍1 𝑍1
𝑍3 = 𝑟𝑍1 𝑍1 + 𝑟𝑍2 𝑍2
4. 𝑐2 = 𝑟𝑍2 in:
𝑊2 = 𝑟𝑍2 𝑍2
5. 𝑐3 = 𝑟𝑋 in:
𝑊3 = 𝑟𝑋 𝑋
6. 𝑡1 = 𝑟𝑊1 , 𝑡2 = 𝑟𝑍3 :
𝑋 = 𝑟𝑊1 𝑊1 + 𝑟𝑍3 𝑍3
(ii) The "Regression Rule for Identification" tells us that, if 𝐺𝛼 has several backdoor sets,
each would lead to a regression equation in which 𝛼 is a coefficient. Therefore, 𝑎, 𝑏, 𝑐 can be
identified by:
58 STUDY QUESTIONS AND SOLUTIONS FOR CHAPTER 3
𝑋 = 𝑟𝑊1 𝑊1 + 𝑟𝑍2 𝑍2
𝑋 = 𝑟𝑊1 𝑊1 + 𝑟𝑍2 𝑍3
Part (d)
To determine which parameters are estimable from data, we consult "The Regression Rule
for Identification." For example, the parameter 𝑐3 can be estimated from data because
𝑊3 = 𝑟𝑋 𝑋 + 𝑈3′ = 𝑐3 𝑋 + 𝑈3′ , since 𝑊3 is 𝑑-separated from 𝑌 given 𝑋 in 𝐺𝑊3 . Like-
wise, 𝑎 = 𝑟𝑌 𝑊3 ·𝑋 .
Lastly, we note that 𝑊3 is a front-door admissible variable for attaining the total effect of 𝑋
on 𝑌 , and so the effect is estimable. Indeed the total effect of 𝑋 on 𝑌 is simply the product
of 𝑎 * 𝑐3 , which we identified above.
Part (e)
Regressing 𝑍1 on all other variables in the model gives:
𝑍1 = 𝑟𝑍2 𝑍2 + 𝑟𝑍3 𝑍3 + 𝑟𝑋 𝑋 + 𝑟𝑤1 𝑊1 + 𝑟𝑊2 𝑊2 + 𝑟𝑊3 𝑊3 + 𝑟𝑌 𝑌
By 𝑑-separation, we see that 𝑍1 is independent of {𝑋, 𝑊3 , 𝑌, 𝑊2 } given 𝑊1 , 𝑍3 , 𝑍2 . There-
fore, 𝑟𝑋 = 0, 𝑟𝑊2 = 0, 𝑟𝑊3 = 0, 𝑟𝑌 = 0
Part (f)
In order for a coefficient to remain invariant under the addition of a new regressor,
the dependent variable must be independent of the added regressor given all of the old
regressors.
Thus, for example, if we regress 𝑊1 on 𝑍3 and 𝑋, adding 𝑊3 would keep all regression
coefficients in tact, but adding 𝑌 or 𝑍2 would change them, because of the path: 𝑌 ← 𝑊2 ←
𝑍2 → 𝑍3 ← 𝑍1 → 𝑊1 is opened by conditioning on 𝑍3 . If we regress 𝑊1 on 𝑍1 , then we
can add 𝑍3 , 𝑍2 , or 𝑊2 without changing the regression coefficient.
STUDY QUESTIONS AND SOLUTIONS FOR CHAPTER 3 59
Part (g)
Note that if we condition on 𝑊1 , we turn 𝑍1 into an instrument relative to the effect 𝜏 of 𝑍3
on 𝑌 . Using this idea, we can write the regression of 𝑌 on 𝑍1 given 𝑊1 , as the product 𝜏 𝑎3
where 𝜏 = 𝑡2 𝑐3 𝑎 + 𝑏. Since each of 𝑡2 , 𝑐3 , 𝑎 can be separately identified (see Part a above),
we can then solve for 𝑏. Formally, we have:
𝑡2 𝑐3 𝑎 + 𝑏 = 𝑟𝑍1 /𝑟𝑍1′
Where 𝑟𝑍1 and 𝑟𝑍1′ are the regression coefficients in the following equations:
𝑌 = 𝑟𝑍1 𝑍1 + 𝑟𝑊1 𝑊1 + 𝜖
𝑍3 = 𝑟𝑍1′ 𝑍1 + 𝜖
Study Questions and Solutions for
Chapter 4
U1 U2
X a Z b Y
(College) (Skill) (Salary)
Figure 4.3: A model representing Eq. (4.7), illustrating the causal relations between college
education (𝑋), skills (𝑍), and salary (𝑌 )
(a) Find the expected salary of workers at skill level 𝑍 = 𝑧 had they received 𝑥 years of
college education. [Hint: Use Theorem 4.3.2, with 𝑒 : 𝑍 = 𝑧, and the fact that for any two
Gaussian variables, say 𝑋 and 𝑍, we have 𝐸[𝑋|𝑍 = 𝑧] = 𝐸[𝑥] + 𝑅𝑋𝑍 (𝑧 − 𝐸[𝑍]).
Use the material in Sections 3.8.2 and 3.8.3 to express all regression coefficients in terms
of structural parameters, and show that 𝐸[𝑌𝑥 |𝑍 = 𝑧] = 𝑎𝑏𝑥 + 𝑏𝑧/(1 + 𝑎2 ).]
(b) Based on the solution for (a), show that the skill-specific effect of education on salary is
independent on the skill level.
62 STUDY QUESTIONS AND SOLUTIONS FOR CHAPTER 4
Part (a)
𝑈1 𝑈2
a b
𝑋 𝑍 𝑌
2
𝜎𝑋 1
𝐸[𝑋|𝑧] = 𝛽𝑥𝑧 𝑧 = 𝛽𝑧𝑥 2 𝑧 = 𝑎 𝑐𝑜𝑣(𝑎𝑋 + 𝑈 ) 𝑧
𝜎𝑍 2
𝑎
=𝑧
(1 + 𝑎2 )
which gives
𝑧𝑎
𝐸[𝑌𝑥 |𝑍 = 𝑧] = 𝑏𝑧 + 𝑎𝑏(𝑥 − )
(1 + 𝑎2 )
𝑏𝑧
= 𝑎𝑏𝑥 +
1 + 𝛼2
Part (b)
b = 0.7
Figure 4.1: A model depicting the effect of Encouragement (𝑋) on student’s score
(Encouragement) (Homework) (Exam Score)
X H= 2 c = 0.4 Y
b = 0.7
Figure 4.2: Answering a counterfactual question about a specific student’s score, predicated
on the assumption that homework would have increased to 𝐻 = 2
U1 U2
X a Z b Y
(College) (Skill) (Salary)
Figure 4.3: A model representing Eq. (4.7), illustrating the causal relations between college
education (𝑋), skills (𝑍), and salary (𝑌 )
Part (b)
By Theorem 4.3.2 we know that:
Part (c)
First, we define the model 𝑀 as follows:
𝑀:
𝐻 = 𝑈𝐻
𝑋 = 𝑎𝐻 + 𝑈𝑋
𝑌 = 𝑏𝑋 + 𝑐𝐻 + 𝛿𝑋𝐻 + 𝑈𝑌
From 𝑀 , we may also describe the mutilated model 𝑀𝑥 , representing the intervention
𝑋 = 𝑥, as given by:
𝑀𝑥 :
𝐻 = 𝑈𝐻
𝑋=𝑥
𝑌 = 𝑏𝑥 + 𝑐𝐻 + 𝛿𝑥𝐻 + 𝑈𝑌
So by definition, to compute the counterfactual quantities needed for the ETT, we use Eq.
(4.5) and write:
STUDY QUESTIONS AND SOLUTIONS FOR CHAPTER 4 65
(b) Apply the result of Question (a) to Simpson’s story with the nonexperimental data of
Table 1.1, and estimate the effect of treatment on those who used the drug by choice.
[Hint: Estimate 𝐸[𝑌𝑥 ] assuming that gender is the only confounder.]
(c) Repeat Question (b) using the fact that 𝑍 in Figure 3.3, satisfies the backdoor criterion.
Show that the answers to (b) and (c) coincide.
66 STUDY QUESTIONS AND SOLUTIONS FOR CHAPTER 4
UZ
Z
UX UY
X Y
Figure 3.3: A graphical model representing the effects of a new drug, with 𝑍 representing
gender, 𝑋 standing for drug usage, and 𝑌 standing for recovery
𝐸[𝑌𝑥 |𝑋 = 𝑥] = 𝐸[𝑌 |𝑋 = 𝑥]
∴ 𝐸[𝑌𝑥 ] = 𝐸[𝑌 |𝑋 = 𝑥]𝑃 (𝑋 = 𝑥) + 𝐸[𝑌𝑥 |𝑋 = 𝑥′ ]𝑃 (𝑋 = 𝑥′ )
The consistency axiom intuitively follows from the notion that a counterfactual predicated on
an actual observation is not counterfactual (here we observed 𝑋 = 𝑥 and were hypothesizing
about 𝑌𝑥 ). The term 𝐸[𝑌 |𝑋 = 𝑥]𝑃 (𝑋 = 𝑥) is already estimable from observational data, so
it remains to address the other. Solving for 𝐸[𝑌𝑥 |𝑋 = 𝑥′ ], gives:
𝐸[𝑌𝑥 ] − 𝐸[𝑌 |𝑋 = 𝑥]𝑃 (𝑋 = 𝑥)
𝐸[𝑌𝑥 |𝑋 = 𝑥′ ] =
𝑃 (𝑋 = 𝑥′ )
Now, substituting back into our equation for the ETT:
𝐸𝑇 𝑇 = 𝐸[𝑌𝑥 − 𝑌𝑥′ |𝑋 = 𝑥]
= 𝐸[𝑌 |𝑋 = 𝑥] − 𝐸[𝑌𝑥′ |𝑋 = 𝑥]
𝐸[𝑌 |𝑑𝑜(𝑋 = 𝑥′ )] − 𝐸[𝑌 |𝑋 = 𝑥′ ]𝑃 (𝑋 = 𝑥′ )
= 𝐸[𝑌 |𝑋 = 𝑥] −
𝑃 (𝑋 = 𝑥)
We see that the effect of treatment on the treated can be estimated from non-experimental
(do-free expressions) and experimental data (expressions with the do-operator).
STUDY QUESTIONS AND SOLUTIONS FOR CHAPTER 4 67
Part (b)
Because Gender is the only confounder, we can adjust for it and substitute the values of Table
1.1, specifically:
Study question 4.3.4. Joe has never smoked before but, as a result of peer pressure and
other personal factors he decided to start smoking. He buys a pack of cigarettes, comes home
and asks himself: “I am about to start smoking, should I?”
(a) Formulate Joe’s question mathematically, in terms of ETT, assuming that the outcome of
interest is lung cancer.
(b) What type of data would enable Joe to estimate his chances of getting cancer given that
he goes ahead with the decision to smoke, versus refraining from smoking.
(c) Use the data in Table 3.1 to estimate the chances associated with the decision in (b).
68 STUDY QUESTIONS AND SOLUTIONS FOR CHAPTER 4
Part (a)
Let 𝑌 stand for lung cancer, with 𝑌 = 1 denoting that it is present. Let 𝑋 be Joe’s choice,
with 𝑋 = 1 indicating that he has decided to start smoking. So, we want to compute the
effect of treatment on the treated to determine if Joe should start smoking given that he was
about to, or specifically: 𝐸𝑇 𝑇 = 𝐸[𝑌1 − 𝑌0 |𝑋 = 1], if 𝐸𝑇 𝑇 > 0, it means that smoking
yields a higher chance of lung cancer for Joe (given that he was about to start smoking) than
refraining from smoking.
Part (b)
Referencing our findings from study question 4.4.1, if we can find a set of variables that
satisfy the backdoor or front-door criterion of the effect of 𝑋 on 𝑌 , then we only need non-
experimental data (see study question 4.4.1c); otherwise, we need both non-experimental and
experimental data (see study question 4.4.1b).
Part (c)
First, let us recall the graphical model associated with this problem from Figure 3.10 in the
book, wherein 𝑍 indicates the presence of Tar deposits in the lung, and 𝑈 , an unmeasured
Genotype that influences individuals to both smoke and get cancer:
𝑋 𝑍 𝑌
Since 𝑍 satisfies the front-door criterion, by Theorem 3.4.1, and again referencing Table 3.1,
we have:
∑︁ ∑︁
𝐸[𝑌 |𝑑𝑜(𝑋 = 0)] = 𝑃 (𝑍 = 𝑧|𝑋 = 0) 𝑃 (𝑌 = 1|𝑋 = 𝑥′ , 𝑍 = 𝑧)𝑃 (𝑋 = 𝑥′ )
𝑧 𝑥′
𝐸𝑇 𝑇 = 𝐸[𝑌1 − 𝑌0 |𝑋 = 1]
= 𝐸[𝑌 |𝑋 = 1] − 𝐸[𝑌0 |𝑋 = 1]
𝐸[𝑌 |𝑑𝑜(𝑋 = 0)] − 𝐸[𝑌 |𝑋 = 0]𝑃 (𝑋 = 0)
= 𝐸[𝑌 |𝑋 = 1] −
𝑃 (𝑋 = 1)
0.5025 − 0.9025 * 0.5
= 0.15 −
0.5
= 0.0475 > 0
Since our ETT is greater than 0, we know that the chance of cancer from smoking (given
that Joe was about to start smoking) is greater than the alternative (no cancer). Therefore, Joe
should refrain from smoking.
In this solution, we relied on two assumptions: (1) 𝑋 is binary and (2) the model satis-
fies the front-door criterion. A more advanced analysis shows that assumtion (2) is sufficient
for estimating ETT; 𝑋 need not be binary once we have a front-door structure (Shpitser and
Pearl, 2009).
First, according to the problem description, let 𝑋 represent treatment (with 𝑥′ representing
lumpectomy alone and 𝑥 representing Ms. Jones’ decision: lumpectomy plus raditation) and
𝑌 represent recovery (with 𝑦 ′ representing recurrence of cancer, and 𝑦 representing the out-
come for Ms. Jones: no recurrence). We also are given that:
𝑃 (𝑦 ′ ) = 0.3
𝑃 (𝑥′ |𝑦 ′ ) = 0.7
𝑃 (𝑦|𝑑𝑜(𝑥)) = 0.39
𝑃 (𝑦|𝑑𝑜(𝑥′ )) = 0.14
70 STUDY QUESTIONS AND SOLUTIONS FOR CHAPTER 4
Our goal is to determine if Ms. Jones’ decision was necessary for remission. So, we’ll see if
PN is more probable than not using the lower bound afforded by Eq. (4.30).
𝑃 (𝑦) − 𝑃 (𝑦|𝑑𝑜(𝑥′ ))
𝑃𝑁 ≥
𝑃 (𝑥, 𝑦)
𝑃 (𝑦) − 𝑃 (𝑦|𝑑𝑜(𝑥′ ))
=
𝑃 (𝑦|𝑥)𝑃 (𝑥)
𝑃 (𝑦) − 𝑃 (𝑦|𝑑𝑜(𝑥′ ))
= 𝑃 (𝑥|𝑦 ′ )𝑃 (𝑦 ′ )
(1 − 𝑃 (𝑥) )𝑃 (𝑥)
𝑃 (𝑦) − 𝑃 (𝑦|𝑑𝑜(𝑥′ ))
=
𝑃 (𝑥) − 𝑃 (𝑥|𝑦 ′ )𝑃 (𝑦 ′ )
At this point we consider whether we have all of the components for our computation, and
see that we can find all parameters except for 𝑃 (𝑥). However, because we are computing a
lower bound for the PN, we can consider the parameterization that would yield its smallest
value, namely, when the denominator is as large as possible with 𝑃 (𝑥) = 1. So, with this
assumption, we write:
𝑃 (𝑦) − 𝑃 (𝑦|𝑑𝑜(𝑥′ ))
𝑃𝑁 ≥
𝑃 (𝑥) − 𝑃 (𝑥|𝑦 ′ )𝑃 (𝑦 ′ )
0.7 − 0.14
≥
1 − (1 − 0.7) * 0.3
= 0.62 > 0.5
Since PN is greater than 0.5, we conclude that Ms. Jones’ decision was likely necessary for
remission.
𝑦 = 𝛽 1 𝑚 + 𝛽2 𝑡 + 𝑢 𝑦 (4.53)
𝑚 = 𝛾1 𝑡 + 𝑢𝑚 (4.54)
(a) Use the basic definition of the natural effects (Eqs. (4.46)–(4.47)) to determine TE, NDE
and NIE.
(b) Repeat (a) assuming that 𝑢𝑦 is correlated with 𝑢𝑚 .
𝑁 𝐷𝐸 = 𝐸[𝑌1,𝑀0 − 𝑌0,𝑀0 ]
= 𝐸[𝑌1,𝑀0 ] − 𝐸[𝑌0,𝑀0 ]
= (𝛽1 [𝛾1 * 0 + 𝑢𝑚 ] + 𝛽2 * 1 + 𝑢𝑦 ) − (𝛽1 [𝛾1 * 0 + 𝑢𝑚 ] + 𝛽2 * 0 + 𝑢𝑦 )
= 𝛽2 − 0
= 𝛽2
Similarly, the NIE is defined as the expected increase in 𝑌 when the treatment is held constant
at 𝑇 = 0 and 𝑀 changes to whatever value it would have attained under 𝑇 = 1. By Eq.
(4.47), we have:
𝑁 𝐼𝐸 = 𝐸[𝑌0,𝑀1 − 𝑌0,𝑀0 ]
= 𝐸[𝑌0,𝑀1 ] − 𝐸[𝑌0,𝑀0 ]
= (𝛽1 [𝛾1 * 1 + 𝑢𝑚 ] + 𝛽2 * 0 + 𝑢𝑦 ) − (𝛽1 [𝛾1 * 0 + 𝑢𝑚 ] + 𝛽2 * 0 + 𝑢𝑦 )
= 𝛾1 𝛽 1 − 0
= 𝛾1 𝛽 1
𝑇 𝐸 = 𝐸[𝑌1 − 𝑌0 ]
= 𝑁 𝐷𝐸 + 𝑁 𝐼𝐸
= 𝛽2 + 𝛾 1 𝛽1
Note that, in this question, we did not have to assume that the treatment is randomized or,
equivalently, that 𝑢𝑡 is not correlated with 𝑢𝑦 or 𝑦𝑚 . This is because we have the functional
form of the equations (linear), and we take the structural parameters as given.
Part (b)
The computations will remain the same since none of the above require that 𝑢𝑦 is uncorre-
lated with 𝑢𝑚 . This is because we are dealing with a linear system with given parameters.
𝑦 = 𝛽1 𝑚 + 𝛽2 𝑡 + 𝛽3 𝑡𝑚 + 𝛽4 𝑤 + 𝑢𝑦 (4.55)
𝑚 = 𝛾1 𝑡 + 𝛾2 𝑤 + 𝑢𝑚 (4.56)
𝑤 = 𝛼𝑡 + 𝑢𝑤 (4.57)
(a) Use the basic definition of the natural effects (Eqs. (4.46) and (4.47)) (treating 𝑀 as
the mediator), to determine the portion of the effect for which mediation is necessary
(𝑇 𝐸 − 𝑁 𝐷𝐸) and the portion for which mediation is sufficient (𝑁 𝐼𝐸). Hint: Show
that:
𝑁 𝐷𝐸 = 𝛽2 + 𝛼𝛽4 (4.58)
𝑁 𝐼𝐸 = 𝛽1 (𝛾1 + 𝛼𝛾2 ) (4.59)
𝑇 𝐸 = 𝛽2 + (𝛾1 + 𝛼𝛾2 )(𝛽3 + 𝛽1 ) + 𝛼𝛽4 (4.60)
𝑇 𝐸 − 𝑁 𝐷𝐸 = (𝛽1 + 𝛽3 )(𝛾1 + 𝛼𝛾2 ) (4.61)
𝐸[𝑊0 ] = 𝛼 * 0 = 0
𝐸[𝑊1 ] = 𝛼 * 1 = 𝛼
𝐸[𝑀0 ] = 𝛾1 * 0 + 𝛾2 * 0 = 0
𝐸[𝑀1 ] = 𝛾1 * 1 + 𝛾2 𝛼 = 𝛾1 + 𝛾2 𝛼
Now we can compute our target quantities:
𝑁 𝐷𝐸 = 𝐸[𝑌1,𝑀0 − 𝑌0,𝑀0 ]
= 𝐸[𝑌1,𝑀0 ] − 𝐸[𝑌0,𝑀0 ]
= (𝛽1 * 0 + 𝛽2 * 1 + 𝛽3 * 1 * 0 + 𝛽4 𝛼) − (𝛽1 * 0 + 𝛽2 * 0 + 𝛽3 * 0 * 0 + 𝛽4 * 0)
= 𝛽2 + 𝛼𝛽4
𝑁 𝐼𝐸 = 𝐸[𝑌0,𝑀1 − 𝑌0,𝑀0 ]
= 𝐸[𝑌0,𝑀1 ] − 𝐸[𝑌0,𝑀0 ]
= (𝛽1 [𝛾1 + 𝛾2 𝛼] + 𝛽2 * 0 + 𝛽3 * 0 + 𝛽4 * 0) − (𝛽1 * 0 + 𝛽2 * 0 + 𝛽3 * 0 * 0 + 𝛽4 * 0)
= 𝛽1 (𝛾1 + 𝛼𝛾2 )
𝑇 𝐸 = 𝐸[𝑌1 − 𝑌0 ]
= 𝐸[𝑌1 ] − 𝐸[𝑌0 ]
= (𝛽1 [𝛾1 + 𝛾2 𝛼] + 𝛽2 * 1 + 𝛽3 [𝛾1 + 𝛾2 𝛼] + 𝛽4 𝛼) − (𝛽1 * 0 + 𝛽2 * 0 + 𝛽3 * 0 * 0 + 𝛽4 * 0)
= 𝛽2 + (𝛾1 + 𝛼𝛾2 )(𝛽1 + 𝛽3 ) + 𝛼𝛽4
STUDY QUESTIONS AND SOLUTIONS FOR CHAPTER 4 73
So, combining our computations from above, we find that the portion of the effect for which
mediation is necessary is:
𝐸[𝑊0 ] = 𝛼 * 0 = 0
𝐸[𝑊1 ] = 𝛼 * 1 = 𝛼
𝐸[𝑀0,𝑊0 ] = 𝛾1 * 0 + 𝛾2 * 0 = 0
𝐸[𝑀0,𝑊1 ] = 𝛾1 * 0 + 𝛾2 * 𝛼 = 𝛾2 * 𝛼
𝐸[𝑀1,𝑊0 ] = 𝛾1 * 1 + 𝛾2 * 0 = 𝛾1
𝐸[𝑀1,𝑊1 ] = 𝛾1 * 1 + 𝛾2 * 𝛼
Once more, using the above to compute our target quantities, we have:
𝑁 𝐷𝐸 = 𝐸[𝑌1,𝑊0 − 𝑌0,𝑊0 ]
= 𝐸[𝑌1,𝑊0 ] − 𝐸[𝑌0,𝑊0 ]
= (𝛽1 [𝛾1 ] + 𝛽2 * 1 + 𝛽3 * 1[𝛾1 ] + 𝛽4 * 0) − (𝛽1 * 0 + 𝛽2 * 0 + 𝛽3 * 0 + 𝛽4 * 0)
= 𝛽1 𝛾 1 + 𝛽2 + 𝛽3 𝛾 1
𝑁 𝐼𝐸 = 𝐸[𝑌0,𝑊1 − 𝑌0,𝑊0 ]
= 𝐸[𝑌0,𝑊1 ] − 𝐸[𝑌0,𝑊0 ]
= (𝛽1 [𝛾2 𝛼] + 𝛽2 * 0 + 𝛽3 * 0 + 𝛽4 * 𝛼) − (𝛽1 * 0 + 𝛽2 * 0 + 𝛽3 * 0 + 𝛽4 * 0)
= 𝛼𝛽1 𝛾2 + 𝛼𝛽4
𝑇 𝐸 = 𝐸[𝑌1 − 𝑌0 ]
= 𝐸[𝑌1 ] − 𝐸[𝑌0 ]
= (𝛽1 [𝛾1 + 𝛾2 𝛼] + 𝛽2 * 1 + 𝛽3 [𝛾1 + 𝛾2 𝛼] + 𝛽4 𝛼) − (𝛽1 * 0 + 𝛽2 * 0 + 𝛽3 * 0 * 0 + 𝛽4 * 0)
= 𝛽2 + (𝛾1 + 𝛼𝛾2 )(𝛽1 + 𝛽3 ) + 𝛼𝛽4
So, combining our computations from above, we know that the portion of the effect for which
mediation is necessary is:
[Hint: this is precisely the numerical problem on homework mediation as presented in the
text; see Tables 4.5, 4.7 and computations that follow]. Our goal is to find the proportion
of hiring disparity that is due to gender, and the proportion that could be explained by dis-
parity in qualification alone. Using the same strategies as the homework-training program
example: the quantity 𝑁 𝐼𝐸/𝑇 𝐸 tells us what proportion of the disparity is due to qualifica-
tion alone and the quantity 1 − 𝑁 𝐷𝐸/𝑇 𝐸 tells us what proportion of the disparity is due to
gender. Assuming that there exists no unobserved confounding, we’ll compute our quantities
of interest using Eqs. (4.51), (4.52), (4.44):
∑︁
𝑁 𝐷𝐸 = [𝐸[𝑌 |𝑇 = 1, 𝑀 = 𝑚] − 𝐸[𝑌 |𝑇 = 0, 𝑀 = 𝑚]]𝑃 (𝑀 = 𝑚|𝑇 = 0)
𝑚
𝑇 𝐸 = 𝐸[𝑌1 − 𝑌0 ]
= 𝐸[𝑌1 ] − 𝐸[𝑌0 ]
= 𝐸[𝑌 |𝑑𝑜(𝑇 = 1)] − 𝐸[𝑌 |𝑑𝑜(𝑇 = 1)]
∑︁
= 𝐸[𝑌 |𝑑𝑜(𝑇 = 1), 𝑀 = 𝑚]𝑃 (𝑀 = 𝑚|𝑑𝑜(𝑇 = 1))
𝑚
∑︁
− 𝐸[𝑌 |𝑑𝑜(𝑇 = 0), 𝑀 = 𝑚]𝑃 (𝑀 = 𝑚|𝑑𝑜(𝑇 = 0))
𝑚
∑︁ ∑︁
= 𝐸[𝑌 |𝑇 = 1, 𝑀 = 𝑚]𝑃 (𝑀 = 𝑚|𝑇 = 1) − 𝐸[𝑌 |𝑇 = 0, 𝑀 = 𝑚]𝑃 (𝑀 = 𝑚|𝑇 = 0)
𝑚 𝑚
Now, we have all the components needed to make claims about the hiring disparity; in
particular:
So, from the above, we conclude that 30.4% of the hiring disparity is due to gender and 7%
of the proportion could be explained by disparity in qualification alone.