ML Section16 Causality
ML Section16 Causality
Stefan Harmeling
▸ The more firemen are sent to a fire, the more damage is done.
▸ Children who get tutored get worse grades than children who do
not get tutored.
▸ In the early elementary school years, astrological sign is
correlated with IQ, but this correlation weakens with age and
disappears by adulthood.
▸ When people play golf they are more likely to be rich.
The
PDF of this book can be
downloaded from MIT Press
“Previews” available at: http:
website (look for “Open Access
//bayes.cs.ucla.edu/PRIMER/
Title”).
Cov(X , Y )
ρ(X , Y ) =
σ(X )σ(Y )
1 n
x̄ = ∑ xi
n i=1
1 n
x̄ = ∑ xi
n i=1
M = "being married"
H = "being happy"
Words:
1. If you get married, you are likely more happy.
M causes H.
H causes M.
C causes H and M.
4. Being married and being happy has nothing to do with each other.
M H
H M
M H
4. Being married and being happy has nothing to do with each other.
H M
Machine Learning / Stefan Harmeling / 22. December 2021 (WS 2021/22) 23
Four scenarios as computer programs
4. Being married and being happy has nothing to do with each other.
M = f (); H = g();
4. Being married and being happy has nothing to do with each other.
1. Observational questions
▸ “what if I see A” or “what is?”
p(Y ∣A)
2. Interventional statements
▸ “what if I do A” or “what if?”
p(Y ∣do(A))
3. Counterfactual statements
▸ “what if I did things differently” or “why?”
p(YA′ ∣do(A))
causal learning
observations &
outcomes incl.
causal model
changes &
interventions
causal reasoning
subsumes
subsume
statistical learning
probabilistic reasoning
Figure 1.1: Terminology used by the present book for various probabilistic inference
problems (bottom) and causal inference problems (top); see Section 1.3. Note that we use
the term “inference” to include both learning and reasoning.
Peters et al., Elements of causal learning, MITpress, 2019.
not distract us from the fact, however, that the ill-posed-ness of the usual statisti-
cal problems is still there (and thus it is important to worry about the capacity of
function classes also in causality, such as by using additive noise models — see
Section 4.1.4 below), only confounded by an additional difficulty arising from the
Machine Learning / fact that
Stefan we are/ 22.
Harmeling trying to estimate
December a richer
2021 (WS 2021/22)structure than just a probabilistic one. 28
Causal discovery
Going from the data to the model
��
��
� ����� ����� �����
�������
�
� �� �� �� �� ���
��
��
��
��
�
� �� ��� ��� ��� ���
��
��
��
�
� �� ��� ��� ��� ���
��
��
���
���
���
� ��� ��� ��� ���
▸ Reference:
J. M. Mooij, J. Peters, D. Janzing, J. Zscheischler, B.
Schoelkopf: "Distinguishing cause from effect using observa-
tional data: methods and benchmarks", Journal of Machine
Learning Research 17(32):1-102, 2016
▸ There exists methods that can do causal discovery for two
variables! (however, not perfectly, but slighly better than chance),
see also Elements of Causal Inference (Peters et al)
causal learning
observations &
outcomes incl.
causal model
changes &
interventions
causal reasoning
subsumes
subsume
statistical learning
probabilistic reasoning
Figure 1.1: Terminology used by the present book for various probabilistic inference
problems (bottom) and causal inference problems (top); see Section 1.3. Note that we use
the term “inference” to include both learning and reasoning.
Peters et al., Elements of causal learning, MITpress, 2019.
not distract us from the fact, however, that the ill-posed-ness of the usual statisti-
cal problems is still there (and thus it is important to worry about the capacity of
function classes also in causality, such as by using additive noise models — see
Section 4.1.4 below), only confounded by an additional difficulty arising from the
Machine Learning / fact that
Stefan we are/ 22.
Harmeling trying to estimate
December a richer
2021 (WS 2021/22)structure than just a probabilistic one. 37
Causal vs anti-causal learning
leaning .
▸Couordr
Consider MNST: MMST
YEY
±EX t##T
"
1
the
.am
9%
"
the
drawing
eked
For
digitrecognition
▸ For digit recognition we
we want to learnwant to
a function
learn
function
a
f ∶X →Y
f :X
Question: Is this the causal direction?
-
Y
Is it the causal
Machine Learning / Stefan Harmeling / 22. December 2021 (WS 2021/22)
direction ? 39
Example: MNIST - two factorizations
Causal direction
▸ Given a label y generate an image x.
Example:Given
MNIST -label
writing vs reading
.÷⇒t
an x.
z
a
Causal direction - WRITING generate image
µ
y
"
"
ply ) - frequencies
0 ⇒
Carell :
plxlyt Et
D= i Et "
how to
"
"
Pk ) plylx ) anti draw
Canal
play ) played
= = yo .
digits
And "
firmAnti-causal ? .
an
image
Learning
but
what) teams
only - plxly
direction READING
xis the
labdy
the mechanism
,
the
ignores digit frequencies
irrelevant
of the
in
's
OtherPl application
dataset , which might be the =
, digit
, since
frequencies
might be dittmti
Shift
" "
y
Covariate
plyk ) =
the mechanism
-
.
X Y
j
id
NX NY
X Y
j
id
NX NY
Figure 5.1: Top: a complicated mechanism j called the ribosome translates mRNA infor-
mation X into a protein chain Y .2 Predicting the protein from the mRNA is an example of
a causal learning problem, where the direction of prediction (green arrow) is aligned with
the direction of causation (red). Bottom: In handwritten digit recognition, we try to infer
the class label Y (i.e., the writer’s intention) from an image X produced by a writer. This
is an anticausal problem.
where SSL helped were anticausal, or confounded, or examples where the causal
Machine Learning / Stefan Harmeling / 22. December 2021 (WS 2021/22) 42
Causal and anti-causal learning
φ∶C→E
f ∶C→E
▸ “learning to write”
Learning the anti-causal direction
▸ find f that calculates the cause given the effect
f ∶E →C
▸ “learning to read”
the minimizer
can be written as
Supervised learning
▸ Given n iid labeled data points:
1. Cluster assumption
▸ Assume that the points x ∼ p(x) can be clustered and points inside
a cluster have the same label y .
f ∶X → Y X is cause, Y is effect
▸ p(x) and p(y ∣x) are independent mechanisms, i.e. SSL should
not work.
Learning the anti-causal direction
▸ find f that calculates the cause Y given the effect X
f ∶X → Y X is effect, Y is cause
60
Anticausal/confounded
self-training instead of a base classifier
Relative decrease of error when using
Causal
40
Unclear
-60 -40 -20 0 20
ba-sc
br-c
br-w
col
col.O
cr-a
cr-g
diab
he-c
he-h
he-s
hep
ion
iris
kr-kp
lab
lett
mush
seg
sick
son
splic
vehi
vote
vow
wave
Figure 5.2: The benefit of SSL depends on the causal structure. Each column of points
corresponds to a benchmark data set from the UCI repository and shows the performance
of six different base classifiers augmented with self-training, a generic method for SSL.
Performance is measured by percentage decrease of error relative to the base classifier,
that is, (error(base) error(self-train))/error(base). Self-training overall does not help for
the causal data sets, but it does help for some of the anticausal/confounded data sets [from
Schölkopf et al., 2012].
Machine Learning / Stefan Harmeling / 22. December 2021 (WS 2021/22) 50
Covariate shift
f ∶X →Y
Simple description
▸ covariate shift means that training and testing distribution of the
input differs in a regression task.
Regression task
▸ Given data (x1 , y1 ), . . . , (xn , yn ) ∼ p(x, y )
▸ Predict y given x.
▸ Train (e.g.) with L2 loss
f ∶X →Y
▸ Since p(x) and p(y ∣x) are independent mechanisms, we can still
use our learned regression function E(Y ∣X = x) since it does not
depend on p(x)
▸ So covariate shift is no problem in the causal direction
Learning the anti-causal direction
▸ find f that calculates the cause Y given the effect X
f ∶X →Y
p0 (x)
p(x)
x x
Figure 5.4: Example where PX changes to PX0 in a way that suggests that PY has changed
and PX|Y remained the same. When Y is binary and known to be the cause of X, observing
that PX is a mixture of two Gaussians makes it plausible that the two modes correspond to
the two different labels y = 0, 1. Then, the influence of Y on X consists just in shifting the
mean of the Gaussian (which amounts to an ANM — see Section 4.1.4), which is certainly
a simple explanation for the joint distribution. Observing furthermore that the weights of
Example the mixture changed from one data set to another one makes it likely that this change is
due to the change of PY .
X = Y + NX , (5.4)
▸ p(x) is a mixture of two Gaussians. the influence of Y consists only in shifting the mean of X. Under this assumption,
we do not need any (x, y)-pairs to learn the relation between X and Y . Assume now
that in a second data set we observe the same mixture of two Gaussian distributions
but with different weights (see Figure 5.4, right). Then, the most natural conclusion
reads that the weights have changed because the same equation (5.4) still holds but
▸ learning f ∶ X → Y , finds a boundary exactly halfway between only PY has changed. Accordingly, we would no longer use the same PY |X for
our prediction and reconstruct PY0 |X from PX0 . The example illustrates that in the
anticausal scenario the changes of PX and PY |X may be related and that this relation
may be due to the fact that PY has changed and PX|Y remained the same. In other
y = 0 and y = 1
▸ Suppose our test data comes from