0% found this document useful (0 votes)
103 views

ML Section16 Causality

This document discusses causality and correlation in machine learning. It provides examples of correlations that do not imply causation, such as the correlation between the number of storks and the number of births. It also discusses how correlations can still provide hints about relationships worth investigating further. Additionally, it explains how correlations only measure similarity between curves and are not a measure of statistical dependence. The document outlines different scenarios to explain a correlation, including one where a common cause is responsible for two variables being correlated.

Uploaded by

dummy
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
103 views

ML Section16 Causality

This document discusses causality and correlation in machine learning. It provides examples of correlations that do not imply causation, such as the correlation between the number of storks and the number of births. It also discusses how correlations can still provide hints about relationships worth investigating further. Additionally, it explains how correlations only measure similarity between curves and are not a measure of statistical dependence. The document outlines different scenarios to explain a correlation, including one where a common cause is responsible for two variables being correlated.

Uploaded by

dummy
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 57

Machine Learning

Section 16: Causality

Stefan Harmeling

22. December 2021 (WS 2021/22)

Machine Learning / Stefan Harmeling / 22. December 2021 (WS 2021/22) 1


Messerli, The New England Journal of Medicine 367(16), 2012

Machine Learning / Stefan Harmeling / 22. December 2021 (WS 2021/22) 2


PiratesVsTemp.svg: RedAndr derivative work: Mikhail Ryazanov (talk), PiratesVsTemp(en), CC BY-SA 3.0

Machine Learning / Stefan Harmeling / 22. December 2021 (WS 2021/22) 3


More examples

From a post of Peter Flom on http://stats.stackexchange.com/questions/36/


examples-for-teaching-correlation-does-not-mean-causation:

▸ The more firemen are sent to a fire, the more damage is done.
▸ Children who get tutored get worse grades than children who do
not get tutored.
▸ In the early elementary school years, astrological sign is
correlated with IQ, but this correlation weakens with age and
disappears by adulthood.
▸ When people play golf they are more likely to be rich.

Machine Learning / Stefan Harmeling / 22. December 2021 (WS 2021/22) 4


Books

Machine Learning / Stefan Harmeling / 22. December 2021 (WS 2021/22) 5


Recommended books

The
PDF of this book can be
downloaded from MIT Press
“Previews” available at: http:
website (look for “Open Access
//bayes.cs.ucla.edu/PRIMER/
Title”).

Machine Learning / Stefan Harmeling / 22. December 2021 (WS 2021/22) 6


More good books

Machine Learning / Stefan Harmeling / 22. December 2021 (WS 2021/22) 7


Flow chart for causality books

from Brady Neal’s blog at https://www.bradyneal.com/which-causal-inference-book

Machine Learning / Stefan Harmeling / 22. December 2021 (WS 2021/22) 8


Correlations

Machine Learning / Stefan Harmeling / 22. December 2021 (WS 2021/22) 9


“Storks vs number of births”

Machine Learning / Stefan Harmeling / 22. December 2021 (WS 2021/22) 10


Correlation coefficient

Cov(X , Y )
ρ(X , Y ) =
σ(X )σ(Y )

▸ Cov(X , Y ) is the covariance between variables X and Y


▸ σ(X ) is the standard deviation

Machine Learning / Stefan Harmeling / 22. December 2021 (WS 2021/22) 11


Empirical correlation coefficient

For samples x1 , . . . , xn and y1 , . . . , yn

∑i=1 (xi − x̄)(yi − ȳ )


n
r = rxy = √ √ n
∑i=1 (xi − x̄)2 ∑i=1 (yi − ȳ )2
n

with sample mean

1 n
x̄ = ∑ xi
n i=1

Machine Learning / Stefan Harmeling / 22. December 2021 (WS 2021/22) 12


Correlation measures only similarity
between curves.
▸ Empirical correlation coefficient rho is basically a normalized inner
product between two vectors.
▸ rho is a second-order statistics.
▸ rho views scatterplots through the Gaussian glasses.

Machine Learning / Stefan Harmeling / 22. December 2021 (WS 2021/22) 13


However, correlation does not imply
causation!

Machine Learning / Stefan Harmeling / 22. December 2021 (WS 2021/22) 14


Nonetheless, correlation gives hints to
interesting relationships that can be
investigated further.

Machine Learning / Stefan Harmeling / 22. December 2021 (WS 2021/22) 15


Machine Learning / Stefan Harmeling / 22. December 2021 (WS 2021/22) 16
Empirical correlation coefficient

For samples x1 , . . . , xn and y1 , . . . , yn

∑i=1 (xi − x̄)(yi − ȳ )


n
r = rxy = √ √ n
∑i=1 (xi − x̄)2 ∑i=1 (yi − ȳ )2
n

with sample mean

1 n
x̄ = ∑ xi
n i=1

Machine Learning / Stefan Harmeling / 22. December 2021 (WS 2021/22) 17


Correlation coefficient of scatterplots

1.0 0.8 0.4 0.0 -0.4 -0.8 -1.0

1.0 1.0 1.0 -1.0 -1.0 -1.0

0.0 0.0 0.0 0.0 0.0 0.0 0.0


▸ Thus uncorrelatedness (ρ = 0) does not imply independence.
▸ Even worse, X and Y could be uncorrelated, but X causes Y , e.g.
example at the bottom left with X horiz and Y vert.
Machine Learning / Stefan Harmeling / 22. December 2021 (WS 2021/22) 18
Some insights

▸ Correlation measures similarly between curves.


▸ Correlation is not statistical dependence.
▸ Correlation does not imply causation, but gives hints to possibly
interesting relations.
▸ Uncorrelatedness does not imply independence.
▸ X and Y could be uncorrelated, even though X causes Y .

Machine Learning / Stefan Harmeling / 22. December 2021 (WS 2021/22) 19


Mechanisms

Machine Learning / Stefan Harmeling / 22. December 2021 (WS 2021/22) 20


Story
Studies show that married people are happier.
Possible conclusions
1. Does it mean, marrying makes us happier?
2. Or does no one marry unhappy people?
3. Or is there some common cause that makes us happy and more
likely to marry?
4. Or are the studies wrong and happiness and marrying are
unrelated?
Notation

M = "being married"
H = "being happy"

Machine Learning / Stefan Harmeling / 22. December 2021 (WS 2021/22) 21


Four scenarios as words

Words:
1. If you get married, you are likely more happy.

M causes H.

2. If you are more happy, you are more likely to marry.

H causes M.

3. There is a common cause that makes you more likely to be happy


and to get married.

C causes H and M.

4. Being married and being happy has nothing to do with each other.

H and M are independent.

Machine Learning / Stefan Harmeling / 22. December 2021 (WS 2021/22) 22


Four scenarios as graphs
1. If you get married, you are likely more happy.

M H

2. If you are more happy, you are more likely to marry.

H M

3. There is a common cause that makes you more likely to be happy


and to get married.
C

M H

4. Being married and being happy has nothing to do with each other.

H M
Machine Learning / Stefan Harmeling / 22. December 2021 (WS 2021/22) 23
Four scenarios as computer programs

1. If you get married, you are likely more happy.


M = f ( ) ; H = g (M) ;

2. If you are more happy, you are more likely to marry.


H = f ( ) ; M = g (H ) ;

3. There is a common cause that makes you more likely to be happy


and to get married.
C = f ( ) ; M = g (C ) ; H = h (C ) ;

4. Being married and being happy has nothing to do with each other.
M = f (); H = g();

Machine Learning / Stefan Harmeling / 22. December 2021 (WS 2021/22) 24


Four scenarios as probabilities

1. If you get married, you are likely more happy.

p(M, H) = p(M) p(H∣M)

2. If you are more happy, you are more likely to marry.

p(M, H) = p(H) p(M∣H)

3. There is a common cause that makes you more likely to be happy


and to get married.

p(M, H) = ∑ p(C) p(M∣C) p(H∣C)


C

4. Being married and being happy has nothing to do with each other.

p(M, H) = p(M) p(H)

Machine Learning / Stefan Harmeling / 22. December 2021 (WS 2021/22) 25


Graphs / computer programs vs probabilities

M H p(M, H) = p(M) p(H∣M)

H M p(M, H) = p(H) p(M∣H)

M H p(M, H) = ∑C p(C) p(M∣C) p(H∣C)

H M p(M, H) = p(M) p(H)

▸ the graph (and the computer program) describes the mechanism


▸ the joint probability p(M, H) does not describe the mechanism
▸ the joint probability does not describe what happens when we set
M or H to some values

Machine Learning / Stefan Harmeling / 22. December 2021 (WS 2021/22) 26


Pearl’s causal hierarchy

extending the notation of probabilities. . .

1. Observational questions
▸ “what if I see A” or “what is?”

p(Y ∣A)

2. Interventional statements
▸ “what if I do A” or “what if?”

p(Y ∣do(A))

3. Counterfactual statements
▸ “what if I did things differently” or “why?”

p(YA′ ∣do(A))

Machine Learning / Stefan Harmeling / 22. December 2021 (WS 2021/22) 27


Causal learning/discovery vs causal
reasoning/inference
From Peters et al
6 Chapter 1. Statistical and Causal Models

causal learning
observations &
outcomes incl.
causal model
changes &
interventions
causal reasoning

subsumes
subsume
statistical learning

probabilistic model observations


& outcomes

probabilistic reasoning

Figure 1.1: Terminology used by the present book for various probabilistic inference
problems (bottom) and causal inference problems (top); see Section 1.3. Note that we use
the term “inference” to include both learning and reasoning.
Peters et al., Elements of causal learning, MITpress, 2019.
not distract us from the fact, however, that the ill-posed-ness of the usual statisti-
cal problems is still there (and thus it is important to worry about the capacity of
function classes also in causality, such as by using additive noise models — see
Section 4.1.4 below), only confounded by an additional difficulty arising from the
Machine Learning / fact that
Stefan we are/ 22.
Harmeling trying to estimate
December a richer
2021 (WS 2021/22)structure than just a probabilistic one. 28
Causal discovery
Going from the data to the model

Machine Learning / Stefan Harmeling / 22. December 2021 (WS 2021/22) 29


��

��

��
� ����� ����� �����

Machine Learning / Stefan Harmeling / 22. December 2021 (WS 2021/22) 30


�������

�������


� �� �� �� �� ���

Machine Learning / Stefan Harmeling / 22. December 2021 (WS 2021/22) 31


��

��

��

��

��


� �� ��� ��� ��� ���

Machine Learning / Stefan Harmeling / 22. December 2021 (WS 2021/22) 32


��

��

��

��


� �� ��� ��� ��� ���

Machine Learning / Stefan Harmeling / 22. December 2021 (WS 2021/22) 33


��

��

��

���

���

���
� ��� ��� ��� ���

Machine Learning / Stefan Harmeling / 22. December 2021 (WS 2021/22) 34


Tübingen pairs data set

▸ Reference:
J. M. Mooij, J. Peters, D. Janzing, J. Zscheischler, B.
Schoelkopf: "Distinguishing cause from effect using observa-
tional data: methods and benchmarks", Journal of Machine
Learning Research 17(32):1-102, 2016
▸ There exists methods that can do causal discovery for two
variables! (however, not perfectly, but slighly better than chance),
see also Elements of Causal Inference (Peters et al)

Machine Learning / Stefan Harmeling / 22. December 2021 (WS 2021/22) 35


Be careful to distinguish correctly:

Causal discovery (aka learning)


▸ given data infer causal model
Causal inference (aka reasoning)
▸ given causal model infer conclusions

Machine Learning / Stefan Harmeling / 22. December 2021 (WS 2021/22) 36


Causal learning/discovery vs causal
reasoning/inference
From Peters et al
6 Chapter 1. Statistical and Causal Models

causal learning
observations &
outcomes incl.
causal model
changes &
interventions
causal reasoning

subsumes
subsume
statistical learning

probabilistic model observations


& outcomes

probabilistic reasoning

Figure 1.1: Terminology used by the present book for various probabilistic inference
problems (bottom) and causal inference problems (top); see Section 1.3. Note that we use
the term “inference” to include both learning and reasoning.
Peters et al., Elements of causal learning, MITpress, 2019.
not distract us from the fact, however, that the ill-posed-ness of the usual statisti-
cal problems is still there (and thus it is important to worry about the capacity of
function classes also in causality, such as by using additive noise models — see
Section 4.1.4 below), only confounded by an additional difficulty arising from the
Machine Learning / fact that
Stefan we are/ 22.
Harmeling trying to estimate
December a richer
2021 (WS 2021/22)structure than just a probabilistic one. 37
Causal vs anti-causal learning

Machine Learning / Stefan Harmeling / 22. December 2021 (WS 2021/22) 38


19.2.18MNISTCausal
Example: and anti -
Canal

leaning .

▸Couordr
Consider MNST: MMST

YEY
±EX t##T
"
1
the
.am

9%
"
the
drawing
eked

For
digitrecognition
▸ For digit recognition we
we want to learnwant to
a function
learn
function
a
f ∶X →Y
f :X
Question: Is this the causal direction?
-
Y
Is it the causal
Machine Learning / Stefan Harmeling / 22. December 2021 (WS 2021/22)
direction ? 39
Example: MNIST - two factorizations
Causal direction
▸ Given a label y generate an image x.

p(x, y ) = p(y ) p(x∣y )

▸ Learning p(x∣y ) learns to write, i.e. learns the writing mechanism.


▸ The frequencies are irrelevant for this.
▸ Forward in time, we first have y , then we have x.
Anti-causal direction
▸ Given an image x what is the label y .

p(x, y ) = p(x) p(y ∣x)

▸ Learning p(y ∣x) learns to read.


▸ The frequencies are relevant for this, if in doubt we take the more
likely letter of the alphabet.
▸ Backwards in time, give the result image go back to the thought
(of the writer).
Machine Learning / Stefan Harmeling / 22. December 2021 (WS 2021/22) 40
Fix
fit ply) pkly )
plxiy )
Causal =

Example:Given
MNIST -label
writing vs reading

.÷⇒t
an x.
z
a
Causal direction - WRITING generate image

µ
y
"
"

ply ) - frequencies

0 ⇒

Carell :
plxlyt Et
D= i Et "
how to
"
"
Pk ) plylx ) anti draw
Canal
play ) played
= = yo .

digits
And "
firmAnti-causal ? .
an
image
Learning
but
what) teams
only - plxly
direction READING
xis the
labdy
the mechanism
,
the
ignores digit frequencies
irrelevant
of the

in
's
OtherPl application
dataset , which might be the =

, digit
, since
frequencies
might be dittmti
Shift
" "
y
Covariate
plyk ) =

the mechanism
-
.

The lomij focusses on .

Machine Learning / Stefan Harmeling / 22. December 2021 (WS 2021/22) 41


Example from Peters
5.1. Semi-Supervised Learning 73
Figure and captions copied from the ECI-book

X Y
j
id
NX NY

X Y
j
id
NX NY

Figure 5.1: Top: a complicated mechanism j called the ribosome translates mRNA infor-
mation X into a protein chain Y .2 Predicting the protein from the mRNA is an example of
a causal learning problem, where the direction of prediction (green arrow) is aligned with
the direction of causation (red). Bottom: In handwritten digit recognition, we try to infer
the class label Y (i.e., the writer’s intention) from an image X produced by a writer. This
is an anticausal problem.

where SSL helped were anticausal, or confounded, or examples where the causal
Machine Learning / Stefan Harmeling / 22. December 2021 (WS 2021/22) 42
Causal and anti-causal learning

True causal model

φ∶C→E

Learning the causal direction


▸ find f that calculates the effect given the cause

f ∶C→E

▸ “learning to write”
Learning the anti-causal direction
▸ find f that calculates the cause given the effect

f ∶E →C

▸ “learning to read”

Machine Learning / Stefan Harmeling / 22. December 2021 (WS 2021/22) 43


Semi-supervised learning

Machine Learning / Stefan Harmeling / 22. December 2021 (WS 2021/22) 44


What is semi-supervised learning (1)

Regression task (supervised)


▸ Given data (x1 , y1 ), . . . , (xn , yn ) ∼ p(x, y )
▸ Predict y given x.
▸ For L2 loss

E(Y − f (X ))2 = ∫ (y − f (x))2 p(x, y ) dx dy

the minimizer

f 0 (x) = argminf E(Y − f (X ))2

can be written as

f 0 (x) = E(Y ∣X = x) = ∫ y p(y ∣x) dy

i.e. it only depends on p(y ∣x).

Machine Learning / Stefan Harmeling / 22. December 2021 (WS 2021/22) 45


What is semi-supervised learning (2)

Supervised learning
▸ Given n iid labeled data points:

(x1 , y1 ), . . . , (xn , yn ) ∼ p(x, y )

where x is the location (the point) and y its label.


▸ Then L2 regression is just using p(y ∣x)

Semi-supervised learning (SSL)


▸ Additionally we have unlabeled data:

xn+1 , . . . , xn+m ∼ p(x)

▸ The “hope” is that information about p(x) tells us something about


p(y ∣x).
▸ This hope holds under certain assumptions (next slide...).

Machine Learning / Stefan Harmeling / 22. December 2021 (WS 2021/22) 46


Simple example of semisupervised learning
from https://en.wikipedia.org/wiki/Semi-supervised_learning

Machine Learning / Stefan Harmeling / 22. December 2021 (WS 2021/22) 47


What is semi-supervised learning (3)

1. Cluster assumption
▸ Assume that the points x ∼ p(x) can be clustered and points inside
a cluster have the same label y .

2. Low-density separation assumption


▸ Assume that the decision boundary of the classification problem, i.e.
where p(Y = 1∣X = x) ≈ 0.5 lies in an area where p(x) is small.
▸ (quite similar to cluster assumption)

3. Semi-supervised smoothness ssumption


▸ The conditional mean x ↦ E(Y ∣X = x) is smooth where p(x) is
large.
▸ In causal words, the cause and the mechanism, p(x) and p(y ∣x),
should be somehow dependent.

Machine Learning / Stefan Harmeling / 22. December 2021 (WS 2021/22) 48


SSL and (anti-)causal direction

Learning the causal direction


▸ find f that calculates the effect Y given the cause X

f ∶X → Y X is cause, Y is effect

▸ p(x) and p(y ∣x) are independent mechanisms, i.e. SSL should
not work.
Learning the anti-causal direction
▸ find f that calculates the cause Y given the effect X

f ∶X → Y X is effect, Y is cause

▸ p(x) might contain information about p(y ∣x).


▸ Meta-study on next slides supports this reasoning.

Machine Learning / Stefan Harmeling / 22. December 2021 (WS 2021/22) 49


Figure and captions copied from the ECI-book

60
Anticausal/confounded
self-training instead of a base classifier
Relative decrease of error when using
Causal
40
Unclear
-60 -40 -20 0 20

ba-sc
br-c
br-w
col
col.O
cr-a
cr-g
diab
he-c
he-h
he-s
hep
ion
iris
kr-kp
lab
lett
mush
seg
sick
son
splic
vehi
vote
vow
wave
Figure 5.2: The benefit of SSL depends on the causal structure. Each column of points
corresponds to a benchmark data set from the UCI repository and shows the performance
of six different base classifiers augmented with self-training, a generic method for SSL.
Performance is measured by percentage decrease of error relative to the base classifier,
that is, (error(base) error(self-train))/error(base). Self-training overall does not help for
the causal data sets, but it does help for some of the anticausal/confounded data sets [from
Schölkopf et al., 2012].
Machine Learning / Stefan Harmeling / 22. December 2021 (WS 2021/22) 50
Covariate shift

Machine Learning / Stefan Harmeling / 22. December 2021 (WS 2021/22) 51


What is a covariate?
source: https:
//en.wikipedia.org/wiki/Dependent_and_independent_variables#Statistics_synonyms

f ∶X →Y

Synomyms for X Synonyms for Y


independent variable dependent variable
predictor variable response variable
regressor regressand
covariate criterion
controlled variable predicted variable
manipulated variable measured variable
explanatory variable explained variable
exposure variable experimental variable
risk factor responding variable
feature (in machine learning!) outcome variable
input variable (my favorite) output variable
label
Machine Learning / Stefan Harmeling / 22. December 2021 (WS 2021/22) 52
What is covariate shift?

Simple description
▸ covariate shift means that training and testing distribution of the
input differs in a regression task.
Regression task
▸ Given data (x1 , y1 ), . . . , (xn , yn ) ∼ p(x, y )
▸ Predict y given x.
▸ Train (e.g.) with L2 loss

E(Y − f (X ))2 = ∫ (y − f (x))2 p(x, y ) dx dy

▸ Test with a L2 on different locations X ∼ q(x):

E(Y − f (X ))2 = ∫ (y − f (x))2 q(x) p(y ∣x) dx dy

Machine Learning / Stefan Harmeling / 22. December 2021 (WS 2021/22) 53


Covariate shift and (anti-)causal direction

Learning the causal direction


▸ find f that calculates the effect Y given the cause X

f ∶X →Y

▸ Since p(x) and p(y ∣x) are independent mechanisms, we can still
use our learned regression function E(Y ∣X = x) since it does not
depend on p(x)
▸ So covariate shift is no problem in the causal direction
Learning the anti-causal direction
▸ find f that calculates the cause Y given the effect X

f ∶X →Y

▸ p(x) might contain information about p(y ∣x)


▸ once we change p(x) to q(x) this can change p(y ∣x) as well!

Machine Learning / Stefan Harmeling / 22. December 2021 (WS 2021/22) 54


Covariate shift for the anti-causal case 78 Chapter 5. Connections to Machine Learning, I

p0 (x)
p(x)
x x

Figure 5.4: Example where PX changes to PX0 in a way that suggests that PY has changed
and PX|Y remained the same. When Y is binary and known to be the cause of X, observing
that PX is a mixture of two Gaussians makes it plausible that the two modes correspond to
the two different labels y = 0, 1. Then, the influence of Y on X consists just in shifting the
mean of the Gaussian (which amounts to an ANM — see Section 4.1.4), which is certainly
a simple explanation for the joint distribution. Observing furthermore that the weights of

Example the mixture changed from one data set to another one makes it likely that this change is
due to the change of PY .

covariate shift. Meanwhile, this is a well-studied assumption in machine learning

▸ Consider the following causal model:


[Sugiyama and Kawanabe, 2012]. The argument that this is only justified in the
causal scenario, in other words, if X is the cause and Y the effect, has been made
by Schölkopf et al. [2012].
To further illustrate this point, consider the following toy example of an anti-
causal scenario where X is the effect. Let Y be a binary variable influencing the
real-valued variable X in an additive way:

X = Y + NX , (5.4)

Y ∼ coin(0.5) ∈ {0, 1} X = Y + NX NX ∼ N (0, 1)


where we assume NX to be Gaussian noise, independent of Y . Figure 5.4, left,
shows the corresponding probability density pX .
If its width is sufficiently small, the distribution PX is bimodal. Even if one does
not know anything about the generating model, PX can be recognized as a mixture
of two Gaussian distributions with equal width. In this case, one can therefore
guess the joint distribution PX,Y from PX alone because it is natural to assume that

▸ p(x) is a mixture of two Gaussians. the influence of Y consists only in shifting the mean of X. Under this assumption,
we do not need any (x, y)-pairs to learn the relation between X and Y . Assume now
that in a second data set we observe the same mixture of two Gaussian distributions
but with different weights (see Figure 5.4, right). Then, the most natural conclusion
reads that the weights have changed because the same equation (5.4) still holds but

▸ learning f ∶ X → Y , finds a boundary exactly halfway between only PY has changed. Accordingly, we would no longer use the same PY |X for
our prediction and reconstruct PY0 |X from PX0 . The example illustrates that in the
anticausal scenario the changes of PX and PY |X may be related and that this relation
may be due to the fact that PY has changed and PX|Y remained the same. In other

y = 0 and y = 1
▸ Suppose our test data comes from

q(x) = 0.9N (x; 0, 1) + 0.1N (x; 1, 1)

▸ This will also shift the boundary from y = 1 towards y = 0.


▸ So for learning the anti-causal direction covariate shift will possibly
destroy the results, i.e. since p(y ∣x) is changed as p(x) changes
to q(x).
Machine Learning / Stefan Harmeling / 22. December 2021 (WS 2021/22) 55
Summary: causal perspective on SSL and covariate
▸ given data sampled from joint p(x, y )
▸ two ways to factorize p(x, y ) = p(x) p(y ∣x) = p(y ) p(x∣y )
▸ machine learning estimates function f ∶ X → Y
1. Causal direction X → Y causal learning
▸ p(x) and p(y ∣x) are independent mechanism
▸ e.g. learn mapping from mRNA to the protein
▸ SLL is not working: more examples for X does not tell us more
about p(y ∣x), which describes f
▸ Covariate shift is no problem: if p(x) changes, learned function will
still work, since p(y ∣x) does not change.
2. Causal direction Y → X anti-causal learning
▸ p(y ) and p(x∣y ) are independent mechanism, but typically p(x) and
p(y ∣x) are somehow coupled
▸ e.g. learn reading from digits to labels (MNIST)
▸ SSL might help because of the above coupling
▸ Covariate shift is problematic: changing p(x) possibly changes
p(y ∣x)

Machine Learning / Stefan Harmeling / 22. December 2021 (WS 2021/22) 56


End of today

Machine Learning / Stefan Harmeling / 22. December 2021 (WS 2021/22) 57

You might also like