0% found this document useful (0 votes)
7 views

Jour 2

Uploaded by

Mayouf
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views

Jour 2

Uploaded by

Mayouf
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 37

Introduction to information theory and coding

Louis WEHENKEL
Set of slides No 2

• Entropies and information measures


• Chain rules for entropy and information
• More about independence, and conditional independence
• Translation of these properties into properties of information measures
• Data processing inequality
• Bayesian networks and decision trees

IT 2005-2, slide 1
Conditional (a posteriori) entropy

X
n X
m
H(X |Y) = − P (Xi ∩ Yj ) log P (Xi |Yj ). (1)
i=1 j=1

The entropy of X knowing that Y = Yj is


X
n
H(X |Yj ) = − P (Xi |Yj ) log P (Xi |Yj ), (2)
i=1

it is positive (it is an entropy) and one has


X
m
H(X |Y) = P (Yj )H(X |Yj ), (3)
j=1

hence this latter is also positive.


And concavity of Hn implies : H(X |Y) ≤ H(X ), which is a fundamental prop-
erty !

IT 2005-2, slide 2
Joint entropy and its relationship with conditional entropy

△ X
n X
m
H(X , Y) = − P (Xi ∩ Yj ) log P (Xi ∩ Yj )
i=1 j=1
Xn X m
= − P (Yj )P (Xi |Yj ) log(P (Yj )P (Xi |Yj ))
i=1 j=1
Xn X m
= − P (Yj )P (Xi |Yj ) log P (Yj )
i=1 j=1
Xn X m
− P (Yj )P (Xi |Yj ) log P (Xi |Yj )
i=1 j=1
Xm X
n
= − P (Yj )( P (Xi |Yj )) log P (Yj ) + H(X |Y)
j=1 i=1
= H(Y) + H(X |Y)
= H(X ) + H(Y|X ).

IT 2005-2, slide 3
Inequalities related to the entropy
One deduces the following inequalities :
H(X , Y) ≥ max (H(X ), H(Y))
H(X , Y) ≤ H(X ) + H(Y)
Conclusion :
H(X , Y) ≤ H(X ) + H(Y) ≤ 2H(X , Y) (4)

Particular cases :
X et Y independent : H(X , Y) = H(X ) + H(Y)
(because then P (Xi ∩ Yj ) = P (Xi )P (Yj )) (⇒ H(X |Y) = H(X ))
X function of Y : H(X , Y) = H(Y).
(because then H(X |Y) = 0) (since H(X |Yj ) = 0, ∀j = 1, . . . , m)

IT 2005-2, slide 4
Mutual information

X
n X
m
P (Xi ∩ Yj )
I(X ; Y) = + P (Xi ∩ Yj ) log . (5)
i=1 j=1
P (Xi )P (Yj )
One can derive :

I(X ; Y) = H(X ) − H(X |Y) = H(Y) − H(Y|X )

and hence
I(X ; Y) = H(X ) + H(Y) − H(X , Y)
which we may also write as

H(X , Y) = H(X ) + H(Y) − I(X ; Y)

Main conclusion :

0 ≤ I(X ; Y) ≤ min{H(X ), H(Y)}

IT 2005-2, slide 5
Exercises.

1. Show that indeed (and in the given order)

1. H(X , Y) = H(Y) + H(X |Y) = H(X ) + H(Y|X )

2. H(X , Y) ≥ max{H(X ), H(Y)}

3. H(X |Y) ≤ H(X )

4. H(X , Y) ≤ H(X ) + H(Y)

5. I(X ; Y) = H(X ) + H(Y) − H(X , Y) = H(X ) − H(X |Y)

2. A tournament between two teams consists of a sequence of at most 5 games which stops as soon as one of the two
teams has won three games. Let a and b denote the two teams and X a r.v. which represents the issue of a tournament
between a and b. For example, X = aaa, babab, bbaaa are possible values of X (there are other possible values).
Let Y denote the random variable which denotes the number of games played (thus Y = {3, 4, 5}).

Suppose that the teams are of the same strength and the outcomes of the successive games are independent, and
compute H(X ),H(Y), H(X |Y) and H(Y|X ).

Let Z = {a, b} denote the random variable which identifies the team winning the tournament. Determine H(X |Z),
compare with H(X ) and justify the result. Determine H(Z|X ), and justify.

IT 2005-2, note 5
00
11
11111
00000 11111
00000
000
111
00
11
00000
11111 00000
11111
Summary H(Y|X ) H(X |Y)

00
11 00
11
00000
11111 000
111
00000
11111
00
11
00
11
00000
11111
00
11 00000
11111
00
11
H(X )
00
11 H(Y)

00000
11111
00000
11111 00
11 00000
11111
00
11
00 11111
11
00000
11111
00000
11111 000000
111111 11
00
00000
H(X |Y) H(Y|X )

00000
11111
00000
11111
00
11
00
11 00
11
00 11
11
000000
111111
00000
11111
00
11
00000
11111
00000
11111
00
11
00
11
111
000 00
000000
111111
00000
11111
00
11
00
11
00
11
00
11 H(X , Y) I(X ; Y)

Particular cases
X et Y independent : I(X ; Y) = 0 (necessary and sufficient).
X function of Y : I(X ; Y) = H(X ).
X one-to-one function of Y : I(X ; Y) = H(X ) = H(Y)

IT 2005-2, slide 6
Exercises.

1. Consider the following contingency table

Y1 Y2
1 1
X1 3 3
1
X2 0 3

Compute (logarithms in base 2) :

1. H(X ), H(Y)

2. H(X |Y), H(Y|X )

3. H(X , Y)

4. H(Y) − H(Y|X )

5. I(X ; Y)

6. Draw a Venn diagram.

2. Consider three random variables X , Y, Z.

Prove that H(X , Y|Z) = H(X |Z) + H(X |Y, Z).

IT 2005-2, note 6
Other important properties
1. Chain rules
A. Entropies
X
n
H(X1 , X2 , . . . , Xn ) = H(Xi |Xi−1 , . . . , X1 )
i=1

B. Informations
X
n
I(X1 , X2 , . . . , Xn ; Y) = I(Xi ; Y|Xi−1 , . . . , X1 )
i=1

NB: Conditional mutual information of X and Y given Z is defined by



I(X ; Y|Z) = H(X |Z) − H(X |Y, Z).
Almost same as before but one uses P (·|Z) (and averaging w.r.t. Zi ).

IT 2005-2, slide 7
Outline of proofs.

Chain rule for entropies, by repeated application of the two variable expansion rule :
H(X1 , X2 ) = H(X1 ) + H(X2 |X1 ) (6)
H(X1 , X2 , X3 ) = H(X1 ) + H(X2 , X3 |X1 ) (7)
= H(X1 ) + H(X2 |X1 ) + H(X3 |X2 , X1 ) (8)
.
.. (9)
H(X1 , X2 , . . . , Xn ) = H(X1 ) + H(X2 |X1 ) + . . . + H(Xn |Xn−1 , . . . , X1 ) (10)
Chain rule for information :
I(X1 , X2 , . . . , Xn ; Y) = H(X1 , X2 , . . . , Xn ) − H(X1 , X2 , . . . , Xn |Y) (11)
X
n
X
n
= H(Xi |Xi−1 , . . . , X1 ) − H(Xi |Xi−1 , . . . , X1 , Y) (12)
i=1 i=1
Xn
= I(Xi ; Y|Xi−1 , . . . , X1 ) (13)
i=1

Equivalent definition of I(X ; Y|Z)


X P (Xi , Yj |Zk ) X
I(X ; Y|Z) = P (Xi , Yj , Zk ) log = P (Zk )I(X ; Y|Zk ) (14)
P (Xi |Zk )P (Yj |Zk )
i,j,k k

IT 2005-2, note 7
2. Conditional independence and data processing inequality
Consider three discrete random variables : X , Y, Z
They are said to form a Markov chain if Z is conditionally indep. of X given Y.
Notation : Z ⊥ X |Y ⇔ Zi ⊥ Xj |Yk , ∀i, j, k.
In other words P (Z|X , Y) = P (Z|Y)
Interpretation :
Conditioning : suppose Y = Yk given ⇒ P (·) → P (·|Yk )
The probability measure becomes a conditional probability measure.
Cond. indep. ≡ independence under the conditional measure, for any Yk .
Independence is a symmetric relation : Z ⊥ X |Y ⇔ X ⊥ Z|Y.
X , Y, Z form a Markov chain which is denoted by X ↔ Y ↔ Z

X Y Z X Y Z

X Y Z
IT 2005-2, slide 8
The graphical representation is again a particular case of a Bayesian belief network, which will be introduced more
precisely later on.

Bayesian belief networks provide a general and very powerful tool in order to handle conditional independence.
Conditional independence is very important as a notion, because for many physical problems it may be used to
represent causal relationships. Thus, the structure of conditional independence of stochastic models may be deduced
from physical causality and structure.

Consider a communication system composed of two channels in series : X represents messages chosen by a source,
Y messages at the receiving end of the first channel, and Z the messages at the receiving end of the second channel.
These three random variables obviously represent a Markov chain.

Similarly, look at an industrial two stage process : X represents the characteristics of the input material; Y the
characteristics of the output of the first stage and Z the characteristics of the output of the second stage. If Y is a
precise enough description, then again we have a Markov chain. This means that if we are able to observe the output
of the first stage, and want to predict what will happen during the second stage, the history X of the material is
irrelevant.

This notion of sufficiently precise description of a process at an intermediate stage, is what we call in system theory
the state of the system.

IT 2005-2, note 8
NB : these ideas may be applied to sets of variables :
X 1 , X 2 , . . . ↔ Y 1 , Y 2 , . . . ↔ Z1 , Z2 , . . .
X1 ↔ X2 ↔ · · · ↔ Xk ↔ · · · ↔ Xn−1 ↔ Xn
Remarks.
If X ↔ Y ↔ Z then
P (X , Y, Z) = P (X )P (Y|X )P (Z|Y) = P (Z)P (Y|Z)P (X |Y).
Data processing inequality
If X ↔ Y ↔ Z form a Markov chain then I(X ; Y) ≥ I(X ; Z).

Indeed : chain rule of information applied in two ways to I(X ; Y, Z):


I(X ; Z) + I(X ; Y|Z) = I(X ; Y, Z) = I(X ; Y) + I(X ; Z|Y).
Since X et Z are conditionally independent, we have I(X ; Z|Y) = 0, and hence
I(X ; Z) ≤ I(X ; Y).

IT 2005-2, slide 9
Examples
If Z is a function of Y it is conditionally independent of X .
(Hence also X ↔ Y ↔ Y)
If Z is a function of Y and another r.v. independent of X and Y, it is also condition-
ally independent of X .
Interpretation
The theorem tells us that whatever we do with Y in terms of data processing, there is
no hope to gain more information about X than what is provided by Y :
⇒ no way to create information by data processing.
Questions:
If A is an event of positive probability, what is the value of P (A|A)?
What is the meaning (value) of P (X , Y, Y) ?
Is it true that P (Y|X , Y) = P (Y|Y) ?

IT 2005-2, slide 10
Another consequence
If X ↔ Y ↔ Z then I(X ; Y|Z) ≤ I(X ; Y).
In other words, in a Markov chain conditioning decreases mutual information.
This property is not true in general.
In other words, it is possible that I(X ; Y|Z) > I(X ; Y) when X , Y, Z do not form
a Markov chain.
For example
Consider the double coin flipping experiment.
Compute I(H1 ; S) and I(H1 ; S|H2 ).
This finishes our study of information measures (algebra).
We will come back later to these notions for continuous random variables.

IT 2005-2, slide 11
Exercises

1. Let X , Y, Z be three binary random variables. One gives the following information :

• P (X = 0) = P (Y = 0) = 0.5,
• P (X , Y) = P (X )P (Y)
• Z = (X + Y)mod2 (i.e. Z = 1 ⇔ X =
6 Y).
(a) What is the value of P (Z = 0) ?
(b) What is the value of H(X ), H(Y), H(Z) ?
(c) What is the value of H(X , Y), H(X , Z), H(Y, Z), H(X , Y, Z) ?
(d) What is the value of I(X ; Y), I(X ; Z), I(Y; Z) ?
(e) What is the value of I(X ; Y, Z), I(Y; X , Z), I(Z; X , Y) ?
(f) What is the value of I(X ; Y|Z), I(Y; X |Z), I(Z; X |Y) ?
(g) Can you draw a Venn diagram which summarizes the situation ?

2. Let X , Y, Z be three discrete random variables. Show that

(a) H(X , Y|Z) ≥ H(X |Z);


(b) I(X , Y; Z) ≥ I(X ; Z);
(c) H(X , Y, Z) − H(X , Y) ≤ H(X , Z) − H(X );
(d) I(X ; Z|Y) ≥ I(Z; Y|X ) − I(Z; Y) + I(X ; Z).

IT 2005-2, note 11
Graphical models for probabilistic inference
Classical logic :
- Start with a theory : set of axioms which are supposed to hold in the physical world
(if X has wings then X is a bird)
- Add observations from the real world : facts (Tweety has wings)
- Infer conclusions about other properties of the real world : Tweety is a bird.
Probabilistic logic :
Same, but statements and axioms are of probabilistic nature.
Inference : from a probabilistic model and observations from the real world, draw
conclusions about unobserved variables.
Graphical models : represent relationships among variables by a graph.
NB.: not all models are graphical...

IT 2005-2, slide 12
Main questions
1. How to build models : from first principles, from observations of nature, from both
2. How to use models : deductive inference
Now, we focus on probabilistic (deductive) inference with graphical models :
⇒ Bayesian networks, decision trees.
Model probabilistic relationships among a set of variables
- We will consider only discrete variables, but theory extends to continuous variables
- Bayesian networks : models for joint probability distributions P (A, B, . . . , U)
- Decision trees : models for conditional probability distributions P (A|B, . . . , U)

IT 2005-2, slide 13
Bayesian networks : models for P (A, B, . . . , U)
NB:
We consider only the case where A, B, . . . , U take a finite number of value. Thus,
the number of possible combinations of values is also finite.
Thus P (A, B, . . . , U) can be represented explicitly as a multidimensional table of
numbers in [0; 1] : contingency table
But :
1. Explicit representation becomes quickly intractable (when the number of variables
increases).
2. Explicit representation says nothing about structural relationships of variables (e.g.
conditional independence)
Bayesian networks : compact representation, tractable, and interpretable (explicitly).

IT 2005-2, slide 14
Example of inference using an explicit representation :
Given P (A, B, C, D, E, F ) (model) and the fact (observation or hypothesis) that B =
Bj and C = Ck , what is the probability of event A = Ai ?
In other words compute : P (Ai |Bj , Ck )
Answer :
P (Ai ,Bj ,Ck )
1. P (Ai |Bj , Ck ) = P (Bj ,Ck ) .
P P P
2. P (Ai , Bj , Ck ) = D∈D E∈E F ∈F P (Ai , Bj , Ck , D, E, F )
P P P P
3. P (Bj , Ck ) = A∈A D∈D E∈E F ∈F P (A, Bj , Ck , D, E, F )
Comments :
Suppose that the variables assume three values each, then P (A, B, C, D, E, F ) is
given by 36 − 1 = 728 numbers.
The two sums concerns respectively 33 = 27 and 34 = 81 terms.
In applications (e.g. coding) : thousands of variables ⇒ trivial method breaks down.
IT 2005-2, slide 15
Same problem : we add some structural knowledge
Suppose we know (e.g. because of physical knowledge about the problem that :
P (A, B, C, D, E, F ) = P (A, B, C)P (D, E, F |A)
and that
P (A, B, C) = P (B)P (C)P (A|BC)

Now we need to specify the model :


− For P (B) and P (C) we need 4 = 2 + 2
− For P (A|BC) we need 2 × 3 × 3 = 18.
− For P (D, E, F |A) we need 3 × (33 − 1) = 78.
⇒ Structural knowledge reduces the size of our model from 728 to 4+18+78 = 100.
Computation of P (Ai |Bj , Ck ) : trivial (table lookup)
What about computation of P (Bj , Ck |Ai ) ? (with and without structural knowledge)

IT 2005-2, slide 16
Models are useful to provide not only accurate but also compact representations of the reality. In general, there is a
tradeoff between model complexity and accuracy. Models are useful only if we are able to exploit them in order to
understand or predict behavior of reality : in most situations tractability is possible only at the expense of accuracy.

Next week, when we will focus on channel coding, we will see that in order to efficiently exploit noisy channels it
is necessary to manipulate very long sequences of symbols (long messages). For example, in the context of Turbo-
codes typical message lengths which are manipulated are in the interval [1000 . . . 100000]. This means that we need
to manipulate joint probability distributions of more than 1000 to 100000 binary variables, which would be totally
impossible if we were to use explicit table-lookup models.

For those who are not yet convinced, let us make the explicit calculation : if N = 1000, a channel code will
comprise 21000 ≈ 10301 code words. If every electron of the Universe (there are about 1080 ) was a 1000 GHz
processor able to store and retrieve the probability of such a code word in a single instruction, one could handle
1012 × 1080 ≈ 1092 code words per second, and in a period equal to the age of the Universe (3 × 1017 seconds),
these computers would handle 3 × 10109 code words. To handle all words, we would still need to wait for a period
equal to 10190 times the age of our Universe !

Nevertheless, by using compact models it is possible to handle the channel encoding and decoding tasks efficiently
(in linear time with respect to the message length).

Later we will introduce stochastic process models. A stochastic process is a sequence of random variables corre-
sponding to successive time instants (we will only consider discrete time models in this course). As time can grow
indefinitely, a stochatsic process is actually an infinite collection of random variables. Still, it is possible to devise
very compact probabilistic models of such processes : actually with a few numbers it is possible to characterize the
joint probability distribution of any finite subcollection of random variables of the process.

IT 2005-2, note 16
Bayesian network : definition
Directed acyclic graph :
- nodes model variables (one node for each variable)
- Arcs model causal relations among variables (conditional independence relations)

P(M )
GMM GPM GMP GPP

M P N D(M )

E F1 F2

e D(M )

IT 2005-2, slide 17
The figure illustrates an example Bayesian network, which is supposed to model the relationships among the color
of eyes of different people in a family. The network actually models the ancestral relationships among the persons of
this family. Each node represents the color (blue or brown) of one person. The arcs indicate which are the children
of a person.

Note that this model does not pretend to be a correct view of Mendelean genetics. We will see later that this model
is slightly more complex, but can still be easily represented by a Bayesian network. For the time being, we will use
this naive picture of genetics as our running example to explain main concepts in Bayesian networks.

Terminology and notation

We use the same notation (round uppercase) to represent variables and nodes, since they are in one-to-one corre-
spondance.

Let Xk denote a node in the graph G. Then we denote by :


- P(Xk ) the set of parent nodes of Xk , i.e. the origins of the arcs pointing towards Xk .
- F (Xk ) the children of Xk , i.e. the set {Xj ∈ G|Xk ∈ P(Xj )}
- D(Xk ) the descendents of Xk , i.e. the set of nodes which are in F (Xk ), or descendents of a node in F (Xk ).
- N D(Xk ) the nondescendents of Xk , i.e. the set G ∩ ¬({Xk } ∪ D(Xk ))

The figure illustrates these notions for the node M, and shows how this node partitions the graph.

Defining property of Bayesian networks

For any variable X ∈ G, and any subset of variables W ∈ N D(X ), we have P (X |P(X ), W) = P (X |P(X )),
i.e. once the parents of a variable are given, it becomes independent of all its other non-descendents

IT 2005-2, note 17
Factorisation property
Suppose we are given a Bayesian network G = {X1 , . . . Xn } and for each variable
Xi ∈ G we are also given P (Xi |P(Xi )), then
Y
n
P (X1 , . . . , Xn ) = P (Xi |P(Xi ))
i=1

Note that for those variables for which P(Xi ) = ∅ we are given the prior P (Xi ).
Comments :
As long as the P (Xi |P(Xi )) are not specified, a Bayesian network is meant to repre-
sent all distributions which can be factorized in this way.
Any probability distribution may be represented in many ways by a Bayesian net-
work, but not necessarily all conditional independence structures may be derived
explicitly from a Bayesian network structure.
There exist probability distributions leading to independence relations which can not
be represented completely by any Bayesian network.

IT 2005-2, slide 18
Simple examples : some (all ?) three variable networks

X Y Z X Y Z

P (X , Y, Z) = P (X )P (Y|X )P (Z|X , Y) P (X , Y, Z) = P (X )P (Y)P (Z)

X Y Z X Y Z

P (X , Y, Z) = P (X )P (Y|X )P (Z|Y) P (X , Y, Z) = P (Z)P (Y|Z)P (X |Y)

X Y X Y

P (Z)P (X |Z)P (Y|Z)


P (Z|X , Y)P (X )P (Y)
Z Z

IT 2005-2, slide 19
The correct Mendel model of eye colors

Produced by JavaBayes tool

IT 2005-2, slide 20
Here is the complete model of our earlier example related to genetic.

We have added new variables denoted by GT... for each individual which denote the two versions of the gene which
determine the eye color for each individual. The variables may take on three values bb, bB, and BB, where b
stands for blue and B for brown. We assume that the prior (or marginal) probability of these three values for the
grand-parent generation are 0.25, 0.5, 0.25; all the other (conditional) probability distributions are deduced from
the Mendel model : the relation between parent and children genotypes assume that one of the two chromosomes is
chosen at random (0.5 probability) and the relation between phenotype and genotype is deterministic, assuming that
B (brown) is dominant character.

Notice that this network models the genotype (genes) of the individuals, and the relationship between the genotype
and the observed variables (eye colors). In spite of the fact that genotypes can not be observed directly, it is possible
to use this model to infer unobserved genotypes and phenotypes from the observed phenotypes (eye colors).

One particularity of this network is that all the conditional probabilities are identical (all people behave in the same
way from the viewpoint of our model). The network can be extended to a whole population and be used to model
relationships between successive generations and how one can observe genetic drift.

The present example is available on the web page http://www.montefiore.ulg.ac.be/˜lwh/javabayes, where you can
use a Java applet to simulate the network and see how it reacts to observations. On the same page you can also try
out the earlier naive version of the same problem, and compare the differences.

The prior probability distributions have been chosen so that without any observations all individuals have the same
marginal probability distribution of genotypes (and hence phenotypes). This is what we will later on denote by
stationary conditions. It turns out that, even if in the earlier generations the prior distribution is different from the
stationary distribution, after a large enough number of generations the system converges to the stationary distribu-
tion.

IT 2005-2, note 20
Graphical models of communication systems
First order markov model of a black&white scanner

0/1 0/1 0/1 0/1 0/1 0/1 0/1 0/1

Same compressed by a 4 bit block code

0/1 0/1 0/1 0/1 0/1 0/1 0/1 0/1

0/10/110/111 0/10/110/111
Same, encoded and sent through a noisy, memoryless communication channel

0/1 0/1 0/1 0/1 0/1 0/1 0/1 0/1 Source

Conv. code

Channel noise

Received message
IT 2005-2, slide 21
D-separation and conditional independance relations induced by a BN
Some comments on the notion of independance of sets of random variables
Let A = {X1 , . . . , Xl } and B = {Y1 , . . . , Ym } two sets of random variables.
- What is the meaning of A ⊥ B ?
- Is it true that A ⊥ B ⇒ (∀i, j : Xi ⊥ Xj ) ?
- And/or is the converse true, i.e. (∀i, j : Xi ⊥ Xj ) ⇒ A ⊥ B ?
D-separation: definition
Let us denote by A, B, C three disjoint subsets of r.v. of a BN, and let us assume that
A and C are non empty.
Let us consider paths over the undirected version of the DAG, from A to C.
We say that A and C are d-separated by B if all paths from A to C are blocked by B.

IT 2005-2, slide 22
By definition, a path is blocked if it goes through a variable, say Xk ,

1. The pattern → Xk → appears in the path and Xk ∈ B


2. The pattern → Xk ← appears in the path and ({Xk } ∪ D(Xk )) ∩ B = ∅
3. The pattern ← Xk → appears in the path and Xk ∈ B

Xk ∈ B
A A C

Xk ∈ B
Xk 6∈ B A C

All paths with are not blocked, arer said to be active (w.r.t. A to C and B).

IT 2005-2, slide 23
D-separation: fundamental property
If A, B, C are three disjoint sets of variables (B may be empty) of a bayesian network,
then “A and C are d-separated by B” ⇒ A ⊥ C|B.
Notice that, if A and B are d-separated by C then any subset of A is d-separated from
any subset of C by B.
Notice also that we can change directions of some arrows in the graph without chang-
ing d-separations, provided that we don’t change the set of → Xk ← structures.
Thus, to represent the conditional independances one often uses so-called essential
graphs, obtained from a DAG by replacing arrows which do not participate in a V -
structure by lines.
Belief propagation
D-separation also leads to the design of effective belief propagation algorithms.
(See course notes and subsequent lessons).

IT 2005-2, slide 24
There is much more to say about Bayesian belief networks, but limited time in the context of this course does not
allow to go further in depth.

Bayesian networks where proposed in the eighties by Judea Pearl, in order to provide modelling tools for reasoning
under uncertainty in artificial intelligence (e.g. expert systems for medical diagnosis).

In the meanwhile, both theory and practice have progressed significantly, and although the field has not yet reached
full maturity there are already many significant real applications.

One of the complex questions, as regards inference, is to devise efficient algorithms to propagate evidence through
the network. If the network has a tree structure this is rather easy task (a generalization of the forward-backward
algorithm used for hidden markov chains, leading to an efficient algorithm). If the network is not a tree, one approach
consists of grouping variables so as to yield a tree (so-called junction tree algorithm); another approach is to use
approximate (but efficient) algorithms for probability propagation.

The other main problem under consideration in research concerns the automatic design of probabilistic models from
data. Here also, there is still a lot to do.

IT 2005-2, note 24
Probabilistic reasoning and questionnaires
Let us consider a medical diagnostic problem and its probabilistic model. Let us
denote by D a variable which is true when the patient under consideration has a
certain disease (say hepatitis).
Ω set of all possible patients which will visit a M.D.
In order to make a diagnosis, the doctor will typically try to look at the symptoms
(concentration of various types of blood cells, eye color, skin color, temperature . . . )
and ask questions about antecedents (factors, such as age, nutrition, smooking, ad-
diction to heroin,. . . ).
Note that not all questions have same relevance, and also in general the relevance
of a question is dependent on already observed variables. Anyhow, typically the
doctor would like to reach conclusions about the diagnostic by asking on relevant
and informative questions.
Problem : how to design an efficient strategy for the diagnosis ?

IT 2005-2, slide 25
Probabilistic model
Suppose that we have a model for P (D, A1 , . . . , An , S1 , . . . , Sm ).
We can measure the residual uncertainty of the diagnosis problem by

H(D|A1 , . . . , An , S1 , . . . , Sm )

i.e. the uncertainty which can not be reduced by observations.


If the disease is well known, hopefully this quantity will be small.
Note that, if we forbid the use of one of the possible observations, say A1 , then the
residual uncertainty increases

H(D|A2 , . . . , An , S1 , . . . , Sm ) ≥ H(D|A1 , . . . , An , S1 , . . . , Sm )

but this does not mean that all questions are relevant in all cases.
Suppose we are allowed only to ask one single question (observe one of Ai or Sj ),
then we would choose X ∈ {A1 , . . . , An , S1 , . . . , Sm } maximizing I(X ; D).

IT 2005-2, slide 26
Strategy : same as decision tree
Skin colour

yellow other

Did you visit Asia Eye color

y n

Did you eat raw meat Do you take heroin

y n

P(D=T) = 0.99

IT 2005-2, slide 27
The test nodes of the tree (square boxes on the figure) represent essentially questions or observations that may be
made by the doctor. The terminal nodes represent conclusions that will be drawn : the doctor stops to ask questions
and decides that it is either very likely or very unlikely that the patient has hepatitis, or possibly decides that he is
still uncertain and the patient should go to a specialist (who will ask more questions).

The tree structure defines the strategy that the doctor will use to reach a decision : the top-node (root of the tree)
defines the first question, and successors define the substrategies depending on the obtained answer. Note that the
terminal nodes of the tree are a function T of the test variables : using the tree is equivalent to observing this variable.

Tree construction algorithms :

A good decision tree is one that minimizes the average conditional entropy at the leaf nodes and at the same time
minimizes complexity of the tree (different measures). If we know P (D, A1 , . . . , An , S1 , . . . , Sm ) (say we have
a Bayesian network) we can try to find an optimal tree, say one which minimizes

H(D|T ) + βComplexity

Brute force :
- generate all possible trees (there is only a finite number of trees)
- for each tree compute H(D|T ) + βComplexity (can be done using P (D, A1 , . . . , An , S1 , . . . , Sm ))
- keep the best one.

Hill climbing :
- select the variable maximizing I(X ; D) at the root node
- for each value Xi of X use P (D, A1 , . . . , An , S1 , . . . , Sm |Xi ) to build subtree.
- stop when H(D|T ) + βComplexity starts to increase.

IT 2005-2, note 27

You might also like