0% found this document useful (0 votes)

7 views

Jour 2

Uploaded by

Mayouf

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

7 views

Jour 2

Uploaded by

Mayouf

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 37

Introduction to information theory and coding

Louis WEHENKEL
Set of slides No 2

• Entropies and information measures

• Chain rules for entropy and information
• More about independence, and conditional independence
• Translation of these properties into properties of information measures
• Data processing inequality
• Bayesian networks and decision trees

IT 2005-2, slide 1
Conditional (a posteriori) entropy

X
n X
m
H(X |Y) = − P (Xi ∩ Yj ) log P (Xi |Yj ). (1)
i=1 j=1

The entropy of X knowing that Y = Yj is

X
n
H(X |Yj ) = − P (Xi |Yj ) log P (Xi |Yj ), (2)
i=1

it is positive (it is an entropy) and one has

X
m
H(X |Y) = P (Yj )H(X |Yj ), (3)
j=1

hence this latter is also positive.

And concavity of Hn implies : H(X |Y) ≤ H(X ), which is a fundamental prop-
erty !

IT 2005-2, slide 2
Joint entropy and its relationship with conditional entropy

IT 2005-2, slide 3
Inequalities related to the entropy
One deduces the following inequalities :
H(X , Y) ≥ max (H(X ), H(Y))
H(X , Y) ≤ H(X ) + H(Y)
Conclusion :
H(X , Y) ≤ H(X ) + H(Y) ≤ 2H(X , Y) (4)

Particular cases :
X et Y independent : H(X , Y) = H(X ) + H(Y)
(because then P (Xi ∩ Yj ) = P (Xi )P (Yj )) (⇒ H(X |Y) = H(X ))
X function of Y : H(X , Y) = H(Y).
(because then H(X |Y) = 0) (since H(X |Yj ) = 0, ∀j = 1, . . . , m)

IT 2005-2, slide 4
Mutual information

X
n X
m
P (Xi ∩ Yj )
I(X ; Y) = + P (Xi ∩ Yj ) log . (5)
i=1 j=1
P (Xi )P (Yj )
One can derive :

I(X ; Y) = H(X ) − H(X |Y) = H(Y) − H(Y|X )

and hence
I(X ; Y) = H(X ) + H(Y) − H(X , Y)
which we may also write as

H(X , Y) = H(X ) + H(Y) − I(X ; Y)

Main conclusion :

0 ≤ I(X ; Y) ≤ min{H(X ), H(Y)}

IT 2005-2, slide 5
Exercises.

1. Show that indeed (and in the given order)

1. H(X , Y) = H(Y) + H(X |Y) = H(X ) + H(Y|X )

2. H(X , Y) ≥ max{H(X ), H(Y)}

3. H(X |Y) ≤ H(X )

4. H(X , Y) ≤ H(X ) + H(Y)

5. I(X ; Y) = H(X ) + H(Y) − H(X , Y) = H(X ) − H(X |Y)

2. A tournament between two teams consists of a sequence of at most 5 games which stops as soon as one of the two
teams has won three games. Let a and b denote the two teams and X a r.v. which represents the issue of a tournament
between a and b. For example, X = aaa, babab, bbaaa are possible values of X (there are other possible values).
Let Y denote the random variable which denotes the number of games played (thus Y = {3, 4, 5}).

Suppose that the teams are of the same strength and the outcomes of the successive games are independent, and
compute H(X ),H(Y), H(X |Y) and H(Y|X ).

Let Z = {a, b} denote the random variable which identifies the team winning the tournament. Determine H(X |Z),
compare with H(X ) and justify the result. Determine H(Z|X ), and justify.

IT 2005-2, note 5
00
11
11111
00000 11111
00000
000
111
00
11
00000
11111 00000
11111
Summary H(Y|X ) H(X |Y)

00
11 00
11
00000
11111 000
111
00000
11111
00
11
00
11
00000
11111
00
11 00000
11111
00
11
H(X )
00
11 H(Y)

00000
11111
00000
11111 00
11 00000
11111
00
11
00 11111
11
00000
11111
00000
11111 000000
111111 11
00
00000
H(X |Y) H(Y|X )

00000
11111
00000
11111
00
11
00
11 00
11
00 11
11
000000
111111
00000
11111
00
11
00000
11111
00000
11111
00
11
00
11
111
000 00
000000
111111
00000
11111
00
11
00
11
00
11
00
11 H(X , Y) I(X ; Y)

Particular cases
X et Y independent : I(X ; Y) = 0 (necessary and sufficient).
X function of Y : I(X ; Y) = H(X ).
X one-to-one function of Y : I(X ; Y) = H(X ) = H(Y)

IT 2005-2, slide 6
Exercises.

1. Consider the following contingency table

Y1 Y2
1 1
X1 3 3
1
X2 0 3

Compute (logarithms in base 2) :

1. H(X ), H(Y)

2. H(X |Y), H(Y|X )

3. H(X , Y)

4. H(Y) − H(Y|X )

5. I(X ; Y)

6. Draw a Venn diagram.

2. Consider three random variables X , Y, Z.

Prove that H(X , Y|Z) = H(X |Z) + H(X |Y, Z).

IT 2005-2, note 6
Other important properties
1. Chain rules
A. Entropies
X
n
H(X1 , X2 , . . . , Xn ) = H(Xi |Xi−1 , . . . , X1 )
i=1

B. Informations
X
n
I(X1 , X2 , . . . , Xn ; Y) = I(Xi ; Y|Xi−1 , . . . , X1 )
i=1

NB: Conditional mutual information of X and Y given Z is defined by

△
I(X ; Y|Z) = H(X |Z) − H(X |Y, Z).
Almost same as before but one uses P (·|Z) (and averaging w.r.t. Zi ).

IT 2005-2, slide 7
Outline of proofs.

Chain rule for entropies, by repeated application of the two variable expansion rule :
H(X1 , X2 ) = H(X1 ) + H(X2 |X1 ) (6)
H(X1 , X2 , X3 ) = H(X1 ) + H(X2 , X3 |X1 ) (7)
= H(X1 ) + H(X2 |X1 ) + H(X3 |X2 , X1 ) (8)
.
.. (9)
H(X1 , X2 , . . . , Xn ) = H(X1 ) + H(X2 |X1 ) + . . . + H(Xn |Xn−1 , . . . , X1 ) (10)
Chain rule for information :
I(X1 , X2 , . . . , Xn ; Y) = H(X1 , X2 , . . . , Xn ) − H(X1 , X2 , . . . , Xn |Y) (11)
X
n
X
n
= H(Xi |Xi−1 , . . . , X1 ) − H(Xi |Xi−1 , . . . , X1 , Y) (12)
i=1 i=1
Xn
= I(Xi ; Y|Xi−1 , . . . , X1 ) (13)
i=1

Equivalent definition of I(X ; Y|Z)

IT 2005-2, note 7
2. Conditional independence and data processing inequality
Consider three discrete random variables : X , Y, Z
They are said to form a Markov chain if Z is conditionally indep. of X given Y.
Notation : Z ⊥ X |Y ⇔ Zi ⊥ Xj |Yk , ∀i, j, k.
In other words P (Z|X , Y) = P (Z|Y)
Interpretation :
Conditioning : suppose Y = Yk given ⇒ P (·) → P (·|Yk )
The probability measure becomes a conditional probability measure.
Cond. indep. ≡ independence under the conditional measure, for any Yk .
Independence is a symmetric relation : Z ⊥ X |Y ⇔ X ⊥ Z|Y.
X , Y, Z form a Markov chain which is denoted by X ↔ Y ↔ Z

X Y Z X Y Z

X Y Z
IT 2005-2, slide 8
The graphical representation is again a particular case of a Bayesian belief network, which will be introduced more
precisely later on.

Bayesian belief networks provide a general and very powerful tool in order to handle conditional independence.
Conditional independence is very important as a notion, because for many physical problems it may be used to
represent causal relationships. Thus, the structure of conditional independence of stochastic models may be deduced
from physical causality and structure.

Consider a communication system composed of two channels in series : X represents messages chosen by a source,
Y messages at the receiving end of the first channel, and Z the messages at the receiving end of the second channel.
These three random variables obviously represent a Markov chain.

Similarly, look at an industrial two stage process : X represents the characteristics of the input material; Y the
characteristics of the output of the first stage and Z the characteristics of the output of the second stage. If Y is a
precise enough description, then again we have a Markov chain. This means that if we are able to observe the output
of the first stage, and want to predict what will happen during the second stage, the history X of the material is
irrelevant.

This notion of sufficiently precise description of a process at an intermediate stage, is what we call in system theory
the state of the system.

IT 2005-2, note 8
NB : these ideas may be applied to sets of variables :
X 1 , X 2 , . . . ↔ Y 1 , Y 2 , . . . ↔ Z1 , Z2 , . . .
X1 ↔ X2 ↔ · · · ↔ Xk ↔ · · · ↔ Xn−1 ↔ Xn
Remarks.
If X ↔ Y ↔ Z then
P (X , Y, Z) = P (X )P (Y|X )P (Z|Y) = P (Z)P (Y|Z)P (X |Y).
Data processing inequality
If X ↔ Y ↔ Z form a Markov chain then I(X ; Y) ≥ I(X ; Z).

Indeed : chain rule of information applied in two ways to I(X ; Y, Z):

I(X ; Z) + I(X ; Y|Z) = I(X ; Y, Z) = I(X ; Y) + I(X ; Z|Y).
Since X et Z are conditionally independent, we have I(X ; Z|Y) = 0, and hence
I(X ; Z) ≤ I(X ; Y).

IT 2005-2, slide 9
Examples
If Z is a function of Y it is conditionally independent of X .
(Hence also X ↔ Y ↔ Y)
If Z is a function of Y and another r.v. independent of X and Y, it is also condition-
ally independent of X .
Interpretation
The theorem tells us that whatever we do with Y in terms of data processing, there is
no hope to gain more information about X than what is provided by Y :
⇒ no way to create information by data processing.
Questions:
If A is an event of positive probability, what is the value of P (A|A)?
What is the meaning (value) of P (X , Y, Y) ?
Is it true that P (Y|X , Y) = P (Y|Y) ?

IT 2005-2, slide 10
Another consequence
If X ↔ Y ↔ Z then I(X ; Y|Z) ≤ I(X ; Y).
In other words, in a Markov chain conditioning decreases mutual information.
This property is not true in general.
In other words, it is possible that I(X ; Y|Z) > I(X ; Y) when X , Y, Z do not form
a Markov chain.
For example
Consider the double coin flipping experiment.
Compute I(H1 ; S) and I(H1 ; S|H2 ).
This finishes our study of information measures (algebra).
We will come back later to these notions for continuous random variables.

IT 2005-2, slide 11
Exercises

1. Let X , Y, Z be three binary random variables. One gives the following information :

• P (X = 0) = P (Y = 0) = 0.5,
• P (X , Y) = P (X )P (Y)
• Z = (X + Y)mod2 (i.e. Z = 1 ⇔ X =
6 Y).
(a) What is the value of P (Z = 0) ?
(b) What is the value of H(X ), H(Y), H(Z) ?
(c) What is the value of H(X , Y), H(X , Z), H(Y, Z), H(X , Y, Z) ?
(d) What is the value of I(X ; Y), I(X ; Z), I(Y; Z) ?
(e) What is the value of I(X ; Y, Z), I(Y; X , Z), I(Z; X , Y) ?
(f) What is the value of I(X ; Y|Z), I(Y; X |Z), I(Z; X |Y) ?
(g) Can you draw a Venn diagram which summarizes the situation ?

2. Let X , Y, Z be three discrete random variables. Show that

(a) H(X , Y|Z) ≥ H(X |Z);

(b) I(X , Y; Z) ≥ I(X ; Z);
(c) H(X , Y, Z) − H(X , Y) ≤ H(X , Z) − H(X );
(d) I(X ; Z|Y) ≥ I(Z; Y|X ) − I(Z; Y) + I(X ; Z).

IT 2005-2, note 11
Graphical models for probabilistic inference
Classical logic :
- Start with a theory : set of axioms which are supposed to hold in the physical world
(if X has wings then X is a bird)
- Add observations from the real world : facts (Tweety has wings)
- Infer conclusions about other properties of the real world : Tweety is a bird.
Probabilistic logic :
Same, but statements and axioms are of probabilistic nature.
Inference : from a probabilistic model and observations from the real world, draw
conclusions about unobserved variables.
Graphical models : represent relationships among variables by a graph.
NB.: not all models are graphical...

IT 2005-2, slide 12
Main questions
1. How to build models : from first principles, from observations of nature, from both
2. How to use models : deductive inference
Now, we focus on probabilistic (deductive) inference with graphical models :
⇒ Bayesian networks, decision trees.
Model probabilistic relationships among a set of variables
- We will consider only discrete variables, but theory extends to continuous variables
- Bayesian networks : models for joint probability distributions P (A, B, . . . , U)
- Decision trees : models for conditional probability distributions P (A|B, . . . , U)

IT 2005-2, slide 13
Bayesian networks : models for P (A, B, . . . , U)
NB:
We consider only the case where A, B, . . . , U take a finite number of value. Thus,
the number of possible combinations of values is also finite.
Thus P (A, B, . . . , U) can be represented explicitly as a multidimensional table of
numbers in [0; 1] : contingency table
But :
1. Explicit representation becomes quickly intractable (when the number of variables
increases).
2. Explicit representation says nothing about structural relationships of variables (e.g.
conditional independence)
Bayesian networks : compact representation, tractable, and interpretable (explicitly).

IT 2005-2, slide 14
Example of inference using an explicit representation :
Given P (A, B, C, D, E, F ) (model) and the fact (observation or hypothesis) that B =
Bj and C = Ck , what is the probability of event A = Ai ?
In other words compute : P (Ai |Bj , Ck )
Answer :
P (Ai ,Bj ,Ck )
1. P (Ai |Bj , Ck ) = P (Bj ,Ck ) .
P P P
2. P (Ai , Bj , Ck ) = D∈D E∈E F ∈F P (Ai , Bj , Ck , D, E, F )
P P P P
3. P (Bj , Ck ) = A∈A D∈D E∈E F ∈F P (A, Bj , Ck , D, E, F )
Comments :
Suppose that the variables assume three values each, then P (A, B, C, D, E, F ) is
given by 36 − 1 = 728 numbers.
The two sums concerns respectively 33 = 27 and 34 = 81 terms.
In applications (e.g. coding) : thousands of variables ⇒ trivial method breaks down.
IT 2005-2, slide 15
Same problem : we add some structural knowledge
Suppose we know (e.g. because of physical knowledge about the problem that :
P (A, B, C, D, E, F ) = P (A, B, C)P (D, E, F |A)
and that
P (A, B, C) = P (B)P (C)P (A|BC)

Now we need to specify the model :

− For P (B) and P (C) we need 4 = 2 + 2
− For P (A|BC) we need 2 × 3 × 3 = 18.
− For P (D, E, F |A) we need 3 × (33 − 1) = 78.
⇒ Structural knowledge reduces the size of our model from 728 to 4+18+78 = 100.
Computation of P (Ai |Bj , Ck ) : trivial (table lookup)
What about computation of P (Bj , Ck |Ai ) ? (with and without structural knowledge)

IT 2005-2, slide 16
Models are useful to provide not only accurate but also compact representations of the reality. In general, there is a
tradeoff between model complexity and accuracy. Models are useful only if we are able to exploit them in order to
understand or predict behavior of reality : in most situations tractability is possible only at the expense of accuracy.

Next week, when we will focus on channel coding, we will see that in order to efficiently exploit noisy channels it
is necessary to manipulate very long sequences of symbols (long messages). For example, in the context of Turbo-
codes typical message lengths which are manipulated are in the interval [1000 . . . 100000]. This means that we need
to manipulate joint probability distributions of more than 1000 to 100000 binary variables, which would be totally
impossible if we were to use explicit table-lookup models.

For those who are not yet convinced, let us make the explicit calculation : if N = 1000, a channel code will
comprise 21000 ≈ 10301 code words. If every electron of the Universe (there are about 1080 ) was a 1000 GHz
processor able to store and retrieve the probability of such a code word in a single instruction, one could handle
1012 × 1080 ≈ 1092 code words per second, and in a period equal to the age of the Universe (3 × 1017 seconds),
these computers would handle 3 × 10109 code words. To handle all words, we would still need to wait for a period
equal to 10190 times the age of our Universe !

Nevertheless, by using compact models it is possible to handle the channel encoding and decoding tasks efficiently
(in linear time with respect to the message length).

Later we will introduce stochastic process models. A stochastic process is a sequence of random variables corre-
sponding to successive time instants (we will only consider discrete time models in this course). As time can grow
indefinitely, a stochatsic process is actually an infinite collection of random variables. Still, it is possible to devise
very compact probabilistic models of such processes : actually with a few numbers it is possible to characterize the
joint probability distribution of any finite subcollection of random variables of the process.

IT 2005-2, note 16
Bayesian network : definition
Directed acyclic graph :
- nodes model variables (one node for each variable)
- Arcs model causal relations among variables (conditional independence relations)

P(M )
GMM GPM GMP GPP

M P N D(M )

E F1 F2

e D(M )

IT 2005-2, slide 17
The figure illustrates an example Bayesian network, which is supposed to model the relationships among the color
of eyes of different people in a family. The network actually models the ancestral relationships among the persons of
this family. Each node represents the color (blue or brown) of one person. The arcs indicate which are the children
of a person.

Note that this model does not pretend to be a correct view of Mendelean genetics. We will see later that this model
is slightly more complex, but can still be easily represented by a Bayesian network. For the time being, we will use
this naive picture of genetics as our running example to explain main concepts in Bayesian networks.

Terminology and notation

We use the same notation (round uppercase) to represent variables and nodes, since they are in one-to-one corre-
spondance.

Let Xk denote a node in the graph G. Then we denote by :

- P(Xk ) the set of parent nodes of Xk , i.e. the origins of the arcs pointing towards Xk .
- F (Xk ) the children of Xk , i.e. the set {Xj ∈ G|Xk ∈ P(Xj )}
- D(Xk ) the descendents of Xk , i.e. the set of nodes which are in F (Xk ), or descendents of a node in F (Xk ).
- N D(Xk ) the nondescendents of Xk , i.e. the set G ∩ ¬({Xk } ∪ D(Xk ))

The figure illustrates these notions for the node M, and shows how this node partitions the graph.

Defining property of Bayesian networks

For any variable X ∈ G, and any subset of variables W ∈ N D(X ), we have P (X |P(X ), W) = P (X |P(X )),
i.e. once the parents of a variable are given, it becomes independent of all its other non-descendents

IT 2005-2, note 17
Factorisation property
Suppose we are given a Bayesian network G = {X1 , . . . Xn } and for each variable
Xi ∈ G we are also given P (Xi |P(Xi )), then
Y
n
P (X1 , . . . , Xn ) = P (Xi |P(Xi ))
i=1

Note that for those variables for which P(Xi ) = ∅ we are given the prior P (Xi ).
Comments :
As long as the P (Xi |P(Xi )) are not specified, a Bayesian network is meant to repre-
sent all distributions which can be factorized in this way.
Any probability distribution may be represented in many ways by a Bayesian net-
work, but not necessarily all conditional independence structures may be derived
explicitly from a Bayesian network structure.
There exist probability distributions leading to independence relations which can not
be represented completely by any Bayesian network.

IT 2005-2, slide 18
Simple examples : some (all ?) three variable networks

X Y Z X Y Z

P (X , Y, Z) = P (X )P (Y|X )P (Z|X , Y) P (X , Y, Z) = P (X )P (Y)P (Z)

X Y Z X Y Z

P (X , Y, Z) = P (X )P (Y|X )P (Z|Y) P (X , Y, Z) = P (Z)P (Y|Z)P (X |Y)

X Y X Y

P (Z)P (X |Z)P (Y|Z)

P (Z|X , Y)P (X )P (Y)
Z Z

IT 2005-2, slide 19
The correct Mendel model of eye colors

Produced by JavaBayes tool

IT 2005-2, slide 20
Here is the complete model of our earlier example related to genetic.

We have added new variables denoted by GT... for each individual which denote the two versions of the gene which
determine the eye color for each individual. The variables may take on three values bb, bB, and BB, where b
stands for blue and B for brown. We assume that the prior (or marginal) probability of these three values for the
grand-parent generation are 0.25, 0.5, 0.25; all the other (conditional) probability distributions are deduced from
the Mendel model : the relation between parent and children genotypes assume that one of the two chromosomes is
chosen at random (0.5 probability) and the relation between phenotype and genotype is deterministic, assuming that
B (brown) is dominant character.

Notice that this network models the genotype (genes) of the individuals, and the relationship between the genotype
and the observed variables (eye colors). In spite of the fact that genotypes can not be observed directly, it is possible
to use this model to infer unobserved genotypes and phenotypes from the observed phenotypes (eye colors).

One particularity of this network is that all the conditional probabilities are identical (all people behave in the same
way from the viewpoint of our model). The network can be extended to a whole population and be used to model
relationships between successive generations and how one can observe genetic drift.

The present example is available on the web page http://www.montefiore.ulg.ac.be/˜lwh/javabayes, where you can
use a Java applet to simulate the network and see how it reacts to observations. On the same page you can also try
out the earlier naive version of the same problem, and compare the differences.

The prior probability distributions have been chosen so that without any observations all individuals have the same
marginal probability distribution of genotypes (and hence phenotypes). This is what we will later on denote by
stationary conditions. It turns out that, even if in the earlier generations the prior distribution is different from the
stationary distribution, after a large enough number of generations the system converges to the stationary distribu-
tion.

IT 2005-2, note 20
Graphical models of communication systems
First order markov model of a black&white scanner

0/1 0/1 0/1 0/1 0/1 0/1 0/1 0/1

Same compressed by a 4 bit block code

0/1 0/1 0/1 0/1 0/1 0/1 0/1 0/1

0/10/110/111 0/10/110/111
Same, encoded and sent through a noisy, memoryless communication channel

0/1 0/1 0/1 0/1 0/1 0/1 0/1 0/1 Source

Conv. code

Channel noise

Received message
IT 2005-2, slide 21
D-separation and conditional independance relations induced by a BN
Some comments on the notion of independance of sets of random variables
Let A = {X1 , . . . , Xl } and B = {Y1 , . . . , Ym } two sets of random variables.
- What is the meaning of A ⊥ B ?
- Is it true that A ⊥ B ⇒ (∀i, j : Xi ⊥ Xj ) ?
- And/or is the converse true, i.e. (∀i, j : Xi ⊥ Xj ) ⇒ A ⊥ B ?
D-separation: definition
Let us denote by A, B, C three disjoint subsets of r.v. of a BN, and let us assume that
A and C are non empty.
Let us consider paths over the undirected version of the DAG, from A to C.
We say that A and C are d-separated by B if all paths from A to C are blocked by B.

IT 2005-2, slide 22
By definition, a path is blocked if it goes through a variable, say Xk ,

1. The pattern → Xk → appears in the path and Xk ∈ B

2. The pattern → Xk ← appears in the path and ({Xk } ∪ D(Xk )) ∩ B = ∅
3. The pattern ← Xk → appears in the path and Xk ∈ B

Xk ∈ B
A A C

Xk ∈ B
Xk 6∈ B A C

All paths with are not blocked, arer said to be active (w.r.t. A to C and B).

IT 2005-2, slide 23
D-separation: fundamental property
If A, B, C are three disjoint sets of variables (B may be empty) of a bayesian network,
then “A and C are d-separated by B” ⇒ A ⊥ C|B.
Notice that, if A and B are d-separated by C then any subset of A is d-separated from
any subset of C by B.
Notice also that we can change directions of some arrows in the graph without chang-
ing d-separations, provided that we don’t change the set of → Xk ← structures.
Thus, to represent the conditional independances one often uses so-called essential
graphs, obtained from a DAG by replacing arrows which do not participate in a V -
structure by lines.
Belief propagation
D-separation also leads to the design of effective belief propagation algorithms.
(See course notes and subsequent lessons).

IT 2005-2, slide 24
There is much more to say about Bayesian belief networks, but limited time in the context of this course does not
allow to go further in depth.

Bayesian networks where proposed in the eighties by Judea Pearl, in order to provide modelling tools for reasoning
under uncertainty in artificial intelligence (e.g. expert systems for medical diagnosis).

In the meanwhile, both theory and practice have progressed significantly, and although the field has not yet reached
full maturity there are already many significant real applications.

One of the complex questions, as regards inference, is to devise efficient algorithms to propagate evidence through
the network. If the network has a tree structure this is rather easy task (a generalization of the forward-backward
algorithm used for hidden markov chains, leading to an efficient algorithm). If the network is not a tree, one approach
consists of grouping variables so as to yield a tree (so-called junction tree algorithm); another approach is to use
approximate (but efficient) algorithms for probability propagation.

The other main problem under consideration in research concerns the automatic design of probabilistic models from
data. Here also, there is still a lot to do.

IT 2005-2, note 24
Probabilistic reasoning and questionnaires
Let us consider a medical diagnostic problem and its probabilistic model. Let us
denote by D a variable which is true when the patient under consideration has a
certain disease (say hepatitis).
Ω set of all possible patients which will visit a M.D.
In order to make a diagnosis, the doctor will typically try to look at the symptoms
(concentration of various types of blood cells, eye color, skin color, temperature . . . )
and ask questions about antecedents (factors, such as age, nutrition, smooking, ad-
diction to heroin,. . . ).
Note that not all questions have same relevance, and also in general the relevance
of a question is dependent on already observed variables. Anyhow, typically the
doctor would like to reach conclusions about the diagnostic by asking on relevant
and informative questions.
Problem : how to design an efficient strategy for the diagnosis ?

IT 2005-2, slide 25
Probabilistic model
Suppose that we have a model for P (D, A1 , . . . , An , S1 , . . . , Sm ).
We can measure the residual uncertainty of the diagnosis problem by

H(D|A1 , . . . , An , S1 , . . . , Sm )

i.e. the uncertainty which can not be reduced by observations.

If the disease is well known, hopefully this quantity will be small.
Note that, if we forbid the use of one of the possible observations, say A1 , then the
residual uncertainty increases

H(D|A2 , . . . , An , S1 , . . . , Sm ) ≥ H(D|A1 , . . . , An , S1 , . . . , Sm )

but this does not mean that all questions are relevant in all cases.
Suppose we are allowed only to ask one single question (observe one of Ai or Sj ),
then we would choose X ∈ {A1 , . . . , An , S1 , . . . , Sm } maximizing I(X ; D).

IT 2005-2, slide 26
Strategy : same as decision tree
Skin colour

yellow other

Did you visit Asia Eye color

y n

Did you eat raw meat Do you take heroin

y n

P(D=T) = 0.99

IT 2005-2, slide 27
The test nodes of the tree (square boxes on the figure) represent essentially questions or observations that may be
made by the doctor. The terminal nodes represent conclusions that will be drawn : the doctor stops to ask questions
and decides that it is either very likely or very unlikely that the patient has hepatitis, or possibly decides that he is
still uncertain and the patient should go to a specialist (who will ask more questions).

The tree structure defines the strategy that the doctor will use to reach a decision : the top-node (root of the tree)
defines the first question, and successors define the substrategies depending on the obtained answer. Note that the
terminal nodes of the tree are a function T of the test variables : using the tree is equivalent to observing this variable.

Tree construction algorithms :

A good decision tree is one that minimizes the average conditional entropy at the leaf nodes and at the same time
minimizes complexity of the tree (different measures). If we know P (D, A1 , . . . , An , S1 , . . . , Sm ) (say we have
a Bayesian network) we can try to find an optimal tree, say one which minimizes

H(D|T ) + βComplexity

Brute force :
- generate all possible trees (there is only a finite number of trees)
- for each tree compute H(D|T ) + βComplexity (can be done using P (D, A1 , . . . , An , S1 , . . . , Sm ))
- keep the best one.

Hill climbing :
- select the variable maximizing I(X ; D) at the root node
- for each value Xi of X use P (D, A1 , . . . , An , S1 , . . . , Sm |Xi ) to build subtree.
- stop when H(D|T ) + βComplexity starts to increase.

IT 2005-2, note 27

Cost Standards For Dredging Equipment (2009)
No ratings yet
Cost Standards For Dredging Equipment (2009)
4 pages
Instructor's Manual For Probabilistic Graphical Models by Daphne Koller, Benjamin Packer
No ratings yet
Instructor's Manual For Probabilistic Graphical Models by Daphne Koller, Benjamin Packer
59 pages
Entropy, Relative Entropy and Mutual Information
No ratings yet
Entropy, Relative Entropy and Mutual Information
4 pages
Solved Problems
No ratings yet
Solved Problems
7 pages
AHSC 220 - Lifespan Growth and Development For Practitioners
100% (1)
AHSC 220 - Lifespan Growth and Development For Practitioners
5 pages
2 Information Theory
No ratings yet
2 Information Theory
40 pages
session3
No ratings yet
session3
44 pages
Chapter 2
No ratings yet
Chapter 2
68 pages
LECTURE 1: Introduction
No ratings yet
LECTURE 1: Introduction
16 pages
Entropy
No ratings yet
Entropy
21 pages
An Introduction To Artificial Intelligence: Chapter 13 &14.1-14.2: Uncertainty & Bayesian Networks
No ratings yet
An Introduction To Artificial Intelligence: Chapter 13 &14.1-14.2: Uncertainty & Bayesian Networks
31 pages
Learning Material - ITC
No ratings yet
Learning Material - ITC
96 pages
Entropy 4
No ratings yet
Entropy 4
10 pages
2 Entropy and Mutual Information: I (A) F (P (A) )
No ratings yet
2 Entropy and Mutual Information: I (A) F (P (A) )
27 pages
Lecture 2
No ratings yet
Lecture 2
22 pages
The Binary Entropy Function: ECE 7680 Lecture 2 - Definitions and Basic Facts
No ratings yet
The Binary Entropy Function: ECE 7680 Lecture 2 - Definitions and Basic Facts
8 pages
Lecture2 1
No ratings yet
Lecture2 1
37 pages
IT_w1
No ratings yet
IT_w1
20 pages
Entropy and Mutual Information
No ratings yet
Entropy and Mutual Information
63 pages
Lect2 PDF
No ratings yet
Lect2 PDF
25 pages
SP14 CS188 Lecture 13 - Markov Models
No ratings yet
SP14 CS188 Lecture 13 - Markov Models
33 pages
Lecture 3 - Entropy
No ratings yet
Lecture 3 - Entropy
35 pages
Elements of Information Theory 2006 Thomas M. Cover and Joy A. Thomas
No ratings yet
Elements of Information Theory 2006 Thomas M. Cover and Joy A. Thomas
16 pages
Information and Interaction Among Features: 36-350: Data Mining 9 September 2009
No ratings yet
Information and Interaction Among Features: 36-350: Data Mining 9 September 2009
16 pages
Information and Entropy: Aria Nosratinia - Information Theory 2-1
No ratings yet
Information and Entropy: Aria Nosratinia - Information Theory 2-1
7 pages
Math Supplement PDF
No ratings yet
Math Supplement PDF
17 pages
ECE4007 Information Theory and Coding: DR - Sangeetha R.G
No ratings yet
ECE4007 Information Theory and Coding: DR - Sangeetha R.G
44 pages
Ch5 Entropy and Information
No ratings yet
Ch5 Entropy and Information
77 pages
Math7224 Notes
No ratings yet
Math7224 Notes
32 pages
MIT16 36s09 Lec03
No ratings yet
MIT16 36s09 Lec03
10 pages
Computer Science CPSC 322: Bayesian Networks: Construction
No ratings yet
Computer Science CPSC 322: Bayesian Networks: Construction
70 pages
ITC Module2 1
No ratings yet
ITC Module2 1
34 pages
Relative Entropy
No ratings yet
Relative Entropy
6 pages
Information Theory and Coding
No ratings yet
Information Theory and Coding
79 pages
Problem Set 1
No ratings yet
Problem Set 1
3 pages
HW 1 Sol
No ratings yet
HW 1 Sol
5 pages
Information Theory, Coding and Cryptography Unit-2 by Arun Pratap Singh
100% (4)
Information Theory, Coding and Cryptography Unit-2 by Arun Pratap Singh
36 pages
Lec4_Probability Theory and Naive Bayes Classifier
No ratings yet
Lec4_Probability Theory and Naive Bayes Classifier
27 pages
IT-CO-1-EN
No ratings yet
IT-CO-1-EN
26 pages
Lec35 - 210108062 - ZAINAB ALI
No ratings yet
Lec35 - 210108062 - ZAINAB ALI
9 pages
Unit Iv L Earning
No ratings yet
Unit Iv L Earning
23 pages
Unit Iv L Earning
No ratings yet
Unit Iv L Earning
33 pages
02 Measure of Information
No ratings yet
02 Measure of Information
17 pages
Unit-4
No ratings yet
Unit-4
74 pages
1-Information Removed
No ratings yet
1-Information Removed
5 pages
Information Theory: Info Rmatio N Types
No ratings yet
Information Theory: Info Rmatio N Types
52 pages
ITC Module - I
No ratings yet
ITC Module - I
98 pages
L12 Bayesian Network
No ratings yet
L12 Bayesian Network
35 pages
Mutual Information
No ratings yet
Mutual Information
48 pages
lời giải
No ratings yet
lời giải
52 pages
SummaryFeb5 2024
No ratings yet
SummaryFeb5 2024
2 pages
ictSolution
No ratings yet
ictSolution
41 pages
Information Theory Final
No ratings yet
Information Theory Final
50 pages
Joint & Conditional Entropy, Mutual Information: Application of Information Theory, Lecture 2
No ratings yet
Joint & Conditional Entropy, Mutual Information: Application of Information Theory, Lecture 2
26 pages
02 - Conditioning and Independence
No ratings yet
02 - Conditioning and Independence
14 pages
Full download Learning Bayesian networks Richard E. Neapolitan pdf docx
100% (1)
Full download Learning Bayesian networks Richard E. Neapolitan pdf docx
67 pages
Mathematical Formulas for Economics and Business: A Simple Introduction
From Everand
Mathematical Formulas for Economics and Business: A Simple Introduction
K.H. Erickson
4/5 (4)
Transformation of Axes (Geometry) Mathematics Question Bank
From Everand
Transformation of Axes (Geometry) Mathematics Question Bank
Mohmmad Khaja Shareef
3/5 (1)
Differential Forms
From Everand
Differential Forms
Henri Cartan
5/5 (2)
Hyperbolic Functions (Trigonometry) Mathematics E-Book For Public Exams
From Everand
Hyperbolic Functions (Trigonometry) Mathematics E-Book For Public Exams
Mohmmad Khaja Shareef
No ratings yet
Lectures on Integral Equations
From Everand
Lectures on Integral Equations
Harold Widom
3.5/5 (1)
Multiple Integrals, A Collection of Solved Problems
From Everand
Multiple Integrals, A Collection of Solved Problems
Steven Tan
No ratings yet
1 s2.0 S2405844024160999 Main
No ratings yet
1 s2.0 S2405844024160999 Main
53 pages
IET Generation Trans Dist - 2023 - Mansour - Applications of IoT and Digital Twin in Electrical Power Systems A
No ratings yet
IET Generation Trans Dist - 2023 - Mansour - Applications of IoT and Digital Twin in Electrical Power Systems A
23 pages
Report 2.3-Revised-Final - Kaal Harir Abdulle, 160041080
No ratings yet
Report 2.3-Revised-Final - Kaal Harir Abdulle, 160041080
36 pages
4 Information Theory
No ratings yet
4 Information Theory
53 pages
Symmetric-Key Encryption: Constructions: PRG, PRF Stream and Block Ciphers
No ratings yet
Symmetric-Key Encryption: Constructions: PRG, PRF Stream and Block Ciphers
19 pages
Public-Key Cryptography: CCA Secure PKE Hybrid Encryption
No ratings yet
Public-Key Cryptography: CCA Secure PKE Hybrid Encryption
18 pages
Balitang Ina
No ratings yet
Balitang Ina
4 pages
Nozojofogepirivo
No ratings yet
Nozojofogepirivo
2 pages
Greensleeves Free
No ratings yet
Greensleeves Free
3 pages
Arun Kumar
No ratings yet
Arun Kumar
2 pages
Capacitors and Dielectric: A PPT For General Physics 2 STEM-12 2017-2018
100% (1)
Capacitors and Dielectric: A PPT For General Physics 2 STEM-12 2017-2018
31 pages
Float Valve, Equilibrium Pattern, Reduced Bore: General Information
No ratings yet
Float Valve, Equilibrium Pattern, Reduced Bore: General Information
9 pages
Project Mangement Plan Examples Prepare Project Support Plans and Documentation - H&S Plan Examples
No ratings yet
Project Mangement Plan Examples Prepare Project Support Plans and Documentation - H&S Plan Examples
16 pages
Ab Initio Calculations
No ratings yet
Ab Initio Calculations
9 pages
Population and Society An Introduction to Demography 1st Edition Dudley L. Poston Jr. pdf download
100% (1)
Population and Society An Introduction to Demography 1st Edition Dudley L. Poston Jr. pdf download
43 pages
Prox td001 - en P
No ratings yet
Prox td001 - en P
102 pages
Dalnc 2022 - Student Handbook
No ratings yet
Dalnc 2022 - Student Handbook
44 pages
Agile Project Management On Government
No ratings yet
Agile Project Management On Government
8 pages
Introduction To Simulink
100% (5)
Introduction To Simulink
66 pages
Iso 18611 2 2014
No ratings yet
Iso 18611 2 2014
15 pages
The Yield Line Analysis of Concrete Slabs
No ratings yet
The Yield Line Analysis of Concrete Slabs
140 pages
Veblen-GustavSchmollersEconomics-1901
No ratings yet
Veblen-GustavSchmollersEconomics-1901
26 pages
GR 12 Physical Sciences P1 Eng - x5
No ratings yet
GR 12 Physical Sciences P1 Eng - x5
20 pages
Annotated Bibliography
No ratings yet
Annotated Bibliography
3 pages
4) Idea Generation
No ratings yet
4) Idea Generation
4 pages
Material Unit 1
No ratings yet
Material Unit 1
55 pages
En 81-58
No ratings yet
En 81-58
32 pages
Account Usage and Recharge Statement From 29-Oct-2023 To 04-Nov-2023
No ratings yet
Account Usage and Recharge Statement From 29-Oct-2023 To 04-Nov-2023
9 pages
Logic Gates and Boolean Algebra
100% (5)
Logic Gates and Boolean Algebra
39 pages
David Silverman - Qualitative Methodology and Sociology - Describing The Social World (1985, Gower) - Libgen - Li
No ratings yet
David Silverman - Qualitative Methodology and Sociology - Describing The Social World (1985, Gower) - Libgen - Li
215 pages
Manual G1o Mini Escavadora
100% (2)
Manual G1o Mini Escavadora
376 pages
Cy-Bocs Parent
No ratings yet
Cy-Bocs Parent
2 pages
SKF Speedi-Sleeve PDF
No ratings yet
SKF Speedi-Sleeve PDF
44 pages
Zeitaku Slats!
No ratings yet
Zeitaku Slats!
22 pages

Jour 2

Uploaded by

Jour 2

Uploaded by

Introduction to information theory and coding

• Entropies and information measures

The entropy of X knowing that Y = Yj is

it is positive (it is an entropy) and one has

hence this latter is also positive.

I(X ; Y) = H(X ) − H(X |Y) = H(Y) − H(Y|X )

H(X , Y) = H(X ) + H(Y) − I(X ; Y)

0 ≤ I(X ; Y) ≤ min{H(X ), H(Y)}

1. Show that indeed (and in the given order)

1. H(X , Y) = H(Y) + H(X |Y) = H(X ) + H(Y|X )

2. H(X , Y) ≥ max{H(X ), H(Y)}

3. H(X |Y) ≤ H(X )

4. H(X , Y) ≤ H(X ) + H(Y)

5. I(X ; Y) = H(X ) + H(Y) − H(X , Y) = H(X ) − H(X |Y)

1. Consider the following contingency table

Compute (logarithms in base 2) :

2. H(X |Y), H(Y|X )

6. Draw a Venn diagram.

2. Consider three random variables X , Y, Z.

Prove that H(X , Y|Z) = H(X |Z) + H(X |Y, Z).

NB: Conditional mutual information of X and Y given Z is defined by

Equivalent definition of I(X ; Y|Z)

Indeed : chain rule of information applied in two ways to I(X ; Y, Z):

2. Let X , Y, Z be three discrete random variables. Show that

(a) H(X , Y|Z) ≥ H(X |Z);

Now we need to specify the model :

Terminology and notation

Let Xk denote a node in the graph G. Then we denote by :

Defining property of Bayesian networks

P (X , Y, Z) = P (X )P (Y|X )P (Z|X , Y) P (X , Y, Z) = P (X )P (Y)P (Z)

P (X , Y, Z) = P (X )P (Y|X )P (Z|Y) P (X , Y, Z) = P (Z)P (Y|Z)P (X |Y)

P (Z)P (X |Z)P (Y|Z)

Produced by JavaBayes tool

0/1 0/1 0/1 0/1 0/1 0/1 0/1 0/1

Same compressed by a 4 bit block code

0/1 0/1 0/1 0/1 0/1 0/1 0/1 0/1

0/1 0/1 0/1 0/1 0/1 0/1 0/1 0/1 Source

1. The pattern → Xk → appears in the path and Xk ∈ B

i.e. the uncertainty which can not be reduced by observations.

Did you visit Asia Eye color

Did you eat raw meat Do you take heroin

Tree construction algorithms :

You might also like