1 Information Theory
1 Information Theory
Louis Wehenkel
ELEN060-2
Information and coding theory
February 2021
1 / 57
Outline
• Course organization
• Course objectives
2 / 57
Course material
• These slides : the slides tend to be self-explanatory; where necessary I have added
some notes.
The slides will be available from the WEB :
https://people.montefiore.uliege.be/lwh/Info/
• Your personal notes
• Detailed course notes (in french; centrale des cours).
• For further reading, some reference books in english
• J. Adámek, Foundations of coding, Wiley Interscience, 1991.
• T. M. Cover and J. A. Thomas, Elements of information theory, Wiley, 1991.
• R. Frey, Graphical models for machine learning and information theory, MIT Press,
1999.
• D. Hankerson, G. A. Harris, and P. D. Johnson Jr, Introduction to information
theory and data compression, CRC Press, 1997.
• D. J.C. MacKay, Information theory, inference, and learning algorithms, Cambridge
University Press 2003.
• D. Welsh, Codes and cryptography, Oxford Science Publications, 1998.
3 / 57
Course objectives
4 / 57
Notes
The course aims at introducing information theory and the practical aspects of data compression and
error-control coding. The theoretical concepts are illustrated using practical examples related to the
effective storage and transmission of digital and analog data. Recent developments in the field of
channel coding are also discussed (Turbo-codes).
More broadly, the goal of the course is to introduce the basic techniques for reasoning under
uncertainty as well as the computational and graphical tools which are broadly used in this area. In
particular, Bayesian networks and decision trees will be introduced, as well as elements of automatic
learning and data mining.
The theoretical course is complemented by a series of computer laboratories, in which the students can
simulate data sources, data transmission channels, and use various software tools for data compression,
error-correction, probabilistic reasoning and data mining.
The course is addressed to engineering students (last year), which have some background in computer
science, general mathematics and elementary probability theory.
The following two slides aim at clarifying the distinction between deterministic and stochastic systems.
5 / 57
Classical system (deterministic view)
Hidden inputs z
(perturbations)
Observable Observable
System:
inputs x outputs y
Relation between inputs
and outputs: y = f (x, z)
6 / 57
Notes
Classical system theory views a system essentially as a function (or a mapping, in the mathematical
sense) between inputs and outputs.
If the system is static, inputs and outputs are scalars (or vectors of scalars). If the system is dynamic,
inputs and outputs are temporal signals (continuous or discrete time); a dynamic system is thus viewed
as a mapping between input signals and output signals.
In classical system theory the issue of unobserved inputs and modeling imperfection is handled through
stability, sensitivity and robustness theories. In this context uncertainty is essentially modeled by subsets
of possible perturbations.
7 / 57
Stochastic system (probabilistic view)
Observable Observable
System:
inputs outputs
Probabilistic relation
P (x) Outputs: P (y|x, z) P (y)
8 / 57
Notes
Here we use probability theory as a tool (a kind of calculus) in order to model and quantify uncertainty.
Note that there are other possible choices (e.g. fuzzy set theory, evidence theory...) to model
uncertainty, but probability theory is the most mature and most widely accepted approach. Still, there
are philosphical arguments and controverses around the interpretation of probability in real-life : e.g.
classical (objective) notion of probability vs bayesian (subjective) notion of probability.
Theory of stochastic systems is more complex and more general than deterministic system theory.
Nevertheless, the present trend in many fileds is to use probability theory and statistics more
systematically in order to build and use stochastic system models of reality. This is due to the fact that
in many real-life systems, uncertainty plays a major role. Within this context, there are two scientific
disciplines which become of growing importance for engineers :
1. Data mining and machine learning : how to build models for stochastic systems from
observations.
2. Information theory and probabilistic inference : how to use stochastic systems in an optimal
manner.
Examples of applications of stochastic system theory :
• Complex systems, where detailed models are intractable (biology, sociology, computer
networks. . . )
• How to take decisions involving an uncertain future (justifications of investments, portfolio
management) ?
• How to take decisions under partial/imperfect knowledge (medical diagnosis) ?
• Forecasting the future... (weather, ecosystems, stock exchange. . . )
• Modeling human behavior (economics, telecommunications, computer networks, road traffic...)
• Efficient storage and transmission of digital (and analog) data
9 / 57
Information and coding theory will be the main focus of the course
10 / 57
Relations between information theory and other disciplines
11 / 57
Shannon paradigm
optic fibre
computer magnetique tape computer
man acoustic medium man
message
SOURCE CHANNEL RECEIVER
perturbations
thermal noise
read or write errors
acoustic noise
Message:
• sequence of symbols, analog signal (sound, image, smell. . . )
• messages are chosen at random
• channel perturbations are random
12 / 57
Notes
The foundations of information theory where laid down by Claude E. Shannon, shortly after the end of
the second worldwar in a seminal paper entitled A mathematical theory of communication (1948). In
this paper, all the main theoretical ingedients of modern information theory where already present. In
particular, as we will see later, Shannon formulated and provided proofs of the two main coding
theorems.
Shannon theory of communication is based on the so-called Shannon paradigm, illustrated on this
slide : a data source produces a message which is sent to a receiver through an imperfect
communication channel. The possible source messages can generally be modeled by a sequence of
symbols, chosen in some way by the source which appears as unpredictable to the receiver. In other
words, before the message has been sent, the receiver has some uncertainty about what will be the next
message. It is precisely the existence of this uncertainty which makes communication necessary (or
useful) : after the message has been received, the corresponding uncertainty has been removed. We will
see later that information will be measured by this reduction in uncertainty.
Most real life physical channels are imperfect due to the existence of some form of noise. This means
that the message sent out will arrive in a corrupted version to the receiver (some received symbols are
different from those emitted), and again the corruption is unpredictable for the receiver and for the
source of the message.
The two main questions posed by Shannon in his early paper are as follows :
• Suppose the channel is perfect (no corruption), and suppose we have a probabilistic description
(model) of the source, what is the maximum rate of communication (source symbols per channel
usage), provided that we use an appropriate source code. This problem is presently termed as the
source coding problem or as the reversible data compression problem. We will see that the answer
to this question is given by the entropy of the source.
• Suppose now that the channel is noisy, what is then the maximum rate of communication without
errors between any source and receiver using this channel ? This is the so-called channel coding
problem or error-correction coding problem; we will see that the answer here is the capacity of the
channel, which is the upper bound of mutual information between input and output messages.
13 / 57
Use of source and channel coding
Normalized channel
1. Source redundancy may be useful to fight againts noise, but is not necessarily adapted to
the channel characteristics.
2. Once redundancy has been removed from the source, all sources have the same behavior
(completely unpredictable behavior).
3. Channel coding : fight against channel noise without spoiling resources.
4. Coding includes conversion of alphabets.
14 / 57
Notes
The two main results of information theory are thus the characterization of upper bounds in terms of
data compression on the one hand, and error-less communication on the other.
A further result of practical importance is that (in most, but not all situations) source and channel
coding problems can be decoupled. In other words, data compression algorithms can be designed
independently from the type of data communication channel that will be used to transmit (or store) the
data. Conversely, channel coding can be carried out irrespectively of the type of data sources that will
be used to transmit information over them. This result has led to the partition of coding theory into its
two main subparts.
Source coding aims at removing redundancy in the source messages, so as to make them appear shorter
and purely random. On the other hand, channel coding aims at introducing redundancy into the
message, so as to make it possible to decode the message in spite of the uncertainty introduced by the
channel noise.
Because of the decomposition property, these problems are generally solved separately. However, there
are examples of situations where the decomposition breaks down (like some multi-user channels) and
also situations where from the engineering point of view it is much easier to solve the two problems
simultaneously than seperately. This latter situation appears when the source redundancy is particularly
well adapted to the channel noise (e.g. spoken natural language redundancy is adapted to acoustic
noise).
Examples of digital sources are : scanned images, computer files of natural language text, computer
programs, binary executable files. . . . From your experience, you already know that compression rates of
such different sources may be quite different.
Examples of channels are : AM or FM modulated radio channel, ethernet cable, magnetic storage (tape
or hard-disk); computer RAM; CD-ROM. . . . Again, the characteristics of these channels may be quite
different and we will see that different coding techniques are also required.
15 / 57
Quantitative notion of information content
Aspects :
• unpredictable character of a message
• interpretation : truth, value, beauty. . .
Interpretation depends on the context, on the observer : too complex and probably not
appropriate as a measure...
Nota bene.
Probability theory (not statistics) provides the main mathematical tool of information
theory.
Notations : (Ω, E, P (·)) = probability space
16 / 57
Notes
17 / 57
Probability theory : definitions et notations
ω, ω 0 , ωi . . . : elements of Ω (outcomes)
18 / 57
Probability space : requirements
• A ∈ E ⇒ ¬A ∈ E
• P (Ω) = 1
• If A1 , A2 , . . . ∈ E and Ai ∩ Aj = ∅ for i 6= j : P (
S P
i Ai ) = i P (Ai )
⇒ one says that P (·) is a probability measure
19 / 57
Finite sample spaces
20 / 57
Random variables
21 / 57
The random variable X (·) induces a probability measure on X = {X1 , . . . , Xn }
4
PX (X = Xi ) = PΩ (Xi ), which we will simply denote by P (Xi ).
We will denote by P (X ) the measure PX (·).
22 / 57
Some more notations . . .
X and Y two discrete r.v. on (Ω, E, P (·)).
Notation : X = {X1 , . . . , Xn } et Y = {Y1 , . . . , Ym }. (n and m finite).
Xi (resp. Yj ) value of X (resp. Y) ≡ subsets of Ω.
⇒ We identify a r.v. with the partition it induces on Ω.
Contingency table
Y1 ··· Yj ··· Ym
.. pi,· ≡ P (Xi ) ←− ∗
X1 . ≡ P (X = Xi )
.. ..
. .
p·,j ≡ P (Yj ) ←− ∗
Xi · · · · · · pi,j ··· ··· pi,·
.. .. ≡ P (Y = Yj )
. .
.. pi,j ≡ P (Xi ∩ Yj ) ≡ P (Xi , Yj ) ←− ∗
Xn .
≡ P ([X = Xi ] ∧ [Y = Yi ])
p·,j
23 / 57
Complete system of events
24 / 57
Calculus of random variables
On a given Ω we may define an arbitrary number of r.v. In reality, random variables are
the only practical way to observe outcomes of a random experiment.
Thus, a random experiment is often defined by the properties of a collection of random
variables.
Composition of r.v. : X (·) is a r.v. defined on Ω and Y(·) is a random variable defined
on X , then Y(X (·)) is also a r.v. defined on Ω.
Concatenation Z of X = {X1 , . . . , Xn } and Y = {Y1 , . . . , Ym } defined on Ω :
Z = X , Y defined on Ω by Z(ω) = (X (ω), Y(ω)) ⇒ P (Z) = P (X , Y).
Independence of X = {X1 , . . . , Xn } and Y = {Y1 , . . . , Ym }
- Iff ∀i ≤ n, j ≤ m : Xi ⊥ Yj .
- Equivalent to factorisation of probability measure : P (X , Y) = P (X )P (Y)
- Otherwise P (X , Y) = P (X )P (Y|X ) = P (Y)P (X |Y)
25 / 57
Example 1 : coin flipping
S
A first example of Bayesian network
Suppose the coins are both fair, and compute P (S).
26 / 57
Notes
27 / 57
Example 2 : binary source with independent symbols ωi ∈ {0, 1}
28 / 57
Notes
Comments.
The theory which will be developped is not really dependent on the base used to compute logarithms.
Base 2 will be the default, and fits well with binary codes as we will see.
You should convince yourself that the definition of self-information of an event fits with the intuitive
requirements of an information measure.
It is possible to show from some explicit hypotheses that there is no other possible choice for the
definition of a measure of information (remember the way the notion of thermodynamic entropy is
justified).
Nevertheless, some alternative measures of information have been proposed based on relaxing some of
the requirements and imposing some others.
To be really convinced that this measure is the right one, it is necessary to wait for the subsequent
lectures, so as to see what kind of implications this definition has.
Example : questionnaire, weighting strategies.
29 / 57
Conditional information
30 / 57
Illustration : transmission of information
31 / 57
Mutual information of two events
32 / 57
Exercise
Source : at successive times t ∈ {t0 , t1 . . .} sends symbols s(t) chosen from a finite
alphabet S = {s1 , . . . , sn }.
Assumption : successive symbols are independent and chosen according to the same
probability measure (i.e. independant of t) ⇒ Memoryless and time-invariant
Notation : pi = P (s(t) = si )
4 Pn
Definition : Source entropy : H(S) = E{h(s)} = − i=1 pi log pi
Entropy : measures average information provided by the symbols sent by the source :
Shannon/symbol
If F denotes the frequency of operation of the source, then F · H(S) measures average
information per time unit : Shannon/second.
Note that, because of the law of large numbers the per-symbol information provided by
any long message produced by the source converges (almost surely) towards H(S).
34 / 57
Notes
Examples.
Compute the entropy (per symbol) of the following sources :
• A source which always emits the same symbol.
• A source which emits zeroes and ones according to two fair coin flipping processes.
How can you simulate a fair coin flipping process with a coin which is not necessarily fair ?
35 / 57
Generalization : entropy of a random variable
36 / 57
Another remarkable property : concavity (consequences will appear later).
Means that ∀p1 6= p2 ∈ [0, 1], ∀q ∈]0, 1[ we have
37 / 57
More definitions
Suppose that X = {X1 , . . . , Xn } and Y = {Y1 , . . . , Ym } are two (discrete) r.v. defined
on a sample space Ω.
Joint entropy of X and Y defined by
n X
m
4 X
H(X , Y) = − P (Xi ∩ Yj ) log P (Xi ∩ Yj ). (1)
i=1 j=1
38 / 57
Note that the joint entropy is nothing novel : it is just the entropy of the random variable Z = (X , Y).
For the time being consider conditional entropy and mutual information as purely mathematical
definitions. The fact that these definitions really make sense will become clear from the study of the
properties of these measures.
Exercise (computational) : Consider our double coin flipping experiment. Suppose the coins are both
fair.
Compute H(H1 ), H(H2 ), H(S). Compute H(H1 , H2 ), H(H2 , H1 ), H(H1 , S), H(H2 , S) and
H(H1 , H2 , S)
Compute H(H1 |H2 ), H(H2 |H1 ) and then H(S|H1 , H2 ) and H(H1 , H2 |S)
Compute I(H1 ; H2 ), I(H2 ; H1 ) and then I(S; H1 ) and I(S; H2 ) and I(S; H1 , H2 ).
Experiment to work out for tomorrow.
You are given 12 balls, all of which are equal in weight except for one which is either lighter or heavier.
You are also given a two-pan balance to use. In each use of the balance you may put any number of the
12 balls on the left pan, and the same number (of the remaining) balls on the right pan. The result
may be one of three outcomes : equal weights on both pans; left pan heavier; right pan heavier. Your
task is to design a strategy to determine which is the odd ball and whether it is lighter or heavier in as
few (expected) uses of the balance as possible.
While thinking about this problem, you should consider the following questions :
• How can you measure information ? What is the most information you can get from a single
weighing ?
• How much information have you gained (in average) when you have identified the odd ball and
whether it is lighter or heavier ?
• What is the smallest number of weightings that might conceivably be sufficient to always identify
the odd ball and whether it is heavy or light ?
• As you design a strategy you can draw a tree showing for each of the three outcomes of a weighing
what weighing to do next. What is the probability of each of the possible outcomes of the first
weighing ?
39 / 57
Properties of the function Hn (· · · )
4 Pn
Notation : Hn (p1 , p2 , . . . , pn ) = − i=1 pi log pi
Pn
( pi discrete probability distribution, i.e. pi ∈ [0, 1] and i=1 pi = 1)
Maximal : ⇔ pi = 1
n , ∀i. (proof follows later)
40 / 57
Notes
The very important properties of the information and entropy measures which have been introduced
before, are all consequences of the mathematical properties of the entropy function Hn defined and
stated on the slide.
The following slides provide proofs respectively of the maximality and concavity properties which are
not trivial.
In addition, let us recall the fact that the function is invariant with respect to any permutation of its
arguments.
Questions :
What is the entropy of a uniform distribution ?
Relate entropy function intuitively to uncertainty and thermodynamic entropy.
Nota bene :
Entropy function : function defined on a (convex) subset of Rn .
Entropy measure : function defined on the set of random variables defined on a given sample space.
⇒ strictly speaking these are two different notions, and that is the reason to use two different notations
(Hn vs H).
41 / 57
Maximum of entropy function (proof)
Gibbs inequality : (Lemma, useful also for the sequel ⇒ keep it in mind)
Formulation : let (p1 , p2 , . . . , pn ) et (q1 , q2 , . . . , qn ) two probability distributions. Then,
n
X qi
pi log ≤ 0, (5)
i=1
pi
Homework : convince yourself that this remains true even when some of the pi = 0.
42 / 57
43 / 57
Theorem
Proof
Let us apply Gibbs inequality with qi = 1
n
We find
n n n
X 1 X 1 X
pi log = pi log − pi log n ≤ 0 ⇒
i=1
npi i=1
pi i=1
n n
X 1 X
Hn (p1 , p2 , . . . , pn ) = pi log ≤ pi log n = log n,
i=1
pi i=1
44 / 57
Concavity of entropy function (proof)
Let (p1 , p2 , . . . , pn ) and (q1 , q2 , . . . , qn ) be two probability distributions and λ ∈ [0, 1],
then
45 / 57
Graphiquely : mixing increases entropy (cf thermodynamics)
Pm Pm
Proof : f (x) = −x log x is concave on [0, 1] : f ( j=1 λj xj ) ≥ j=1 λj f (xj ).
Pm Pm Pn Pm
Thus we have : Hn ( j=1 λj p1j , . . . , j=1 λj pnj ) = i=1 f ( j=1 λj pij )
Pn Pm Pm Pn
≥ i=1 [ j=1 λj f (pij )] = j=1 λj i=1 f (pij )
Pm
= j=1 λj Hn (p1j , . . . , pnj )
46 / 57
Another interpretation (thermodynamic)
Suppose that you pick a molecule in one of the two cuves and have to guess which kind of molecule you
will obtain : in both cases you can take into account in which container you pick the molecule.
⇒ there is less uncertainty in the left cuve than in the right. Compute entropies relevant to this
problem.
47 / 57
(Let us open a parenthesis : notions of convexity/concavity
Convex set : a set which contains all the line segments joining any two of its points.
C ⊂ Rp is convex (by def.) if
x, y ∈ C, λ ∈ [0, 1] ⇒ λx + (1 − λ)y ∈ C.
Examples:
In R : intervals, semi-intervals, R.
48 / 57
Examples : sets of probability distributions : n = 2 and n = 3
More generally :
Linear subspaces, semi-planes are convex
Any intersection of convex sets is also convex (ex. polyhedra)
Ellipsoids : {x|(x − xc )T A−1 (x − xc ) ≤ 1} (A positive definite)
49 / 57
Convex functions
x, y ∈ C, λ ∈ [0, 1] ⇒
50 / 57
Strictly convex function
If equality holds only for the trivial cases
λ ∈ {0, 1} and/or x = y
51 / 57
Notion of convex linear combination,
Pm Pm
( j λj xj ), with λj ≥ 0 and j λj = 1.
52 / 57
Jensen’s inequality :
If f (·) is a convex function defined on R → R and X a real random variable
f (E{X }) ≤ E{f (X )} where, if the function is strictly convex, equality holds iff X is
constant almost surely.
Extension to vectors ...
Concave functions : ≤ −→ ≥...
Particular case : convex linear combinations
The λj ’s act as a discrete probability measure on Ω = {1, . . . , m}.
And xj denotes the value of X at point ω = i.
Pm Pm
Hence : f ( j λj xj ) ≤ j λj f (xj )
Let us close the parenthesis...)
53 / 57
Let us return to the entropies.
One deduces
54 / 57
The fact that f (x) = −x log x is strictly concave is clear from the picture. Clearly log x is also concave.
All inequalities related to the entropy function may be easily deduced from the concavity of these two
functions and Jensen’s inequality. For example, let us introduce here a new quantity called relative
entropy or Kullback Leibler distance. Let P and Q be two discrete probability distributions defined on a
discrete Ω = {ω1 , . . . , ωn }. The Kullback Leibler distance (or relative entropy) of P w.r.t. Q is defined
by
X P (ω)
D(P ||Q) = P (ω) log (8)
ω∈Ω
Q(ω)
X P (ω) X Q(ω)
−D(P ||Q) = − P (ω) log = P (ω) log (9)
ω∈Ω
Q(ω) ω∈Ω
P (ω)
X Q(ω) X
≤ log P (ω) = log Q(ω) = log 1 = 0 (10)
ω∈Ω
P (ω) ω∈Ω
where the inequality follows from Jensen’s inequality applied to the concave function log x. Because the
Q(ω)
function is strictly concave, equality holds only if P (ω) is constant over Ω (and hence equal to 1),
which justifies the name of distance of the relative entropy.
2 This is nothing else than Gibbs inequality, which we have already proven without using Jensen’s inequality.
55 / 57
Further reading
56 / 57
Frequently asked questions
57 / 57