0% found this document useful (0 votes)

22 views54 pages

Course 1

Uploaded by

Mayouf

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

22 views54 pages

Course 1

Uploaded by

Mayouf

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 54

Introduction to information theory and coding

Louis WEHENKEL

University of Liège - Department of Electrical and Computer Engineering

Institut Montefiore

• Course organization
• Course objectives
• Introduction to probabilistic reasoning
• Algebra of information measures
• Some exercises

IT 2012, slide 1
Course material :

• These slides : the slides tend to be self-explanatory; where necessary I have added some notes.
The slides will be available from the WEB :
“http://www.montefiore.ulg.ac.be/˜lwh/Cours/Info/”

• Your personal notes

• Detailed course notes (in french; centrale des cours).

• For further reading, some reference books in english

– J. Adámek, Foundations of coding, Wiley Interscience, 1991.

– T. M. Cover and J. A. Thomas, Elements of information theory, Wiley, 1991.
– R. Frey, Graphical models for machine learning and information theory, MIT Press, 1999.
– D. Hankerson, G. A. Harris, and P. D. Johnson Jr, Introduction to information theory and data compres-
sion, CRC Press, 1997.
– D. J.C. MacKay, Information theory, inference, and learning algorithms, Cambridge University Press
2003.
– D. Welsh, Codes and cryptography, Oxford Science Publications, 1998.

IT 2012, note 1
Course objectives

1. Introduce information theory

• Probabilistic (stochastic) systems
• Reasoning under uncertainty
• Quantifying information
• State and discuss coding theorems
2. Give an overview of coding theory and practice
• Data compression
• Error-control coding
• Automatic learning and data mining
3. Illustrate ideas with a large range of practical applications

NB. It is not sure that everything in the slides will be covered during the oral course.
You should read the slides and notes (especially those which I will skip) after each
course and before the next course.

IT 2012, slide 2
The course aims at introducing information theory and the practical aspects of data compression and error-control
coding. The theoretical concepts are illustrated using practical examples related to the effective storage and transmis-
sion of digital and analog data. Recent developments in the field of channel coding are also discussed (Turbo-codes).

More broadly, the goal of the course is to introduce the basic techniques for reasoning under uncertainty as well
as the computational and graphical tools which are broadly used in this area. In particular, Bayesian networks and
decision trees will be introduced, as well as elements of automatic learning and data mining.

The theoretical course is complemented by a series of computer laboratories, in which the students can simulate
data sources, data transmission channels, and use various software tools for data compression, error-correction,
probabilistic reasoning and data mining.

The course is addressed to engineering students (last year), which have some background in computer science,
general mathematics and elementary probability theory.

The following two slides aim at clarifying the disctinction between deterministic and stochastic systems.

IT 2012, note 2
Classical system (deterministic view)

Hidden inputs z
(perturbations)

Obervable System : Observable

inputs x Outputs y
Relation between inputs and
outputs : y = f(x,z)

IT 2012, slide 3
Classical system theory views a system essentially as a function (or a mapping, in the mathematical sense) between
inputs and outputs.

If the system is static, inputs and outputs are scalars (or vectors of scalars). If the system is dynamic, inputs and
outputs are temporal signals (continuous or discrete time); a dynamic system is thus viewed as a mapping between
input signals and output signals.

In classical system theory the issue of unobserved inputs and modeling imperfection is handled through stability,
sensitivity and robustness theories. In this context uncertainty is essentially modeled by subsets of possible pertur-
bations.

IT 2012, note 3
Stochastic system (probabilistic view)

Hidden inputs
(perturbations)

P(z)

Observable System : Observable

inputs Outputs
Probabilistic relation
P(x) P(y)
Outputs : P(y|x,z)

IT 2012, slide 4
Here we use probability theory as a tool (a kind of calculus) in order to model and quantify uncertainty. Note that
there are other possible choices (e.g. fuzzy set theory, evidence theory...) to model uncertainty, but probability theory
is the most mature and most widely accepted approach. Still, there are philosphical arguments and controverses
around the interpretation of probability in real-life : e.g. classical (objective) notion of probability vs bayesian
(subjective) notion of probability.

Theory of stochastic systems is more complex and more general than deterministic system theory. Nevertheless, the
present trend in many fileds is to use probability theory and statistics more systematically in order to build and use
stochastic system models of reality. This is due to the fact that in many real-life systems, uncertainty plays a major
role. Within this context, there are two scientific disciplines which become of growing importance for engineers :
1. Data mining and machine learning : how to build models for stochastic systems from observations.
2. Information theory and probabilistic inference : how to use stochastic systems in an optimal manner.

Examples of applications of stochastic system theory :

• Complex systems, where detailed models are intractable (biology, sociology, computer networks. . . )
• How to take decisions involving an uncertain future (justifications of investments, portfolio management) ?
• How to take decisions under partial/imperfect knowledge (medical diagnosis) ?
• Forecasting the future... (wheather, ecosystems, stock exchange. . . )
• Modeling human behavior (economics, telecommunications, computer networks, road traffic...)
• Efficient storage and transmission of digital (and analog) data

IT 2012, note 4
Information and coding theory will be the main focus of the course

1. What is it all about ?

• 2 complementary aspects
⇒ Information theory : general theoretical basis
⇒ Coding : compress, fight against noise, encrypt data

• Information theory
⇒ Notions of data source and data transmission channel
⇒ Quantitative measures of information content (or uncertainty)
⇒ Properties : 2 theorems (Shannon theorems) about feasibility limits
⇒ Discrete vs continuous signals
• Applications to coding
⇒ How to reach feasibility limits
⇒ Practical implementation aspects (gzip, Turbo-codes...)

IT 2012, slide 5
Relations between information theory and other disciplines

Théorie
des
Télé Probabilités
communications
Théorèmes
limites
Limites
Ev. rares
ultimes
Statistique
Tests

Théorie Inf. Fisher

Thermo de
Physique
dynamique
l’information
Th. des
portefeuilles

Complexité Jeux
Economie
de
Kolmogorov
Inégalités
Informatique
théorique

Mathématiques

IT 2012, slide 6
Shannon paradigm
optic fibre
computer computer
magnetique tape
man acoustic medium man

message
SOURCE CHANNEL RECEIVER

perturbations

thermal noise
read or write errors
acoustic noise
Message :
- sequence of symbols, analog signal (sound, image, smell. . . )
- messages are chosen at random
- channel perturbations are random

IT 2012, slide 7
The foundations of information theory where laid down by Claude E. Shannon, shortly after the end of the sec-
ond worldwar in a seminal paper entitled A mathematical theory of communication (1948). In this paper, all the
main theoretical ingedients of modern information theory where already present. In particular, as we will see later,
Shannon formulated and provided proofs of the two main coding theorems.

Shannon theory of communication is based on the so-called Shannon paradigm, illustrated on this slide : a data
source produces a message which is sent to a receiver through an imperfect communication channel. The possible
source messages can generally be modeled by a sequence of symbols, chosen in some way by the source which
appears as unpredictable to the receiver. In other words, before the message has been sent, the receiver has some
uncertainty about what will be the next message. It is precisely the existence of this uncertainty which makes
communication necessary (or useful) : after the message has been received, the corresponding uncertainty has been
removed. We will see later that information will be measured by this reduction in uncertainty.

Most real life physical channels are imperfect due to the existence of some form of noise. This means that the
message sent out will arrive in a corrupted version to the receiver (some received symbols are different from those
emitted), and again the corruption is unpredictable for the receiver and for the source of the message.

The two main questions posed by Shannon in his early paper are as follows :
• Suppose the channel is perfect (no corruption), and suppose we have a probabilistic description (model) of the
source, what is the maximum rate of communication (source symbols per channel usage), provided that we use
an appropriate source code. This problem is presently termed as the source coding problem or as the reversible
data compression problem. We will see that the answer to this question is given by the entropy of the source.

• Suppose now that the channel is noisy, what is then the maximum rate of communication without errors between
any source and receiver using this channel ? This is the so-called channel coding problem or error-correction
coding problem; we will see that the answer here is the capacity of the channel, which is the upper bound of
mutual information between input and output messages.

IT 2012, note 7
Use of source and channel coding
Normalized source Normalized receiver

Source CS CC Channel DC DS Reveiver

Normalized channel

Source coding : remove redundancy (make message as short as possible)

Channel coding : make data tranmission reliable (fight against noise)
Nota bene :
1. Source redundancy may be useful to fight againts noise, but is not necessarily adapted to
the channel characteristics.
2. Once redundancy has been removed from the source, all sources have the same behavior
(completely unpredictable behavior).
3. Channel coding : fight against channel noise without spoiling resources.
4. Coding includes conversion of alphabets.
IT 2012, slide 8
The two main results of information theory are thus the characterization of upper bounds in terms of data compression
on the one hand, and error-less communication on the other.

A further result of practical importance is that (in most, but not all situations) source and channel coding problems
can be decoupled. In other words, data compression algorithms can be designed independently from the type of data
communication channel that will be used to transmit (or store) the data. Conversely, channel coding can be carried
out irrespectively of the type of data sources that will be used to transmit information over them. This result has led
to the partition of coding theory into its two main subparts.

Source coding aims at removing redundancy in the source messages, so as to make them appear shorter and purely
random. On the other hand, channel coding aims at introducing redundancy into the message, so as to make it
possible to decode the message in spite of the uncertainty introduced by the channel noise.

Because of the decomposition property, these problems are generally solved separately. However, there are examples
of situations where the decomposition breaks down (like some multi-user channels) and also situations where from
the engineering point of view it is much easier to solve the two problems simultaneously than seperately. This latter
situation appears when the source redundancy is particularly well adapted to the channel noise (e.g. spoken natural
language redundancy is adapted to acoustic noise).

Examples of digital sources are : scanned images, computer files of natural language text, computer programs, binary
executable files. . . . From your experience, you already know that compression rates of such different sources may
be quite different.

Examples of channels are : AM or FM modulated radio channel, ethernet cable, magnetic storage (tape or hard-disk);
computer RAM; CD-ROM. . . . Again, the characteristics of these channels may be quite different and we will see
that different coding techniques are also required.

IT 2012, note 8
Quantitative notion of information content
Information provided by a message : vague but widely used term
Aspects :
- unpredictable character of a message
- interpretation : truth, value, beauty. . .
Interpretation depends on the context, on the observer : too complex and probably
not appropriate as a measure...
The unpredictable character may be measured ⇒ quantitative notion of information
⇒ A message carries more information if it is more unpredictable
⇒ Information quantity : decreasing function of probability of occurrence
Nota bene.
Probability theory (not statistics) provides the main mathematical tool of information
theory.
Notations : (Ω, E, P (·)) = probability space

IT 2012, slide 9
Information and coding.

Think of a simple coin flipping experiment (the coin is fair). How much information is gained when you learn (i) the
state of a flipped coin ; (ii) the states of two flipped coins; (iii) the outcome when a four-sided die is rolled ? How
much memory do you need to store these informations on a binary computer ?

Consider now the double coin flipping experiment, where the two coins are thrown together and are indistinguishable
once they have been thrown. Both coins are fair. What are the possible issues of this experiment ? What are the
probabilities of these issues ? How much information is gained when you observe any one of these issues ? How
much is gained in average per experiment (supposing that you repeat it indefinitely) ? Supposing that you have to
communicate the result to a friend through a binary channel, how could you code the outcome so that in the average,
you will minimize channel use ?

IT 2012, note 9
Probability theory : definitions et notations
Probability space : triplet (Ω, E, P (·))
Ω : universe of all possible outcomes of random experiment (sample space)
ω, ω ′ , ωi . . . : elements of Ω (outcomes)
E : denotes a set of subsets of Ω (called the event space)
A, B, C . . . : elements of E, i.e. events.
The event space models all possible observations.
An event corresponds to a logical statement about ω
Elementary events : singletons {ωi }
P (·) : probability measure (distribution)
P (·) : for each A ∈ E provides a number ∈ [0, 1].

IT 2012, slide 10
Probability space : requirements
Event space E : must satisfy the following properties
- Ω∈E
- A ∈ E ⇒ ¬A ∈ E
- ∀A1 , A2 , . . . ∈ E (finite or countable number) : ∈E
S
i Ai

⇒ one says that E is a σ-algebra

Measure P (·) : must satisfy the following properties (Kolmogorov axioms)
- P (A) ∈ [0, 1], ∀A ∈ E
- P (Ω) = 1
- If A1 , A2 , . . . ∈ E and Ai ∩ Aj = ∅ for i 6= j : P ( i Ai ) = i P (Ai )
S P

⇒ one says that P (·) is a probability measure

IT 2012, slide 11
Finite sample spaces
We will restrict ourselves to finite sample spaces : Ω is finite
We will use the maximal σ-algebra : E = 2Ω which contains all subsets of Ω.
Conditional probability
△ P (A∩B)
Definition : conditional probability measure : P (A|B) = P (B)

Careful : P (A|B) defined only if P (B) > 0.

IT 2012, slide 12
Random variables
From a physical viewpoint, a random variable is an elementary way of perceiving
(observing) outcomes.
From a mathematical viewpoint, a random variable is a function defined on Ω (the
values of this function may be observed).
Because we restrict ourselves to finite sample spaces, all the random variables are
necessarily discrete and finite : they have a finite number of possible values.
Let us denote by X (·) a function defined on Ω and by X = {X1 , . . . , Xn } its range
(set of possible values).
a value (say Xi ) of X (·) and

We will not distinguish between
the subset {ω ∈ Ω|X (ω) = Xi }

Thus a random variable can also be viewed as a partition {X1 , . . . , Xn } of Ω.

Theoretical requirement : Xi ∈ E (always true if Ω is finite and E maximal).

IT 2012, slide 13
The random variable X (·) induces a probability measure on X = {X1 , . . . , Xn }
△
PX (X = Xi ) = PΩ (Xi ), which we will simply denote by P (Xi ).
We will denote by P (X ) the measure PX (·).
X (·)

X
X2
Ω PX

PΩ X2

X3
X1
X1
X3

A random variable provides condensed information about the experiment.

IT 2012, slide 14
Some more notations . . .
X and Y two discrete r.v. on (Ω, E, P (·)).
Notation : X = {X1 , . . . , Xn } et Y = {Y1 , . . . , Ym }. (n and m finite).
Xi (resp. Yj ) value of X (resp. Y) ≡ subsets of Ω.
⇒ We identify a r.v. with the partition it induces on Ω.
Contingency table

Y1 ··· Yj · · · Ym
.. pi,· ≡ P (Xi ) ←− ∗
X1 . ≡ P (X = Xi )
.. ..
. .
p·,j ≡ P (Yj ) ←− ∗
Xi · · · · · · pi,j · · · · · · pi,·
.. .. ≡ P (Y = Yj )
. .
.. pi,j ≡ P (Xi ∩ Yj ) ≡ P (Xi , Yj ) ←− ∗
Xn . ≡ P ([X = Xi ] ∧ [Y = Yi ])
p·,j

IT 2012, slide 15
Complete system of events
Reminder : event ≡ subset of Ω
(E ≡ set of all events ⇒ σ-algebra)
Definition : A1 , . . . An form a complete system of events if
- S ∀i 6= j : Ai ∩ Aj = ∅ (they are incompatible two by two) and if
n
- i Ai = Ω (they cover Ω).

Conclusion : a discrete r.v. ≡ complete system of events.

Remark : we suppose that Ai 6= ∅
But, this does not imply that P (Ai ) > 0 !!!
Some authors S give a slightly different definition, where the second condition is re-
placed by : P ( ni Ai ) = 1.
Sn
If then i Ai 6= Ω, one may complete such a system by adjoining one more event
An+1 (non empty but of zero probability)

IT 2012, slide 16
Calculus of random variables
On a given Ω we may define an arbitrary number of r.v. In reality, random variables
are the only practical way to observe outcomes of a random experiment.
Thus, a random experiment is often defined by the properties of a collection of ran-
dom variables.
Composition of r.v. : X (·) is a r.v. defined on Ω and Y(·) is a random variable
defined on X , then Y(X (·)) is also a r.v. defined on Ω.
Concatenation Z of X = {X1 , . . . , Xn } and Y = {Y1 , . . . , Ym } defined on Ω :
Z = X , Y defined on Ω by Z(ω) = (X (ω), Y(ω)) ⇒ P (Z) = P (X , Y).
Independence of X = {X1 , . . . , Xn } and Y = {Y1 , . . . , Ym }
- Iff ∀i ≤ n, j ≤ m : Xi ⊥ Yj .
- Equivalent to factorisation of probability measure : P (X , Y) = P (X )P (Y)
- Otherwise P (X , Y) = P (X )P (Y|X ) = P (Y)P (X |Y)

IT 2012, slide 17
Example 1 : coin flipping
Experiment : throwing two coins at the same time.
Random variables :
- H1 ∈ {T, F } true if first coin falls on heads
- H2 ∈ {T, F } true if second coin falls on heads
- S ∈ {T, F } true if both coins fall on the same face

Thus S = (H1 ∧ H2 ) ∨ (¬H1 ∧ ¬H2 ).

Coins are independent : H1 ⊥ H2 , and P (S, H1 , H2 ) = P (S|H1 , H2 )P (H1 )P (H2 ).
Graphically : H1 H2

S
A first example of Bayesian network
Suppose the coins are both fair, and compute P (S).

IT 2012, slide 18
This is a very simple example (and classical) of a random experiment.

The first structural information we have is that the two coins behave independently (this is a very realistic, but
not perfectly true assumption). The second structural information we have is that the third random variable is a
(deterministic) function of the other two, in other words its value is a causal consequence of the values of the first
two random variables.

Using only the structural information one can depict graphically the relationship between the three variables, as
shown on the slide. We will see the precise definition of a Baysian belief network tomorrow, but this is actually
a very simple example of this very rich concept. Note that the absence of any link between the two first variables
indicates independence graphically.

To yield a full probabilistic description, we need to specify the following three probability measures : P (H1 ),
P (H2 ) and P (S|H1 , H2 ), i.e. essentially 2 real numbers (since P (S|H1 , H2 ) is already given by the functional
description of S).

If the coins are identical, then we have to specify only one number, e.g. P (H1 = T ).

If the coins are fair, then we know everything, i.e. P (H1 = T ) = 0.5.

Show that if the coins are fair, we have H1 ⊥ S ⊥ H2 . Still, we don’t have H1 ⊥ H2 |S. Explain intuitively.

In general (i.e. for unfair coins) however, we don’t have H1 ⊥ S. For example, suppose that both coins are biased
towards heads.

You can use the “javabayes” application on the computer to play around with this example. More on this later. . .

IT 2012, note 18
Example 2 : binary source with independent symbols ωi ∈ {0, 1}
P (1) : probability that the next symbol is 1.
P (0) = 1 − P (1) : probability that the next symbol is 0.
Let us denote by h(ω) the information provided by one symbol ω.
1
Then : - h(ω) = f ( P (ω) ) where f (·) is increasing and
- limx→1 f (x) = 0 (zero information if event is certain)
On the other hand (symbols are independent) :
For two successive symbols ω1 , ω2 we should have h(ω1 , ω2 ) = h(ω1 ) + h(ω2 ).
But : h(ω1 , ω2 ) = f ( P (ω11 ,ω2 ) ) = f ( P (ω1 )P
1
(ω2 ) )
⇒ f (xy) = f (x) + f (y) ⇒ f (·) ∝ log(·)
Definition : the self-information provided by the observation of an event A ∈ E is
given by : h(A) = − log2 P (A) Shannon
Note : h(A) ≥ 0. When P (A) → 0 : h(A) → +∞.

IT 2012, slide 19
Comments.

The theory which will be developped is not really dependent on the base used to compute logarithms. Base 2 will be
the default, and fits well with binary codes as we will see.

You should convince yourself that the definition of self-information of an event fits with the intuitive requirements
of an information measure.

It is possible to show from some explicit hypotheses that there is no other possible choice for the definition of a
measure of information (remember the way the notion of thermodynamic entropy is justified).

Nevertheless, some alternative measures of information have been proposed based on relaxing some of the require-
ments and imposing some others.

To be really convinced that this measure is the right one, it is necessary to wait for the subsequent lectures, so as to
see what kind of implications this definition has.

Example : questionnaire, weighting strategies.

IT 2012, note 19
Conditional information
Let us consider an event C = A ∩ B :
We have : h(C) = h(A ∩ B) = − log P (A ∩ B) = − log P (A)P (B|A)
= − log P (A) − log P (B|A)
One defines the conditional self-information of the event B given that (or supposing
that) the event A is true : h(B|A) = − log P (B|A)
Thus, once we know that ω ∈ A, the information provided by the observation that
ω ∈ B becomes − log P (B|A).
Note that : h(B|A) ≥ 0
One can write : h(A ∩ B) = h(A) + h(B|A) = h(B) + h(A|B)
In particular : A ⊥ B : h(A ∩ B) = h(A) + h(B)
Thus : h(A ∩ B) ≥ max{h(A), h(B)} ⇒ monotonicity of self-information

IT 2012, slide 20
Illustration : transmission of information
- Ω = Ωi × Ωo : all possible input/output pairs of a channel-source combination.
- A : denotes the observation of a input message; B an output message.
- Linked by transition probability P (B|A) (stochastic channel model).
- P (A) : what a receiver can guess about the sent message before it is sent (knowing
only the model of the source).
- P (A|B) : what a receiver can guess about the sent message after communication
has happened and the output message B has been received.
- P (B|A) represents what we can predict about the output, once we know which
message will be sent.
- Channel without noise (or deterministic) : P (B|A) associates one single possible
output to each input.
- For example if inputs and outputs are binary : P (ωi |ωo ) = δi,o

IT 2012, slide 21
Mutual information of two events
Definition : i(A; B) = h(A) − h(A|B).

Thus : i(A; B) = log PP(A|B)

(A) = log P (A∩B)
P (A)P (B) = i(B; A)

⇒ mutual information is by definition symmetric.

Discussion :
h(A|B) may be larger or smaller than h(A)
⇒ mutual information may be positive, zero or negative.
It is equal to zero iff the events are independent.
Particular cases :
If A ⊃ B then P (A|B) = 1 ⇒ h(A|B) = 0 ⇒ i(A; B) = h(A).
If A ⊃ B then h(B) ≥ h(A).
But converse is not true...

IT 2012, slide 22
Exercise.

Let us consider the coin throwing experiment.

Suppose that the two coins are identical (not necessarily fair), and say that p ∈ [0; 1] denotes the probability to get
heads for either coin.
1
Compute1 the following quantities under the two assumptions p = 2
(fair coins), and p = 1.0 (totally biased coins).

h(H1 = T ), h(H1 = F ), h(H2 = T ), h(H2 = F ).

h([H1 = T ] ∧ [H2 = T ])

h([H1 = T ] ∧ [H1 = F ])

h([S = T ]), h([S = F ]),

h([H1 = T ]|[H2 = T ])

h([H1 = T ]|[S = T ])

h([H1 = T ]|[S = T ], [H2 = T ]))

1
Use base 2 for the logarithms.

IT 2012, note 22
Entropy of a memoryless time-invariant source
Source : at successive times t ∈ {t0 , t1 . . .} sends symbols s(t) chosen from a finite
alphabet S = {s1 , . . . , sn }.
Assumption : successive symbols are independent and chosen according to the same
probability measure (i.e. independant of t) ⇒ Memoryless and time-invariant
Notation : pi = P (s(t) = si )
△ Pn
Definition : Source entropy : H(S) = E{h(s)} = − i=1 pi log pi

Entropy : measures average information provided by the symbols sent by the source :
Shannon/symbol
If F denotes the frequency of operation of the source, then F ·H(S) measures average
information per time unit : Shannon/second.
Note that, because of the law of large numbers the per-symbol information pro-
vided by any long message produced by the source converges (almost surely) towards
H(S).

IT 2012, slide 23
Examples.

Compute the entropy (per symbol) of the following sources :

A source which always emits the same symbol.

A source which emits zeroes and ones according to two fair coin flipping processes.

How can you simulate a fair coin flipping process with a coin which is not necessarily fair ?

IT 2012, note 23
Generalization : entropy of a random variable
Let X be a discrete r.v. : X defines the partition {X1 , . . . , Xn } of Ω.
△ Pn
Entropy of X : H(X ) = − i=1 P (Xi ) log P (Xi )

What if some probs. are zero : limx→0 x log x = 0 : the terms vanish by continuity.
Note : H(X ) does only depend on the values P (Xi )
Particular case : n = 2 (binary source, an event and its negation)
H(X ) = −p log p − (1 − p) log(1 − p) = H2 (p) where p denotes the probability of
(any) one of the two values of X .
Properties of H2 (p) :
H2 (p) = H2 (1 − p)
H2 (0) = H2 (1) = 0
H2 (0.5) = 1 et H2 (p) ≤ 1

IT 2012, slide 24
H2 (p)
H2 (qp1 + (1 − q)p2 )
1

H2 (p2 ) qH2 (p1 ) + (1 − q)H2 (p2 )

H2 (p1 )

0 p

0 p1 qp1 + (1 − q)p2 p2 1

Another remarkable property : concavity (consequences will appear later).

Means that ∀p1 6= p2 ∈ [0, 1], ∀q ∈]0, 1[ we have
H2 (qp1 + (1 − q)p2 ) > qH2 (p1 ) + (1 − q)H2 (p2 )

IT 2012, slide 25
More definitions
Suppose that X = {X1 , . . . , Xn } and Y = {Y1 , . . . , Ym } are two (discrete) r.v.
defined on a sample space Ω.
Joint entropy of X and Y defined by
n X
m
△ X
H(X , Y) = − P (Xi ∩ Yj ) log P (Xi ∩ Yj ). (1)
i=1 j=1

Conditional entropy of X given Y defined by

n X
X m
H(X |Y) = − P (Xi ∩ Yj ) log P (Xi |Yj ). (2)
i=1 j=1

Mutual information defined by

n X
m
X P (Xi ∩ Yj )
I(X ; Y) = + P (Xi ∩ Yj ) log . (3)
i=1 j=1
P (Xi )P (Yj )

IT 2012, slide 26
Note that the joint entropy is nothing novel : it is just the entropy of the random variable Z = (X , Y). For the time
being consider conditional entropy and mutual information as purely mathematical definitions. The fact that these
definitions really make sense will become clear from the study of the properties of these measures.

Exercise (computational) : Consider our double coin flipping experiment. Suppose the coins are both fair.

Compute H(H1 ), H(H2 ), H(S). Compute H(H1 , H2 ), H(H2 , H1 ), H(H1 , S), H(H2 , S) and H(H1 , H2 , S)

Compute H(H1 |H2 ), H(H2 |H1 ) and then H(S|H1 , H2 ) and H(H1 , H2 |S)

Compute I(H1 ; H2 ), I(H2 ; H1 ) and then I(S; H1 ) and I(S; H2 ) and I(S; H1 , H2 ).

Experiment to work out for tomorrow.

You are given 12 balls, all of which are equal in weight except for one which is either lighter or heavier. You are
also given a two-pan balance to use. In each use of the balance you may put any number of the 12 balls on the left
pan, and the same number (of the remaining) balls on the right pan. The result may be one of three outcomes : equal
weights on both pans; left pan heavier; right pan heavier. Your task is to design a strategy to determine which is the
odd ball and whether it is lighter or heavier in as few (expected) uses of the balance as possible.
While thinking about this problem, you should consider the following questions :
• How can you measure information ? What is the most information you can get from a single weighing ?
• How much information have you gained (in average) when you have identified the odd ball and whether it is
lighter or heavier ?
• What is the smallest number of weighings that might concevably be sufficient to always identify the odd ball and
whether it is heavy or light ?
• As you design a strategy you can draw a tree showing for each of the three outcomes of a weighing what weighing
to do next. What is the probability of each of the possible outcomes of the first weighing ?

IT 2012, note 26
Properties of the function Hn (· · ·)
△ Pn
Notation : Hn (p1 , p2 , . . . , pn ) = − i=1 pi log pi
Pn
( pi discrete probability distribution, i.e. pi ∈ [0, 1] and i=1 pi = 1)
Positivity : Hn (p1 , p2 , . . . , pn ) ≥ 0 (evident)
Annulation : Hn (p1 , p2 , . . . , pn ) = 0 ⇒ pi = δi,j (also evident)

Maximal : ⇔ pi = n1 , ∀i. (proof follows later)

Concavity : (proof follows)
Let (p1 , p2 , . . . , pn ) and (q1 , q2 , . . . , qn ) be two discrete probability distributions and
λ ∈ [0, 1] then
Hn (λp1 + (1 − λ)q1 , . . . , λpn + (1 − λ)qn )
≥ (4)
λHn (p1 , . . . , pn ) + (1 − λ)Hn (q1 , . . . , qn ),

IT 2012, slide 27
The very important properties of the information and entropy measures which have been introduced before, are all
consequences of the mathematical properties of the entropy function Hn defined and stated on the slide.

The following slides provide proofs respectively of the maximality and concavity properties which are not trivial.

In addition, let us recall the fact that the function is invariant with respect to any permutation of its arguments.

Questions :

What is the entropy of a uniform distribution ?

Relate entropy function intuitively to uncertainty and thermodynamic entropy.

Nota bene :

Entropy function : function defined on a (convex) subset of IRn .

Entropy measure : function defined on the set of random variables defined on a given sample space.

⇒ strictly speaking these are two different notions, and that is the reason to use two different notations (Hn vs H).

IT 2012, note 27
Maximum of entropy function (proof)
Gibbs inequality : (Lemma, useful also for the sequel ⇒ keep it in mind)
Formulation : let (p1 , p2 , . . . , pn ) et (q1 , q2 , . . . , qn ) two probability distributions.
Then,
n
X qi
pi log ≤ 0, (5)
i=1
pi

where equality holds if, and only if, ∀i : pi = qi .

Proof : (we use the fact that : ln x ≤ x − 1, with equality ⇔ x = 1, slide below)
Pn qi
Let us proove that i=1 pi ln pi ≤ 0

In ln x ≤ x − 1 replace x by pqii , multiply then by pi sum over index i, which gives

(when all pi are strictly positive)
Pn qi Pn qi Pn Pn
i=1 pi ln pi ≤ i=1 pi ( pi − 1) = i=1 qi − i=1 pi =1−1=0

Homework : convince yourself that this remains true even when some of the pi = 0.

IT 2012, slide 28
y
2.

1.5 y =x−1

0.5
y = ln x

0.0 x
0.0 0.5 1. 1.5 2. 2.5 3.
-.5

-1.

-1.5

-2.

IT 2012, slide 29
Theorem :
Hn (p1 , p2 , . . . , pn ) ≤ log n, with equality ⇔ ∀i : pi = n1 .
Proof
1
Let us apply Gibbs inequality with qi = n

We find
n n n
X 1 X 1 X
pi log = pi log − pi log n ≤ 0 ⇒
i=1
npi i=1
pi i=1
n n
X 1 X
Hn (p1 , p2 , . . . , pn ) = pi log ≤ pi log n = log n,
i=1
pi i=1
1
where equality holds if, and only if all pi = qi = n 2

IT 2012, slide 30
Concavity of entropy function (proof)
Let (p1 , p2 , . . . , pn ) and (q1 , q2 , . . . , qn ) be two probability distributions and λ ∈
[0, 1], then

Hn (λp1 + (1 − λ)q1 , . . . , λpn + (1 − λ)qn )

≥ (6)
λHn (p1 , . . . , pn ) + (1 − λ)Hn (q1 , . . . , qn ),

An (apparently) more general (but logically equivalent) formulation :

Mixture of an arbitrary number of probability distributions
Pm m
Hn ( j=1 λj pnj )
P
j=1 λj p1j , . . . ,
≥ (7)
Pm
j=1 λj Hn (p1j , . . . , pnj ),
Pm Pn
where λj ∈ [0, 1], i=j λj = 1, et ∀j : i=1 pij = 1.

IT 2012, slide 31
Graphiquely : mixing increases entropy (cf thermodynamics)

λ1 λ2 λ3 λ4 λ5

P
p1,1 p1,2 p1,5 λj p1,j
j

p2,1 p2,2
P
λj p2,j
j
→
p2,5
P
p3,1 p3,2 λj p3,j
j

p3,5
p4,2 P
p4,1 λj p4,j
p4,5 j

Pm Pm Pm
j=1 λj Hn (p1j , . . . , pnj ) ≤ Hn ( j=1 λj p1j , . . . , j=1 λj pnj )
Pm Pm
Proof : f (x) = −x log x is concave on [0, 1] : f ( j=1 λj xj ) ≥ j=1 λj f (xj ).
Pm Pm Pn Pm
Thus we have : Hn ( j=1 λj p1j , . . . , j=1 λj pnj ) = i=1 f ( j=1 λj pij )
≥ ni=1 [ m
Pm Pn
(p )] = i=1 f (pij )
P P
λ
j=1 j f ij j=1 jλ
Pm
= j=1 λj Hn (p1j , . . . , pnj )

IT 2012, slide 32
Another interpretation (thermodynamic)

Isolated containers red Containers communicate (at equilibrium)

blue Entropy is smaller (before mixing) Entropy is higher (after mixing)

Suppose that you pick a molecule in one of the two cuves and have to guess which kind of molecule you will obtain :
in both cases you can take into account in which container you pick the molecule.

⇒ there is less uncertainty in the left cuve than in the right. Compute entropies relevant to this problem.

IT 2012, note 32
(Let us open a parenthesis : notions of convexity/concavity
Convex set : a set which contains all the line segments joining any two of its points.
C ⊂ IRp is convex (by def.) if

x, y ∈ C, λ ∈ [0, 1] ⇒ λx + (1 − λ)y ∈ C.

Examples :

Convexe Non convexes

In IR : intervals, semi-intervals, IR.

IT 2012, slide 33
Examples : sets of probability distributions : n = 2 et n = 3

p2 p3
1 1
p2
1

p1
1 p1
1

More generally :
Linear subspaces, semi-planes are convex
Any intersection of convex sets is also convex (ex. polyhedra)
Ellipsoids : {x|(x − xc )T A−1 (x − xc ) ≤ 1} (A positive definite)

IT 2012, slide 34
Convex functions
f (·) : IRp → IR convex on a convex subset C of IRp if :

x, y ∈ C, λ ∈ [0, 1] ⇒

f (λx + (1 − λ)y) ≤ λf (x) + (1 − λ)f (y).

y epigraph : {(x, y)|x ∈ C, f (x) ≤ y}

convex set ⇔ f (·) convex

A sum of convex functions is also convex

R
y = f (x) If g(x, y) if convex in x then also g(x, y)dy
fi convex, then max{fi } convex
f (x) convex ⇒ f (Ax + b) convex

x Similar properties hold for concave functions...

x1 λx1 + (1 − λ)x2 x2

IT 2012, slide 35
Strictly convex function
If equality holds only for the trivial cases

λ ∈ {0, 1} and/or x = y

(Strictly) concave function

If −f (·) is (strictly) convex.
Important properties
- A convex (or concave) function is continuous inside a convex set
- Every local mininum of a convex function is also a global minimum
Criteria :
- convex iff convex epigraph
- convex if second derivative (resp. Hessian) positive (resp. positive definite)
In practice : there exist powerful optimization algorithms...

IT 2012, slide 36
Notion of convex linear combination,
Pm Pm
( j λj xj ), with λj ≥ 0 and j λj = 1.

Convex hull of some points Set of points which are

convex linear combinations
of these points

These are the points which may be written

as a kind of weighted average of the starting points
(non-negative weights)
Associate weights ⇒ center of gravity

A convex hull is a convex set.

In fact, it is the smallest convex set which contains the points xj .
(≡ Intersection of all convex sets containing these points.)

IT 2012, slide 37
Jensen’s inequality :
If f (·) is a convex function defined on IR → IR and X a real random variable
f (E{X }) ≤ E{f (X )} where, if the function is strictly convex, equality holds iff X
is constant almost surely.
Extension to vectors ...
Concave functions : ≤ −→ ≥...
Particular case : convex linear combinations
The λj ’s act as a discrete probability measure on Ω = {1, . . . , m}.
And xj denotes the value of X at point ω = i.
Pm Pm
Hence : f ( j λj xj ) ≤ j λj f (xj )

Let us close the parenthesis...)

IT 2012, slide 38
Let us return to the entropies.
Strictly concave functions f (x) = −x log x and g(x) = log x.

y
0.5

One deduces
0.4
H(X ), H(Y), H(X , Y)
are maximal for uniform distri-
0.3
butions.

0.2 Also, H(X |Y) is maximal if

for each j P (X |Yj ) is uniform,
which is possible only if P (X )
0.1
is uniform and X indep. of Y.

0.0
x
0.0 0.25 0.5 0.75 1.

IT 2012, slide 39
The fact that f (x) = −x log x is strictly concave is clear from the picture. Clearly log x is also concave.

All inequalities related to the entropy function may be easily deduced from the concavity of these two functions and
Jensen’s inequality. For example, let us introduce here a new quantity called relative entropy or Kullback Leibler
distance. Let P and Q be two discrete probability distributions defined on a discrete Ω = {ω1 , . . . , ωn }. The
Kullback Leibler distance (or relative entropy) of P w.r.t. Q is defined by
X P (ω)
D(P ||Q) = P (ω) log (8)
Q(ω)
ω∈Ω

Jensen’s inequality allows us to prove that 2 D(P ||Q) ≥ 0 :

X P (ω) X Q(ω)
−D(P ||Q) = − P (ω) log = P (ω) log (9)
Q(ω) P (ω)
ω∈Ω ω∈Ω
! !
X Q(ω) X
≤ log P (ω) = log Q(ω) = log 1 = 0 (10)
P (ω)
ω∈Ω ω∈Ω

where the inequality follows from Jensen’s inequality applied to the concave function log x. Because the function is
Q(ω)
strictly concave, equality holds only if P (ω) is constant over Ω (and hence equal to 1), which justifies the name of
distance of the relative entropy.
2
This is nothing else than Gibbs inequality, which we have already proven without using Jensen’s inequality.

IT 2012, note 39

Object Oriented System Design Quantum
70% (10)
Object Oriented System Design Quantum
253 pages
Exam in Community Engagement Solidarity & Citizenship
100% (10)
Exam in Community Engagement Solidarity & Citizenship
4 pages
An Introduction to Information Theory
From Everand
An Introduction to Information Theory
Fazlollah M. Reza
No ratings yet
Sppa T3000
100% (1)
Sppa T3000
6 pages
Paper Id - 355 Smart Structure For Automated Rangoli
No ratings yet
Paper Id - 355 Smart Structure For Automated Rangoli
6 pages
1 Information Theory
No ratings yet
1 Information Theory
57 pages
Overview & Chapter 1
No ratings yet
Overview & Chapter 1
45 pages
01-Syllabus and Intro
No ratings yet
01-Syllabus and Intro
21 pages
Info Theory
No ratings yet
Info Theory
59 pages
Information and Coding Theory ECE533 - Lec-1 Introduction. Information Theory v4.0
No ratings yet
Information and Coding Theory ECE533 - Lec-1 Introduction. Information Theory v4.0
9 pages
Introduction To Information Theory
75% (4)
Introduction To Information Theory
178 pages
Information Theory and Coding
0% (1)
Information Theory and Coding
8 pages
Unit 1
100% (2)
Unit 1
45 pages
CHAPTER 1
No ratings yet
CHAPTER 1
22 pages
Lecture Notes in Information Theory Volume II
No ratings yet
Lecture Notes in Information Theory Volume II
293 pages
L1_CE4
No ratings yet
L1_CE4
39 pages
Lecture1
No ratings yet
Lecture1
19 pages
Ec23ec4211itc PPT
No ratings yet
Ec23ec4211itc PPT
148 pages
A Visual Introduction To Information Theory
No ratings yet
A Visual Introduction To Information Theory
43 pages
Inf Theory
No ratings yet
Inf Theory
17 pages
ECE 723 - Information Theory and Coding
No ratings yet
ECE 723 - Information Theory and Coding
23 pages
Nikesh Bajaj: Information Theory and Coding
No ratings yet
Nikesh Bajaj: Information Theory and Coding
10 pages
[Electronic Science] Norman Abramson - Information Theory and Coding (1963, McGraw-Hill Inc.,US) - Libgen.li
No ratings yet
[Electronic Science] Norman Abramson - Information Theory and Coding (1963, McGraw-Hill Inc.,US) - Libgen.li
111 pages
ECE 565a: Information Theory Instructor: Prof. Salman Avestimehr
No ratings yet
ECE 565a: Information Theory Instructor: Prof. Salman Avestimehr
49 pages
18 753 Information
No ratings yet
18 753 Information
71 pages
The Information Theory: C.E. Shannon, A Mathematical Theory of Communication'
No ratings yet
The Information Theory: C.E. Shannon, A Mathematical Theory of Communication'
43 pages
Information Theory and Coding (Lecture 1) : Dr. Farman Ullah
No ratings yet
Information Theory and Coding (Lecture 1) : Dr. Farman Ullah
32 pages
Lec35 - 210108062 - ZAINAB ALI
No ratings yet
Lec35 - 210108062 - ZAINAB ALI
9 pages
Information Theory &amp Coding (ECE) by Nitin Mittal
0% (1)
Information Theory &amp Coding (ECE) by Nitin Mittal
15 pages
DC Lecture Slides 1 - Information Theory
No ratings yet
DC Lecture Slides 1 - Information Theory
22 pages
COSC 402 Information Theory
No ratings yet
COSC 402 Information Theory
291 pages
Lect. No. Topics To Be Covered Learning Objectives: Chapter in The Text Book
No ratings yet
Lect. No. Topics To Be Covered Learning Objectives: Chapter in The Text Book
3 pages
Infonoise
No ratings yet
Infonoise
37 pages
Amount of Information I Log (1/P)
No ratings yet
Amount of Information I Log (1/P)
2 pages
Information Coding
No ratings yet
Information Coding
9 pages
Lec-1 Introduction. Information Theory
No ratings yet
Lec-1 Introduction. Information Theory
9 pages
Digital Communication Unit 5
No ratings yet
Digital Communication Unit 5
105 pages
Introduction to ITC
No ratings yet
Introduction to ITC
24 pages
ITC Lecture 01
No ratings yet
ITC Lecture 01
31 pages
Basic Lecture Information Theory and Coding
No ratings yet
Basic Lecture Information Theory and Coding
68 pages
ITC Lecture1
No ratings yet
ITC Lecture1
17 pages
C&C Combined Module Notes
No ratings yet
C&C Combined Module Notes
206 pages
21ECE72_Coding and Cryp Module 1
No ratings yet
21ECE72_Coding and Cryp Module 1
34 pages
Lecture 1 - Introduction
No ratings yet
Lecture 1 - Introduction
27 pages
Information Theory Final
No ratings yet
Information Theory Final
50 pages
EE-317: Information Theory: Fall 2008
No ratings yet
EE-317: Information Theory: Fall 2008
38 pages
Information T Information Theory and Coding: S.Chandramohan
No ratings yet
Information T Information Theory and Coding: S.Chandramohan
38 pages
Unit 1 INFORMATION ENTROPY FUNDAMENTALS
No ratings yet
Unit 1 INFORMATION ENTROPY FUNDAMENTALS
13 pages
Chapter 1 (A)
No ratings yet
Chapter 1 (A)
30 pages
Information Theory IT4592E: Dang Tuan Linh
No ratings yet
Information Theory IT4592E: Dang Tuan Linh
10 pages
Information Theory Lecture Notes
100% (1)
Information Theory Lecture Notes
97 pages
CSC 402 Information Theory3
No ratings yet
CSC 402 Information Theory3
292 pages
Information Theory PDF
No ratings yet
Information Theory PDF
26 pages
Information Theory
No ratings yet
Information Theory
37 pages
Information Theory and Coding - Solved Problems (PDFDrive)
No ratings yet
Information Theory and Coding - Solved Problems (PDFDrive)
517 pages
ITC
No ratings yet
ITC
3 pages
1067117361 (1)
No ratings yet
1067117361 (1)
72 pages
Troanary Photonic Storage Blueprint - How Light Based Logic can Redefine Computation and Data Storage: Volume 10 Troanary Photonic Storage Blueprint, #1
From Everand
Troanary Photonic Storage Blueprint - How Light Based Logic can Redefine Computation and Data Storage: Volume 10 Troanary Photonic Storage Blueprint, #1
Ylia Callan
No ratings yet
Photonics of Intelligence - Unleashing Troanary Computing: Volume 11 Photonics of Intelligence, #1
From Everand
Photonics of Intelligence - Unleashing Troanary Computing: Volume 11 Photonics of Intelligence, #1
Ylia Callan
No ratings yet
Algorithmic Information Theory: Fundamentals and Applications
From Everand
Algorithmic Information Theory: Fundamentals and Applications
Fouad Sabry
No ratings yet
A Trek Beyond Complexity: A Journey Through Discrete Math for Computing
From Everand
A Trek Beyond Complexity: A Journey Through Discrete Math for Computing
Pasquale De Marco
No ratings yet
Mivar NETs and logical inference with the linear complexity
From Everand
Mivar NETs and logical inference with the linear complexity
Varlamov, Oleg O.
No ratings yet
Big-O Notation Demystified: Definitive Reference for Developers and Engineers
From Everand
Big-O Notation Demystified: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
1 s2.0 S2405844024160999 Main
No ratings yet
1 s2.0 S2405844024160999 Main
53 pages
IET Generation Trans Dist - 2023 - Mansour - Applications of IoT and Digital Twin in Electrical Power Systems A
No ratings yet
IET Generation Trans Dist - 2023 - Mansour - Applications of IoT and Digital Twin in Electrical Power Systems A
23 pages
Report 2.3-Revised-Final - Kaal Harir Abdulle, 160041080
No ratings yet
Report 2.3-Revised-Final - Kaal Harir Abdulle, 160041080
36 pages
4 Information Theory
No ratings yet
4 Information Theory
53 pages
Public-Key Cryptography: CCA Secure PKE Hybrid Encryption
No ratings yet
Public-Key Cryptography: CCA Secure PKE Hybrid Encryption
18 pages
2 Information Theory
No ratings yet
2 Information Theory
40 pages
Symmetric-Key Encryption: Constructions: PRG, PRF Stream and Block Ciphers
No ratings yet
Symmetric-Key Encryption: Constructions: PRG, PRF Stream and Block Ciphers
19 pages
COMPUTER-STUDIES-SYLLABUS-1
No ratings yet
COMPUTER-STUDIES-SYLLABUS-1
64 pages
Quiz on Acctsys
No ratings yet
Quiz on Acctsys
2 pages
Hierarchical Task Analysis: Developments, Applications, and Extensions
No ratings yet
Hierarchical Task Analysis: Developments, Applications, and Extensions
25 pages
Social Impacts of (Information) Technology
100% (3)
Social Impacts of (Information) Technology
5 pages
Elective II: Traffic Engineering and Management
No ratings yet
Elective II: Traffic Engineering and Management
18 pages
Monitoring and Evaluation For Cyclical Planning
No ratings yet
Monitoring and Evaluation For Cyclical Planning
19 pages
Enabling Safe Autonomous Driving in Real-World
No ratings yet
Enabling Safe Autonomous Driving in Real-World
14 pages
Management Information System For Effective and Efficient Decision Making: A Case Study
No ratings yet
Management Information System For Effective and Efficient Decision Making: A Case Study
21 pages
Design and Research of Campus E-Commerce System Based On B/S
No ratings yet
Design and Research of Campus E-Commerce System Based On B/S
4 pages
Spe 191598
No ratings yet
Spe 191598
14 pages
12+HiranFinal
No ratings yet
12+HiranFinal
14 pages
Chapter 2
No ratings yet
Chapter 2
38 pages
Online Admission System Project Report
No ratings yet
Online Admission System Project Report
40 pages
On Comprehensive Project: BCA (Session 2010-2013)
No ratings yet
On Comprehensive Project: BCA (Session 2010-2013)
74 pages
Advanced Control Strategies Word
No ratings yet
Advanced Control Strategies Word
6 pages
High Performance Work Systems Literature Review
100% (1)
High Performance Work Systems Literature Review
5 pages
Artificial Intelligence Introduction
No ratings yet
Artificial Intelligence Introduction
37 pages
Role of Diagramming in Systems Investigation
100% (1)
Role of Diagramming in Systems Investigation
25 pages
Nature and Scope
No ratings yet
Nature and Scope
11 pages
DMC Template With CGPA Calculation Computer
100% (1)
DMC Template With CGPA Calculation Computer
19 pages
The Interactionist Approach: Client Individual or Group System Group, Agency, or Other
0% (1)
The Interactionist Approach: Client Individual or Group System Group, Agency, or Other
27 pages
Condition-Based Maintenance Implementation: A Literature Review Condition-Based Maintenance Implementation: A Literature Review
No ratings yet
Condition-Based Maintenance Implementation: A Literature Review Condition-Based Maintenance Implementation: A Literature Review
8 pages
Work Activity Studies Within The Framework Of Ergonomics Psychology And Economics Bedny instant download
No ratings yet
Work Activity Studies Within The Framework Of Ergonomics Psychology And Economics Bedny instant download
79 pages
Generative+Algorithms CaE Strip+Morphologies
No ratings yet
Generative+Algorithms CaE Strip+Morphologies
53 pages
A01 Production Assurance US
No ratings yet
A01 Production Assurance US
16 pages

Course 1

Uploaded by

Course 1

Uploaded by

Introduction to information theory and coding

University of Liège - Department of Electrical and Computer Engineering

• Your personal notes

• Detailed course notes (in french; centrale des cours).

• For further reading, some reference books in english

– J. Adámek, Foundations of coding, Wiley Interscience, 1991.

1. Introduce information theory

Obervable System : Observable

Observable System : Observable

Examples of applications of stochastic system theory :

1. What is it all about ?

Théorie Inf. Fisher

Source CS CC Channel DC DS Reveiver

Source coding : remove redundancy (make message as short as possible)

⇒ one says that E is a σ-algebra

⇒ one says that P (·) is a probability measure

Careful : P (A|B) defined only if P (B) > 0.

Thus a random variable can also be viewed as a partition {X1 , . . . , Xn } of Ω.

A random variable provides condensed information about the experiment.

Conclusion : a discrete r.v. ≡ complete system of events.

Thus S = (H1 ∧ H2 ) ∨ (¬H1 ∧ ¬H2 ).

Example : questionnaire, weighting strategies.

Thus : i(A; B) = log PP(A|B)

⇒ mutual information is by definition symmetric.

Let us consider the coin throwing experiment.

h(H1 = T ), h(H1 = F ), h(H2 = T ), h(H2 = F ).

h([S = T ]), h([S = F ]),

h([H1 = T ]|[S = T ], [H2 = T ]))

Compute the entropy (per symbol) of the following sources :

A source which always emits the same symbol.

H2 (p2 ) qH2 (p1 ) + (1 − q)H2 (p2 )

Another remarkable property : concavity (consequences will appear later).

Conditional entropy of X given Y defined by

Mutual information defined by

Experiment to work out for tomorrow.

Maximal : ⇔ pi = n1 , ∀i. (proof follows later)

What is the entropy of a uniform distribution ?

Relate entropy function intuitively to uncertainty and thermodynamic entropy.

Entropy function : function defined on a (convex) subset of IRn .

where equality holds if, and only if, ∀i : pi = qi .

In ln x ≤ x − 1 replace x by pqii , multiply then by pi sum over index i, which gives

Hn (λp1 + (1 − λ)q1 , . . . , λpn + (1 − λ)qn )

An (apparently) more general (but logically equivalent) formulation :

Isolated containers red Containers communicate (at equilibrium)

blue Entropy is smaller (before mixing) Entropy is higher (after mixing)

Convexe Non convexes

In IR : intervals, semi-intervals, IR.

f (λx + (1 − λ)y) ≤ λf (x) + (1 − λ)f (y).

y epigraph : {(x, y)|x ∈ C, f (x) ≤ y}

A sum of convex functions is also convex

x Similar properties hold for concave functions...

(Strictly) concave function

Convex hull of some points Set of points which are

These are the points which may be written

A convex hull is a convex set.

Let us close the parenthesis...)

0.2 Also, H(X |Y) is maximal if

Jensen’s inequality allows us to prove that 2 D(P ||Q) ≥ 0 :

You might also like