0% found this document useful (0 votes)
6 views

Information Theory Lecture Notes

Uploaded by

hay902
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views

Information Theory Lecture Notes

Uploaded by

hay902
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 37

Information Theory

lecture notes, Fall 2021

First lecture (September 7, 2021)

Some “warming up” problems:


1. What contains more information: the date of someone’s birthday or the
month of someone’s birth?
Answer: Obviously the first one as it contains the second.
2. What contains more information: the month of someone’s birthday or the
name of the day of the week when this person was born?
Answer: The first one because it specifies one out of 12 possibilities, while the
other specifies one out of only 7 possibilities.
3. How many yes-no questions are needed (in the worst case, but for the best
strategy) to find a date of the year (someone’s birthday, say)?
Answer: It is dlog2 366e = 9.
That many may be needed because each question partitions the set of possibili-
ties into two subsets and if the answer always leaves us with the larger partition
class then we will reduce the number of possibilities to 1 only after 9 questions.
That many is enough, because if we always ask questions that partition the set
of still possible numbers into two subsets whose respective sizes are as close to
each other as possible (i.e., they differ by at most 1), then after 9 answers we
will have only one option remaining.
This argument shows that with 9 questions we can find one out of at most
512 numbers with similar rules. In general, if we have n options, then dlog2 ne
questions are needed.
4. How many questions do we need if we must ask all of them together, without
knowing the answer to previous ones?
Answer: Somewhat surprisingly, the answer is the same. The situation is more
restricted now, so we need at least as many questions as before. The point is
that we do not need more. Here is an optimal strategy: write all the numbers
(or dates, up to 366 or 512 or n, in general) into binary form and let the ith
question be whether the ith binary digit is 0 (or 1). Clearly, this gives dlog2 ne
questions and knowing the answer to all of them identifies the number (the date)
we look for.

We will measure information content by the number of binary digits (“bits”)


needed to “describe” the information. Thus the information contained in telling
one out of 512 numbers is 9 bits.
We feel that the above is plausible only if the probability of the possibilities is
close to each other. E.g., learning the fact that one did not hit the jackpot in
the lottery game this week seems much less information than learning that the
same person actually did hit the jackpot. So we will have to take probabilities
also into account.

Historical remark: Information theory is a branch of science that has a surpris-


ingly clear starting point: it is the publication of a two-part article by Claude
Elwood Shannon in 1948 entitled “A Mathematical Theory of Communication”.
Many basic results are already contained in this article. Naturally, many other
important results were obtained during the more than 70 years that passed

1
since then, and information theory today is considered a flourishing field of
both mathematics and electrical engineering.

Two main problems in information theory are:


Source coding
Channel coding
Goal of source coding: compressing data, that is encoding data with reduced
redundancy.
Goal of channel coding: safe data transmission, that is encoding messages so
that one can still correctly decode them after transmission in spite of channel
noise. (This is achieved by increasing redundancy in some clever way.)

Variable length source coding

Notation: For a finite set V , the set of all finite length sequences of elements of
V will be denoted by V ∗ .
Model: Source emits sequence of random symbols that are elements of the source
alphabet X = {x(1) , . . . , x(r) }.
Given code alphabet Y = {y1 , . . . , ys } (with s elements) we seek for an encoding
function f : X → Y ∗ which efficiently encodes the source.
meaning of ”efficient”: it uses as short sequences of yi ’s as possible, while the
original x(j) will always be possible to be reproduced correctly.
meaning of ”short”: The average length of codewords should be small. The
average is calculated according to the probability distribution characterizing
the source: We assume that the emitted symbol is a random variable X and in
the ideal situation we know the distribution of X that governs the behavior of
the source.
Def. A uniquely decodable (UD) code is a function f : X → Y ∗ satisfying that
∀u, v ∈ X ∗ , u = u1 u2 . . . uk , v = v1 v2 . . . vm , u 6= v implies f (u1 )f (u2 ) . . . f (uk ) 6=
f (v1 )f (v2 ) . . . f (vm ) (where f (a)f (b) means the sequence obtained by concate-
nating the sequences f (a) and f (b)).
Prefix code: No codeword f (x(i) ) is a prefix of another. A prefix code is always
UD.
Examples: (Codes given here with collection of codewords.) C1 = (0, 10, 110, 111)
is UD, even prefix. C2 = (0, 10, 100, 101) is not prefix, not even UD, 100 can be
f (x(2) )f (x(1) ) as well as f (x(3) )). But C3 = (0, 01) is UD, although not prefix.
Question: Why do we care about variable length and not simply use |X | code-
words of length dlogs |X |e each?
Answer: Average length may be better, see this example. Let the probabilities of
emitting the symbols be p(x(1) ) = 1/2, p(x(2) ) = 1/4, p(x(3) ) = 1/8, p(x(4) )) =
1/8. The code f (x(1) ) = 0, f (x(2) ) = 10, f (x(3) )) = 110, f (x(4) ) = 111 has
average length 1 · 1/2 + 2 · 1/4 + 3 · 1/8 + 3 · 1/8 = 1.75 < 2 = log2 4.

Next we state and prove two basic theorems that belong together: they sort of
complement each other. (One could certainly look at them as the two parts of
a single theorem.)

2
Kraft-McMillan inequality

Theorem 1 (McMillan): If C = (f (x(1) ), . . . , f (x(r) ) is a UD code over an


s-ary alphabet, then
r
X (i)
s−|f (x )| ≤ 1.
i=1

Theorem 2 (Kraft): If the positive integers l1 , . . . , lr satisfy


r
X
s−li ≤ 1.
i=1

then there exists an s-ary prefix code with codeword lengths l1 , . . . , lr .

Proof of McMillan’s theorem. Consider

r
!k k·l max
−|f (x(i) )|
X X X
s = s−|v| = Al s−l ,
i=1 v∈C k l=1

where Al is the number of ways we can have an l length string of code symbols
when using our code and lmax is the length of the longest codeword f (x(i) ).
Since the code is UD, we cannot have more than sl different source strings
resulting in such an l length string, so Al ≤ sl . Thus the right hand side is at
Pr (i)
most k · lmax giving ( i=1 s−|f (x )| )k ≤ k · lmax . Taking kth root and limit as
k → ∞, the result follows. 2

Proof of Kraft’s theorem. Arrange the lengths in nondecreasing order, i.e., l1 ≤


. . . ≤ lr . Define the numbers w1 := 0 and for j > 1 let
j−1
X
wj := slj −li .
i=1

Pj−1 Pj
This gives wj = slj i=1 s−li < slj i=1 s−li ≤ slj , thus the s-ary form of wj
(j)
has at most lj digits. Let f (x ) be the s-ary form of wj ”padded” with 0’s at
the beginning if necessary to make it have length exactly lj for every j. This
gives a code, we show it is prefix. Assume some f (x(j) ) is just the continuation
of another f (x(h) ). (Then lj > lh , so j > h.) Thus cutting the last lj − lh digits
of f (x(j) ) we get f (x(h) ). This ”cutting” belongs
j to division
k j by slj −lh (plus
k
wj lh
P j−1 −li
taking integer part), so this would mean wh = slj −l h
= s i=1 s =
j P k
h−1 j−1
slh i=1 s−li + slh i=h s−li ≥ wh + 1, a contradiction. 2
P

3
Second lecture (September 14, 2019)

Idea:
l Kraft’s
m theorem
l implies
m that there is a prefix code with codeword lengths
logs p11 , . . . , logs p1m , since
m
X m
X m
X m
X
1= pi = slogs pi = s− logs (1/pi ) ≥ s−dlogs (1/pi )e .
i=1 i=1 i=1 i=1

Such a code has average length


m   X m m m
X 1 1 X 1 X X 1
pi logs < pi (logs +1) ≤ pi logs + pi = pi logs +1.
i=1
p i i=1
pi i=1
pi i=1
p i

Pm
The sum i=1 pi logs p1i is an important quantity in information theory. Usually
we define it with s = 2.

Convention: When no basis for a logarithm is given, we mean it to be of base


2.
Def. The entropy H(P ) of the probability distribution P = (p1 , . . . , pr ) is
defined as
Xr
H(P ) = − pi log pi .
i=1
For r = 2 we speak about the binary entropy function of the distribution P =
(p, 1 − p) and denote it by h(p). Thus h(p) = −p log p − (1 − p) log(1 − p).
The quantity
m
X 1 1
Hs (P ) := pi logs = H(P )
i=1
pi log s
is sometimes called the s-ary or base s entropy of P . (Note however, that the
binary entropy function mentioned above is not the s = 2 case of this. Rather
the s = 2 case of the s-ary entropy we simply call entropy in accordance with
the convention that logarithms are to the base 2 if not said otherwise. If we
really want to emphasize that s = 2 we might say base-2 entropy.)

Thus above we have proved the following.

Theorem 3 Let us have an information source emitting symbol x(i) ∈ X with


probability p(x(i) ) = pi , (i = 1, . . . , r). There exists an s-ary prefix code for this
source with average codeword length less than Hs (P ) + 1 = H(P )
log s + 1.

Remark: The entropy function H(P ) is often interpreted as a measure of the


information content in a random variable X that has distribution P . Intuitively,
one can think about log p1i = − log pi as the information gained when observing
that X just obtained its value having probability pi . This interpretation would
then mean that the average information per observation obtained during several
observations is just H(P ). We think that the information content is measured
in bits (binary digits) thus it has to do with the number of binary digits needed
for an optimal encoding. Theorem 3 together with Theorem 4 below gives
justification for this interpretation.

Theorem 4 Let us have an information source emitting symbol x(i) ∈ X with


probability p(x(i) ) = pi , (i = 1, . . . , r). For any s-ary UD code f : X → Y ∗ of
this source we have
r r
! r
X
(i) 1 1 X X
pi |f (x )| ≥ H(P ) = − pi log pi = − pi logs pi ,
i=1
log s log s i=1 i=1

4
where P stands for the distribution (p1 , . . . , pr ). Thus, for a binary (this belongs
to s = 2) UD code the average codeword length is bounded from below by the
entropy of the distribution governing the system.

For the proof we will need the following simple tool from calculus that is often
very useful when proving theorems in information theory. Recall the notion of
convexity of a function first.
Def.: A function g : [a, b] → R is convex if for every x, y ∈ [a, b] and λ ∈ [0, 1]
we have
g(λx + (1 − λ)y) ≤ λg(x) + (1 − λ)g(y).
We say that g is strictly convex if we have strict inequality whenever 0 < λ < 1
and x 6= y.
Jensen’s inequality: Let g : [a, b] → R be a convex function. Then for any
Pk
x1 , . . . , xk ∈ [a, b] and non-negative reals α1 , . . . , αk satisfying i=1 αi = 1, we
have !
Xk Xk
g αi xi ≤ αi g(xi ).
i=1 i=1

Moreover, if g is strictly convex, then equality holds if and only if all xi ’s be-
longing to non-zero coefficients αi are equal.
We first formulate a consequence of Jensen’s inequality and prove Theorem 4.
The proof of Jensen’s inequality will follow only afterwards.

Corollary 5 If P = (p1 , . . . , pk ) and Q = (q1 , . . . , qk ) are two probability dis-


tributions, then
k
X pi
pi log ≥ 0,
i=1
qi
and equality holds iff qi = pi for every i.
Convention: To make the formulas above always meaningful, we use the ”calcu-
lation rules” (for a ≥ 0, b > 0) 0 log a0 = 0 log a0 = 0 and b log 0b = +∞, b log 0b =
−∞.
Proof. The function − log x is convex, thus by Jensen’s inequality
k k   k
! k
!
X pi X qi X qi X
pi log = pi − log ≥ − log pi = − log qi = 0.
i=1
qi i=1
pi i=1
pi i=1

The condition of equality also follows from the corresponding condition in


Jensen’s inequality. 2
Pr (i)
Proof of Theorem 4. We know from the McMillan theorem, that i=1 s−|f (x )| ≤
Pr (i) −|f (x(i) )| (i)
1. Set b = i=1 s−|f (x )| and qi = s b ≥ s−|f (x )| . Then
r r r r
X X X 1 X
pi |f (x(i) )| = − pi logs (qi b) ≥ − pi logs qi = − pi log qi .
i=1 i=1 i=1
log s i=1
Pr
Observe that i=1 qi = 1 and qi ≥ 0 for every i (so (q1 , . . . , qr ) could be con-
sidered a probability distribution).
Pr Thus by
PrCorollary 5 of Jensen’s inequality
above, we have that − i=1 pi log qi ≥ − i=1 pi log pi and the statement fol-
lows. 2

Now we prove Jensen’s inequality.

5
Proof of Jensen’s inequality: We argue by induction on k. The base case is
k = 2, when the statement simply follows from the definition of convexity. (It
is also obviously true for k = 1, in fact then the statement is simply trivial,
stating that g(x1 ) ≤ g(x1 ). But we do need the k = 2 case in the proof, so it
would not be enough to consider k = 1 alone as the base case.)
Now assume the statement is true for all k < `, we now prove it for k = `. We
can write ! !
X` `−1
X
g αi xi = g α` x` + αi xi =
i=1 i=1
`−1
! `−1
!
X αi X αi
g α` x` + (1 − α` ) xi ≤ α` g(x` ) + (1 − α` )g xi ≤
i=1
1 − α` i=1
1 − α`
`−1
X αi
α` g(x` ) + (1 − α` ) g(xi ) =
i=1
1 − α`
`−1
X `
X
α` g(x` ) + αi g(xi ) = αi g(xi )
i=1 i=1
as claimed. Here the first inequality follows by the definition of convexity,
while the second from the induction hypothesis applied to k = ` − 1. The
statement for the conditions of equality also follows: to have equality in the
first inequality
P`−1weαineed either that α` = 0 or that α` = 1, or if neither holds
then x` = i=1 1−α `
xi . To have equality also in the second inequality we know
from the induction hypothesis that we need that all xi ’s with i ≤ `−1 belonging
to nonzero coefficients
P`−1 ααi i are equal. Then these are equal also to their weighted
average value i=1 1−α `
xi , so if α` is neither 0 nor 1, then x` must also be
equal to this common value. 2

Next we introduced another code construction, called the Shannon-Fano code:

Pj−1
We assume p1 ≥ p2 ≥ . . . ≥ pn > 0. Let w1 = 0 and for j > 1 let wj = i=1 pi .
(j)
Let the codeword f (x ) be the s-ary representation of the number wj (which
is always in the [0, 1) interval) without the starting integer part digit 0, and
with minimal such length that it is not a prefix of any other such codeword.
The latter condition already ensures that the code is prefix.
This construction is very closely related to the one used in the proof of Kraft’s
theorem on which the proof of Theorem 3 was based. Nevertheless, below we give
a second proof of Theorem 3 directly using the Shannon-Fano code construction.

The above definition (of Shannon-Fano code) implies that the first |f (x(j) )| − 1
digits of f (x(j) ) is a prefix of another codeword and thus it must be the prefix of
a codeword coming from a closest number wh , thus wj−1 or wj+1 . This implies
(j)
pj = p(x(j) ) = wj+1 − wj ≤ s−(|f (x )|−1)

or (j)
pj−1 = p(x(j−1) ) = wj − wj−1 ≤ s−(|f (x )|−1)
.
By pj−1 ≥ pj in either case the first of the above two inequalities holds. Thus
logs pj ≤ −|f (x(j) )| + 1 implying

−pj logs pj ≥ pj (|f (x(j) )| − 1),

and thus
r
X r
X
− pj logs pj + 1 ≥ pj |f (x(j) )|.
j=1 j=1

6
We have seen two constructions giving average codeword length close to the
lower bound Hs (P ), but nothing guaranteed that any of these codes would be
best possible. So the question of how to find an optimal average length code
comes up. This will be answered by constructing the so-called Huffman code.
We will study this only for the binary case, i.e, when the size of the code alphabet
is s = 2.

The construction of the Huffman code we will learn about next time. Neverthe-
less, as a preparation, we already made some observations.

Assume p1 ≥ . . . ≥ pn > 0, pi = p(x(i) ) and having an optimal binary code


C = (f (x(1) ), . . . , f (x(n) )), li := |f (x(i) )|. By the foregoing we can assume that
the code is prefix. (Note that the pn > 0 assumption is not a real restriction:
if we have 0-probability events, they need not be encoded. Or they could even
be encoded into long codewords, since their contribution to the average length
will be zero anyway.)

Observe:
We may also assume
(1) l1 ≤ l2 ≤ . . . ≤ ln . This is true, because if this is not satisfied, then we may
exchange codewords without increasing the average length.
(2) ln = ln−1 and the two codewords f (x(n) ) and f (x(n−1) ) differ only in the last
digit. The first statement is true, because if not, then ln > ln−1 by (1) above
and since the code is prefix, deleting the last digit of f (x(n) ) would result in a
prefix code with smaller average length, so the original code was not optimal.
The second statement is true, because exchanging the last digit of the codeword
f (x(n) ) we should get another codeword (otherwise this last digit could have
been deleted without ruining the prefix property), and if this codeword is not
f (x(n−1) ) but some f (x(i) ) with i 6= n − 1, then we can simply exchange the
two without effecting the average length as these two codewords both have the
same length |f (x(i) )| = |f (x(n) )| = |f (x(n−1) )|.
I only hinted on a third observation that, together with the previous ones, will
lead us to the construction of optimal codes. This is what we will start with
next time.

7
Third lecture (September 21, 2021)

Huffman code

After recalling the two observations made at the end of the lecture last time we
added the following third observation.
(3) Cutting the last digit of the two codewords f (x(n−1) ) and f (x(n) ) we obtain
an optimal binary prefix code for the distribution (p1 , p2 , . . . , pn−2 , pn−1 + pn ).
This is true because the average length L of our code is L0 +pn−1 +pn , where L0 is
the average length of the code obtained by identifying the codewords f (x(n−1) )
and f (x(n) ) by cutting their last digit. If there was a better (i.e., one with smaller
average length) prefix code for the distribution (p1 , p2 , . . . , pn−2 , pn−1 +pn ), then
extending the codeword belonging to the probability pn−1 + pn source symbol
once with a 0 digit and once with a 1 digit, we would obtain a better code than
our original one, so its average length could have not been optimal.
From these three observations the optimal code construction is immediate: add
two smallest probabilities iteratively until only two distinct ones remain. Give
these the (sub)words 0 and 1 and then follow the previous ”adding up two prob-
abilities” process backwards and put a 0 and a 1 at the end of the corresponding
codeword.
Example:
P = (0.05, 0.10, 0.10, 0.11, 0.12.0.13, 0.14, 0.25)
. (Note that now we listed the probabilities in non-decreasing rather than non-
increasing order. So in this ordering it is always the first two that are the
smallest ones and should be added up.) The in-between distributions are:
(0.10, 0.11, 0.12, 0.13, 0.14, 0.15 = 0.10 + 0.05, 0.25);
(0.12, 0.13, 0.14, 015, 0.21, 0.25); (0.14, 0.15, 021, 0.25, 0.25);
(0.21, 0.25, 0.25, 0.29); (0.25, 0.29, 0.46); (0.46, 0.54).
And the code obtained(writing it backwards for each stage of the construction:
(0, 1); (0, 10, 11); (00, 01, 10, 11);
(110, 111, 00, 01, 10); (010, 011, 110, 111, 00, 10);
(000, 001, 010, 011, 110, 111, 10), and finally

(1110, 1111, 000, 001, 010, 011, 100, 10).

In the second half of class today we solved some exercises.

1. What is the largest integer value of ` for which a prefix code of 8 codewords
with respective lengths 1, 2, 3, 4, 5, 6, 7, and ` over a binary alphabet does not
exist? (Notice that we do not assume anything about the relation between the
value of ` and the other given lengths.)
Sketch of solution: By the theorems of McMillan and Kraft such a code does
not exist if and only if
1 1 1 1 1 1 1 1
+ + 3 + 4 + 5 + 6 + 7 + ` > 1.
2 22 2 2 2 2 2 2
This is the case if and only if ` ≤ 6 (we would have equality for ` = 7). So the
requested largest number is ` = 6. 3

2. We have two dice with 1 dot on two faces, 2 dots on two faces, and 3 dots on
two faces. We roll the two dice together and want to encode the total number of
dots we see on the rolled faces of the two dice. Give the Shannon-Fano code for
alphabet size 2 and also for alphabet size 3 for this problem. Construct also a

8
binary code that has shortest average length, that is one, for which the expected
number of bits needed to encode the result of many rolls is as small as possible.
Sketch of solution: The result can be 2, 3, 4, 5, or 6 dots on the two faces seen,
and their probabilities can be calculated by the number of elementary events
giving the corresponding number. So the probabilities are 1/9, 2/9, 3/9, 2/9, 1/9,
respectively. The corresponding wi values are 0, 3/9, 5/9, 7/9, 8/9 in the Shannon-
Fano code construction. The binary Shannon-Fano code we obtain from these
values is (00, 01, 10, 110, 111), while the ternary Shannon-Fano code is:

(0, 10, 12, 21, 22).

To obtain shortest average length we have to construct a Huffman code for the
above distribution. Doing this we can get 000, 001, 01, 10, 11. (This shows that
the binary Shannon-Fano code also has optimal average length in this case.) 3

3. We roll a standard dice (i.e., one with 1, . . . , 6 dots on its 6 sides, respectively)
until we get a result second time. That is we roll twice if the second roll results
in the same number as the first one. We roll three times if the second roll gives
a different result than the first one but the third roll has a result that is equal
to the result of either the first or the second roll, etc. Let X denote the random
variable that is the number of times we have to roll the dice according to the
above rules. Give a binary prefix code encoding the outcome of X with optimal
average length.
We have calculated the probabilities with which x takes its possible values and
left the code construction for homework. Here is the calculation of said proba-
bilities¿ By definition X cannot get a value smaller than 2 and it also cannot
be larger than 7 (since we cannot roll seven different numbers with a dice).
P (X = 2) = 61 , as this is the probability that the second roll has the same result
as the first one.
P (X = 3) = 56 · 13 = 18 5
, since the probability that the second roll is different
5
from the first one is 6 and the probability that the third roll is equal to one of
the two numbers rolled at the first two rolls is 13 .
With similar logic, we obtain that
P (X = 4) = 56 · 23 · 12 = 18 5 90
= 324 ,
5 2 1 2 5 60
P (X = 5) = 6 · 3 · 2 · 3 = 27 = 324 ,
5 2 1 1 5 25
P (X = 6) = 6 · 3 · 2 · 3 · 6 = 324 ,
P (X = 7) = 56 · 23 · 12 · 31 · 16 = 324
5
.
The rest is to construct a Huffman code for this distribution and that remained
homework.

I gave also the following problem for homework.


4. Two people made two different Huffman codes for the distribution p1 ≥
p2 ≥ p3 ≥ p4 . The codewords of these codes are 0, 10, 110, 111 for one and
00, 01, 10, 11 for the other. Determine the distribution if we know that p3 = 1/6.

9
Fourth lecture (September 28, 2021)

We started by discussing the solutions of the two problems that remained home-
work last time. The problem formulations are repeated here for the reader’s
convenience.

3. We roll a standard dice (i.e., one with 1, . . . , 6 dots on its 6 sides, respectively)
until we get a result second time. That is we roll twice if the second roll results
in the same number as the first one. We roll three times if the second roll gives
a different result than the first one but the third roll has a result that is equal
to the result of either the first or the second roll, etc. Let X denote the random
variable that is the number of times we have to roll the dice according to the
above rules. Give a binary prefix code encoding the outcome of X with optimal
average length.
We have already calculated the probabilities with which X takes its possible
values last time and left the code construction for homework.
Thus we needed to find an optimal average length code, that is a Huffman code
5
for the distribution ( 61 , 18 5
, 18 5
, 27 25
, 324 5
, 324 54
) = ( 324 90
, 324 90
, 324 60
, 324 25
, 324 5
, 324 ).
Applying the learnt algorithm to this distribution we obtain that the following
code will be optimal:
X = 2 : 110, X = 3 : 00, X = 4 : 01, X = 5 : 10, X = 6 : 1110, X = 7 : 1111
3

4. Two people made two different Huffman codes for the distribution p1 ≥
p2 ≥ p3 ≥ p4 . The codewords of these codes are 0, 10, 110, 111 for one and
00, 01, 10, 11 for the other. Determine the distribution if we know that p3 = 1/6.
Solution of Exercise 4: When constructing the code, in the first step both people
had to create the distribution (p1 , p2 , p3 + p4 ). In the next step, however, one of
them had to create (p1 + p2 , p3 + p4 ), while the other created (p1 , p2 + p3 + p4 ).
If both led to Huffman codes, that means both steps are optimal, so it did not
matter whether we add p1 or p3 + p4 to p2 . This implies p1 = p3 + p4 . All the
probability values can be obtained from this as follows.
Since p4 ≤ p3 = 1/6, we have p1 = p3 + p4 ≤ 1/6 + 1/6 = 1/3. On the other
hand, p1 ≥ 12 (p1 + p2 ) = 21 (1 − (p3 + p4 ) ≥ 12 (1 − 1/3) = 1/3. So p1 is not more
and not less than 1/3, thus p1 = 1/3. Thus p4 = p1 − p3 = 1/3 − 1/6 = 1/6
and p2 = 1 − 1/3 − 1/6 − 1/6 = 1/3. So the distribution is (1/3, 1/3, 1/6, 1/6).
Note that the above two codes give indeed the same optimal average length 2
for this distribution. 3

More on the entropy function

Notation:
p(x) = P rob(X = x),
p(y) = P rob(Y = y),
p(x, y) = P rob(X = x, Y = y),
p(x|y) = P rob(X = x|Y = y),
p(y|x) = P rob(Y = y|X = x).

X
H(X, Y ) = − p(x, y) log p(x, y)
x,y

10
is simply the entropy of the joint distribution of the variable (X, Y ).

Conditional entropy is defined as:


X
H(X|Y ) = p(y)H(X|Y = y) =
y

X X
=− p(y) p(x|y) log p(x|y) =
y x
X
=− p(x, y) log p(x|y) =
x,y
X p(x, y)
=− p(x, y) log .
x,y
p(y)

Theorem 6 a)
0 ≤ H(X) ≤ log n,
where n = |X |. H(X) = 0 iff X takes a fix value with probability 1, H(X) =
log n iff p(x) is uniform.
b)
H(X, Y ) ≤ H(X) + H(Y )
with equality iff X and Y are independent.

Note the intuitive plausibility of these statements. (For example: The infor-
mation content of the pair (X, Y ) is not more than the sum of the information
X and Y contain separately. And equality means that they ”do not contain
information about each other”, that is, they are independent.)

Proof of a). 0 ≤ H(X) is clear by log p(x) ≤ 0 for all x. Equality can occur iff
p(x) = 1 for some x, then all other probabilities should be zero.
Applying Corollary 5 to qi = 1/n ∀i gives H(X) ≤ log n and also the condition
for equality.

Proof of b). Follows by applying Corollary 5 for p = p(x, y) and q = p(x)p(y).


In details:
H(X) + H(Y ) − H(X, Y ) =
! !
X X X X X
− p(x, y) log p(x)− p(x, y) log p(y)+ p(x, y) log p(x, y) =
x y y x x,y
X p(x, y)
p(x, y) log ≥ 0.
x,y
p(x)p(y)

Equality holds iff p(x, y) = p(x)p(y)∀x, y, i.e. iff X and Y are independent. 2
Some properties of conditional entropy are proven next.

Theorem 7 a)
H(X|Y ) = H(X, Y ) − H(Y ).
b)
0 ≤ H(X|Y ) ≤ H(X).

11
Proof of a). H(X|Y ) = − x,y p(x, y) log p(x|y) = − x,y p(x, y) log p(x,y)
P P
P P P p(y) =
− x,y p(x, y) log p(x, y) + x,y p(x, y) log p(y) = H(X, Y ) + y p(y) log p(y) =
H(X, Y ) − H(Y ).
Proof of b). 0 ≤ H(X|Y ) follows from observing that H(X|Y ) is the expected
value of entropies that are non-negative by 0 ≤ H(X) being valid in general.
H(X|Y ) ≤ H(X) follows from a) as it is equivalent to H(X, Y ) = H(X|Y ) +
H(Y ), while we have already seen that H(X, Y ) ≤ H(X) + H(Y ). This also
gives that the condition of equality is exactly the same as it is in Theorem 6 b),
namely that X and Y are independent. 2

Exercises.
Exercise 5. Let X and Z be independent random variables such that P rob(X =
1) = p, P rob(X = 0) = 1 − p and P rob(Z = 0) = P rob(Z = 1) = 1/2. Let Y be
the random variable that is given as the modulo 2 sum of X and Z. Calculate
H(X), H(X|Z), H(X|Y ), H(X|, Y, Z).

Solution of Exercise 5: We have H(X) = h(p) (directly follows), H(X|Z) = h(p)


(immediate by the independence of X and Z), H(X|Y ) = h(p) (follows by
realizing that H(X|Y = 0) = H(X|Y = 1) = h(p)), and H(X|Y, Z) = 0 (since
X is determined by the pair Y, Z). It is worth noticing that H(X|Y ) = h(p) =
H(X) means that X and Y are also independent. 3
Part of the following exercise remained homework (see details below).
Exercise 6. We roll two fair dice (like the one in problem 4 above) independently.
Let Z denote the product of the two numbers rolled. For i = 2, 3, and 4, we
denote by Xi the random variable which is 0 if Z is divisible by i and 0 otherwise.
(For example, X2 = 0 if Z is even and X2 = 1 otherwise.) Calculate the
entropy values H(X2 ), H(X3 ), H(X4 ) and the conditional entropies H(X2 |X3 )
and H(X2 |X4 ).
(Formulas containing the binary entropy function of precisely given numbers
9
like h( 16 ) can be considered well determined and need not be calculated numer-
ically. Nevertheless, obvious simplifications are considered important to make,
for example, a 0 value should not be left there as a complicated sum.)

We have calculated the first two entropies that were asked, calculating the re-
maining three ones remained homework.
Those we have already calculated are as follows.
We have X2 = 1 if both rolls are odd, and this event has probability 14 . Thus
 
1
H(X2 ) = h .
4

Similarly, we have X3 = 1 if neither rolls have a result divisible by 3, which has


2
probability 23 = 49 , so  
4
H(X3 ) = h .
9

12
Fifth lecture (October 5, 2021)

We started by discussing the homework, that is, the reaining part of the solution
of Exercise 6.
Recall the problem formulation:
Exercise 6. We roll two fair dice (like the one in problem 4 above) independently.
Let Z denote the product of the two numbers rolled. For i = 2, 3, and 4, we
denote by Xi the random variable which is 0 if Z is divisible by i and 0 otherwise.
(For example, X2 = 0 if Z is even and X2 = 1 otherwise.) Calculate the
entropy values H(X2 ), H(X3 ), H(X4 ) and the conditional entropies H(X2 |X3 )
and H(X2 |X4 ).
Last time we have already calculated H(X2 ) and H(X3 ), here is the rest of the
solution.
We have X4 = 1 if either both rolls are odd, which is true for 9 of the possible
36 outcomes, or exactly one of them is even but not equal to 4, that can happen
in 2 · 3 · 2 = 12 ways, so altogether, the probability of X4 = 1 is 9+12 7
36 = 12 . So
 
7
H(X4 ) = h .
12

We observe, that P (X2 = 1|X3 = 0) = P (X2 = 1|X3 = 1) = 41 = P (X2 = 1),


so knowing X3 does not change the probabilities with which X2 takes its values.
This already shows
H(X2 |X3 ) = P (X3 = 1)H(X2 |X3 = 1) + P (X3 = 0)H(X2 |X3 = 0) = H(X2 ) =
h 41 .
If X4 = 0 then we surely have X2 = 0, too, since an integer divisible by 4 is
also divisible by 2. Thus P (X2 = 0|X4 = 0) = 1 and P (X2 = 1|X4 = 0) = 0,
implying H(X2 |X4 = 0) = 0. We also need the probabilities P (X2 = 0|X4 = 1)
and P (X2 = 1|X4 = 1). We have already seen above that X4 = 1 can happen
in 21 ways out of the 36 possible outcomes of the two rolls. We have also seen
that there are 9 of these 21 cases when neither number is even, that is X2 = 1.
9
Thus P (X2 = 1|X4 = 1) = 21 = 37 and P (X2 = 0|X4 = 1) = 1 − 37 = 74 .
So the requested conditional entropy is
 
7 3
H(X2 |X4 ) = P (X4 = 0)H(X2 |X4 = 0)+P (X4 = 1)H(X2 |X4 = 1) = h .
12 7
3

Chain rule

The following theorem is called Chain rule.

Theorem 8 (Chain rule)

H(X1 , . . . , Xn ) = H(X1 )+H(X2 |X1 )+H(X3 |X1 , X2 )+. . .+H(Xn |X1 , . . . , Xn−1 ).

Proof: Goes by induction. Clear for n = 1, and it is just Theorem 7 a) for n = 2.


Having it for n − 1, apply Theorem 7 a) for Y = Xn and X = (X1 , . . . , Xn−1 )
in the form H(X, Y ) = H(X) + H(Y |X). It gives

H(X1 , . . . , Xn ) = H((X1 , . . . , Xn−1 ), Xn ) =

H(X1 , . . . , Xn−1 ) + H(Xn |(X1 , . . . , Xn−1 )) =


H(X1 ) + H(X2 |X1 ) + . . . + H(Xn−1 |X1 , . . . , Xn−2 ) + H(Xn |X1 , . . . , Xn−1 ).

13
2

Recall the theorem we proved last time stating

0 ≤ H(X|Y ) = H(X, Y ) − H(Y ).

We prove a consequence of this next.

Corollary 9 For any function g(X) of a random variable X we have

H(g(X)) ≤ H(X).

Proof. Since g(X) is determined by X we have H(g(X)|X) = 0. Thus using


Theorem 7 a) we can write

H(X) = H(X)+H(g(X)|X) = H(X, g(X)) = H(g(X))+H(X|g(X)) ≥ H(g(X)).

We also see that the condition of equality is H(X|g(X)) = 0, which is equivalent


to g(X) determining X, i.e. to g being invertible. 2
From the above we see that

H(X) − H(X|Y ) = H(X) + H(Y ) − H(X, Y ) = H(Y ) − H(Y |X).

This quantity is thus intuitively the difference of the amount of information X


contains if we do and if we do not know Y . We can think about it as the amount
of information Y carries about X. And we see that we get the same value if we
exchange the role of X and Y . This interpretation is also consistent with the
fact that the above value is 0 if and only if X and Y are independent. These
thoughts motivate the following definition.
Def. For two random variables X and Y, their mutual information I(X, Y ) is
defined as
I(X, Y ) = H(X) + H(Y ) − H(X, Y ).
By the foregoing we also have I(X, Y ) = H(X) − H(X|Y ) = H(Y ) − H(Y |X).
Later we will see that mutual information is a basic quantity that also comes
up as a central value in certain coding theorems

We solved two more exercises:

Exercise 7. (Taken from the book: Thomas Cover and Joy Thomas Elements
of Information Theory, Second Edition, Wiley, 2006.)
The World Series is a seven-game series that terminates as soon as either team
wins four games. Let X be the random variable that represents the outcome
of a World Series between teams A and B; examples of possible values of X
are AAAA, BABABAB, and BBBAAAA. Let Y be the number of games
played, which ranges from 4 to 7. Assuming that A and B are equally matched
(that is, both has an 50% chance to win at each game) and that the games are
independent, calculate H(X), H(Y ), H(Y |X), and H(X|Y ).
Solution of Exercise 7: Since X determines Y , we can immediately write H(Y |X) =
0. Also, since H(X|Y ) + H(Y ) = H(Y |X) + H(X) = H(X), we will be able

14
to calculate H(X|Y ) as the difference between H(X) and H(Y ). So it will be
enough to calculate H(X) and H(Y ). The value of Y can be 4 in two different
ways (AAAA and BBBB), both with probability 214 , so Prob(Y = 4) = 1/8. Y
can be 5 in 8 different ways: there are 4 ways to have one game won by B among
the first four games the fifth of which is won by A, and there are four other op-
tions when we change the role of A and B. Each of these options have probability
1 5
25 , thus Prob(Y = 5) = 1/4. There are 2 = 10 ways to have exactly two games
won by B among the first five while the sixth is won by A and we have 10 more
options to have six games changing the role of A and B. Each of these  have prob-
6
ability 216 . So Prob(Y = 6) = 2026 = 5/16. Similarly, we have 2 · 3 = 40 options
to have Y = 7, each with probability 217 . So Prob(Y = 7) = 40 2 = 5/16. Thus
7

H(Y ) = 81 log 8+ 14 log 4+2· 16


5
log 16 5 = 3
8 + 1
2 + 40
16 − 5
8 log 5 = 27 5
8 − 8 log 5. Now to
calculate H(X), note that each particular outcome of i games have probability
1 1 1 1
2i . So the distribution of X contains the value 16 twice, 32 8 times, 64 20 times,
1 2 8 20 40
and 128 40 times. Thus H(X) = 16 log 16 + 32 log 32 + 64 log 64 + 128 log 128 =
1/2 + 5/4 + 30/16 + 35/16 = 93 16 .
Thus we obtained
93 27 5
H(X) = , H(Y ) = − log 5,
16 8 8
93 27 5 39 5
H(X|Y ) = − ( − log 5) = + log 5,
16 8 8 16 8
H(Y |X) = 0.
3
Exercise 8.
Which ones of the following codes can and which ones cannot be a Huffman
code?
a) 0, 10, 111, 101
b) 00, 010, 011, 10, 110
c) 1, 000, 001, 010, 011
Solution: The code in part (a) is not even prefix (the fourth codeword starts
with the second one), so itcannot be a Huffman cod.
The code in part (b) cannot have optimal average length because deleting the
last digit of the last codeword we still have a prefix code. Therefore this code
is also cannot be a Huffman code.
The code in part (c) can be a Huffman code, to prove this we give a distribution
P for which it can belong. Let P = (1/2, 1/8, 1/8, 1/8, 1/8). Performing the
construction for this distribution shows that it results in the code given in (c)
if we choose the binary digits accordingly. (One can also easily check that the
average length is exactly H(P ) so it must be optimal as it cannot be less. 3

15
Sixth lecture (October 12, 2021)

Def.: A source X1 , X2 , . . . is memoryless if the Xi ’s are independent.


Def.: A source is stationary if X1 , . . . , Xk and Xn+1 , . . . , Xn+k has the same
distribution for every n and k.

If the source is stationary and memoryless, then H(X1 , X2 , . . . , Xk ) = kH(X1 ).

Let f : X k → Y ∗ be a UD code of the stationary and memoryless source


X = X1 , X2 , . . ..
We know there exists a (e.g. Shannon-Fano or achieving optimal average length
a Huffman) code satisfying
H(X1 , . . . , Xk )
L = E(f (X1 , . . . , Xk )) ≤ + 1.
log s
(Here E(.) means expected value.) Then the per letter average length of this
1 H(X1 ,...,Xk )
1
code is k L ≤ k log s + 1 = H(X 1) 1 1
log s + k . So k L can get arbitrarily close
H(X1 )
to the lower bound Hs (X1 ) = log s .

Entropy of a source

The entropy of a source in general is defined as follows.


Definition. The entropy of a source emitting the sequence of random variables
X1 , X2 , . . . is
1
lim H(X1 , . . . , Xn ),
n→∞ n

provided that this limit exists.

The above limit trivially exists for stationary memoryless sources defined above.
Indeed, if the source is stationary and memoryless, then H(X1 , X2 , . . . , Xn ) =
nH(X1 ), so we have limn→∞ n1 H(X1 , . . . , Xn ) = limn→∞ n1 nH(X1 ) = H(X1 ).
In fact, once a source is stationary it always has an entropy, it need not be
memoryless.

Theorem 10 If a source X1 , X2 , . . . is stationary then its entropy exists and is


equal to
lim H(Xn |X1 , . . . , Xn−1 ).
n→∞

Remark: Note that limn→∞ H(Xn |X1 , . . . , Xn−1 ) can be much smaller than
H(X1 ). Think about a source with source alphabet {0, 1} that emits the same
symbol as the previous one with probability 9/10 and the opposite with prob-
ability 1/10. In the long run we have the same number of 0’s and 1’s, P rob(X1 =
1) = P rob(X1 = 0) = 1/2, so H(X1 ) = 1, while limn→∞ H(Xn |X1 , . . . , Xn−1 ) =
H(Xn |Xn−1 ) = h(0.9) < 1.

Proof. By the source being stationary, we have

H(Xn |X1 , . . . , Xn−1 ) = H(Xn+1 |X2 , . . . , Xn ) ≥ H(Xn+1 |X1 , X2 , . . . , Xn ).

Thus the sequence H(Xi |X1 , . . . , Xi−1 ) is non-increasing and since all its ele-
ments are non-negative, it has a limit.

16
From the Chain rule we can write
n
!
1 1 X
H(X1 , . . . , Xn ) = H(X1 ) + H(Xi |X1 , . . . , Xi−1 ) .
n n i=2

To complete the proof we refer to a lemma of Toeplitz that says that if {an }∞
n=1
is P
a convergent sequence of reals with limn→∞ an = a, then defining bn :=
1 n ∞
n i=1 ai , we have that {bn }n+1 is also convergent and limn→∞ bn = a, too.
Applying this to an := H(Xn |X1 , . . . , Xn−1 ) the statement follows. 2
1
Note that the proof implies that the sequence n H(X1 , . . . , Xn ) is also non-
increasing.

Markov chains, Markov source

Def. A stochastic process Z = Z1 , Z2 , . . . is Markov (or Markovian) if for


every i we have P (Zk |Z1 , . . . , Zk−1 ) = P (Zk |Zk−1 ). We say that the variables
Z1 , Z2 , . . . form a Markov chain.
Intuitively the above definition means that knowing just the previous Zi tells
us everything we could know about the next one even if we knew the complete
past. Such situations often occur.
Example. If Zi denotes the number of heads we have when tossing a fair coin
i times, then Zi+1 = Zi or Zi+1 = Zi + 1 with probability 1/2 − −1/2 and we
cannot say more than this even if we know the values of Zi−2 , Zi−3 , etc. Thus
Z1 , Z2 , . . . is a Markov chain.
The pixels of a picture can be modeled by a Markov chain: After a black pixel
we have another black pixel with high probability and a white one with small
probability and this can be considered independent of earlier pixels (though
clearly, this independence is not completely true).
A Markov chain Z is homogenous if P (Zn |Zn−1 ) is independent of n.
A Markov chain is stationary if all its stochastic parameters are invariant in
“time”, that is, they are not dependent on the indices of the Zi ’s involved. In
particular, a stationary Markov-chain is always homogenous. However, being
stationary means more: it also requires that the distribution of Zi (and not only
the conditional probabilities P (Zi |Zi−1 )) are independent of i. In particular,
for a stationary Markov chain already Z1 is distributed according to the same
stationary distribution according to which Zn is distributed for a (very) large
n.
The entropy of a homogenous Markov chain Z is H(Z2 |Z1 ), this follows from
Theorem 10 above, cf. also the above Remark.
A general Markov source is a stochastic process X, for which each Xi can be
written as a function of two random variables, namely Xi = F (Zi , Yi ) where Z
is a homogenous Markov chain and Y is a stationary and memoryless source
that is independent of Z.
A Markov source can model a situation where, for example, Z is a text or speech
and Y is the noise. When the noise does not effect the outcome of the source,
that is, F (z, y) = z for every z and y, then the entropy of the Markov source is
simply H(Z2 |Z1 ).

17
Homogenous Markov chains tend to a stationary distribution whose entropy
might be much larger than the entropy of the Markov chain. When the homoge-
nous Markov chain has r states its behavior is described by an r × r stochastic
matrix (each row is a probability distribution) A defined by A[i, j] = P rob(Z2 =
j|Z1 = i). Then the stationary distribution p is the probability distribution
(p1 , . . . , pr ) for which pA = p.
Example. Let A be the 2 × 2 matrix with first row p, 1 − p and second row
1−p, p. Then the stationary distribution is 21 , 21 , while the entropy of the Markov
chain is h(p).
Example. Let the 3 × 3 transition probability matrix A of a Markov chain Z
with three states A, B, C have first row: 0, 1, 0, second row: 0, 2/3, 1/3, third
row: 2/3, 0, 1/3. We determine the entropy of the source whose outcome is the
actual state of this Markov chain.
This needs the calculation of the stationary distribution. Let a, b, c denote the
stationary probabilities of the system being in state A, B, C, respectively. Then
from the first column we have a = 32 c giving c = 23 a and from the second column
we have b = a+ 23 b giving b = 3a. (The third column gives c = 31 b+ 13 c that gives
b = 2c, that alredy follows from the first two equations, so this is redundant
and we do not need to use this third equation.) Using a + b + c = 1 we obtain
a + 3a + 32 a = 11 2 6 3 3
2 a = 1, so a = 11 , b = 3a == 11 , and c = 2 a == 11 . Now we
can write
2 6 3
H(Z) = H(Zn |Zn−1 = a) + H(Zn |Zn−1 = b) + H(Zn |Zn−1 = c) =
11 11 11
   
2 6 2 1 3 2 1
H(0, 1, 0) + H 0, , + H , 0, =
11 11 3 3 11 3 3
6 3 9
0+ h(1/3) + h(1/3) = h(1/3).
11 11 11

We have solved the following exercises.


Exercise 9. This exercise is similar to the previous example, only the numbers
differ. Let the 3 × 3 transition probability matrix A of a Markov chain X with
three states A, B, C have first row: 7/8, 1/8, 0, second row: 0, 7/8, 1/8, third
row: 1/3, 1/3, 1/3. Determine the entropy of the source whose outcome is the
actual state of this Markov chain.
Solution: We need to calculate the stationary distribution. Let a, b, c denote
the stationary probabilities of the system being in state A, B, C, respectively.
Then from the first column we have a = 87 a + 13 c giving a = 83 c and from the
third column we have c = 81 b + 13 c giving b = 16 3 c. Using a + b + c = 1 we
obtain 38 c + 16
3 c + c = 1 which implies c = 3
27 = 1 8 16
9 Thus a = 27 , b = 27 and the
.
requested entropy value is

8 16 1
H(X) = H(Xn |Xn−1 = a) + H(Xn |Xn−1 = b) + H(Xn |Xn−1 = c) =
27 27 9
8 16 1 8 1
h(1/8) + h(1/8) + log 3 = h(1/8) + log 3.
27 27 9 9 9
3
Exercise 10: Let X1 , X2 , . . . be a Markov chain for which Prob(X1 = 0) =
Prob(X) = 1) = 12 and let the transition probabilities for i ≥ 1 be given by
Prob(Xi+1 = 0|Xi = 0) = Prob(Xi+1 = 1|Xi = 0) = 12 , while Prob(Xi+1 =

18
0|Xi = 1) = 0 and Prob(Xi+1 = 1|Xi = 1) = 1. Calculate the entropy of the
source whose outcome is the resulting sequence of random variables X1 , X2 , . . ..
Intuitively the solution is quite clear: This source emits some number (perhaps
zero) 0’s first, but after the first 1 it will emit only 1’s. As i gets larger and
larger, the probbility of Xi = 0 is smaller and smaller (in fact it will be 21i ), so
if i is large, then Xi is almost certainly 1. Therefore the uncertainty about the
value of Xi approaches zero, so the entropy of the source should be 0.
This intuition is easy to confirm by calculation: by Theorem 10

H(X) = lim H(Xn |X1 , . . . , Xn−1 ) = lim H(Xn |Xn−1 ) =


n→∞ n→∞

lim Prob(Xn−1 = 0)h(1/2) + Prob(Xn−1 = 1)h(0) =


n→∞

1 1
lim + (1 − n−1 )0 = 0.
n→∞ 2n−1 2
Note that the exercise can also be solved similarly as the previous one: realizing
that the stationary distribution is concentrated on the value 1 we get that the
entropy of the source is 0 · h(1/2) + 1 · h(1) = 0. 3

19
Seventh lecture (October 19, 2021)

We started the class by discussing the solutions of the midterm problems. After
that we turned to the topic of universal source coding.

Universal source coding, Lempel-Ziv type algorithms

Huffmann code gives optimal average length but it assumes knowledge of source
statistics: we expected we know the probabilities pi with which the characters xi
are emitted by the source. When we have to compress information it may not be
so. Or we want to compress earlier than we could know such statistics. (Think
about compressing a text. In principle we could first read it through, make the
source statistics and then encode. But we may prefer to encode right in the
moment we proceed with its reading) The term universal source coding refers to
coding the source in such a way that we do not have to know the source statistics
in advance. The following algorithms are devised to such situations. Although
they usually cannot provide as good compression as the Huffman code, they still
do pretty well. Perhaps surprisingly, it can be shown that their compression rate
approaches the entropy rate of the source, that is the theoretical limit. (We will
learn the technique without proving this.) The examples provided in different
files (see references to them below) are from the textbook written in Hungarian
by László Györfi, Sándor Győri, and István Vajda ”Információ- és kódelmélet”
(Information Theory and Coding Theory), published by Typotex, 2000, 2002.

Remark: Huffman encoding can also be adapted to the situation when we cannot
create the optimal code in advance. Then we simply use the empirical source
statistics. This means that the probability of a character is considered to be
equal to the proportion of the number of its appearances up to a certain point
to the number of characters seen altogether up to that point. This statistics
is updated after every single character. For every new character we use the
Huffman code that would belong to the source statistics we calculated right
before the arrival of that character. We encode the character according to that
code and then update the statistics and modify the code accordingly. This can
be done simultaneously at the decoder (without communication), so the decoder
will always be aware of the code the encoder is using. For the latter we need
a rule how to handle equal probabilities that could result in different optimal
encodings, but such rules are not hard to agree on. Also, the encoder and the
decoder has to agree on a starting code. This can be one that assumes uniform
distribution on the source characters or one that estimates the source statistics
according to some a priori knowledge if that exists. (For example, if the source
is an English text, we can use the more or less known frequency of each letter
in the English language as the starting distribution and a Huffman encoding of
that.) Although this is possible, the methods to be discussed below are simpler.

We are going to discuss three versions of the same method: LZ77 (suggested
by Lempel and Ziv in 1977), LZ78 (suggested by Lempel and Ziv in 1978) and
LZW (which is a modification of LZ78 suggested by Welch).
First version: LZ77
There is s sliding window in which we see hw = hb +ha characters, where hb is the
number of characters we see backwards and ha is the number of characters we see
ahead. The algorithm looks at the not yet encoded part of the character flow in
the ”ahead part” of the window and looks for the longest identical subsequence
in the window that starts earlier. The output of the encoder is then a triple

20
(t, h, c), where t is the number of characters we have to step backwards to the
start of the longest subsequence identical to what comes ahead, h is the length
of this longest identical subsequence, and c is the codeword for the first new
character that is already not fitting in this longest subsequence. Note that the
longest identical subsequence should start in the backward part of the window
but may end in the ahead part, so t may be less than h.
For example, when we are encoding the sequence ...cabracadabrarrarrad..., the
...cabraca part is already encoded (so the coming part is dabrarrarrad...), and
we have hb = 7, ha = 6, then the first triple sent is (0, 0, f (d)) (where f (.)
is the codeword for the character in the argument), the second triple sent is
(7, 4, f (r)), etc., see the ”LZexamples” file. Note that for the next triple we will
have (3, 5, f (d)) showing an example when t < h.

Second version: LZ78


In LZ77 we build on the belief that similar parts of the text come close to each
other. The LZ78 version needs only that substantial parts are repeated but they
do not have to be right after each other. Also, LZ77 is sensitive to the window
size. In LZ78 we do not have this disadvantage.
We build a codebook and each time we encode we look for the longest new
segment that already appears in the codebook. The output is a pair (i, c)
where i is the index of the longest coming segment that already has a codeword
and c is the first new character after it. Apart from producing this output
the algorithm also extends the codebook by putting into it the shortest not yet
found segment, which is the concatenation of the segment with index i we found
and the character c. This new, one longer segment gets the next index and then
we go on with the encoding. For an example see again the ”LZexamples” file

Third version: LZW


This is the most popular version of the algorithm that is a modification of LZ78
as suggested by Welch. We now start with a codebook that already contains all
the one-character sequences. (They have an index which serves as a codeword
for them; we can think about their codeword as the s-ary, or simply binary
representation of this index.) We now read the longest new part p of the text
that can be found in the codebook and the next character, let it be a. Then the
output is simply the index of p, we extend the codebook with the new sequence
pa (that we obtain by simply putting a to the end of p) giving it the next index,
and we consider the extra character a as the beginning of the not yet encoded
part of the text. For an example see the file ”LZWexample”.

After having seen the examples we have solved the following exercise.
Exercise 11. Let us have a source with alphabet X = {a, b, c}. Encode the
source sequence
abcabcbabcabc
with the Lempel-Ziv-Welch algorithm. (The dictionary originally contains the
codewords (i.e., the indices) for the one character sequences: 1 for a, 2 for b, 3
for c.)
Solution: Using the algorithm we learnt one obtains the code 1, 2, 3, 4, 3, 2, 7, 7.
The dictionary becomes: 1 : a; 2 : b; 3 : c; 4 : ab; 5 : bc; 6 : ca; 7 : abc; 8 :
cb; 9 : ba; 10 : abca. 3
The following exercise remained homework.
Exercise 12. Decode the following sequence that is the result of encoding a
sequence with the LZW method. The alphabet has 4 characters: a, b, c, d and

21
they are in the dictionary with indices 1, 2, 3, 4, respectively. The sequence to
be decoded is this:
1, 2, 2, 1, 4, 5, 7, 3, 10, 1, 12, 2

Eighth lecture (October 26, 2021)

We started by solving the homework problem. The decoded sequence is:

abbadabbacabbacab

and the dictionary at the end (which the decoder also must build, otherwise
could not successfully decode) contains

1 : a, 2 : b, 3 : c, 4 : d, 5 : ab, 6 : bb, 7 : ba, 8 : ad,

9 : da, 10 : abb, 11 : bac, 12 : ca, 13 : abba, 14 : ac.

Source coding with negligible probability of error


A disadvantage of using variable length codewords is that if a codeword becomes
erroneous causing a mistake in the decoding this mistake may propagate to the
subsequent codewords. This will not happen if all codewords have the same
length. Then, however, we cannot achieve any compression once we insist on
error-free decoding. If the k-length codewords encode r messages over an s-ary
alphabet then we must have sk ≥ r and this is clearly enough, too. But this has
nothing to do with the source statistics, so this way we encode everything with
the same average length as if the r messages were equally likely thus providing
maximum entropy. To overcome this problem we allow a negligible, but positive
probability of error.
We will use block codes f : X k → Y m .
Def. A code f : X k → Y m is possible to decode with error probability at most
ε if there exists a function ϕ : Y m → X k such that

P rob(ϕ(f (X1 , . . . , Xk )) 6= (X1 , . . . , Xk )) < ε.

We can think about the code as the pair (f, ϕ) where we may select ϕ to be
the decoding function achieving the smallest error probability for f . (The error
probability is then defined as the above quantity on the left hand side, that is
as P rob(ϕ(f (X1 , . . . , Xk )) 6= (X1 , . . . , Xk )).
We are interested in codes with small error probability and small rate m/k. We
are able to give |Y|m distinct codewords, so we may have that many elements of
X k decoded in an error-free manner. Thus the error probability is minimized if
we choose the |Y|m largest probability elements of X k to encode in a one-to-one
way, while all the rest of the elements of X k will just be given some codeword
which will not be decoded to them.
Let N (k, ε) be the smallest number N for which if x1 , . . . , xN are the N largest
PN
probability k-length outputs of the source X, we have i=1 p(xi ) > 1 − ε. The
quantity relevant for us is log Nk(k,ε) . The main result here is, that for a fairly
general class of sources this quantity tends to the entropy of the source. So even
in this setting it is the entropy of the source that gives the quantitative char-
acterization of the best encoding rate one can achieve. This verifies again the

22
intuition that interprets the entropy of the source as a measure of information
content in the source variables.
Def. We call a stationary source information stable if for every δ > 0
 
1
lim P rob − log p(X1 , . . . , Xk ) − H(X) > δ = 0.
k→∞ k

Some remarks:
1) If X is stationary and memoryless, then it is information stable. Here is the
proof: Pk
Yk := − k1 log p(X1 , . . . , Xk ) = − k1 log(p(X1 )p(X2 ) . . . p(Xk )) = k1 i=1 (− log p(Xi )),
where the (− log p(Xi ))’s are independent identically distributed (i.i.d.) random
variables. Observe that their expected value is just the entropy H(X) = H(Xi ).
By the weak law of large numbers the Yi ’s converge in probability to this com-
mon expected value which means exactly that
 
1
lim P rob − log p(X1 , . . . , Xk ) − H(X) > δ = 0,
k→∞ k
i.e. the information stability of the source.
2) The intuitive meaning of information stability is that if k is large enough
then there is a large probability set A ⊆ X k for which x ∈ A implies

p(x) ≈ 2−kH(X) .

If P rob(A) is close to 1 this also implies


1
|A| ≈ ≈ 2kH(X) .
p(x)
Thus encoding the elements of A in a one-to-one manner will require about
kH(X) bits.
Here is the formal statement we are going to prove.

Theorem 11 Let the stationary source X be information stable. Then for every
0<ε<1
1
lim log N (k, ε) = H(X).
k→∞ k

Proof. Let Bk,ε be the set of the N (k, ε) highest probability k-length source
outputs and for every δ ∈ (0, 1) define
n o
Ak,δ := x ∈ X k : 2−k(H(X)+δ) ≤ p(x) ≤ 2−k(H(X)−δ) .

Then by X
1 ≥ P (Ak,δ ) = p(x) ≥ |Ak,δ |2−k(H(X)+δ)
x∈Ak,δ

we have
|Ak,δ | ≤ 2k(H(X)+δ) .

By information stability, we know that P (Ak,δ ) is close to 1 (in particular,


P (Ak,δ ) > 1 − ε) for large enough k. Since Bk,ε is the smallest cardinality set
with probability at least 1 − ε, we get that for large enough k

N (k, ε) = |Bk,ε | ≤ |Ak,δ | ≤ 2k(H(X)+δ) ,

23
and so
1
log N (k, ε) ≤ H(X) + δ.
k
Since δ can be arbitrarily small, this also implies
1
lim sup log N (k, ε) ≤ H(X).
k→∞ k

1+ε
For the reverse inequality let k be large enough for P (Ak,δ ) > 2 . Then
(denoting the complement of set U by U c ) we have
1+ε c
< P (Ak.δ ) = P (Akδ ∩ Bk,ε ) + P (Ak,δ ∩ Bk,ε ) < P (Ak,δ ∩ Bk,ε ) + ε,
2
that is
1−ε
P (Ak,δ ∩ Bk,ε ) > .
2
We can thus write
1−ε
< P (Ak,δ ∩Bk,ε ) ≤ |Bk,ε | max p(x) ≤ |Bk,ε |2−k(H(X)−δ) = N (k, ε)2−k(H(X)−δ) .
2 x∈Ak,δ

So we get
1 − ε k(H(X)−δ)
N (k, ε) > 2 .
2
Thus  
1 1 1−ε
log N (k, ε) > H(X) − δ + log .
k k 2
Since δ > 0 can be arbitrarily small and log 1−ε

2 is a constant independent of
k, the latter implies
1
lim inf log N (k, ε) ≥ H(X)
k→∞ k
and thus the statement. 2

By the foregoing we have essentially proved the following coding theorem.

Theorem 12 Let X be a stationary source which is also information stable and


0 < ε < 1. Let us have a sequence of codes fk : X k → Y mk , (|Y| = s) which
encodes the source with less than ε probability of error. Then

mk H(X)
lim inf ≥ .
k→∞ k log s
On the other hand, for every 0 < ε < 1 and δ > 0 if k is large enough then there
exists an fk : X k → Y m code with probability of error less than ε and

m H(X)
< + δ.
k log s
The proof is immediate by realizing that if we want a code with probability
of error less than ε, then the best we can do is to encode the N (k, ε) largest prob-
ability elements of X k in a one-to-one way to codewords of length logs N (k, ε) =
log N (k,ε)
log s . Thus the compression rate will tend to limk→∞ k1 logs N (k, ε) = H(X)
log s .

24
Ninth lecture (November 2, 2021)

Quantization

In many practical situations the source variables are real numbers, thus have a
continuum range. If we want to use digital communication we have to discretize,
which means that some kind of ”rounding” is necessary.
Def. Let X = X1 , X2 , . . . be a stationary source, where the Xi ’s are real-
valued random variables. A (1-dimensional) quantized version of this source
is a sequence of discrete random variables (another source) Q(X1 ), Q(X2 ), . . .
obtained by a map Q : R → R where the range of the map is finite. The function
Q(.) is called the quantizer.

Goal: Quantize a source so that the caused distortion is small.


How can we measure the distortion? We will do it by using the quadratic
distortion measure D(Q) defined for n-length blocks as
n
!
1 X
D(Q) = E (Xi − Q(Xi ))2 ,
n i=1

where E(.) means expected value.


Since our Xi ’s are identically distributed we have
D(Q) = E((X − Q(X))2 ).
(Here X is meant to have the same distribution as all the Xi ’s.)
Let the range of Q(.) be the set {x1 , . . . , xN }, where the xi ’s are real numbers.
Q(.) is uniquely defined by the values x1 , . . . , xN and the sets Bi = {x : Q(x) =
xi }. Once we fix x1 , . . . , xN , we will have the smallest distortion D(Q) if every
x is ”quantized” to the closest xi , i.e.,
Bi = {x : |x − xi | ≤ |x − xj | ∀j 6= i}.
(Note that this rule puts some values into two neighboring Bi ’s (considering
x1 < x2 < . . . < xN , we have x = 21 (xi + xi+1 ) in both Bi and Bi+1 ). This can
easily be resolved by saying that all these values go to (say) the smaller indexed
Bi .)
If now we consider the Bi ’s fixed then the smallest distortion D(Q) is obtained
if the xRi values lie in the barycenter of the Bi , which is E(X|Bi ) := E(X|X ∈
xf (x)dx
Bi ) = RBi , where f (x) is the density function of the random variable X.
f (x)dx
Bi

(We will always assume that f (x) has all the ”nice” properties needed for the
existence of the integrals we mention.)
The previous claim (smallest distortion is achieved for given quantization inter-
vals Bi when Q(x) = E(X|Bi ) for x ∈ Bi ) can be seen as follows.
This holds for all Bi separately, so it is enough to show it for one of them. By
the linearity of expectation
E((X − c)2 ) = E(X 2 ) − c(2E(X) − c),
and this is smallest when c(2E(X) − c) is largest. Since the sum of c and
2E(X)−c does not depend on c, one can see√simply from the inequality between
the arithmetic and geometric mean ( a+b2 ≥ ab with equality iff a = b) that this
product is largest when c = E(X). (At least this is the case if we can assume √
that both c and 2E(X) − c are non-negative and so the inequality a+b 2 ≥ ab
can be used. If this is not the case, we can still easily obtain that c(2E(X) − c)
is maximized by c = E(X) by looking at the derivatives.)

25
Lloyd-Max algorithm
The above suggests an iterative algorithm to find a good quantizer: We fix some
quantization levels x1 < . . . < xN first and optimize for them the Bi domains
by defining them as above: let yi = xi +x2
i+1
for i = 1, . . . , N − 1 and

B1 := (−∞, y1 ], Bi := (yi , yi+1 ], i = 2, . . . , N − 1, BN = (yN −1 , ∞).

Notice that in general there is no reason for the xi ’s to be automatically the


barycenters of the domains Bi obtained in the previous step. So now we can con-
sider these domains Bi fixed and optimize the quantization levels with respect
to them by redefining them as the corresponding barycenters:
R
xf (x)dx
xi := RBi .
Bi
f (x)dx

Now we can consider again the so-obtained xi ’s fixed and redefine the Bi ’s for
them, and so on. After each step (or after each ”odd” step when we optimize the
Bi domains for the actual xi ’s) we can check whether the current distortion is
below a certain threshold. If yes we stop the algorithm, if no, then we continue
with further iterations.
Remarks.
1) It should be clear from the above that if either of the two steps above changes
the xi quantization levels or the Bi domains, then the quantizer before that
step was not optimal. It is possible, however, that no such change is attainable
already and the quantizer is still not optimal. Here is an example. Let X be a
random variable that takes its values on the finite set {1, 2, 3, 4} with uniform
distribution. (That is P (X = 1) = P (X = 2) = P (X = 3) = P (X = 4) = 1/4.)
Let N = 2 that is we are allowed to use two values for the quantization. There
are three different quantizers for which neither of the above steps can cause any
improvement, but only one of them is optimal. These three quantizers can be
described by
Q1 (1) = 1, Q1 (2) = Q1 (3) = Q1 (4) = 3;

Q2 (1) = Q2 (2) = 1.5, Q2 (3) = Q2 (4) = 3.5;

Q3 (1) = Q3 (2) = Q3 (3) = 2, Q3 (4) = 4.


It takes an easy calculation to check that D(Q1 ) = D(Q3 ) = 0.5, while D(Q2 ) =
0.25. Thus only Q2 is optimal, although neither of Q1 and Q3 can be improved
by the Lloyd-Max algorithm.
2) Let us call a quantizer a Lloyd-Max quantizer if the two steps of the Lloyd-
Max algorithm have no effect on them. In the previous remark we have seen
that a Lloyd-Max quantizer is not necessarily optimal. Fleischer gave a sufficient
condition for the optimality of a Lloyd-Max quantizer. It is in terms of the
density function f (x) of the random variable to be quantized. In particular, it
requires that log f (x) is concave.

26
Tenth lecture (November 9, 2021)

The above condition of Fleischer is satisfied by the density function of a random


variable uniformly distributed in an interval [a, b]. Thus a corollary of Fleischer’s
theorem is that there is only one Lloyd-Max quantizer with N levels for the
uniform distribution on [a, b]. It is not hard to see that this should be the
uniform quantizer: the one belonging to Bi = {x : a+(i−1) b−a N ≤ x ≤ a+i N }
b−a

and quantization levels at the middle of these intervals. (The extreme points of
the intervals belonging to two neighboring Bi ’s can be freely decided to belong
to either of them.)

Distortion of the uniform quantizer


The simplest quantizer is the uniform quantizer, we investigate it a bit closer.
Note that we do not assume now that the distribution we work with is uniform.
For simplicity we assume, however, that the density function of our random
variable to be quantized is 0 outside the interval [−A, A], and it is continuous
within [−A, A]. The N -level uniform quantizer is defined by the function

A
QN (x) = −A + (2i − 1)
N
whenever
A A
−A + 2(i − 1) < x ≤ −A + 2i .
N N
A
(To be precise: for x = −A we also have QN (−A) = −A + N .)

The length of each interval for the elements of which we assign the same value
is then qN = 2AN . The following theorem gives the distortion of the uniform
quantizer asymptotically (as N goes to infinity) in terms of qN .

Theorem 13 If the density function f of the random variable X satisfies the


above requirements (continuous in [−A, A] and 0 outside it) then for the distor-
tion of the N -level uniform quantizer QN we have

D(QN ) 1
lim 2 = .
N →∞ qN 12

Proof. We will use the following notation. The extreme points of the quantiza-
tion intervals are
A
yN,i = −A + 2i , i = 0, 1, . . . , N,
N
while the quantization levels are
A
xN,i = −A + (2i − 1) , i = 1, 2, . . . , N.
N
With this notation the distortion can be written by definition as
N Z
X yN,i
D(Qn ) = (x − xN,i )2 f (x)dx.
i=1 yN,i−1

We define the auxiliary density function fN (x) as


Z yN,i
1
fN (x) := f (z)dz if x ∈ (yN,i−1 , yN,i ].
qN yN,i−1

27
First we calculate the distortion D̂(QN ) of QN with respect to this auxiliary
density function.
N Z
X yN,i
D̂(QN ) = (x − xN,i )2 fN (x)dx =
i=1 yN,(i−1)

N Z yN,i Z yN,i
X 1
f (z)dz (x − xN,i )2 dx =
i=1
qN yN,(i−1) yN,(i−1)

N Z yN,i Z q2N
X 1
f (z)dz x2 dx =
i=1
q N yN,(i−1) −
qN
2

2 XN Z yN,i
qN q2
f (z)dz = N .
12 i=1 yN,(i−1) 12

To finish the proof we will show that

D̂(QN ) − D(QN ) D̂(QN ) − D(QN )


lim = lim 2 /12 = 0,
N →∞ D̂(QN ) N →∞ qN

that is clearly enough.


Since f is continuous in the closed interval [−A, A] it is also uniformly con-
tinuous. Thus for every ε > 0 there exists N0 such that if N ≥ N0 then
|f (x) − f (x0 )| < ε whenever x, x0 ∈ (yN,(i−1) , yN,i ) (since |yN,(i−1) − yN,i | < qN ,
and qN → 0 as N → ∞).
So for N ≥ N0 we can write

|D̂(QN ) − D(QN )|
2 /12 =
qN
N Z N Z yN,i
12 X yN,i 2
X
2 (x − x N,i ) f (x)dx − (x − xN,i )2 fN (x)dx ≤
qN i=1 y N,(i−1) i=1 yN,(i−1)

N Z
12 X yN,i
2 (x − xN,i )2 |f (x) − fN (x)|dx ≤
qN i=1 yN,(i−1)
N Z
12 X qN /2 2 12 qN3
2A
2 z εdz = 2 N ε = qN N ε = N ε = 2Aε
qN i=1 −qN /2 qN 12 N
that can be made arbitrarily small by choosing ε small enough. This completes
the proof. 2

At the end of the class meeting we considered two exercises. We have solved
the first one, the second remained homework.

Exercise 13. Let X = (X1 , X2 , . . .) be a stationary source with entropy H(X).


Decide whether the entropy of the following sources exists and determine it if
it does.
a) Xa = (X1 , X1 , X2 , X2 , X3 , X3 , . . .) (all random variables are repeated once)
b) Xb = (X1 , X1 , X2 , X3 , X3 , X4 , X5 , X5 , X6 , . . .) (only the odd indexed random
variables are repeated)
c) Xc = (X1 , X2 , X2 , X3 , X3 , X3 , X4 , X4 , X4 , X4 , . . .) (the random variable with
index i is repeated i times)

28
Sketch of Solution:
a)
1
H(Xa ) = lim H(X1 , X1 , X2 , X2 , . . . , Xn , Xn ) =
n→∞ 2n
1 1
lim H(X1 , X1 , X2 , X2 , . . . , Xn , Xn ) =
2 n→∞ n
1 1 1
lim H(X1 , X2 , . . . , Xn ) = H(X).
2 n→∞ n 2
b)
1
H(Xb ) = lim H(X1 , X1 , X2 , X3 , X3 , X4 , X4 . . . , Xn ) =
n→∞ 1.5n
2 1
lim H(X1 , X1 , X2 , X3 , X3 . . . , Xn ) =
3 n→∞ n
2 1 2
lim H(X1 , X2 , . . . , Xn ) = H(X).
3 n→∞ n 3
c)
2
H(Xc ) = lim H(X1 , X2 , X2 , X3 , X3 , X3 . . . , Xn , . . . , Xn ) =
n→∞ n(n + 1)
2 1
lim H(X1 , X2 , X2 , X3 , X3 , X3 . . . , Xn , . . . , Xn ) =
n→∞ n+1n
2 1
lim H(X1 , X2 , . . . , Xn ) =
n→∞ n + 1 n

2
lim H(X) = 0.
n→∞ n + 1

Exercise 14. (Homework)


Let X and Y be two random variables both taking values in the set {1, 2, 3}.
What should be the values of the joint probabilities

p(X = i, Y = j), 1 ≤ i, j ≤ 3

if we know that

p(X = 1) = 2/3, p(X = 2) = p(X = 3) = 1/6,

p(Y = 1) = 1/2, p(Y = 2) = p(Y = 3) = 1/4


and H(X, Y ) is as large as it can be under these constraints?

Eleventh lecture (November 23, 2021)

Channel coding

Channel model: stochastic matrix. Rows belong to input letters, columns belong
to output letters. Wi,j = W (yj |xi ), which is the probability of receiving yj when
xi was sent. We consider only discrete memoryless channels, so the input and
output alphabets are finite and the behaviour of the channel is described by the
same matrix W at every channel use. (In particular, the probabilities described
by this matrix do not depend on what happened in the past, e.g., what input
letters were sent previously and what output letters they resulted in.)

29
Example. (The input and output alphabets are denoted X , Y, respectively.)
Binary symmetric channel (BSC): X = Y = {0, 1}, W (1|1) = W (0|0) = 1 − p,
W (1|0) = W (0|1) = p.

Goal: Communicating reliably and efficiently.


Reliably means: with small probability of error.
Efficiently means: with as few channel use as possible.

Code: A(n invertible) function f : M → X n , where M is the set of possible


messages. The relevance of M will be its size M := |M|. We can also think
about the code as its codeword set {c(1) , . . . , c(M ) }.
We also need a decoding function ϕ : Y n → M that tells us which message we
decode a certain received sequence to. (One can also define codes as the pairs
of functions (f, ϕ).)
The probability of error if the message mi , that is the codeword c(i) , was sent
is
X X n
Y
Pe,i = P rob(y was received|c(i) was sent) = W (yr |cr(i) ),
ϕ(y)6=mi ϕ(y )6=mi r=1

(i)
where yr and cr denote the rth character in the sequences y and c(i) , respec-
tively.
We want small error independently of the probability distribution on the mes-
sages. So we define the average error probability that is the average of the Pe,i
values on the M messages:
M
1 X
P̄e = Pe,i .
M i=1

The efficiency of the code is measured by its rate:


log2 M
R= .
n

Shannon’s Channel Coding Theorem, one of the most fundamental results in


information theory, says that discrete memoryless channels have a characteris-
tic value, their capacity, with the property that one can communicate reliably
with any rate below it, and one cannot, above it. Here ”reliably” means ”with
arbitrary small probability of error”.
First we define the capacity CW of a discrete memoryless channel given by its
matrix W .
Def.
CW := max I(X, Y ),
where the maximization is over all joint distributions of the pair of random
variables (X, Y ) that satisfy that the conditional probability of Y given X is
what is prescribed by W .
The above expression can be rewritten as
 
 X p(x, y) 
CW = max p(x, y) log
 p(x)p(y) 
x∈X ,y∈Y

30
 
 X p(y|x) 
= max p(x)p(y|x) log P 0 0

x∈X ,y∈Y x0 ∈X p(x )p(y|x ) 

The advantage of the last expression is that it shows very clearly that when
maximizing I(X, Y ) what we can vary is the distribution of X, that is the input
distribution. (All other values in the last expression are conditional probabilities
given by the channel matrix W .)
Example: For the binary symmetric channel we can calculate the capacity as
follows. I(X, Y ) = H(Y ) − H(Y |X) and it follows from the channel character-
istics that H(Y |X = 0) = H(Y |X = 1) = h(p), so H(Y |X) = h(p) irrespective
of the distribution of X. So I(X, Y ) = H(Y ) − h(p) ≤ log 2 − h(p) = 1 − h(p).
Observing that if we let X have uniform distribution, then Y will also have
uniform distribution (that results in H(Y ) = 1), we conclude that this upper
bound can be achieved. Thus the capacity of the binary symmetric channel is
1 − h(p).

Now we state the Channel Coding Theorem:


Given a channel with capacity C, for every rate R < C there exists a sequence
of codes with length n and number of codewords at least 2nR such that the average
probability of error P̄e (when using these codes for communication over the given
channel) goes to zero as n tends to infinity.
Conversely, for any sequence of codes with length n, number of codewords at
least 2nR if using these codes for communication over a channel with capacity
C we have average error probability tending to zero as n goes to infinity, then
we must have R ≤ C.
In short one can say that all rates below capacity are achievable with an arbi-
trarily small error probability, and this is not true for any rate above capacity.

Here is another example for calculating channel capacity. The following channel
is called the binary erasure channel: X = {0, 1}, Y = {0, 1, ∗}, W (1|1) =
W (0|0) = 1 − p, W (1|0) = W (0|1) = 0, W (∗|0) = W (∗|1) = p.
We need to calculate C = max I(X, Y ) = max{H(Y )−H(Y |X)} = max{H(X)−
H(X|Y )}. First we use the second form, then also show how the same re-
sult can be obtained from the first one. Assume that at the input we have
the distribution given by P rob(X = 0) = q, P rob(X = 1) = 1 − q. Then
H(X) − H(X|Y ) = h(q) − (P rob(Y = 0)H(X|Y = 0) + P rob(Y = 1)H(X|Y =
1) + P rob(Y = ∗)H(X|Y = ∗). Note that both H(X|Y = 0)) = 0 and
H(X|Y = 1) = 0 since Y = 0 and Y = 1 leaves no uncertainty about
 the value
pq
of X. At the same time P rob(Y = ∗) = p and H(X|Y = ∗) = h p = h(q),
so we can write

C = max{H(X) − H(X|Y )} = max h(q) − ph(q) = 1 − p.


q

Notice that this is indeed what one would intuitively expect.


Now we show how the same value can be obtained via using the other formula.
We need to calculate C = max I(X, Y ) = max{H(Y )−H(Y |X)} = max{H(Y )−
h(p)}. So our task is to maximize H(Y ) (since h(p) is a fixed value). Let E be the
indicator variable taking value 1 when an erasure occurs, that is, when Y = ∗,
and 0 otherwise. Note that E is a function of Y , so H(Y ) = H(E, Y ). Then by
the Chain rule H(Y ) = H(E, Y ) = H(E)+H(Y |E) = h(p)+P (E = 1)H(Y |E =
1) + P (E = 0)H(Y |E = 0) = h(p) + p · 0 + (1 − p) · H(Y |Y 6= ∗) ≤ h(p) + 1 − p.

31
Note that if the input distribution is set to be uniform then the last inequality
is an equality, so we get max H(Y ) = h(p) + 1 − p and thus

C = max I(X, Y ) = max{H(Y ) − h(p)} = 1 − p.

In the last part of the class meeting we solved exercises. First we discussed the
solution of the homework (Exercise 14):
Let X and Y be two random variables both taking values in the set {1, 2, 3}.
What should be the values of the joint probabilities

p(X = i, Y = j), 1 ≤ i, j ≤ 3

if we know that

p(X = 1) = 2/3, p(X = 2) = p(X = 3) = 1/6,

p(Y = 1) = 1/2, p(Y = 2) = p(Y = 3) = 1/4


and H(X, Y ) is as large as it can be under these constraints?
Solution: The given marginal distributions determine H(X) and H(Y ). The
joint entropy H(X, Y ) ≤ H(X) + H(Y ) and we have equality if and only if X
and Y are independent. So H(X, Y ) is maximized if p(X = i, Y = j) = p(X =
i)p(Y = j) holds for all i, j. 3

Exercise 15. (Mod 11 channel)


We have a channel with input and output alphabet X = Y = {0, 1, . . . , 10}.
When input i is sent the output is one of i + 1, i + 2, and i + 3 (where addition
is meant modulo 11), each with probability 1/3. Determine the capacity of this
channel.
Solution:

C = max I(X, Y ) = max{H(Y ) − H(Y |X)} = max{H(Y ) − log 3},

as H(Y |X) = log 3 follows from the properties of the channel. Since Y can take
11 possible values, we also know that

H(Y ) ≤ log 11.

This implies that C ≤ log 11 − log 3 = log 113 . We can observe that if we let the
distribution of X, i.e, the input distribution, be uniform, then the distribution
of Y will also be uniform. Thus H(Y ) = log 11 is achievable, and so the capacity
of the channel is equal to the above upper bound,
11
C = log .
3
3

The following exercise remained homework.


Exercise 16.
Call a channel weakly symmetric if the matrix W describing it has the following
properties. There is a probability distribution P = (p1 , p2 , . . . , pn ) such that
all rows of W is a permutation of the numbers p1 , . . . , pn . Another property
that W satisfies is that the sum of its entries in every column is the same value.
Prove that for a weakly symmetric channel with output alphabet Y its capacity
CW can always be expressed as

CW = log |Y| − H(P ),

32
where P is the distribution we referred to when describing the properties of W .
(Note that the channel in the previous exercise as well as a binary symmetric
channel are weakly symmetric channels.)

On the twelfth class meeting (November 30, 2021) we solved exercises.

Twelfth lecture (Thirteenth class meeting) (December 7, 2021)

Let us state again the Channel Coding Theorem:


For every rate R < C there exists a sequence of codes with length n and
number of codewords at least 2nR such that the average probability of error P̄e
goes to zero as n tends to infinity. (This is called the direct part of the theorem.)
Conversely, for any sequence of codes with length n, number of codewords at
least 2nR and average error probability tending to zero as n goes to infinity, we
must have R ≤ C. (This is called the converse part of the theorem.)

First we will prove the converse part. Even that we first show in a weaker form,
namely we show that for zero-error codes we must have R ≤ C. We will use the
following lemma.

Lemma 1 Let Y n be the output of a discrete memoryless channel with capacity


C resulting from the input X n . Then

I(X n , Y n ) ≤ nC.

Proof.
I(X n , Y n ) = H(Y n ) − H(Y n |X n )
n
X
n
= H(Y ) − H(Yi |Y1 , ..., Yi−1 , X n )
i=1
n
X n
X n
X
= H(Y n ) − H(Yi |Xi ) ≤ H(Yi ) − H(Yi |Xi )
i=1 i=1 i=1
n
X
= I(Xi , Yi ) ≤ nC.
i=1

Here the second equality follows from the Chain rule, and the third equality
used the discrete memoryless property of the channel, which implies that Yi
depends only on Xi among Y1 , ..., Yi−1 , X1 , . . . Xn and thus the used equality
of conditional entropies. (The other relations should be clear: the first and
fourth equality follows from the definition of mutual information, the first “≤”
is a consequence of the standard property of the entropy of joint distributions,
while the final inequality follows from the definition of channel capacity.) 2
Now assume that we communicate over a discrete memoryless channel of ca-
pacity C with zero-error, that is we have a code of length n with M = 2nR
codewords and P̄e = 0. Then
R ≤ C.
Here is the proof. Let the random variable that takes its values on the message
set M (that is its value is the index of the message mi to be sent) be denoted
by U . (We assume that U is uniformly distributed, so its entropy is log M .)
Now we can write

33
nR = H(U ) = H(U |Y n ) + I(U, Y n ) = I(U, Y n ) = I(X n , Y n )
n
X
≤ I(Xi , Yi ) ≤ nC.
i=1

Here we used that if the code has error probability zero then the message U sent
is completely determined by the channel output Y n , therefore H(U |Y n ) = 0.
This explains the third equality above (the first two come from the appropriate
definitions). The fourth equality I(U, Y n ) = I(X n , Y n ) follows from considering
the coding function establishing a one-to-one correspondence between U and the
input codeword X n . (In some discussions X n is considered as a “processed”
version of U and then I(U, Y n ) ≤ I(X n , Y n ) follows, which also properly fits
the chain of inequalities above.) The last two inequalities are just proven in
Lemma 1 above. Now dividing by n we just get the required R ≤ C inequality.
2

To strengthen the above proof so that we get R ≤ C also for negligible (but
not necessarily zero) error probability codes we will need to estimate H(U |Y n )
from above in the calculation just presented, since we no longer can say it is 0
if Y n does not determine U with certainty.

To have an upper bound on H(U |Y n ) we need another lemma that is known as


Fano’s inequality.

Lemma 2 (Fano’s inequality) Let us have a discrete memoryless channel where


the input message U is uniformly distributed over 2nR possible messages. After
sending the codeword belonging to U through the channel we receive the output
Y n from
P which we estimate U by Û . The error probability is P̄e = P (Û 6= U ) =
1
2nR
Pe,i . Then we have

H(U |Û ) ≤ 1 + P̄e nR.

Proof. Let E be the random variable defined by

E ∈ {0, 1}, E = 1 ⇔ Û 6= U,

i.e., the indicator variable for decoding the received word with an error. Clearly,
E is determined by the pair (U, Û ), so H(E|U, Û ) = 0
Then using the Chain rule to expand H(E, U |Û ) in two different ways, we can
write

H(U |Û ) = H(U |Û ) + H(E|U, Û ) = H(E, U |Û ) = H(E|Û ) + H(U |E, Û )

≤ h(P̄e ) + P̄e log 2nR ≤ 1 + P̄e nR,


giving the statement. For the inequalities we used that conditioning cannot
increase entropy and that H(U |Û , E = 0) = 0 since E = 0 means that U = Û ,
thus H(U |E, Û ) = P (E = 0)H(U |Û , E = 0) + P (E = 1)H(U |Û , E = 1) ≤
(1 − P̄e )0 + P̄e log 2nR . 2
In the proof below we will use the intuitively clear, but not yet explicitly stated
property of conditional entropy expressed by the following lemma. It can be
proven with a little work from Jensen’s theorem.
Lemma 3 If X, Y are two random variables and Z = g(Y ) is a function of Y ,
then
H(X|Y ) ≤ H(X|Z).

34
Proof of the converse of the channel coding theorem. We follow the proof we
have seen for zero-error codes and plug in Fano’s inequality at the appropriate
place. (The notation is identical to that used in Fano’s inequality.)

nR = H(U ) = H(U |Û ) + I(U, Û ) ≤ 1 + P̄e nR + I(U, Û ) ≤

1 + P̄e nR + I(X n , Y n ) ≤ 1 + P̄e nR + nC.


Here the first inequality is by Fano’s inequality. The second inequality is a
consequence of Û being a function of Y n and thus I(U, Û ) = H(U ) − H(U |Û ) ≤
H(U ) − H(U |Y n ). We consider X n and U having a one-to-one correspondence
between them, so we can write H(U ) − H(U |Y n ) = H(X n ) − H(X n |Y n ). (In
a more general way we can refer to the so-called data processing inequality
that expresses that if the random variables A, B, Z form a Markov chain then
I(A, Z) ≤ I(A, B). In that case we can write the inequality even if we consider
X n simply a function, not necessarily a one-to-one function of U .)
So dividing by n we have obtained above that
1
R(1 − P̄e ) ≤ C + .
n
1
Now letting n go to infinity we know that P̄e → 0 and n → 0, so we get

R≤C

as stated. 2
Outline of the proof of the direct part of the Channel Coding Theorem:
The main idea of the proof is to randomly select d2nR e codewords (for some R <
C according to the input distribution achieving the channel capacity C. Once
the set of codewords is given there is an optimal decoding function belonging
to it. However, for making the analysis more convenient a special decoding
function is defined (which, though suboptimal, asymptotically still achieves the
result we need). This is based on joint typicality. The set of jointly typical
sequences (for some small ε > 0) is defined as

Aε(n) = {(x, y) ∈ X n × Y n
2−n(H(X)+ε) ≤ p(x) ≤ 2−n(H(X)−ε)
2−n(H(Y )+ε) ≤ p(y) ≤ 2−n(H(Y )−ε)
2−n(H(X,Y )+ε) ≤ p(x, y) ≤ 2−n(H(X,Y )−ε) }.

When a codeword is sent and y ∈ Y n is received at the output, we decide on the


codeword that is jointly typical with the received sequence, if there is a unique
such codeword. Thus we make a mistake if either
1. the received word is not jointly typical with the codeword sent, or
2. there is another codeword, which is jointly typical with the received word.
It can be shown that the average value (over all codes) of the probability of
both of these events goes to zero as n goes to infinity (when R < C that we
ensured). Thus there must exist a code for which the error probability tends to
zero, while it has the required size, that is achieving the rate R. This proves
the statement in the direct part of the Channel Coding Theorem.

35
Hamming codes

Constructing codes with rate close to channel capacity is a hard task and was
out of reach for the first quite a few decades of information theory starting in
1948 with the publication of Shannon’s fundamental paper that also contained
the Channel Coding Theorem. Here we just say a few words on one of the
most basic (and most aesthetic) construction of error-correcting codes called
Hamming codes. (These codes do not have rates close to channel capacity but
at least show the basic ideas of error correction. We only mention the name of
turbo codes defined in recent decades that achive rates close to channel capacity.)
A typical trick of error detection is to use a parity bit. Having a binary code
in which every codeword contains an even number of ones, we will always be
able to detect if at most one error occurred, because that will make the number
of ones odd. With length n we have 2n−1 binary sequences with this property.
Receiving an erroneous sequence, we will not be able to tell which codeword
was sent even if we know that exactly one error occurred (that is only one bit
was received erroneously). If, however, any two codewords would differ in at
least 3 bits, then if we knew that at most one error occurred, we would be able
to tell which codeword was sent as there is only one codeword within Hamming
distance 1 from the received sequence. (The Hamming distance of two sequences
is the number of positions where they differ.)
The Hamming code is a very simple and elegant construction of a binary code
that has the ability of correcting one error: any two of its codewords differ at
at least three positions. Here is how they are defined.
Let ` be a positive integer and n = 2` − 1. Let H be the n × ` matrix whose
n columns are all the 2` − 1 nonzero binary sequences. A 0 − 1 sequence x
of length n is a codeword if and only if Hx = 0. Thus, by elementary linear
algebra, the words of this code form an (n − `)-dimensional subspace of the
vector space {0, 1}n , the number of elements in it is thus 2n−` . And indeed, any
two differs at at least three coordinates. This can be seen as follows. Let c be
any codeword and e a vector of errors, that is the received word is c + e. Now
this is another codeword if and only if H(c + e) = He = 0. That means that
the modulo 2 sum of that many columns of H as many 1’s are contained in e
must be 0. Since no two columns of H are identical (and none of them is all-0),
this requires that e has at least three 1’s. So the minimum Hamming distance
between two codewords of our code is at least three (in fact, it is three), thus
the code can correct one error.
For ` = 1 the above construction is not yet interesting, and it is not very
sophisticated for ` = 2 either: it contains two codewords of length n = 3, the
all-0 and the all-1 vector. (Notice that these two words have Hamming distance
three, indeed.) The simplest non-trivial Hamming code we obtain for ` = 3.
Then n = 23 − 1 = 7 and the number of codewords is 2(7−3) = 16.

Finally we discussed the following exercise that gives a somewhat surprising


application of Hamming codes:
In a tv show three people are playing a cooperative game to win some prize.
They each get a black or a white hat on their head. The probability of getting
a white hat is 1/2 as well as that of getting a black hat and these events are
independent for the three of them. (A coin is tossed independently three times
to decide who gets what.) Nobody sees the hat on their own head but they see
each other’s. Then the game is as follows: Each of them can guess the color
of the hat on their own head or pass. This they have to do without knowing
what the others do. (Yet they can have a strategy they agreed on before.) If all

36
of them pass they do not win. If anybody guesses incorrectly they also do not
win. In the remaining cases, that is when at least one of them makes a guess
and everybody who makes a guess guesses correctly, then they win.
The question is the probability with which they can win if they apply the best
possible strategy.

A more difficult question: What is the same maximum probability of winning


if we have seven players?

About the solution: Although one’s intuition first can naturally be that it is
impossible to beat winning probability 1/2, somewhat surprisingly this is a false
intuition. For three players we can give the following strategy that achieves
winning probability 3/4. If a player sees two different hats then she passes, if
she sees two hats of the same color, then she guesses that hers is the other color.
It is easy to check that out of the 8 possible configurations of black and white
hats on the head of the players they will lose only in those two cases when all
three hats are of the same color. They win in the other six cases out of the eight
thus achieving the claimed probability 3/4 of winning. (One can also prove that
this is best possible but that we did not do.)

In the seven player game it is possible to achieve even 7/8 as the probability of
winning. This is more tricky and uses the 7-length Hamming code. Identifying
the configuration given by the colors of the hats with a 7-length binary sequence
in the obvious way the players can use the following strategy. If the six bits of
such a sequence a player sees can be extended to a codeword of the 7-length
Hamming code (by an appropriate choice of the bit belonging to this player’s
hat) then the player guesses the opposite bit. In all other cases the player passes.
Then they will lose if the color configuration belongs to one of the 16 codewords
of the Hamming code. (In these cases all players will guess and all of them do
that incorrectly.) In all other cases the actual binary sequence differs in exactly
one bit from a unique codeword of the Hamming code. (This uniqueness is a
consequence of the main property of the Hamming code that any two of its
codewords differ in at least three positions.) So in these cases exactly one of the
players will guess (all others pass) and that player will guess correctly. So the
probability of winning is 7·16 7
27 = 8 . (This strategy is also optimal.)

Remark: The above strategy can easily be generalized to longer Hamming codes
implying that as the number of players goes to infinity the winning probabil-
ity can approach 1. It is worth noting that the 3-player strategy can also be
considered as the same strategy using the trivial Hamming code on three bits
consisting of the codewords 000 and 111.

37

You might also like