It Lectures
It Lectures
Preface
“There is a whole book of readymade, long and convincing, lav-
ishly composed telegrams for all occasions. Sending such a
telegram costs only twenty-five cents. You see, what gets trans-
mitted over the telegraph is not the text of the telegram, but
simply the number under which it is listed in the book, and the
signature of the sender. This is quite a funny thing, reminiscent
of Drugstore Breakfast #2. Everything is served up in a ready
form, and the customer is totally freed from the unpleasant
necessity to think, and to spend money on top of it.”
Little Golden America. Travelogue by I. Ilf and E. Petrov, 1937.
[Pre-Shannon encoding, courtesy of M. Raginsky]
Contents 2
Notations 8
I Information measures 9
2
4.7 Gaussian saddle point . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
3
11.5 Sequential probability assignment: Krichevsky-Trofimov . . . . . . . . . . . . . . . 130
11.6 Individual sequence and universal prediction . . . . . . . . . . . . . . . . . . . . . 131
11.7 Lempel-Ziv compressor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133
4
18.3 Bounds on C ; Capacity of Stationary Memoryless Channels . . . . . . . . . . . . 193
18.4 Examples of DMC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 197
18.5* Information Stability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 198
5
V Lossy data compression 262
6
32.1 Binary vectors of average weights . . . . . . . . . . . . . . . . . . . . . . . . . . . 331
32.2 Shearer’s lemma & counting subgraphs . . . . . . . . . . . . . . . . . . . . . . . . 332
32.3 Brégman’s Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 334
32.4 Euclidean geometry: Bollobás-Thomason and Loomis-Whitney . . . . . . . . . . . 336
Bibliography 338
7
Notations
• x+ = max{x, 0}.
• Standard big O notations are used: e.g., for any positive sequences {an } and {bn }, an = O(bn )
if there is an absolute constant c > 0 such that an ≤ cbn ; an = Ω(bn ) if bn = O(an ); an = Θ(bn )
if both an = O(bn ) and an = Ω(bn ); an = o(bn ) or bn = ω(an ) if an ≤ n bn for some n → 0.
8
Part I
Information measures
9
§ 1. Information measures: entropy and divergence
• A
P RV X is discrete if there exists a countable set X = {xj : j ≥ 1} such that
j≥1 PX (xj ) = 1. The set X is called the alphabet of X, x ∈ X are atoms, and
PX (x) is the probability mass function (pmf).
1.1 Entropy
Definition 1.1 (Entropy). Let X be a discrete RV with distribution PX . The entropy (or Shannon
entropy) of X is
h 1 i
H(X) = E log
PX (X)
X 1
= PX (x) log .
PX (x)
x∈X
Definition 1.2 (Joint entropy). Let Xn = (X1 , X2 , . . . , Xn ) be a random vector with n components.
h 1 i
H(X n ) = H(X1 , . . . , Xn ) = E log .
PX1 ,...,Xn (X1 , . . . , Xn )
Note: This is not really a new definition: Definition 1.2 is consistent with Definition 1.1 by treating
X n as a RV taking values on the product space.
Definition 1.3 (Conditional entropy).
h 1 i
H(X|Y ) = Ey∼PY [H(PX|Y =y )] = E log ,
PX|Y (X|Y )
i.e., the entropy of H(PX|Y =y ) averaged over PY .
10
Note:
• Also write H(PX ) instead of H(X) (abuse of notation, as customary in information theory).
log2 ↔ bits
loge ↔ nats
log256 ↔ bytes
log ↔ arbitrary units, base always matches exp
1 1
H(X) = h(p) , p log + p log
p p
11
Review: Convexity
• Convex function: f : S → R is
– convex if f (αx + ᾱy) ≤ αf (x) + ᾱf (y) for all x, y ∈ S, α ∈ [0, 1].
– strictly convex if f (αx + ᾱy) < αf (x) + ᾱf (y) for all x 6= y ∈ S, α ∈ (0, 1).
– (strictly) concave if −f is (strictly) convex.
R
e.g., x 7→ x log x is strictly convex; the mean P 7→ xdP is convex but not
strictly convex, variance is concave (Q: is it strictly concave? Think of zero-mean
distributions.).
Ef (X)
Famous puzzle: A man says, ”I am the average height and average weight of the
population. Thus, I am an average man.” However, he is still considered a little overweight.
Why?
Answer: The weight is roughly proportional to the volume, which, for us three-dimensiona
beings, is roughly proportional to the third power of the height. Let PX denote the
distribution of the height among the population. So by Jensen’s inequality, since x 7→ x3
is strictly convex on x > 0, we have (EX)3 < EX 3 , regardless of the distribution of X.
Source: [Yos03, Puzzle 94] or online [Har].
2. (Uniform distribution maximizes entropy) For finite X , H(X) ≤ log |X |, with equality iff X is
uniform on X .
12
4. (Conditioning reduces entropy)
H(X|Y ) ≤ H(X), with equality iff X and Y are independent.
4. Later (Lecture 2)
h i h 1 i h 1 i
5. E[log PXY (X,Y
1
) ] = E log PX (X)·P1 |X) = E log + E log
Y |X (Y PX (X) PY |X (Y |X)
| {z } | {z }
H(X) H(Y |X)
6. Intuition: (X, f (X)) contains the same amount of information as X. Indeed, x 7→ (x, f (x)) is
one-to-one. Thus by 3 and 5:
H(X) = H(X, f (X)) = H(f (X)) + H(X|f (X)) ≥ H(f (X))
The bound is attained iff H(X|f (X)) = 0 which in turn happens iff X is a constant given
f (X).
7. Telescoping:
PX1 X2 ···Xn = PX1 PX2 |X1 · · · PXn |X n−1
then take the log.
Note: To give a preview of the operational meaning of entropy, let us play the game of 20
Questions. We are allowed to make queries about some unknown discrete RV X by asking yes-no
questions. The objective of the game is to guess the realized value of the RV X. For example,
X ∈ {a, b, c, d} with P [X = a] = 1/2, P [X = b] = 1/4, and P [X = c] = P [X = d] = 1/8. In this
case, we can ask “X = a?”. If not, proceed by asking “X = b?”. If not, ask “X = c?”, after
which we will know for sure the realization of X. The resulting average number of questions is
1/2 + 1/4 × 2 + 1/8 × 3 + 1/8 × 3 = 1.75, which equals H(X) in bits. An alternative strategy is to
ask “X = a, b or c, d” in the first round then proceeds to determine the value in the second round,
which always requires two questions and does worse on average.
It turns out (Section 8.2) that the minimal average number of yes-no questions to pin down the
value of X is always between H(X) bits and H(X) + 1 bits. In this special case the above scheme
is optimal because (intuitively) it always splits the probability in half.
13
1.2 Entropy: axiomatic characterization
P
One might wonder why entropy is defined as H(P ) = pi log p1i and if there are other definitions.
Indeed, the information-theoretic definition of entropy is related to entropy in statistical physics.
Also, it arises as answers to specific operational problems, e.g., the minimum average number of bits
to describe a random variable as discussed above. Therefore it is fair to say that it is not pulled out
of thin air.
Shannon in 1948 paper has also showed that entropy can be defined axiomatically, as a function
satisfying several natural conditions. Denote a probability distribution on m letters by P =
(p1 , . . . , pm ) and consider a functional Hm (p1 , . . . , pm ). If Hm obeys the following axioms:
a) Permutation invariance
c) Normalization: H2 ( 21 , 21 ) = log 2.
f) Continuity: H2 (p, 1 − p) → 0 as p → 0.
P
then Hm (p1 , . . . , pm ) = m 1
i=1 pi log pi is the only possibility. The interested reader is referred to
[CT06, p. 53] and the reference therein.
Second law of thermodynamics: There does not exist a machine that operates in a cycle
(i.e. returns to its original state periodically), produces useful work and whose only
other effect on the outside world is drawing heat from a warm body. (That is, every
such machine, should expend some amount of heat to some cold body too!)1
Equivalent formulation is: There does not exist a cyclic process that transfers heat from a cold
body to a warm body (that is, every such process needs to be helped by expending some amount of
external work).
1
Note that the reverse effect (that is converting work into heat) is rather easy: friction is an example.
14
Notice that there is something annoying about the second law as compared to the first law. In
the first law there is a quantity that is conserved, and this is somehow logically easy to accept. The
second law seems a bit harder to believe in (and some engineers did not, and only their recurrent
failures to circumvent it finally convinced them). So Clausius, building on an ingenious work of
S. Carnot, figured out that there is an “explanation” to why any cyclic machine should expend
heat. He proposed that there must be some hidden quantity associated to the machine, entropy of it
(translated as “transformative content”), whose value must return to its original state. Furthermore,
under any reversible (i.e. quasi-stationary, or “very slow”) process operated on this machine the
change of entropy is proportional to the ratio of absorbed heat and the temperature of the machine:
∆Q
∆S = . (1.2)
T
If heat Q is absorbed at temperature Thot then to return to the original state, one must return some
Q0 amount of heat, where Q0 can be significantly smaller than Q but never zero if Q0 is returned
at temperature 0 < Tcold < Thot .2 Further logical arguments can convince one that for irreversible
cyclic process the change of entropy at the end of the cycle can only be positive, and hence entropy
cannot reduce.
There were great many experimentally verified consequences that second law produced. However,
what is surprising is that the mysterious entropy did not have any formula for it (unlike, say, energy),
and thus had to be computed indirectly on the basis of relation (1.2). This was changed with the
revolutionary work of Boltzmann and Gibbs, who provided a microscopic explanation of the second
law based on statistical physics principles and showed that, e.g., for a system of n independent
particles (as in ideal gas) the entropy of a given macro-state can be computed as
X̀ 1
S = kn pj log ,
pj
j=1
where k is the Boltzmann constant, and we assumed that each particle can only be in one of `
molecular states (e.g. spin up/down, or if we quantize the phase volume into ` subcubes) and pj is
the fraction of particles in j-th molecular state.
Submodularity is similar to concavity, in the sense that “adding elements gives diminishing returns”.
Indeed consider T 0 ⊂ T and b 6∈ T . Then
f (T ∪ b) − f (T ) ≤ f (T 0 ∪ b) − f (T 0 ) .
Proof. Let A = XT1 \T2 , B = XT1 ∩T2 , C = XT2 \T1 . Then we need to show
15
This follows from a simple chain
T1 ⊂ T2 =⇒ H(XT1 ) ≤ H(XT2 ) .
So fixing n, let us denote by Γn the set of all non-negative, monotone, submodular set-functions on
[n]. Note that via an obvious enumeration of all non-empty subsets of [n], Γn is a closed convex cone
n
in R2+ −1 . Similarly, let us denote by Γ∗n the set of all set-functions corresponding to distributions
on X n . Let us also denote Γ̄∗n the closure of Γ∗n . It is not hard to show, cf. [ZY97], that Γ̄∗n is also a
closed convex cone and that
Γ∗n ⊂ Γ̄∗n ⊂ Γn .
The astonishing result of [ZY98] is that
This follows from the fundamental new information inequality not implied by the submodularity of
entropy (and thus called non-Shannon inequality). Namely, [ZY98] showed that for any 4-tupe of
discrete random variables:
1 1 1
I(X3 ; X4 ) − I(X3 ; X4 |X1 ) − I(X3 ; X4 |X2 ) ≤ I(X1 ; X2 ) + I(X1 ; X3 , X4 ) + I(X2 ; X3 , X4 ) .
2 4 4
(to translate into an entropy inequality, see Theorem 2.4).
1 1
H̄n ≤ · · · ≤ H̄k · · · ≤ H̄1 . (1.10)
n k
Furthermore, the sequence H̄k is increasing and concave in the sense of decreasing slope:
16
Thus, it is clear that (1.11) implies (1.10) since increasing m by one adds a smaller element to the
average. To prove (1.11) observe that from submodularity
Now average this inequality over all n! permutations of indices {1, . . . , n} to get
as claimed by (1.11).
Alternative proof: Notice that by “conditioning decreases entropy” we have
Theorem 1.4 (Shearer’s Lemma). Let X n be discrete n-dimensional RV and let S ⊂ [n] be a
random variable independent of X n and taking values in subsets of [n]. Then
Remark 1.1. In the special case where S is uniform over all subsets of cardinality k, (1.12) reduces
to Han’s inequality n1 H(X n ) ≤ k1 H̄k . The case of n = 3 and k = 2 can be used to give an entropy
proof of the following well-known geometry result that relates the size of 3-D object to those of its
2-D projections: Place N points in R3 arbitrarily. Let N1 , N2 , N3 denote the number of distinct
points projected onto the xy, xz and yz-plane, respectively. Then N1 N2 N3 ≥ N 2 .
Proof. We will prove an equivalent (by taking a suitable limit) version: If C = (S1 , . . . , SM ) is a list
(possibly with repetitions) of subsets of [n] then
X
H(XSj ) ≥ H(X n ) · min deg(i) , (1.13)
i
j
where deg(i) , #{j : i ∈ Sj }. Let us call C a chain if all subsets can be rearranged so that
S1 ⊆ S2 · · · ⊆ SM . For a chain, (1.13) is trivial, since the minimum on the right-hand side is either
zero (if SM 6= [n]) or equals multiplicity of SM in C, 3 in which case we have
X
H(XSj ) ≥ H(XSM )#{j : Sj = SM } = H(X n ) · min deg(i) .
i
j
For the case of C not a chain, consider a pair of sets S1 , S2 that are not related by inclusion and
replace them in the collection with S1 ∩ S2 , S1 ∪ S2 . Submodularity (1.3) implies that the sum
on the left-hand side of (1.13) does not decrease under this replacement, values deg(i) are not
changed. Since the total number of pairs that are not related by inclusion strictly decreases by this
replacement, we must eventually arrive to a chain, for which (1.13) has already been shown.
3
Note that, consequently, for X n without constant coordinates, and if C is a chain, (1.13) is only tight if C consists
of only ∅ and [n] (with multiplicities). Thus if degrees deg(i) are known and non-constant, then (1.13) can be improved,
cf. [MT10].
17
Note: Han’s inequality holds for any submodular set-function. Shearer’s lemma holds for any
submodular set-function that is also non-negative.
Example: Another submodular set-function is
S 7→ I(XS ; XS c ) .
1.6 Divergence
Review: Measurability
In this course we will assume that all alphabets are standard Borel spaces. Some of the
nice properties of standard Borel spaces:
• All complete separable metric spaces, endowed with Borel σ-algebras are standard
Borel. In particular, countable alphabets and Rn and R∞ (space of sequences) are
standard Borel.
Q
• If Xi , i = 1, . . . are standard Borel, then so is ∞i=1 Xi
where we agree:
0
(1) 0 · log =0
0
18
(2) ∃a : Q(a) = 0, P (a) > 0 ⇒ D(P kQ) = ∞
Notes:
• (Radon-Nikodym theorem) Recall that for two measures P and Q, we say P is absolutely
continuous w.r.t. Q (denoted by P Q) if Q(E) = 0 implies P (E) = 0 for all measurable E.
If P Q, then there exists a function f : X → R+ such that for any measurable set E,
Z
P (E) = f dQ. [change of measure]
E
dP
– For discrete distributions, we can just take dQ (x) to be the ratio of pmfs.
dP
– For continuous distributions, we can take dQ (x) to be the ratio of pdfs.
• (Infinite values) D(P kQ) can be ∞ also when P Q, but the two cases of D(P kQ) = +∞
are consistent since D(P kQ) = supΠ D(PΠ kQΠ ), where Π is a finite partition of the underlying
space A (Theorem 3.7).
• (Asymmetry) D(P kQ) 6= D(QkP ). Therefore divergence is not a distance. However, asym-
metry can be very useful. Example: P (H) = P (T ) = 1/2, Q(H) = 1. Upon observing
HHHHHHH, one tends to believe it is Q but can never be absolutely sure; Upon observing
HHT, know for sure it is P . Indeed, D(P kQ) = ∞, D(QkP ) = 1 bit (see Lecture 13).
• (Pinsker’s inequality) There are many other measures for dissimilarity, e.g., total variation
(L1 -distance)
Z
1
TV(P, Q) , sup P [E] − Q[E] = |dP − dQ| (1.14)
E 2
1X
= |P (x) − Q(x)| (discrete)
2 x
Z
1
= dx|p(x) − q(x)| (continuous)
2
Total variation is symmetric and in fact a distance. The famous Pinsker’s (or Pinsker-Csiszár)
inequality relates D and TV (see Theorem 6.5):
r
1
TV(P, Q) ≤ D(P kQ) . (1.15)
2 log e
19
• (Other divergences) A general class of divergence-like measures was proposed by Csiszár.
Fixing a convex function f : R+ → R with f (1) = 0 we define f -divergence Df as
dP
Df (P kQ) , EQ f . (1.16)
dQ
This encompasses total variation, χ2 -distance, Hellinger, Tsallis etc. Inequalities between
various f -divergences such as (1.15) was once an active field of research. A complete solution
was obtained by Harremoës and Vajda [HV11] who gave a simple method for obtaining best
possible inequalities between any pair of f -divergences.
d (p ||q ) d (p ||q )
−log q
−log q−
q p
p 1 q 1
1 σ 2 1 h (m1 − m0 )2 σ12 i
D(N (m1 , σ12 )kN (m0 , σ02 )) = log 02 + + − 1 log e (1.17)
2 σ1 2 σ02 σ02
20
1 −|x−m|2 /σ2
Example (Complex Gaussian): A = C. The pdf of Nc (m, σ 2 ) is e , or equivalently:
πσ 2
2
σ 2 /2 0
Nc (m, σ ) = N Re(m) Im(m) ,
0 σ 2 /2
σ 2 |m1 − m0 |2 σ12
D(Nc (m1 , σ12 )kNc (m0 , σ02 )) = log 02 + + − 1 log e
σ1 σ02 σ02
which follows from (1.18).
More generally, for vector space A = Ck , assuming det Σ0 6= 0,
D(Nc (m1 , Σ1 )kNc (m0 , Σ0 )) = log det Σ0 − log det Σ1 + (m1 − m0 )H Σ−1
0 (m1 − m0 ) log e
+ tr(Σ−1
0 Σ1 − I) log e
Note: The definition of D(P kQ) extends verbatim to measures P and Q (not necessarily probability
measures), in which case D(P kQ) can be negative. A sufficient condition
R R kQ) ≥ 0 is that P
for D(P
is a probability measure and Q is a sub-probability measure, i.e., dQ ≤ 1 = dP .
In particular, if X k has probability density function (pdf) p, then h(X k ) = E log p(X1 k ) ; otherwise
h(X k ) = −∞. Conditional differential entropy h(X k |Y ) , E log p 1
(X k |Y )
where pX k |Y is a
X k |Y
conditional pdf.
Example: Gaussian. For X ∼ N (µ, σ 2 ),
1
h(X) = log(2πeσ 2 ) (1.20)
2
More generally, for X ∼ N (µ, Σ) in Rd ,
1
h(X) = log((2πe)d det Σ) (1.21)
2
Warning: Even for continuous random variable X, h(X) can be positive, negative, take values
of ±∞ or even undefined.4
Nevertheless, differential entropy shares many properties with the usual Shannon entropy:
Theorem 1.6 (Properties of differential entropy). Assume that all differential entropies appearing
below exist and are finite (in particular all RVs have pdfs and conditional pdfs).
1. (Uniform distribution maximizes differential entropy) If P[X n ∈ S] = 1 then h(X n ) ≤
log Leb(S),5 with equality iff X n is uniform on S.
n
4 n
For an example, consider a piecewise-constant pdf taking value e(−1) on the n-th interval of width ∆n =
c −(−1)n n
n2
e .
5
Here Leb(S) is the same as the volume vol(S).
21
2. (Scaling and shifting) h(X k + x) = h(X k + x), h(αX k ) = h(X k ) + k log |α| and for invertible
A, h(AX k ) = h(X k ) + log | det A|
3. (Conditioning reduces differential entropy) h(X|Y ) ≤ h(X) (here Y could be arbitrary, e.g.
discrete)
4. (Chain rule)
n
X
n
h(X ) = h(Xk |X k−1 ) .
k=1
22
§ 2. Information measures: mutual information
Proof. Let ϕ(x) , x log x, which is strictly convex, and use Jensen’s Inequality:
!
X P (x) X P (x) X P (x)
D(P kQ) = P (x) log = Q(x)ϕ ≥ϕ Q(x) = ϕ(1) = 0
Q(x) Q(x) Q(x)
X X X
Definition 2.1. Conditional probability distribution (aka random transformation, transition prob-
ability kernel, Markov kernel, channel) K(·|·) has two arguments: first argument is a measurable
subset of Y, second argument is an element of X . It must satisfy:
In this case we will say that K acts from X to Y. In fact, we will abuse notation and write PY |X
instead of K to suggest what spaces X and Y are1 . Furthermore, if X and Y are connected by the
PY |X
random transformation PY |X we will write X −−−→ Y .
Example:
23
2. decoupled system: Y ⊥
⊥ X ⇔ PY |X=x = PY
3. additive noise (convolution): Y = X + Z with Z ⊥
⊥ X ⇔ PY |X=x = Px+Z .
We will use the following notations extensively:
• Multiplication:
PX
PY |X
X −−−→ Y to get PXY = PX PY |X :
PXY (x, y) = PY |X (y|x)PX (x) .
PY |X
We will also write PX −−−→ PY .
Definition 2.2 (Conditional divergence).
D(PY |X kQY |X |PX ) = Ex∼PX [D(PY |X=x kQY |X=x )]
X
= PX (x)D(PY |X=x kQY |X=x ) (X discrete)
x∈X
Z
= dxpX (x)D(PY |X=x kQY |X=x ) (X continuous)
Pictorially:
24
PY |X PY
PX
QY |X QY
Pictorially:
PX PY
PY |X
X Y =⇒ D(PX kQX ) ≥ D(PY kQY )
QX QY
Proof. We only illustrate these results for the case of finite alphabets. General case follows by doing
a careful analysis of Radon-Nikodym derivatives, introduction of regular branches of conditional
probability etc. For certain cases (e.g. separable metric spaces), however, we can simply discretize
alphabets and take the granularity of discretization to zero. This method will become clearer in
Lecture 3, once we understand continuity of D.
h i
P
1. Ex∼PX [D(PY |X=x kQY |X=x )] = EXY ∼PX PY |X log QY |X PPX X
Y |X
h i h i
PXY PY |X PX
2. Disintegration: EXY log QXY
= EXY log Q + log QX
Y |X
5. Inequality follows from monotonicity. To get conditions for equality, notice that by the chain
rule for D:
Corollary 2.1.
X
D(PX1 ···Xn kQX1 · · · QXn ) ≥ D(PXi kQXi ) or
Q
= iff PX n = nj=1 PXj
25
Note: In general we can have D(PXY kQXY ) ≶ D(PX kQX ) + D(PY kQY ). For example, if X = Y
under P and Q, then D(PXY kQXY ) = D(PX kQX ) < 2D(PX kQX ). Conversely, if PX = QX and
PY = QY but PXY 6= QXY we have D(PXY kQXY ) > 0 = D(PX kQX ) + D(PY kQY ).
Corollary 2.2. Y = f (X) ⇒ D(PY kQY ) ≤ D(PX kQX ), with equality if f is 1-1.
Note: D(PY kQY ) = D(PX kQX ) 6⇒ f is 1-1. Example: PX = Gaussian, QX = Laplace, Y = |X|.
Note: This method will be very useful in stuyding large deviation (Section 13.1 and Section 14.1)
when applied to an event E which is highly likely under P but highly unlikely under Q.
Note:
• Intuition: I(X; Y ) measures the dependence between X and Y , or, the information about X
(resp. Y ) provided by Y (resp. X)
• I(X; Y ) is a functional of the joint distribution PXY , or equivalently, the pair (PX , PY |X ).
PY |X PX|Y
Proof. 1. I(X; Y ) = E log PPXXY
PY = E log PY = E log PX .
2. Apply data-processing inequality twice to the map (x, y) → (y, x) to get D(PXY kPX PY ) =
D(PY X kPY PX ).
26
4. We will use the data-processing property of mutual information (to be proved shortly, see
Theorem 2.5). Consider the chain of data processing: (x, y) 7→ (f (x), y) 7→ (f −1 (f (x)), y).
Then
I(X; Y ) ≥ I(f (X); Y ) ≥ I(f −1 (f (X)); Y ) = I(X; Y )
5. Consider f (X1 , X2 ) = X1 .
2. If X is discrete, then
I(X; Y ) = H(X) − H(X|Y ).
If both X and Y are discrete, then
3. If X, Y are real-valued vectors, have joint pdf and all three differential entropies are finite
then
I(X; Y ) = h(X) + h(Y ) − h(X, Y )
If X has marginal pdf pX and conditional pdf pX|Y (x|y) then
4. If X or Y are discrete then I(X; Y ) ≤ min (H(X), H(Y )), with equality iff H(X|Y ) = 0 or
H(Y |X) = 0, i.e., one is a deterministic function of the other.
Proof. 1. By definition, I(X; X) = D(PX|X kPX |PX ) = Ex∼X D(δx kPX ). If PX is discrete, then
D(δx kPX ) = log PX1(x) and I(X; X) = H(X). If PX is not discrete, then let A = {x :
PX (x) > 0} denote the set of atoms of PX . Let ∆ = {(x, x) : x 6∈ A} ⊂ X × X . Then
PX,X (∆) = PX (Ac ) > 0 but since
Z Z
(PX × PX )(E) , PX (dx1 ) PX (dx2 )1{(x1 , x2 ) ∈ E}
X X
4. Follows from 2.
27
Corollary 2.4 (Conditioning reduces entropy). X discrete: H(X|Y ) ≤ H(X), with equality iff
X⊥ ⊥Y.
Intuition: The amount of entropy reduction = mutual information
Example: It is important to note that conditioning reduces entropy on average, not per reliazation.
i.i.d.
X = U OR Y , where U, Y ∼ Bern( 12 ). Then X ∼ Bern( 34 ) and H(X) = h( 14 ) < 1 bit = H(X|Y = 0),
i.e., conditioning on Y = 0 increases entropy. But on average, H(X|Y ) = P [Y = 0] H(X|Y =
0) + P [Y = 1] H(X|Y = 1) = 21 bits < H(X), by the strong concavity of h(·).
Note: Information, entropy and Venn diagrams:
1. The following Venn diagram illustrates the relationship between entropy, conditional entropy,
joint entropy, and mutual information.
H(X, Y )
H(Y ) H(X)
2. If you do the same for 3 variables, you will discover that the triple intersection corresponds to
H(X1 ) + H(X2 ) + H(X3 ) − H(X1 , X2 ) − H(X2 , X3 ) − H(X1 , X3 ) + H(X1 , X2 , X3 ) (2.2)
which is sometimes denoted by I(X; Y ; Z). It can be both positive and negative (why?).
3. In general, one can treat random variables as sets (so that the Xi corresponds to set Ei and
the pair (X1 , X2 ) corresponds to E1 ∪ E2 ). Then we can define a unique signed measure µ
on the finite algebra generated by these sets so that every information quantity is found by
replacing
I/H → µ ; → ∩ , → ∪ | → \ .
As an example, we have
H(X1 |X2 , X3 ) = µ(E1 \ (E2 ∪ E3 )) , (2.3)
I(X1 , X2 ; X3 |X4 ) = µ(((E1 ∪ E2 ) ∩ E3 ) \ E4 ) . (2.4)
By inclusion-exclusion, quantity (2.2) corresponds to µ(E1 ∩ E2 ∩ E3 ), which explains why µ
is not necessarily a positive measure. For an extensive discussion, see [CK81a, Chapter 1.3].
I(X; Y )
28
Proof. WLOG, by shifting and scaling if necessary, we can assume EX = EY = 0 and EX 2 =
EY 2 = 1. Then ρ = EXY . By joint Gaussianity, Y = ρX + Z for some Z ∼ N (0, 1 − ρ2 ) ⊥
⊥ X.
Then using the divergence formula for Gaussians (1.17), we get
I(X; Y ) = D(PY |X kPY |PX )
= ED(N (ρX, 1 − ρ2 )kN (0, 1))
1 1 log e
=E log + 2 2
(ρX) + 1 − ρ − 1
2 1 − ρ2 2
1 1
= log .
2 1 − ρ2
Alternatively, use the differential entropy representation Theorem 2.4 and the entropy formula
(1.20) for Gaussians:
I(X; Y ) = h(Y ) − h(Y |X)
= h(Y ) − h(Z)
1 1 1 1
= log(2πe) − log(2πe(1 − ρ2 )) = log
2 2 2 1 − ρ2
Note: Similar to the role of mutual information, the correlation coefficient also measures the
dependency between random variables which are real-valued (more generally, on an inner-product
space) in certain sense. However, mutual information is invariant to bijections and more general: it
can be defined not just for numerical random variables, but also for apples and oranges.
29
1
X ∼ Bern , N ∼ Bern(δ)
2
Y =X +N
I(X; Y ) = log 2 − h(δ)
where the product of two random transformations is (PX|Z=z PY |Z=z )(x, y) , PX|Z (x|z)PY |Z (y|z),
under which X and Y are independent conditioned on Z.
30
b) I(X; Y |Z) ≤ I(X; Y )
2.
PXY Z PXZ PY |XZ
= ·
PXY PZ PX PZ PY |X
i.i.d. 1
X, Z ∼ Bern( ) Y =X ⊕Z
2
I(X; Y ) = 0 since X ⊥
⊥Y
I(X; Y |Z) = I(X; X ⊕ Z|Z) = H(X) = 1 bit
Remark 2.3 (Data processing for mutual information via data processing of divergence). We
proved data processing for mutual information in Theorem 2.5 using Kolmogorov’s identity. In fact,
data processing for mutual information is implied by the data processing for divergence:
PZ|Y PZ|Y
where note that for each x, we have PY |X=x −−−→ PZ|X=x and PY −−−→ PZ . Therefore if we have a
bi-variate functional of distributions D(P kQ) which satisfies data processing, then we can define an
“M.I.-like” quantity via ID (X; Y ) , D(PY |X kPY |PX ) , Ex∼PX D(PY |X=x kPY ) which will satisfy data
processing on Markov chains. A rich class of examples arises by taking D = Df (an f -divergence,
defined in (1.16)). That f -divergence satisfies data-processing will be proved in Remark 4.2.
31
2.5 Strong data-processing inequalities
For many random transformations PY |X , it is possible to improve the data-processing inequality (2.1):
For any PX , QX we have
D(PY kQY ) ≤ ηKL D(PX kQX ) ,
where ηKL < 1 and depends on the channel PY |X only. Similarly, this gives an improvement in the
data-processing inequality for mutual information: For any PU,X we have
For example, for PY |X = BSC(δ) we have ηKL = (1 − 2δ)2 . Strong data-processing inequalities
quantify the intuitive observation that noise inside the channel PY |X must reduce the information
that Y carries about the data U , regardless of how smart the mapping U 7→ X is.
This is an active area of research, see [PW17] for a short summary.
Notes:
1. Given another kernel QY |X specified via ρ0 (y|x) and µ02 we may first replace µ2 and µ02 via
µ002 = µ2 + µ02 and thus assume that both PY |X and QY |X are specified in terms of the same
dµ2
dominating measure µ002 . (This modifies ρ(y|x) to ρ(y|x) dµ00 (y).)
2
2. Given two kernels PY |X and QY |X specified in terms of the same dominating measure µ2 and
functions ρP (y|x) and ρQ (y|x), respectively, we may set
dPY |X ρP (y|x)
,
dQY |X ρQ (y|x)
outside of ρQ = 0. When PY |X=x QY |X=x the above gives a version of the Radon-Nikodym
derivative, which is automatically measurable in (x, y).
32
3. Given QY specified as
dQY = q(y)dµ2
we may set Z
A0 = {x : ρ(y|x)dµ2 = 0}
{q=0}
PXY PX QY ⇐⇒ PX [A0 ] = 1 .
So, while our agreement resolves the two measurability problems above, it introduces a new
one. Indeed, given a joint distribution PXY on standard Borel spaces, it is always true that one
can extract a conditional distribution PY |X satisfying Definition 2.1 (this is called disintegration).
However, it is not guaranteed that PY |X will satisfy Agreement A1. To work around this issue as
well, we add another agreement:
Agreement A2: All joint distributions PXY are specified by means of data: µ1 ,µ2 –
σ-finite measures on X and Y, respectively, and measurable function λ(x, y) such that
Z
PXY (E) , λ(x, y)µ1 (dx)µ2 (dy) .
E
Notes:
1. Again, given a finite or countable collection of joint distributions PXY , QX,Y , . . . satisfying A2
we may without loss of generality assume they are defined in terms of a common µ1 , µ2 .
2. Given PXY satisfying A2 we can disintegrate it into conditional (satisfying A1) and marginal:
Z
λ(x, y)
PY |X (A|x) = ρ(y|x)µ2 (dy) ρ(y|x) , (2.7)
p(x)
ZA Z
PX (A) = p(x)µ1 (dx) p(x) , λ(x, η)µ2 (dη) (2.8)
A Y
Remark 2.4. The first problem can also be resolved with the help of Doob’s version of Radon-
Nikodym theorem [Ç11, Chapter V.4, Theorem 4.44]: If the σ-algebra on Y is separable (satisfied
whenever Y is a Polish space, for example) and PY |X=x QY |X=x then there exists a jointly
measurable version of Radon-Nikodym derivative
dPY |X=x
(x, y) 7→ (y)
dQY |X=x
33
§ 3. Sufficient statistic. Continuity of divergence and mutual information
• PT |X be some probability kernel. Let PTθ , PT |X ◦ PXθ be the induced distribution on T for
each θ.
We say that T is a sufficient statistic (s.s.) of X for θ if there exists a transition probability kernel
PX|T so that PXθ PT |X = PTθ PX|T , i.e., PX|T can be chosen to not depend on θ.
Note:
• With T known, one can forget X (T contains all the information that is sufficient to make
inference about θ). This is because X can be simulated from T alone with knowing θ, hence
X is useless.
• θ need not be a random variable (the definition does not involve any distribution on θ)
1. T is a s.s. of X for θ.
2. ∀Pθ , θ → T → X.
4. ∀Pθ , I(θ; X) = I(θ; T ), i.e., data processing inequality for M.I. holds with equality.
Theorem 3.2 (Fisher’s factorization criterion). For all θ ∈ Θ, let PXθ have a density pθ with respect
to a measure µ (e.g., discrete – pmf, continuous – pdf ). Let T = T (X) be a deterministic function
of X. Then T is a s.s. of X for θ iff
pθ (x) = gθ (T (x))h(x)
34
P R
Proof. We only give the proof in the discrete case (continuous case → dµ). Let t = T (x).
“⇒”: Suppose T is a s.s. of X for θ. Then pθ (x) = Pθ (X = x) = Pθ (X = x, T = t) = Pθ (X =
x|T = t)Pθ (T = t) = P (X = x|T = T (x)) Pθ (T = T (x))
| {z }| {z }
h(x) gθ (T (x))
“⇐”: Suppose the factorization holds. Then
free of θ.
Example:
ind
1. Normal meanPmodel. Let θ ∈ R and observations Xi ∼ N (θ, 1), i ∈ [n]. Then the sample
mean X̄ = n1 j Xj is a s.s. of X n for θ. [Exercise: verify that PXθ n factorizes.]
i.i.d. Pn
2. Coin flips. Let Bi ∼ Bern(θ). Then i=1 Bi is a s.s. of B n for θ.
i.i.d.
3. Uniform distribution. Let Ui ∼ uniform[0, θ]. Then maxi∈[n] Ui is a s.s. of U n for θ.
which is precisely the definition of s.s. when θ ∈ {0, 1}. This example explains the condition for
equality in the data-processing for divergence:
PX PY
PY |X
X Y
QX QY
Then assuming D(PY kQY ) < ∞ we have:
with equality iff D(PX|Y kQX|Y |PY ) = 0, which is equivalent to Y being a s.s. for testing PX vs QX
as desired.
35
3.2 Geometric interpretation of mutual information
Mutual information can be understood as the “weighted distance” from the conditional distribution
to the output distribution. For example, for discrete X:
X
I(X; Y ) = D(PY |X kPY |PX ) = D(PY |X=x kPY )PX (x)
x
Theorem 3.3 (Golden formula). ∀QY such that D(PY kQY ) < ∞
achieved at Q = PY .
Note: This representation is useful to bound mutual information from above by choosing some Q.
Theorem 3.4 (mutual information as distance to product distributions).
Interpretation: The most general graphical model for the triplet (X, Y, Z) is a 3-clique (triangle).
What is the information flow on the edge X → Z? To answer, notice that removing this edge
restricts possible joint distributions to a Markov chain X → Y → Z. Thus, it is natural to ask what
is the minimum distance between a given PX,Y,Z and the set of all distributions QX,Y,Z satisfying
the Markov chain constraint. By the above calculation, optimal QX,Y,Z = PY PX|Y PZ|Y and hence
the distance is I(X; Z|Y ). It is natural to interpret this number as the information flowing on the
edge X → Z.
36
3.3 Variational characterizations of divergence:
Donsker-Varadhan
Why variational characterization (sup- or inf-representation): F (x) = supλ∈Λ fλ (x)
Theorem 3.5 (Donsker-Varadhan). Let P, Q be probability measures on X and let C denote the
set of functions f : X → R such that EQ [exp{f (X)}] < ∞. If D(P kQ) < ∞ then for every f ∈ C
expectation EP [f (X)] exists and furthermore
dP
Proof. “≤”: take f = log dQ .
“≥”: Fix f ∈ C and define a probability measure Qf (tilted version of Q) via Qf (dx) ,
R exp{f (x)}Q(dx) , or equivalently,
X exp{f (x)}Q(dx)
Remark 3.1. 1. What is Donsker-Varadhan good for? By setting f (x) = · g(x) with 1
and linearizing exp and log we can see that when D(P kQ) is small, expectations under P can
be approximated by expectations over Q (change of measure): EP [g(X)] ≈ EQ [g(X)]. This
holds for all functions g with finite exponential moment under Q. Total variation distance
provides a similar bound, but for a narrower class of bounded functions:
2. More formally, inequality EP [f (X)] ≤ log EQ [exp f (X)] + D(P kQ) is useful in estimating
EP [f (X)] for complicated distribution P (e.g. over large-dimensional vector X n with lots of
weak inter-coordinate dependencies) by making a smart choice of Q (e.g. with iid components).
3. In the next lecture we will show that P 7→ D(P kQ) is convex. A general method of obtaining
variational formulas like (3.1) is by Young-Fenchel duality. Indeed, (3.1) is exactly this
inequality since the Fenchel-Legendre conjugate of D(·kQ) is given by a convex map f 7→ Zf .
Theorem 3.6 (Weak lower-semicontinuity of divergence). Let X be a metric space with Borel
σ-algebra H. If Pn and Qn converge weakly (in distribution) to P , Q, then
37
Proof. First method: On a metric space X bounded continuous functions (Cb ) are dense in the set
of all integrable functions. Then in Donsker-Varadhan (3.1) we can replace C by Cb to get
D(Pn kQn ) = sup EPn [f (X)] − log EQn [exp{f (X)}] .
f ∈Cb
Recall Pn → P weakly if and only if EPn f (X) → EP f (X) for all f ∈ Cb . Taking the limit concludes
the proof.
Second method (less mysterious): Let A be the algebra of Borel sets E whose boundary has
zero (P + Q) measure, i.e.
A = {E ∈ H : (P + Q)(∂E) = 0} .
By the property of weak convergence Pn and Qn converge pointwise on A. Thus by (3.10) we have
D(PA kQA ) ≤ lim D(Pn,A kQn,A )
n→∞
for all n. Note that this is an example for strict inequality in (3.2).
Note: Why do we care about continuity of information measures? Let’s take divergence as an
example.
1. Computation. For complicated P and Q direct computation of D(P kQ) might be hard. Instead,
one may want to discretize them compute numerically. Question: Is this procedure stable,
i.e., as the quantization becomes finer, does this procedure guarantee to converge to the true
value? Yes! Continuity w.r.t. discretization is guaranteed by the next theorem.
2. Estimating information measures. In many statistical setups, oftentimes we do not know P or
Q, if we estimate the distribution from data (e.g., estimate P by empirical distribution P̂n
from n samples) and then plug in, does D(P̂n kQ) provide a good estimator for D(P kQ)? Well,
note from the first example that this is a bad idea if Q is continuous, since D(P̂n kQ) = ∞
for n. In fact, if one convolves the empirical distribution with a tiny bit of, say, Gaussian
distribution, then it will always have a density. If we allow the variance of the Gaussian to
vanish with n appropriately, we will have convergence. This leads to the idea of kernel density
estimators. All these need regularity properties of divergence.
38
Theorem 3.7 (Gelfand-Yaglom-Perez [GKY56]). Let P, Q be two probability measures on X with
σ-algebra F. Then
Xn
P [Ei ]
D(P kQ) = sup P [Ei ] log , (3.3)
{E1 ,...,En } i=1 Q[Ei ]
S
where the supremum is over all finite F-measurable partitions: nj=1 Ej = X , Ej ∩ Ei = ∅, and
0 log 01 = 0 and log 10 = ∞ per our usual convention.
Remark 3.2. This theorem, in particular, allows us to prove all general identities and inequalities
for the cases of discrete random variables.
Proof. “≥”: Fix a finite partition E1 , . . . En . Define a function (quantizer/discretizer) f : X →
{1, . . . , n} as follows: For any x, let f (x) denote the index j of the set Ej to which X belongs. Let
X be distributed according to either P or Q and set Y = f (X). Applying data processing inequality
for divergence yields
“≤”: To show D(P kQ) is indeed achievable, first note that if P 6 Q, then by definition, there
exists B such that Q(B) = 0 < P (B). Choosing the partition E1 = B and E2 = B c , we have
P P [Ei ]
D(P kQ) = ∞ = 2i=1 P [Ei ] log Q[E i]
. In the sequel we assume that P Q, hence the likelihood
dP dP
ratio dQ is well-defined. Let us define a partition of X by partitioning the range of log dQ : Ej = {x :
dP dP dP
log dQ ∈ · [j − n/2, j + 1 − n/2)}, j = 1, . . . , n − 1 and En = {x : log dQ < (1 − n/2) or log dQ ≥
1 dP P (Ej ) Pn−1 R dP
n/2)}. Note that on Ej , log dQ ≤ (j + 1 − n/2) ≤ log Q(Ej ) + . Hence j=1 Ej dP log dQ ≤
Pn−1 P (Ej ) Pn P (Ej ) 1
j=1 P (Ej ) + P (Ej ) log Q(Ej ) ≤ + j=1 P (Ej ) + P (Ej ) log Q(Ej ) + P (En ) log P (En ) . In other
P P (E ) R
words, nj=1 P (Ej ) log Q(Ejj ) ≥ E c dP log dQ dP 1
− − P (En ) log P (E n)
. Let n → ∞ and → 0 be
√
n
such that n → ∞ (e.g., = 1/ n). The proof is complete by noting that P (En ) → 0 and
R n dP n↑∞
R dP
1 | log dP |≤no dP log dQ −−−→ dP log dQ = D(P kQ).
dQ
39
Warning: Divergence is never continuous in the pair, even for finite alphabets, e.g.: d( n1 k2−n ) 6→
0.
Our next goal is to study continuity properties of divergence for general alphabets. First,
however, we need to understand dependence on the σ-algebra of the space. Indeed, divergence
D(P kQ) implicitly depends on the σ-algebra F defining the measurable space (X , F). To emphasize
the dependence on F we will write
D(PF kQF ) .
Shortly, we will prove that divergence is continuous under monotone limits:
For establishing the first result, it will be convenient to extend the definition of the divergence
D(PF kQF ) to any algebra of sets F and two positive additive set-functions P, Q on F. For this we
take (3.3) as the definition. Note that when F is not a σ-algebra or P, Q are not σ-additive, we do
not have Radon-Nikodym theorem and thus our original definition is not applicable.
Corollary 3.2 (Measure-theoretic properties of divergence). Let P, Q be probability measures on
the measurable space (X , H). Assume all algebras below are sub-algebras of H. Then:
• (Monotonicity) If F ⊆ G then
• If F is (P + Q)-dense in G then2
• (Monotone
W convergence theorem) Let F1 ⊆ F2 . . . be an increasing sequence of algebras and let
F = n Fn be the σ-algebra generated by them, then
In particular,
D(PX ∞ kQX ∞ ) = lim D(PX n kQX n ) .
n→∞
2
Note: F is µ-dense in G if ∀E ∈ G, > 0∃E ∈ F s.t. µ[E∆E 0 ] ≤ .
0
40
• (Lower-semicontinuity of divergence) If Pn → P and Qn → Q pointwise on the algebra F,
then3
D(PF kQF ) ≤ lim inf D(Pn,F kQn,F ) . (3.10)
n→∞
Proof. Straightforward applications of (3.3) and the observation that any algebra F is µ-dense in
the σ-algebra σ{F} it generates, for any µ on (X , H).4
Note: Pointwise convergence on H is weaker than convergence in total variation and stronger than
convergence in distribution (aka “weak convergence”). However, (3.10) can be extended to this
mode of convergence (see Theorem 3.6).
Finally, we address the continuity under the decreasing σ-algebra, i.e. (3.7).
Proposition 3.2. Let Fn & F be a sequence of decreasing σ-algebras and P, Q two probability
measures on F0 . If D(PF0 kQF0 ) < ∞ then we have
The condition D(PF0 kQF0 ) < ∞ can not be dropped, cf. the example after (3.16).
h i
dP dP
Proof. Let X−n = dQ . Since X−n = EQ dQ Fn , we have that (. . . , X−1 , X0 ) is a uniformly
Fn
integrable martingale. By the martingale convergence theorem in reversed time, cf. [Ç11, Theorem
5.4.17], we have almost surely
dP
X−n → X−∞ , . (3.12)
dQ F
We need to prove that
EQ [X−n log X−n ] → EQ [X−∞ log X−∞ ] .
We will do so by decomposing x log x as follows
where log+ x = max(log x, 0) and log− x = min(log x, 0). Since x log− x is bounded, we have from
the bounded convergence theorem:
To prove a similar convergence for log+ we need to notice two things. First, the function
x 7→ x log+ x
is convex. Second, for any non-negative convex function φ s.t. E[φ(X0 )] < ∞ the collection
{Zn = φ(E[X0 |Fn ]), n ≥ 0} is uniformly integrable. Indeed, we have from Jensen’s inequality
1 E[φ(X0 )]
P[Zn > c] ≤ E[φ(E[X0 |Fn ])] ≤
c c
3
Pn → P pointwise on some algebra F if ∀E ∈ F : Pn [E] → P [E].
4
This may be shown by transfinite induction: to each ordinal ω associate an algebra Fω generated by monotone
limits of sets from Fω0 with ω 0 < ω. Then σ{F} = Fω0 , where ω0 is the first ordinal for which Fω is a monotone class.
But F is µ-dense in each Fω by transfinite induction.
41
and thus P[Zn > c] → 0 as c → ∞. Therefore, we have again by Jensen’s
Finally, since X−n log+ X−n is uniformly integrable, we have from (3.12)
PX,Y 7→ I(X; Y )
is continuous.
and (3.5).
Further properties of mutual information follow from I(X; Y ) = D(PXY kPX PY ) and correspond-
ing properties of divergence, e.g.
1.
I(X; Y ) = sup E[f (X, Y )] − log E[exp{f (X, Ȳ )}] ,
f
1 d
• Good example of strict inequality: Xn = Yn = n Z. In this case (Xn , Yn ) → (0, 0) but
I(Xn ; Yn ) = H(Z) > 0 = I(0; 0).
• Even crazier example: Let (Xp , Yp ) be uniformly distributed on the unit `p -ball on the
d
plane: {x, y : |x|p + |y|p ≤ 1}. Then as p → 0, (Xp , Yp ) → (0, 0), but I(Xp ; Yp ) → ∞!
(Homework)
42
3.
X PXY [Ei × Fj ]
I(X; Y ) = sup PXY [Ei × Fj ] log ,
{Ei }×{Fj } i,j PX [Ei ]PY [Fj ]
This implies that all mutual information between two-processes X ∞ and Y ∞ is contained in
their finite-dimensional projections, leaving nothing for the tail σ-algebra.
T
5. (Monotone convergence II): Let Xtail be a random variable such that σ(Xtail ) = n≥1 σ(Xn∞ ).
Then
I(Xtail ; Y ) = lim I(Xn∞ ; Y ) , (3.16)
n→∞
whenever the right-hand side is finite. This is a consequence of Prop. 3.2. Without the finiteness
i.i.d.
assumption the statement is incorrect. Indeed, consider Xj ∼ Bern(1/2) and Y = X0∞ . Then
each I(Xn∞ ; Y ) = ∞, but Xtail = const a.e. by Kolmogorov’s 0-1 law, and thus the left-hand
side of (3.16) is zero.
5
To prove this from (3.3) one needs to notice that algebra of measurable rectangles is dense in the product
σ-algebra.
43
§ 4. Extremization of mutual information: capacity saddle point
where for fixed f , P 7→ EP [f (X)] is affine, Q 7→ log EQ [exp{f (X)}] is concave. Therefore (P, Q) 7→
D(P kQ) is a pointwise supremum of convex functions, hence convex.
Remark 4.1. The first proof shows that for an arbitrary measure of similarity D(P kQ) convexity
of (P, Q) 7→ D(P kQ) is equivalent to “conditioning increases divergence” property of D. Convexity
can also be understood as “mixing decreases divergence”.
Remark 4.2 (f -divergences). Any f -divergence, cf. (1.16), satisfies all the key properties of the
usual divergence: positivity, monotonicity, data processing (DP), conditioning increases divergence
(CID) and convexity in the pair. Indeed, by previous remark the last two are equivalent. Furthermore,
proof of Theorem 2.2 showed that DP and CID are implied by monotonicity. Thus, consider PXY
and QXY and note
PXY
Df (PXY kQXY ) = EQXY f (4.1)
QXY
PY PX|Y
= EQY EQX|Y f · (4.2)
QY QX|Y
PY
≥ EQY f , (4.3)
QY
where inequality follows by applying Jensen’s inequality to the convex function f . Finally, positivity
PY
follows from Jensen’s inequality and recalling that f (1) = 0 by assumption and EQY [ Q Y
] = 1.
1 p
This is a general phenomenon: for a convex f (·) the perspective function (p, q) 7→ qf q
is convex too.
44
Theorem 4.2 (Entropy). PX 7→ H(PX ) is concave.
Proof. If PX is on a finite alphabet, then proof is complete by H(X) = log |X | − D(PX kUX ).
Otherwise, set
P0 Y = 0
PX|Y = , P (Y = 0) = λ
P1 Y = 1
Then apply H(X|Y ) ≤ H(X).
Proof.
• First proof : Introduce θ ∈ Bern(λ). Define PX|θ=0 = PX0 and PX|θ=1 = PX1 . Then θ → X → Y .
Then PX = λ̄PX0 + λPX1 . I(X; Y ) = I(X, θ; Y ) = I(θ; Y ) + I(X; Y |θ) ≥ I(X; Y |θ), which is
our desired I(λ̄PX0 + λPX1 , PY |X ) ≥ λ̄I(PX0 , PY |X ) + λI(PX0 , PY |X ).
Second proof : I(X; Y ) = minQ D(PY |X kQ|PX ) – pointwise minimum of affine functions is
concave.
Third proof : Pick a Q and use the golden formula: I(X; Y ) = D(PY |X kQ|PX ) − D(PY kQ),
where PX 7→ D(PY kQ) is convex, as the composition of the PX 7→ PY (affine) and PY 7→
D(PY kQ) (convex).
d
D(λP + λ̄QkQ) = 0
dλ λ=0
If we exchange the arguments, the criterion is even simpler:
d
D(QkλP + λ̄Q) = 0 ⇐⇒ P Q (4.4)
dλ λ=0
45
Proof.
1 1
D(λP + λ̄QkQ) = EQ (λf + λ̄) log(λf + λ̄)
λ λ
dP
where f = . As λ → 0 the function under expectation decreases to (f − 1) log e monotonically.
dQ
Indeed, the function
λ 7→ g(λ) , (λf + λ̄) log(λf + λ̄)
is convex and equals zero at λ = 0. Thus g(λ)λ is increasing in λ. Moreover, by the convexity of
x 7→ x log x:
1 1
(λf + λ)(log(λf + λ)) ≤ (λf log f + λ1 log 1) = f log f
λ λ
and by assumption f log f is Q-integrable. Thus the Monotone Convergence Theorem applies.
To prove (4.4) first notice that if P 6 Q then there is a set E with p = P [E] > 0 = Q[E].
Applying data-processing for divergence to X 7→ 1E (X), we get
1
D(QkλP + λ̄Q) ≥ d(0kλp) = log
1 − λp
dP
and derivative is non-zero. If P Q, then let f = dQ and notice simple inequalities
λ 7→ D(λP + λ̄QkQ) ,
46
This is a very popular measure of distance between P and Q, frequently used in statistics. It has
many important properties, but we will only mention that χ2 dominates KL-divergence:
1 log e 2
lim inf 2
D(λP + λ̄QkQ) = χ (P kQ) , (4.5)
λ→0 λ 2
where both sides are finite or infinite simultaneously.
λ2 log e 2
D(λP + λ̄QkQ) = χ (P kQ) + o(λ2 ) , λ → 0.
2
To that end notice that
dP
D(P kQ) = EQ g ,
dQ
where
g(x) , x log x − (x − 1) log e .
g(x) R1 sds
Note that x 7→ (x−1)2 log e
= 0 x(1−s)+s is decreasing in x on (0, ∞). Therefore
and hence 2
1 dP dP
0 ≤ 2 g λ̄ + λ ≤ − 1 log e.
λ dQ dQ
By the dominated convergence theorem (which is applicable since χ2 (P kQ) < ∞) we have
" 2 #
1 dP g 00 (1) dP log e 2
lim EQ g λ̄ + λ = EQ −1 = χ (P kQ) .
λ→0 λ2 dQ 2 dQ 2
Therefore, from (4.6) we conclude that if χ2 (P kQ) = ∞ then so is the LHS of (4.5).
47
4.3* Local behavior of divergence and Fisher information
Consider a parameterized set of distributions {Pθ , θ ∈ Θ} and assume Θ is an open subset of Rd .
Furthermore, suppose that distribution Pθ are all given in the form of
Pθ (dx) = f (x|θ)µ(dx) ,
where µ is some common dominating measure (e.g. Lebesgue or counting). If for a fixed x functions
θ → f (x|θ) are smooth, one can define Fisher information matrix with respect to parameter θ as
JF (θ) , EX∼Pθ V V T , V , ∇θ log f (X|θ) . (4.7)
Under suitable regularity conditions, Fisher information matrix has several equivalent expressions:
JF (θ) = covX∼Pθ [∇θ log f (X|θ)] (4.8)
Z p p
= (4 log e) µ(dx)(∇θ f (x|θ))(∇θ f (x|θ))T (4.9)
Significance of Fisher information matrix arises from the fact that it gauges the local behaviour
of divergence for smooth parametric families. Namely, we have (again under suitable technical
conditions):2
1
D(Pθ0 kPθ0 +ξ ) = ξ T JF (θ0 )ξ + o(kξk2 ) , (4.14)
2 log e
which is obtained by integrating the Taylor expansion:
1
log f (x|θ0 + ξ) = log f (x|θ0 ) + ξ T ∇θ log f (x|θ0 ) + ξ T Hessθ (log f (x|θ0 ))ξ + o(kξk2 ) .
2
Property (4.14) is of paramount importance in statistics. We should remember it as: Divergence
is locally quadratic on the parameter space, with Hessian given by the Fisher information matrix.
2
To illustrate the subtlety here, consider a scalar location family, i.e. f (x|θ) = f0 (x − θ) with f0 – some density.
R (f00 )2
In this case Fisher information JF (θ0 ) = (log e)2 f0
does not depend on θ0 and is well-defined even for compactly
supported f0 , provided f00 vanishes at the endpoints sufficiently fast. But at the same time the left-hand side of (4.14)
is infinite for any ξ > 0. In such cases, a better interpretation for Fisher information is as the coefficient of the
ξ2
expansion D(Pθ0 k 12 Pθ0 + 12 Pθ0 +ξ ) = 8 log J + o(ξ 2 ).
e F
48
Remark 4.4. It can be seen that if one introduces another parametrization θ̃ ∈ Θ̃ by means of a
smooth invertible map Θ̃ → Θ, then Fisher information matrix changes as
where A = dθ
dθ̃
is the Jacobian of the map. So we can see that JF transforms similarly to the metric
tensor in Riemannian geometry. This idea can be used to define a Riemannian metric on the space
of parameters Θ, called the Fisher-Rao metric. This is explored in a field known as information
geometry [AN07].
where 1 · 1T is the d × d matrix of all ones. For future reference, we also compute determinant of
JF (θ). To that end notice that det(A + xy T ) = det A · det(I + A−1 xy T ) = det A · (1 + y T A−1 x),
where we used the identity det(I + AB) = det(I + BA). Thus, we have
d
Y d
Y
1 1 1
det JF (θ) = (log e)2d = (log e)2d Pd . (4.17)
θx 1 − x=1 θx θx
x=0 x=1
Theorem 4.4 (Saddle point). Let P be a convex set of distributions on X . Suppose there exists
PX∗ ∈ P such that
sup I(PX , PY |X ) = I(PX∗ , PY |X ) , C
PX ∈P
PY |X
and let PX∗ −−−→ PY∗ . Then for all PX ∈ P and for all QY , we have
D(PY |X kPY∗ |PX ) ≤ D(PY |X kPY∗ |PX∗ ) ≤ D(PY |X kQY |PX∗ ). (4.18)
Note: PX∗ (resp., PY∗ ) is called a capacity-achieving input (resp., output) distribution, or a caid
(resp., the caod ).
49
Proof. Right inequality: obvious from C = I(PX∗ , PY |X ) = minQY D(PY |X kQY |PX∗ ).
Left inequality: If C = ∞, then trivial. In the sequel assume that C < ∞, hence I(PX , PY |X ) <
∞ for all PX ∈ P. Let PXλ = λPX + λPX∗ ∈ P by convexity of P, and introduce θ ∼ Bern(λ), so
that PXλ |θ=0 = PX∗ , PXλ |θ=1 = PX , and θ → Xλ → Yλ . Then
Dividing both sides by λ, taking the lim inf and using lower semicontinuity of D, we have
where inequality is by the right part of (4.18) (already shown). Thus, subtracting λ̄C and dividing
by λ we get
D(PX,Y kPX PYλ ) ≤ C
and the proof is completed by taking lim inf λ→0 and applying the lower semincontinuity of divergence.
Corollary 4.1. In addition to the assumptions of Theorem 4.4, suppose C < ∞. Then caod PY∗ is
unique. It satisfies the property that for any PY induced by some PX ∈ P (i.e. PY = PY |X ◦ PX ) we
have
D(PY kPY∗ ) ≤ C < ∞ (4.19)
and in particular PY PY∗ .
Statement (4.19) follows from the left inequality in (4.18) and “conditioning increases divergence”.
50
Remark 4.5. • Finiteness of C is necessary. Counterexample: Consider the identity channel
Y = X, where X takes values on integers. Then any distribution with infinite entropy is caid
or caod.
• Non-uniqueness of caid. Unlike the caod, caid need not be unique. Let Z1 ∼ Bern( 12 ).
Consider Y1 = X1 ⊕ Z1 and Y2 = X2 . Then maxPX1 X2 I(X1 , X2 ; Y1 , Y2 ) = log 4, achieved by
PX1 X2 = Bern(p)×Bern( 12 ) for any p. Note that the caod is unique: PY∗1 Y2 = Bern( 12 )×Bern( 12 ).
Suppose we have a bivariate function f . Then we always have the minimax inequality:
1. It turns out minimax equality is implied by the existence of a saddle point (x∗ , y ∗ ),
i.e.,
f (x, y ∗ ) ≤ f (x∗ , y ∗ ) ≤ f (x∗ , y) ∀x, y
Furthermore, minimax equality also implies existence of saddle point if inf and sup
are achieved c.f. [BNO03, Section 2.6]) for all x, y [Straightforward to check. See
proof of corollary below].
3. The mother result of all this minimax theory is a theorem of von Neumann on
bilinear functions: Let A and B have finite alphabets, and g(a, b) be arbitrary, then
4. A more general version is: if X and Y are compact convex domains in Rn , f (x, y)
continuous in (x, y), concave in x and convex in y then
51
Proof. This follows from saddle-point trivially: Maximizing/minimizing the leftmost/rightmost
sides of (4.18) gives
min max D(PY |X kQY |PX ) ≤ max D(PY |X kPY∗ |PX ) = D(PY |X kPY∗ |PX∗ )
QY PX ∈P PX ∈P
1. Radius (aka Chebyshev radius) of A: the radius of the smallest ball that covers A,
i.e., rad (A) = inf y∈X supx∈A d(x, y).
3. Note that the radius and the diameter both measure how big/rich a set is.
The next simple corollary shows that capacity is just the radius of the set of distributions {PY |X=x , x ∈
X } when distances are measured by divergence (although, we remind, divergence is not a metric).
Corollary 4.3. For fixed kernel PY |X , let P = {all distributions on X } and X is finite, then
The last corollary gives a geometric interpretation to capacity: it equals the radius of the smallest
divergence-“ball” that encompasses all distributions {PY |X=x : x ∈ X }. Moreover, PY∗ is a convex
combination of some PY |X=x and it is equidistant to those.
52
The following is the information-theoretic version of “radius ≤ diameter” (in KL divergence) for
arbitrary input space:
Corollary 4.4. Let {PY |X=x : x ∈ X } be a set of distributions. Prove that
C = sup I(X; Y ) ≤ inf sup D(PY |X=x kQ) ≤ sup D(PY |X=x kPY |X=x0 )
PX Q x∈X x,x0 ∈X
| {z } | {z }
radius diameter
I(X; Y ) = inf D(PY |X kQ|PX ) ≤ inf sup D(PY |X=x kQ) ≤ inf
0
sup D(PY |X=x kPY |X=x0 ).
Q Q x∈X x ∈X x∈X
C = sup I(X; Y )
PX ∈P
can be a) interpreted as a saddle point; b) written in the minimax form and c) that caod PY∗ is
unique. This was all done under the extra assumption that supremum over PX is attainable. It
turns out, properties b) and c) can be shown without that extra assumption.
Theorem 4.5 (Kemperman). For any PY |X and a convex set of distributions P such that
Furthermore,
53
If we remove the constraint E[X 4 ] = s the unique caid is PX = N (0, P ), see Theorem 4.6. When
s 6= 3P 2 then such PX is no longer inside the constraint set P. However, for s > 3P 2 the maximum
1
C= log(1 + P )
2
is still attainable. Indeed, we can add a small “bump” to the gaussian distribution as follows:
where p → 0, px2 → 0 but px4 → s − 3P 2 > 0. This shows that for the problem (4.26) with s > 3P 2
the caid does not exist, the caod PY∗ = N (0, 1 + P ) exists and unique as Theorem 4.5 postulates.
Proof of Theorem 4.5. Let PX0 n be a sequence of input distributions achieving C, i.e., I(PX0 n , PY |X ) →
C. Let Pn be the convex hull of {PX0 1 , . . . , PX0 n }. Since Pn is a finite-dimensional simplex, the
concave function PX 7→ I(PX , PY |X ) attains its maximum at some point PXn ∈ Pn , i.e.,
where in (4.28) we applied Theorem 4.4 to (Pn+k , PYn+k ). By the Pinsker-Csiszár inequality (1.15)
and since In % C, we conclude that the sequence PYn is Cauchy in total variation:
Since the space of probability distributions is complete in total variation, the sequence must have a
limit point PYn → PY∗ . By taking a limit as k → ∞ in (4.29) and applying the lower semi-continuity
of divergence (Theorem 3.6) we get
and therefore, PYn → PY∗ in the (stronger) sense of D(PYn kPY∗ ) → 0. By Theorem 3.3,
To prove that (4.32) holds for arbitrary PX ∈ P, we may repeat the argument above with Pn
replaced by P̃n = conv({PX } ∪ Pn ), denoting the resulting sequences by P̃Xn , P̃Yn and the limit
point by P̃Y∗ , and obtain:
54
where (4.34) follows from (4.32) since PXn ∈ P̃n . Hence taking limit as n → ∞ we have P̃Y∗ = PY∗
and therefore (4.32) holds.
Finally, to see (4.23), note that by definition capacity as a max-min is at most the min-max, i.e.,
C = sup min D(PY |X kQY |PX ) ≤ min sup D(PY |X kQY |PX ) ≤ sup D(PY |X kPY∗ |PX ) = C
PX ∈P QY QY PX ∈P PX ∈P
Corollary 4.5. Let X be countable and P a convex set of distributions on X . If supPX ∈P H(X) < ∞
then X 1
sup H(X) = min sup PX (x) log <∞
PX ∈P QX PX ∈P
x
QX (x)
and the optimizer Q∗X exists and is unique. If Q∗X ∈ P, then it is also the unique maximizer of
H(X).
This follows from taking Q(n) = c exp{−λf (n)}. This bound is often tight and achieved by
PX (n) = c exp{−λ∗ f (n)} with λ∗ being the minimizer, known as the Gibbs distribution for the
energy function f .
1. “Gaussian capacity”:
1 σ2
C = I(Xg ; Xg + Ng ) = log 1 + X
2
2 σN
I(X; X + Ng ) ≤ I(Xg ; Xg + Ng ),
D
with equality iff X =Xg .
3. “Gaussian noise is the worst for Gaussian input”: For for all N s.t. E[Xg N ] = 0 and
EN 2 ≤ σN
2 ,
I(Xg ; Xg + N ) ≥ I(Xg ; Xg + Ng ),
D
with equality iff N =Ng and independent of Xg .
Interpretations:
55
1. For AWGN channel, Gaussian input is the most favorable. Indeed, immediately from the
second statement we have
1 σ2
max I(X; X + Ng ) = log 1 + X 2
2
X:varX≤σX 2 σN
2. For Gaussian source, additive Gaussian noise is the worst in the sense that it minimizes the
mutual information provided by the noisy version.
Proof. WLOG, assume all random variables have zero mean. Let Yg = Xg + Ng . Define
1 σ 2 log e x2 − σX
2
2 2 2
f (x) = D(PYg |Xg =x kPYg ) = D(N (x, σN )kN (0, σX + σN )) = log 1 + X +
2 σ2 2 σX 2 + σ2
| {z N } N
=C
2. Recall the inf-representation (Corollary 3.1) I(X; Y ) = minQ D(PY |X kQ|PX ). Then
Furthermore, if I(X; X + Ng ) then uniqueness of caod, cf. Corollary 4.1, implies PY = PYg . But
PY = PX ∗ N (0, σN2 ). Then it must be that X ∼ N (0, σ 2 ) simply by considering characteristic
X
functions:
1 2 2 1 2 2 2 1 2 2
ΨX (t) · e− 2 σN t = e− 2 (σX +σN )t ⇒ ΨX (t) = e− 2 σX t =⇒ X ∼ N (0, σX
2
)
3. Let Y = Xg + N and let PY |Xg be the respective kernel. [Note that here we only assume that N
is uncorrelated with Xg , i.e., E [N Xg ] = 0, not necessarily independent.] Then
where
PXg |Yg PYg |Xg
• (4.36): P Xg = PYg
56
• (4.39): EN 2 ≤ σN
2 .
Thus, PXg |Y = PXg |Yg , i.e., Xg is conditionally Gaussian: PXg |Y =y = N (by, c2 ) for some constants
b and c. In other words, under PXg Y , we have
Xg = bY + cZ , Z ∼ Gaussian ⊥
⊥ Y.
But then Y must be Gaussian itself by Cramer’s Theorem or simply by considering characteristic
functions:
2 0 2 00 2
ΨY (t) · ect = ec t ⇒ ΨY (t) = ec t =⇒ Y– Gaussian
Therefore, (Xg , Y ) must be jointly Gaussian and hence N = Y − Xg is Gaussian. Thus we
2
conclude that it is only possible to attain I(Xg ; Xg + N ) = C if N is Gaussian of variance σN
and independent of Xg .
57
§ 5. Single-letterization. Probability of error. Entropy rate.
(2) If X1 ⊥
⊥ ... ⊥
⊥ Xn then X
I(X n ; Y n ) ≥ I(Xi ; Yi ) (5.2)
Q
with equality iff PX n |Y n = PXi |Yi PY n -almost surely.1 Consequently,
n
X
n n
min I(X ; Y ) = min I(Xi ; Yi ).
PY n |X n PYi |Xi
i=1
P Q Q
Proof. (1) Use I(X n ; Y n ) − I(Xj ; Yj ) = D(PY n |X n k PYi |Xi |PX n ) − D(PY n k PYi )
P Q Q
(2) Reverse the role of X and Y : I(X n ; Y n )− I(Xj ; Yj ) = D(PX n |Y n k PXi |Yi |PY n )−D(PX n k PXi )
This type of result is often known as single-letterization in information theory, which tremendously
simplifies the optimization problem over a high-dimensional (multi-letter) problem to a scalar (single-
letter) problem. For example, in the simplest case where X n , Y n are binary vectors, optimizing
I(X n ; Y n ) over PX n and PY n |X n entails optimizing over 2n -dimensional vectors and 2n × 2n matrices,
whereas optimizing each I(Xi ; Yi ) individually is easy.
Example:
1 Q
That is, if PX n ,Y n = PY n PXi |Yi as measures.
58
⊥ X2 ∼ Bern(1/2) on {0, 1} = F2 :
1. (5.1) fails for non-product channels. X1 ⊥
Y1 = X1 + X2
Y2 = X1
I(X1 ; Y1 ) = I(X2 ; Y2 ) = 0 but I(X 2 ; Y 2 ) = 2 bits
Given a distribution
P PX1 · · · PXn satisfying the constraint, form the “average of marginals”
P distribu-
tion P̄X = n1 nk=1 PXk , which also satisfies the single letter constraint E[X 2 ] = n1 nk=1 E[Xk2 ] ≤ P .
Then from concavity in PX of I(PX , PY |X )
n
1X
I(P̄X ; PY |X ) ≥ I(PXk , PY |X )
n
k=1
So P̄X gives the same or better MI, which shows that the extremization above ought to have the form
nC(P ) where C(P ) is the single letter capacity. Now suppose Y n = X n + ZG n where Z n ∼ N (0, I ).
G n
Since an isotropic Gaussian is rotationally symmetric, for any orthogonal transformation U ∈ O(n),
the additive noise has the same distribution ZG n ∼ U Z n , so that P
G U Y n |U X n = PY n |X n , and
This uses the same trick, except here the input distribution is automatically invariant under
orthogonal transformations.
59
5.3 Information measures and probability of error
Let W be a random variable and Ŵ be our prediction. There are three types of problems:
1. Random guessing: W ⊥
⊥ Ŵ .
We want to draw converse statements, e.g., if the uncertainty of W is too high or if the information
provided by the data is too scarce, then it is difficult to guess the value of W .
Theorem 5.2. Let |X | = M < ∞ and Pmax , maxx∈X PX (x). Then
1 − Pmax 1 − Pmax
with equality iff PX = (Pmax , ,..., ).
M −1 M −1
| {z }
M −1
Proof. First proof : Write RHS-LHS as a divergence. Let P = (Pmax , P2 , . . . , PM ) and introduce
Q = (Pmax , 1−P 1−Pmax
M −1 , . . . , M −1 ). Then RHS-LHS = D(P kQ) ≥ 0, with inequality iff P = Q.
max
Second proof : Given any P = (Pmax , P2 , . . . , PM ), apply a random permutation π to the last
M − 1 atoms to obtain the distribution Pπ . Then averaging Pπ over all permutation π gives Q.
Then use concavity of entropy or “conditioning reduces entropy”: H(Q) ≥ H(Pπ |π) = H(P ). P
Third proof : Directly solve the convex optimization max{H(P ) : 0 ≤ pi ≤ Pmax , i = 1, . . . , M, i pi =
1}.
Fourth proof : Data processing inequality. Later.
Note: Similar to Shannon entropy H, Pmax is also a reasonable measure for randomness of P . In fact,
1
log Pmax is known as the Rényi entropy of order ∞, denoted by H∞ (P ). Note that H∞ (P ) = log M
iff P is uniform; H∞ (P ) = 0 iff P is a point mass.
Note: The function FM on the RHS of (5.3) looks like
FM (p)
log M
log(M − 1)
p
0 1/M 1
1
which is concave with maximum log M at maximizer 1/M , but not monotone. However, Pmax ≥ M
1
and FM is decreasing on [ M , 1]. Therefore (5.3) gives a lower bound on Pmax in terms of entropy.
60
Interpretation: Suppose one is trying to guess the value of X without any information. Then
the best bet is obviously the most likely outcome (mode), i.e., the maximal probability of success
among all estimators is
max P[X = X̂] = Pmax (5.4)
⊥X
X̂⊥
This is not obvious from (5.3) and (5.4) since p 7→ FM (p) is not monotone. To show (5.5) consider
the data processor (X, X̂) 7→ 1{X=X̂ } :
PX X̂ = PX PX̂ P[X = X̂] , PS 1
d PS k M ≤ D(PX X̂ kQX X̂ )
⇒
⇒
QX X̂ = UX PX̂ Q[X = X̂] = 1
M = log M − H(X)
Proof. This is a direct corollary of Theorem 5.2: averaging H(X|Y = y) ≤ FM (P[X = X̂|Y = y])
over PY and use the concavity of FM .
For a standalone proof, apply data processing to PXY X̂ = PX PY |X PX̂|Y vs. QXY X̂ = UX PY PX̂|Y
and the data processor (kernel) (X, Y ) 7→ 1{X6=X̂} (note that PX̂|Y is fixed).
Remark: We can also derive Fano’s inequality as follows: Let = P[X 6= X̂]. Apply data
processing for M.I.
This minimum will not be zero since if we force X and Z to agree with some probability, then I(X; Z)
cannot be too small. It remains to compute the minimum, which is a nice convex optimization
problem. (Hint: look for invariants that the matrix PZ|X must satisfy under permutations (X, Z) 7→
(π(X), π(Z)) then apply the convexity of I(PX , ·)).
Theorem 5.4 (Fano inequality: general). Let X, Y ∈ X , |X | = M and let QXY = PX PY , then
61
Proof. Apply data processing to PXY and QXY . Note that if PX or PY = uniform, then Q[X =
1
Y]= M always.
The following result is useful in providing converses for statistics and data transmission.
Corollary 5.1 (Lower bound on average probability of error). Let W → X → Y → Ŵ and W is
uniform on [M ] , {1, . . . , M }. Then
I(X; Y ) + h(Pe )
Pe , P[W 6= Ŵ ] ≥ 1 − (5.8)
log M
I(X; Y ) + log 2
≥ 1− . (5.9)
log M
Proof. Apply Theorem 5.3 and the data processing for M.I.: I(W ; Ŵ ) ≤ I(X; Y ).
H(X1 ) < ∞.
Proof.
4. n 7→ n1 H(X n ) is a decreasing sequence and lower bounded by zero, hence has a limit H(X).
1 1 Pn
Moreover by chain rule, n H(X ) = n i=1 H(Xi |X i−1 ). Then H(Xn |X n−1 ) → H(X). Indeed,
n
from part 1 limn P H(Xn |X n−1 ) = H 0 exists. Next, recall from calculus: if an → a, then the
Cesàro’s mean n ni=1 ai → a as well. Thus, H 0 = H(X).
1
62
5. Assuming H(X1 ) < ∞ we have from (3.14):
0 0 0 0
lim H(X1 ) − H(X1 |X−n ) = lim I(X1 ; X−n ) = I(X1 ; X−∞ ) = H(X1 ) − H(X1 |X−∞ )
n→∞ n→∞
2. X − mixed sources: Flip a coin with bias p at time t = 0, if head, let X = Y, if tail, let X = Z.
Then H(X) = pH(Y) + p̄H(Z).
E[dH (X n , Y n )] ≤ n .
Before showing our main result, we show that Fano’s inequality Theorem 5.3 can be tensorized:
Proposition 5.1. Let Xk take values on a finite alphabet X . Then
where
n
1 1X
δ= E[dH (X n , Y n )] = P[Xj 6= Yj ] .
n n
j=1
63
Proof. For each j ∈ [n] consider X̂j (Y n ) = Yj . Then from (5.6) we get
where we denoted M = |X |. Then, upper-bounding joint entropy by the sum of marginals, cf. (1.1),
and combining with (5.13) we get
n
X
H(X n |Y n ) ≤ H(Xj |Y n ) (5.14)
j=1
Xn
≤ FM (P[Xj = Yj ]) (5.15)
j=1
n
1X
≤ nFM ( P[Xj = Yj ]) . (5.16)
n
j=1
where in the last step we used concavity of FM and Jensen’s inequality. Noticing that
n
1X
P[Xj = Yj ] = 1 − δ
n
j=1
Corollary 5.2. Consider two processes X and Y with entropy rates H(X) and H(Y). If
P[Xj 6= Yj ] ≤
H(X) − H(Y) ≤ FM (1 − ) .
and apply (5.12). For the last statement just recall the expression for FM .
64
Example: Gaussian processes. Consider X, N two stationary Gaussian processes, independent of
each other. Assume that their auto-covariance functions are absolutely summable and thus there
exist continuous power spectral density functions fX and fN . Without loss of generality, assume all
means are zero. Let cX (k) =PE [X1 Xk+1 ]. Then fX is the Fourier transform of the auto-covariance
function cX , i.e., fX (ω) = ∞ k=−∞ cX (k)e
iωk . Finally, assume f
N ≥ δ > 0. Then recall from
Lecture 2:
1 det(ΣX n + ΣN n )
I(X n ; X n + N n ) = log
2 det ΣN n
Xn n
1 1X
= log σi − log λi ,
2 2
i=1 i=1
where σj , λj are the eigenvalues of the covariance matrices ΣY n = ΣX n + ΣN n and ΣN n , which are
all Toeplitz matrices, e.g., (ΣX n )ij = E [Xi Xj ] = cX (i − j). By Szegö’s theorem (see Section 5.7*):
n Z
1X 1 2π
log σi → log fY (ω)dω
n 2π 0
i=1
Note that cY (k) = E [(X1 + N1 )(Xk+1 + Nk+1 )] = cX (k) + cN (k) and hence fY = fX + fN . Thus,
we have Z 2π
1 1 fX (w) + fN (ω)
I(X n ; X n + N n ) → I(X; X + N) = log dω.
n 4π 0 fN (ω)
∗ (ω) = |T − f (ω)|+ .
Maximizing this over fX (ω) leads to the famous water-filling solution fX N
where {σn,j , j = 1, . . . , n} are the eigenvalues of the Toeplitz matrix Tn = {a`−m }n`,m=1 .
Proof sketch. The idea is to approximate φ by polynomials, while for polynomials the statement
can be checked directly. An alternative interpretation of the strategy is theP following: Roughly
speaking we want to show that the empirical distribution of the eigenvalues n nj=1 δσn,j converges
1
weakly to the distribution of f (W ), where W is uniformly distributed on [0, 2π]. To this end, let us
check that all moments converge. Usually this does not imply weak convergence, but on compact
intervals it does.
65
For example, for φ(x) = x2 we have
n
1X 2 1
σn,j = tr Tn2
n n
j=1
n
1 X
= (Tn )`,m (Tn )m,`
n
`,m=1
1X
= a`−m am−`
n
`,m
n−1
X
1
= (n − |`|)a` a−`
n
`=−n−1
X
= (1 − |x|)anx a−nx ,
1
x∈(−1,1)∩ n Z
1
R 2π
Substituting a` = 2π 0 f (ω)eiω` we get
n ZZ
1X 2 1
σn,j = f (ω)f (ω 0 )θn (ω − ω 0 ) , (5.17)
n (2π)2
j=1
where X
θn (u) = (1 − |x|)e−inux
1
x∈(−1,1)∩ n Z
is a Fejer kernel and converges to a δ-function: θn (u) → 2πδ(u) (in the sense of convergence of
Schwartz distributions). Thus from (5.17) we get
n ZZ Z
1X 2 1 0 0 1 0
2π
σn,j → f (ω)f (ω )2πδ(ω − ω )dωdω = f 2 (ω)dω
n (2π)2 2π 0
j=1
as claimed.
66
§ 6. f -divergences: definition and properties
6.1 f -divergences
In Section 1.6 we introduced the KL divergence that measures the dissimilarity between two
distributions. This turns out to be a special case of the family of f -divergence between probability
distributions, introduced by Csiszár [Csi67]. Roughly speaking, all f -divergences quantify the
difference between a pair of distributions, each with different operational meaning.
Definition 6.1 (f -divergence). Let f : (0, ∞) → R be a convex function which is strictly convex1
at 1 and f (1) = 0. Let P and Q be two probability distributions on a measurable space (X , F).
The f -divergence of Q from P with P Q is defined as
dP/dµ
Df (P kQ) , EQ f (6.1)
dQ/dµ
where µ is any dominating probability measure (e.g., µ = (P + Q)/2) of P and Q, i.e., both P µ
and Q µ, with the understanding that
• f (0) = f (0+),
• 0f ( 00 ) = 0, and
Remark 6.1. It is useful to consider the case when P 6 Q, in which case P and Q are “very
dissimilar.” For example, we will show later that TV(P, Q) = 1 iff P ⊥ Q. When P Q, we have
dP
Df (P kQ) , EQ f . (6.2)
dQ
P
P (x)
Similar to Definition 1.4, when X is discrete, Df (P kQ) = x∈X Q(x)f Q(x) ; when X is Eucledian
R
space, Df (P kQ) = X q(x)f p(x) q(x) dx.
1
By strict convexity at 1, we mean for all s, t ∈ (0, ∞) and α ∈ (0, 1) such that αs + ᾱt = 1, we have
αf (s) + (1 − α)f (t) > f (1).
67
• Total variation: f (x) = 21 |x − 1|,
Z
1 P 1
TV(P, Q) , EQ − 1 = |dP − dQ|.
2 Q 2
Moreover, TV(·, ·) is a metric on the space of probability distributions, and hence it is a
symmetric function of P and Q.
Note that we can also choose f (x) = x2 − 1. Indeed different f can lead to the same divergence.
√ 2
• Squared Hellinger distance: f (x) = (1 − x) ,
s !2 Z
P √ p 2
H 2 (P, Q) , EQ 1 − = P− Q .
Q
PY |X PY
PX
QY |X QY
68
Then
Df (PY kQY ) ≤ Df PY |X kQY |X |PX .
Note: For the last property, one can view PY and QY as the output distributions after passing PX
through the channel transition matrices PY |X and QY |X respectively. The above relation tells us
that the average f -divergence between the corresponding channel transition rows is at least the
f -divergence between the output distributions.
h i h i
Proof. • Df (P kQ) = EQ f Q P
≥ f EQ Q P
= f (1) = 0, where the inequality follows from
the Jensen’s inequality. By strict convexity at 1, equality holds if and only if P = Q.
• For any convex function f on R+ , it follows that (p, q) 7→ qf pq is convex on R2+ (the
h i
perspective of f ). Since Df (P kQ) = EQ f Q P
, Df (P kQ) is jointly convex.
• This follows directly from the joint-convexity of Df (P kQ) and the Jensen’s inequality.
Recall the definition of f -divergences from last time. If a function f : R+ → R satisfies the
following properties:
• f is a convex function.
• f (1) = 0.
• f is strictly convex at x = 1, i.e. for all x, y, α such that αx + ᾱy = 1, the inequality
f (1) < αf (x) + ᾱf (y) is strict.
PX PY
PY |X
QX QY
69
One interpretation of this result is that processing the observation x makes it more difficult to
determine whether it came from PX or QX .
Proof.
PX (a) PXY PXY
Df (PX kQX ) = EQX f = EQXY f = EQY EQX|Y f
QX QXY QXY
PXY
Jensen’s inequality → ≥ EQY f EQX|Y
QXY
PY (b) PY
= EQY f EPX|Y = EQY f = Df (PY kQY ).
QY QY
Note that (a) means Df (PX kQX ) = Df (PXY kQXY ); (b) can be alternatively understood by noting
PXY PY
that EQ [ Q XY
|Y ] is precisely the relative density Q Y
, by checking the definition of change of measure,
PXY PXY
i.e., EP [g(Y )] = EQ [g(Y ) QXY
] = EQ [g(Y )E[ QXY
|Y ]] for any g.
Remark 6.2. PY |X can be a deterministic map so that Y = f (X). More specifically, if f (X) =
1E (X) for any event E, then Y is Bernoulli with parameter P (E) or Q(E) and the data processing
inequality gives
Df (PX kQX ) ≥ Df (Bern(P (E))kBern(Q(E))). (6.3)
This is how we will prove the converse direction of large deviation (see Lecture 14).
Example: If X = (X1 , X2 ) and f (X) = X1 , then we have
As seen from the proof of Theorem 6.2, this is in fact equivalent to data processing inequality.
Remark 6.3. If Df (P kQ) is an f -divergence, then Df˜(P kQ) with f˜(x) , xf ( x1 ) is also an f -
divergence and Df (P kQ) = Df˜(QkP ). Example: Df (P kQ) = D(P kQ) then Df˜(P kQ) = D(QkP ).
Proof. First we verify that f˜ has all three properties required for Df˜(·k·) to be an f -divergence.
ᾱy
• For x, y ∈ R+ and α ∈ [0, 1] define c = αx + ᾱy so that αx
c + c = 1. Observe that
˜ 1 αx 1 ᾱy 1 αx 1 ᾱy 1
f (αx + ᾱy) = cf = cf + ≤c f +c f = αf˜(x) + ᾱf˜(y).
c c x c y c x c y
• f˜(1) = f (1) = 0.
Finally,
P Q P Q
Df (P kQ) = EQ f = EP f = EP f˜ = Df˜(QkP ).
Q P Q P
70
6.3 Total variation and hypothesis testing
Recall that the choice of f (x) = 12 |x − 1| gives rise to the total variation distance,
Z
1 P 1
Df (P kQ) = EQ − 1 = |P − Q|,
2 Q 2
R R dQ
where |P − Q| is a short-hand understood in the usual sense, namely, | dP dµ − dµ |dµ where µ is a
dominating measure, e.g., µ = P + Q, and the value of the integral does not depends on µ.
We will denote total variation by TV(P, Q).
Theorem 6.3. The following definitions for total variation are equivalent:
1.
TV(P, Q) = sup P (E) − Q(E), (6.4)
E
where the supremum is over all measurable set E.
2. 1 − TV(P, Q) is the minimal sum of Type-I and Type-II error probabilities for testing P versus
Q, and2 Z
TV(P, Q) = 1 − P ∧ Q. (6.5)
Remark 6.4 (Variational representation). The equation (6.4) and (6.7) provide sup-representation
of total variation, which will be extended to general f -divergences (later). Note that (6.6) is an
inf-representation of total variation in terms of couplings, meaning total variation is the Wasserstein
distance with respect to Hamming distance. The benefit of variational representations is that
choosing a particular coupling in (6.6) gives an upper bound on TV(P, Q), and choosing a particular
f in (6.7) yields a lower bound.
Remark 6.5 (Operational meaning). In the binary hypothesis test for H0 : X ∼ P or H1 : X ∼ Q,
Theorem 6.3 shows that 1 − TV(P, Q) is the sum of false alarm and missed detection probabilities.
This can be seen either from (6.4) where E is the decision region for deciding P or from (6.5) since
dP
the optimal test (for average probability of error) is the likelihood ratio test dQ > 1. In particular,
• TV(P, Q) = 1 ⇔ P ⊥ Q, the probability of error is zero since essentially P and Q have disjoint
supports.
• TV(P, Q) = 0 ⇔ P = Q and the minimal sum of error probabilities is one, meaning the best
thing to do is to flip a coin.
dQ
2
Here again P ∧ Q is a short-hand understood per the usual sense, namely, ( dP
R R
dµ
∧ dµ
)dµ where µ is any
dominating measure.
71
6.4 Motivating example: Hypothesis testing with multiple
samples
Observation:
1. Different f -divergences have different operational significance. For example, as we saw in
Section 6.3, testing two hypothesis boils down to total variation, which determines the
fundamental limit (minimum average probability of error). For estimation under quadratic
R 2
loss the f -divergence LC(P kQ) = (PP−Q)
+Q is useful.
2. Some f -divergence is easier to evaluate than others. For example, for product distributions,
Hellinger distance and χ2 -divergence tensorize in the sense that they are easily expressible
in terms of those of the one-dimensional marginals; however, computing the total variation
between product measures is frequently difficult. Another example is that computing the
χ2 -divergence from a mixture of distributions to a simple distribution is convenient.
Therefore the punchline is that it is often fruitful to bound one f -divergence by another and this
sometimes leads to tight characterizations. In this section we consider a specific useful example
to drive this point home. Then in the next lecture we develop inequalities between f -divergences
systematically.
Consider a binary hypothesis test where data X = (X1 , . . . , Xn ) are i.i.d drawn from either P
or Q and the goal is to test
H0 : X ∼ P ⊗n vs H1 : X ∼ Q⊗n .
As mentioned before, 1 − TV(P ⊗n , Q⊗n ) gives minimal Type-I+II probabilities of error, achieved by
the maximum likelihood test. By the data processing inequality, TV(P ⊗m , Q⊗m ) ≤ TV(P ⊗n , Q⊗n )
for m < n. From this we see that TV(P ⊗n , Q⊗n ) is an increasing sequence in n (and bounded by 1
by definition) and hence converges. One would hope that as n → ∞, TV(P ⊗n , Q⊗n ) converges to 1
and consequently, the probability of error in the hypothesis test converges to zero. It turns out that
if the distributions P, Q are independent of n, then large deviation theory (see Corollary 15.1) gives
TV(P ⊗n , Q⊗n ) = 1 − exp(−nC(P, Q) + o(n)), (6.8)
R α 1−α
where the constant C(P, Q) = − log inf 0≤α≤1 P Q is the Chernoff Information of P, Q. It
⊗n ⊗n
is clear from this that TV(P , Q ) → 1 as n → ∞, and, in fact, exponentially fast.
However, as frequently encountered in high-dimensional statistical problems, if the distributions
P = Pn and Q = Qn depend on n, then the large-deviation approach that leads to (6.8) is no longer
valid. In such a situation, total variation is still relevant for hypothesis testing, but its behavior as
n → ∞ is not obvious nor easy to compute. In this case, understanding how a more computationally
tractable f -divergence is related to total variation may give insight on hypothesis testing without
needing to directly compute the total variation. It turns out Hellinger distance is precisely suited
for this task – see Theorem 6.4 below. q 2
Recall that the squared Hellinger distance, H 2 (P, Q) = EQ 1 − Q P
is an f -divergence
√
with f (x) = (1 − x)2 , which provides a sandwich bound for total variation
r
1 2 H 2 (P, Q)
0 ≤ H (P, Q) ≤ TV(P, Q) ≤ H(P, Q) 1 − ≤ 1. (6.9)
2 4
This can be proved by elementary manipulations and a systematic proof will be explained in the
next lecture. Direct consequences of this bound are:
72
• H 2 (P, Q) = 2, if and only if TV(P, Q) = 1.
Proof. Because the observations X = (X1 , X2 , ...Xn ) are i.i.d, the joint distribution factors
v
u n
u Y Pn
H 2 (Pn⊗n , Q⊗n
n ) = 2 − 2EQ⊗n
t (Xi )
n Qn
i=1
n
"s # "s #!n
Y Pn Pn
By independence → = 2 − 2 EQn (Xi ) = 2 − 2 EQn
Qn Qn
i=1
n
1
= 2 − 2 1 − H 2 (Pn , Qn ) .
2
Note that there are other f -divergences that are also tensorizable, e.g., χ2 -divergences:
n n
! n
Y Y Y
2
χ Pi , Qi = 1 + χ2 (Pi , Qi ) − 1; (6.11)
i=1 i=1 i=1
however, no sandwich inequality like (6.9) exists for TV and χ2 and hence there is no χ2 -version of
Theorem 6.4. Asserting the non-existence of such inequalities requires understanding the relationship
between these two f -divergences (see next lecture).
73
6.5 Inequalities between f -divergences
We will discuss two methods for finding inequalities between f -divergences.
• ad hoc approach: case-by-case proof using results like Jensen’s inequality, max ≤ mean ≤ min,
Cauchy-Schwarz, etc.
Definition 6.2. The joint range between two f -divergences Df (·k·) and Dg (·k·) is the range of the
mapping (P, Q) 7→ (Df (P kQ), Dg (P kQ)), i.e., the set R ⊂ R+ × R+ where (x, y) ∈ R if there exist
distributions P, Q on some common measurable space such that x = Df (P kQ) and y = Dg (P kQ).
0.9
0.8
0.7
0.6
Dg
0.5
0.4
0.3
0.2
0.1
0
0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2
Df
The green region in the above figure shows what a joint range between Df (·k·) and Dg (·k·) might
look like. By definition of R, the lower boundary gives the sharpest lower bound of Dg in terms of
Df , namely:
similarly, the upper boundary gives the best upper bound. As will be discussed in the next lecture,
the sandwich bound (6.9) correspond to precisely the lower and upper boundaries of the joint range
of H 2 and TV, therefore not improvable. It is important to note, however, that R may be an
unbounded region and some of the boundaries may not exist, meaning it is impossible to bound one
by the other, such as χ2 versus TV.
To gain some intuition, we start with the ad hoc approach by proving Pinsker’s inequality, which
bounds total variation from above by the KL divergence.
74
Theorem 6.5 (Pinsker’s inequality).
Proof. First we show that, by the data processing inequality, it suffices to prove the result for
Bernoulli distributions. For any event E, let Y = 1{X∈E} which is Bernoulli with parameter P (E)
or Q(E). By data processing inequality, D(P kQ) ≥ d(P (E)kQ(E)). If Pinsker’s inequality is true
for all Bernoulli random variables, we have
r
1
D(P kQ) ≥ TV(Bern(P (E)), Bern(Q(E)) = |P (E) − Q(E)|
2
q
1
Taking the supremum over E gives 2 D(P kQ) ≥ supE |P (E) − Q(E)| = TV(P, Q), in view of
Theorem 6.3.
The binary case follows easily from Taylor’s theorem (with integral remainder form):
Z p Z p
p−t
d(pkq) = dt ≥ 4 (p − t)dt = 2(p − q)2
q t(1 − t) q
Remark 6.7. Pinsker’s inequality is known to be sharp in the sense that the constant “2” in (1.15)
LHS
is not improvable, i.e., there exist {Pn , Qn } such that RHS → 2 as n → ∞. (Why? Think about
the local quadratic behavior in Proposition 4.2) Nevertheless, this does not mean that (1.15) itself
is not improvable because it might be possible to subtract some higher-order term from the RHS.
This is indeed the case and there are many refinements of Pinsker’s inequality. But what is the best
inequality? Settling this question rests on characterizing the joint range and the lower boundary.
This is the topic of next lecture.
75
§ 7. Inequalities between f -divergences via joint range
In the last lecture we proved the Pinkser’s inequality that D(P kQ) ≥ 2TV2 (P, Q) in an ad hoc
manner. The downside of ad hoc approaches is that it is hard to tell whether those inequalities can
be improved or not. However, the key step when we proved the Pinkser’s inequality, reduction to
the case for Bernoulli random variables, is inspiring: is it possible to reduce inequalities between
any two f -divergences to the binary case?
The problem boils to the characterization of the region {(TV(P, Q), D(P kQ)) : P, Q} ⊆ R2 , their
joint range, whose lower boundary is the function F .
D
1.5
1.0
0.5
0.0 TV
0.0 0.2 0.4 0.6 0.8
76
Definition 7.1 (Joint range). Consider two f -divergences Df (P kQ) and Dg (P kQ). Their joint
range is a subset of R2 defined by
The region R seems difficult to characterize since we need to consider P, Q over all measurable
spaces; on the other hand, the region Rk for small k is easy to obtain. The main theorem we will
prove is the following [HV11]:
Theorem 7.1 (Harremoës-Vajda ’11).
R = co(R2 ).
It is easy to obtain a parametric formula of R2 . By Theorem 7.1, the region R is no more than
the convex hull of R2 .
Theorem 7.1 implies that R is a convex set; however, as a warmup, it is instructive to prove con-
vexity of R directly, which simply follows from the arbitrariness of the alphabet size of distributions.
Given any two points (Df (P0 kQ0 ), Dg (P0 kQ0 )) and (Df (P1 kQ1 ), Dg (P1 kQ1 )) in the joint range, it
is easy to construct another pair of distributions (P, Q) by alphabet extension that produces any
convex combination of those two points.
Theorem 7.2. R is convex.
Proof. Given any two pairs of distributions (P0 , Q0 ) and (P1 , Q1 ) on some space X and given any
α ∈ [0, 1], we define another pair of distributions (P, Q) on X × {0, 1} by
P = ᾱ(P0 × δ0 ) + α(P1 × δ1 ),
Q = ᾱ(Q0 × δ0 ) + α(Q1 × δ1 ).
In other words, we construct a random variable Z = (X, B) with B ∼ Bern(α), where PX|B=i = Pi
and QX|B=i = Qi . Then
P P
Df (P kQ) = EQ f = EB EQZ|B f = ᾱDf (P0 kQ0 ) + αDf (P1 kQ1 ),
Q Q
P P
Dg (P kQ) = EQ g = EB EQZ|B g = ᾱDg (P0 kQ0 ) + αDg (P1 kQ1 ).
Q Q
Therefore, ᾱ(Df (P0 kQ0 ), Dg (P0 kQ0 )) + α(Df (P1 kQ1 ), Dg (P1 kQ1 )) ∈ R and thus R is convex.
and hence
Rk = co(R2 ), for any k ≥ 3.
77
7.1.1 Representation of f -divergences
To prove Lemma 7.1 and Lemma 7.2, we first express f -divergences by means of expectation over
the likelihood ratio.
Lemma 7.3. Given two f -divergences Df (·k·) and Dg (·k·), their joint range is
E[f (X)] + f˜(0)(1 − E[X])
R= : X ≥ 0, E[X] ≤ 1 ,
E[g(X)] + g̃(0)(1 − E[X])
( )
E[f (X)] + f˜(0)(1 − E[X]) X ≥ 0, E[X] ≤ 1, X takes at most k − 1 values,
Rk = : ,
E[g(X)] + g̃(0)(1 − E[X]) or X ≥ 0, E[X] = 1, X takes at most k values
Therefore Df (P kQ) = f (0.6) + 0.4f˜(0) = E[f (X)] + f˜(0)(1 − E[X]) where X is defined on {Q > 0}:
x 0.6
µ(x) 1
78
On the other direction, given above pmf of a non-negative random variable X ∼ µ with E[X] ≤ 1
that takes 1 value, we let Q(x) = µ(x), let P (x) = xµ(x) on {Q > 0} and let P have an extra point
mass 1 − E[X].
Proof of Lemma 7.3. We first prove it for R. Given any pair of distributions (P, Q) that produces a
point of R, let p, q denote the densities of P, Q under some dominating measure µ, respectively. Let
p
X= on {q > 0} , µX = Q, (7.1)
q
then X ≥ 0 and E[X] = P [q > 0] ≤ 1. Then
Z Z Z
p q p p
Df (P kQ) = f dQ + f dP = f dQ + f˜(0)P [q = 0]
{q>0} q {q=0} p q {q>0} q
= E[f (X)] + f˜(0)(1 − E[X]),
Analogously,
Dg (P kQ) = E[g(X)] + g̃(0)(1 − E[X]),
On the other direction, given any random variable X ≥ 0 and E[X] ≤ 1 where X ∼ µ, let
Now we consider Rk . Given two probability measures P, Q on [k], the likelihood ratio defined in
(7.1) takes at most k values. If P Q then E[X] = 1; if P 6 Q then X takes at most k − 1 values.
On the other direction, if E[X] = 1 then the construction of P, Q in (7.2) are on the same
support of X; if E[X] < 1 then the support of P is increased by one.
R ⊆ R4 .
Let S , {(x, f (x), g(x)) : x ≥ 0} which is a connected set. For any pair of distributions (P, Q) that
produces a point of R, we construct a random variable X as in (7.1), then (E[X], E[f (X)], E[g(X)]) ∈
co(S). By Fenchel-Eggleston-Carathéodory’s theorem,1 there exists (xi , f (xi ), g(xi )) and the corre-
sponding weight αi for i = 1, 2, 3 such that
3
X
(E[X], E[f (X)], E[g(X)]) = αi (xi , f (xi ), g(xi )).
i=1
1
To prove Theorem 7.1, it suffices to invoke the basic Carathéodory’s theorem, which proves a weaker version of
Lemma 7.1 that R = R5 .
79
We construct another random variable X 0 that takes value xi with probability αi . Then X takes 3
values and
(E[X], E[f (X)], E[g(X)]) = (E[X 0 ], E[f (X 0 )], E[g(X 0 )]). (7.3)
By Lemma 7.3 and (7.3),
Df (P kQ) E[f (X)] + f˜(0)(1 − E[X]) E[f (X 0 )] + f˜(0)(1 − E[X 0 ])
= = ∈ R4 .
Dg (P kQ) E[g(X)] + g̃(0)(1 − E[X]) E[g(X 0 )] + g̃(0)(1 − E[X 0 ])
We observe from Lemma 7.3 that Df (P kQ) only depends on the distribution of X for some
X ≥ 0 and E[X] ≤ 1. To find a pair of distributions that produce a point in Rk it suffices to find a
random variable X ≥ 0 taking k values with E[X] = 1, or taking k − 1 values with E[X] ≤ 1. In
Example 7.1.1 where (P, Q) produces a point in R3 , we want to show that it also belongs to co(R2 ).
The decomposition of a point in R3 is equivalent to the decomposition of the likelihood ratio X that
µX = αµ1 + ᾱµ2 .
A solution of such decomposition is that µX = 0.5µ1 + 0.5µ2 where µ1 , µ2 has the following pmf:
x 0.4 3.4 x 0.4 6.4
µ1 (x) 0.8 0.2 µ2 (x) 0.9 0.1
Then by (7.2) we obtain two pairs of distributions
P1 0.32 0.68 P2 0.36 0.64
Q1 0.8 0.2 Q2 0.9 0.1
We obtain that
Df (P kQ) Df (P1 kQ1 ) Df (P2 kQ2 )
= 0.5 + 0.5 .
Dg (P kQ) Dg (P1 kQ1 ) Dg (P2 kQ2 )
Proof of Lemma 7.2. It suffices to prove the first statement, namely, Rk+1 = co(R2 ∪ Rk ) for any
k ≥ 2. By the same argument as in the proof of Theorem 7.2 we have co(Rk ) ⊆ Rk+1 and note
that R2 ∪ Rk = Rk . We only need to prove that
Rk+1 ⊆ co(R2 ∪ Rk ).
Given any pair of distributions (P, Q) that produces a point of (Df (P kQ), Dg (P kQ)) ∈ Rk+1 , we
construct a random variable X as in (7.1) that takes at most k + 1 values. Let µ denote the
distribution of X. We consider two cases that Eµ [X] < 1 and Eµ [X] = 1 separately.
• Eµ [X] < 1. Then X takes at most k values since otherwise P Q. Denote the smallest value
of X by x and then x < 1. Suppose µ(x) = q and then µ can be represented as
µ = qδx + q̄µ0 ,
µ = αµ1 + ᾱµ2 ,
80
Eµ0 [X]−1
– Eµ0 [X] > 1. Let µ1 = pδx + p̄µ0 where p = Eµ0 [X]−x such that Eµ1 [X] = 1. Let
Eµ [X]−x
α= 1−x .
• Eµ [X] = 1.2 Denote the smallest value of X by x and the largest value by y, respectively, and
then x ≤ 1, y ≥ 1. Suppose µ(x) = r and µ(y) = s and then µ can be represented as
where µ0 is supported on at most k − 1 values of X other than x, y. Let µ2 = βδx + β̄δy where
y−1
β = y−x such that Eµ2 [X] = 1. We need to construct another probability measure µ1 such
that
µ = αµ1 + ᾱµ2 ,
1−Eµ0 [X]
– Eµ0 [X] ≤ 1. Let µ1 = pδy + p̄µ0 where p = y−Eµ0 [X] such that Eµ1 [X] = 1. Let ᾱ = r/β.
Eµ0 [X]−1
– Eµ0 [X] > 1. Let µ1 = pδx + p̄µ0 where p = Eµ0 [X]−x such that Eµ1 [X] = 1. Let ᾱ = s/β̄.
Applying the construction in (7.2) with µ1 and µ2 , we obtain two pairs of distributions (P1 , Q1 )
supported on k values and (P2 , Q2 ) supported on two values, respectively. Then
Df (P kQ) Eµ [f (X)] + f˜(0)(1 − Eµ [X])
=
Dg (P kQ) Eµ [g(X)] + g̃(0)(1 − Eµ [X])
Eµ1 [f (X)] + f˜(0)(1 − Eµ1 [X]) Eµ2 [f (X)] + f˜(0)(1 − Eµ2 [X])
=α + ᾱ
Eµ1 [g(X)] + g̃(0)(1 − Eµ1 [X]) Eµ2 [g(X)] + g̃(0)(1 − Eµ2 [X])
Df (P1 kQ1 ) Df (P2 kQ2 )
=α + ᾱ .
Dg (P1 kQ1 ) Dg (P2 kQ2 )
Remark 7.1. Theorem 7.1 can be viewed as a consequence of Krein-Milman’s theorem. Consider
the space of {PX : X ≥ 0, E[X] ≤ 1}, which has only two types of extreme points:
1. X = x for 0 ≤ x ≤ 1;
For the first case, let P = Bern(x) and Q = δ1 ; for the second case, let P = Bern(α2 x2 ) and
Q = Bern(α2 ).
7.2 Examples
7.2.1 Hellinger distance versus total variation
The upper and lower bound we mentioned in the last lecture is the following [Tsy09, Sec. 2.4]:
1 2 p
H (P, Q) ≤ TV(P, Q) ≤ H(P, Q) 1 − H 2 (P, Q)/4. (7.4)
2
2
Many thanks to Pengkun Yang for correcting the error in the original proof.
81
Their joint range R2 has a parametric formula
√ √
(2(1 − pq − p̄q̄), |p − q|) : 0 ≤ p ≤ 1, 0 ≤ q ≤ 1
and is the gray region in Fig. 7.2. The joint range R is the convex hull of R2 (grey region, non-convex)
and exactly described by (7.4); so (7.4) is not improvable. Indeed, with t ranges from 0 to 1,
1.0
0.8
0.6
TV
0.4
0.2
0.0
0.0 0.5 1.0 1.5 2.0
H^2
There are various kinds of improvements of Pinsker’s inequality. Now we know that the best lower
bound is the lower boundary of Fig. 7.1, which is exactly the boundary of R2 . Although there is no
known close-form expression, a paremetric formula of the lower boundary (see Fig. 7.1) not hard to
write it down [FHT03, Theorem 1]:
( 2
Dt = 12 t 1 − coth(t) − 1t
, t ≥ 0. (7.6)
TVt = −t2 csch2 (t) + t coth(t) + log(tcsch(t))
1 + TV(P, Q)
D(P kQ) ≥ TV(P, Q) log .
1 − TV(P, Q)
Consequences:
82
• TV → 1 ⇒ D → ∞. Thus D = O(1) ⇒ TV is bounded away from one. This is not obtainable
from Pinsker’s inequality.
Also from Fig. 7.1 we know that it is impossible to have an upper bound of D(P kQ) using a function
of TV(P, Q) due to the lack of upper boundary.
For more examples see [Tsy09, Sec. 2.4].
• KL vs Hellinger:
2
D(P ||Q) ≥ 2 log . (7.7)
2 − H 2 (P, Q)
There is a partial result in the opposite direction (log-Sobolev inequality for Bonami-Beckner
semigroup)
log( π1∗ − 1)
D(µkπ) ≤ (H 2 (µ, π) − H 4 (µ, π)/4) , π∗ = min π(x)
1 − 2π∗ x
• KL vs χ2 :
0 ≤ D(P ||Q) ≤ log(1 + χ2 (P ||Q)) . (7.8)
(i.e. no lower-bound on KL in terms of χ2 is possible).
83
Part II
84
§ 8. Variable-length Lossless Compression
• Lossless:
P (X 6= X̂) = 0. variable-length code, uniquely decodable codes, prefix codes, Huffman codes
• Almost lossless:
P (X 6= X̂) ≤ . fixed-length codes
• Lossy:
X → W → X̂ s.t. E[(X − X̂)2 ] ≤ distortion.
85
X Compressor
{0, 1}∗ Decompressor X
f : X →{0,1}∗ g: {0,1}∗ →X
• Since {0, 1}∗ = {∅, 0, 1, 00, 01, . . . } is countable, lossless compression is only possible for
discrete R.V.;
• WLOG, we can relabel X such that X = N = {1, 2, . . . } and order the pmf decreasingly:
PX (i) ≥ PX (i + 1).
• Note that at this point we do not impose any conditions on the codebook (such as prefix
or unique-decodability). This is sometimes called single-shot compression setting. Original
results for this setting can be found in [KV14].
Length function:
l : {0, 1}∗ → N
e.g., l(01001) = 5.
Objectives: Find the best compressor f to minimize
E[l(f (X))]
Then
1. length of codeword:
l(f ∗ (i)) = blog2 ic
86
2. l(f ∗ (X)) is stochastically the smallest: for any lossless f ,
st.
l(f ∗ (X)) ≤ l(f (X))
where the inequality is because of f is lossless and hence |Ak | must be at least the total number of
binary strings of length at most k. Then
X X
P[l(f (X)) ≤ k] = PX (x) ≤ PX (x) = P[l(f ∗ (X)) ≤ k],
x∈Ak x∈A∗k
since |Ak | ≤ |A∗k | and A∗k contains all 2k+1 − 1 most likely symbols.
The following lemma (homework) is useful in bounding the expected code length of f ∗ . It says
if the random variable is integer-valued, then its entropy can be controlled using its mean.
Lemma 8.1. For any Z ∈ N s.t. E[Z] < ∞, H(Z) ≤ E[Z]h( E[Z]
1
), where h(·) is the binary entropy
function.
Theorem 8.2 (Optimal average code length: exact expression). Suppose X ∈ N and PX (1) ≥
PX (2) ≥ . . .. Then
∞
X
E[l(f ∗ (X))] = P[X ≥ 2k ].
k=1
P
Proof. Recall that expectation of U ∈ Z+P can be written as E [U ]P
= k≥1 P [U ≥ k]. Then by
Theorem 8.1, E[l(f ∗ (X))] = E [blog2 Xc] = k≥1 P [blog2 Xc ≥ k] = k≥1 P [log2 X ≥ k].
Note: Theorem 8.3 is the first example of a coding theorem, which relates the fundamental limit
E[l(f ∗ (X))] (operational quantity) to the entropy H(X) (information measure).
87
LHS:
H(X) = H(X, L) = H(X|L) + H(L)
1
≤ E[L] + h (1 + E[L]) (Lemma 8.1)
1 + E[L]
1
= E[L] + log2 (1 + E[L]) + E[L] log 1 +
E[L]
≤ E[L] + log2 (1 + E[L]) + log2 e (x log(1 + 1/x) ≤ log e, ∀x > 0)
≤ E[L] + log2 (e(1 + H(X))) (by RHS)
where we have used H(X|L = k) ≤ k bits, since given l(f ∗ (X))) = k, X has at most 2k choices.
Note: (Memoryless source) If X = S n is an i.i.d. sequence, then
nH(S) ≥ E[l(f ∗ (S n ))] ≥ nH(S) − log n + O(1).
For iid sources, the exact asymptotic behavior is found in [SV11, Theorem 4] as:
1
E[`(f ∗ (S n ))] = nH(S) −
log n + O(1) ,
2
unless the source is uniform (in which case it is nH(S) + O(1).
Theorem 8.3 relates the mean of l(f ∗ (X)) ≤ k to that of log2 PX1(X) (entropy). The next result
relates their CDFs.
Theorem 8.4 (Code length distribution of f ∗ ). ∀τ > 0, k ≥ 0,
1 ∗ 1
P log2 ≤ k ≤ P [l(f (X)) ≤ k] ≤ P log2 ≤ k + τ + 2−τ +1
PX (X) PX (X)
Proof. LHS (achievability): Use PX (m) ≤ 1/m. Then similarly as in Theorem 8.3, L(m) =
blog2 mc ≤ log2 m ≤ log2 PX1(m) . Hence L(X) ≤ log2 PX1(X) a.s.
RHS (converse): By truncation,
1 1
P [L ≤ k] = P L ≤ k, log2 ≤ k + τ + P L ≤ k, log2 >k+τ
PX (X) PX (X)
X
1
≤ P log2 ≤k+τ + PX (x)1{l(f ∗ (x))≤k} 1{PX (x)≤2−k−τ }
PX (X)
x∈X
1
≤ P log2 ≤ k + τ + (2k+1 − 1) · 2−k−τ
PX (X)
So far our discussion applies to an arbitrary random variable X. Next we consider the source as
a random process (S1 , S2 , . . .) and introduce blocklength n. We apply our results to X = S n , that is,
by treating the first n symbols as a supersymbol. The following corollary states that the limiting
behavior of l(f ∗ (S n )) and log PSn1(S n ) always coincide.
Corollary 8.1. Let (S1 , S2 , . . .) be some random process and U be some random variable. Then
1 1 D 1 ∗ n D
log2 n
−
→U ⇔ l(f (S ))−
→U (8.1)
n PS (S )
n n
and
1 1 D 1 D
√ log2 n
− H(S ) −→V ⇔ √ (l(f ∗ (S n )) − H(S n ))−
→V (8.2)
n PS n (S n ) n
88
Proof. The proof is simply logic. First recall: convergence in distribution is equivalent to convergence
D
of CDF at all continuity point, i.e., Un −→U ⇔ P [Un ≤ u] → P [U ≤ u] for all u at which point the
CDF of U is continuous (i.e., not an atom of U ).
√
To get (8.1), apply Theorem 8.4 with k = un and τ = n:
√
1 1 1 ∗ 1 1 1
P log2 ≤u ≤P l(f (X)) ≤ u ≤ P log2 ≤ u + √ + 2− n+1 .
n PX (X) n n PX (X) n
√
To get (8.2), apply Theorem 8.4 with k = H(S n ) + nu and τ = n1/4 :
∗ n
1 1 l(f (S )) − H(S n )
P √ log − H(S ) ≤ u ≤ P
n
√ ≤u
n PS n (S n ) n
1 1 −1/4 1/4
≤P √ log n
n
− H(S ) ≤ u + n + 2−n +1
n PS n (S )
Remark 8.2 (Memoryless source). Now let us consider S n that are i.i.d. Then the important
observation is that the log likihood becomes an i.i.d. sum:
X n
1 1
log n
= log .
PS n (S ) PS (Si )
i=1 | {z }
i.i.d.
P
1. By the Law of Large Numbers (LLN), we know that n1 log PSn1(S n ) − →E log PS1(S) = H(S).
Therefore in (8.1) the limiting distribution U is degenerate, i.e., U = H(S), and we have
P
1 ∗ n →
n l(f (S ))− E log PS1(S)
= H(S). [Note: convergence in distribution to a constant ⇔ conver-
gence in probability to a constant]
2. By the Central Limit Theorem (CLT), if V (S) , Var log PS1(S) < ∞,2 then we know that V
in (8.2) is Gaussian, i.e.,
1 1 D
p log − nH(S) −→N (0, 1).
nV (S) PS n (S n )
Consequently, we have the following Gaussian approximation for the probability law of the
optimal code length
1 D
p (l(f ∗ (S n )) − nH(S))−
→N (0, 1),
nV (S)
or, in shorthand,
p
l(f ∗ (S n )) ∼ nH(S) + nV (S)N (0, 1) in distribution.
Gaussian approximation tells us the speed of n1 l(f ∗ (S n )) to entropy and give us a good
approximation at finite n. In the next section we apply our bounds to approximate the
distribution of l(f ∗ (S n )) in a concrete example:
2
V (S) is often known as the varentropy of S.
89
8.1.1 Compressing iid ternary source
Consider the source outputing n ternary letters each independent and distributed as
In all cases above E[`(f ∗ (X))] is close to a midpoint between the two.
90
Optimal compression: CDF, n = 20, PX = [0.445 0.445 0.110]
1
True CDF
Lower bound
Upper bound
0.9 Gaussian approximation
0.8
0.7
0.6
0.5
P
0.4
0.3
0.2
0.1
0
0.8 0.9 1 1.1 1.2 1.3 1.4 1.5
Rate
True PMF
Gaussian approximation
0.16
0.14
0.12
0.1
P
0.08
0.06
0.04
0.02
0
0.8 0.9 1 1.1 1.2 1.3 1.4 1.5
Rate
91
Optimal compression: CDF, n = 100, PX = [0.445 0.445 0.110]
1
True CDF
0.9 Lower bound
Upper bound
0.8 Gaussian approximation
0.7
0.6
0.5
P
0.4
0.3
0.2
0.1
0
1.2 1.25 1.3 1.35 1.4 1.45 1.5 1.55
Rate
True PMF
0.06 Gaussian approximation
0.05
0.04
P
0.03
0.02
0.01
0
1.15 1.2 1.25 1.3 1.35 1.4 1.45 1.5 1.55
Rate
Figure 8.2: CDF and PMF, Gaussian is shifted to the true E[`(f ∗ (X))]
92
Optimal compression: CDF, n = 500, PX = [0.445 0.445 0.110]
1
True CDF
Lower bound
Upper bound
0.9 Gaussian approximation
0.8
0.7
0.6
0.5
P
0.4
0.3
0.2
0.1
0
1.3 1.35 1.4 1.45 1.5 1.55
Rate
True PMF
Gaussian approximation
0.03
0.025
0.02
P
0.015
0.01
0.005
0
1.3 1.35 1.4 1.45 1.5 1.55
Rate
93
8.2 Uniquely decodable codes, prefix codes and Huffman codes
prefix codes
Huffman
code
We have studied f ∗ , which achieves the stochastically smallest code length among all variable-
length compressors. Note that f ∗ is obtained by ordering the pmf and assigning shorter codewords to
more likely symbols. In this section we focus on a specific class of compressors with good algorithmic
properties which lead to low complexity and short delay when decoding from a stream of compressed
bits. This part is more combinatorial in nature.S
We start with a few definitions. Let A+ = n≥1 An denotes all non-empty finite-length strings
consisting of symbols from the alphabet A. Throughout this lecture A is a countable set.
Definition 8.1 (Extension of a code). The (symbol-by-symbol) extension of f : A → {0, 1}∗ is
f : A+ → {0, 1}∗ where f (a1 , . . . , an ) = (f (a1 ), . . . , f (an )) is defined by concatenating the bits.
Definition 8.2 (Uniquely decodable codes). f : A → {0, 1}∗ is uniquely decodable if its extension
f : A+ → {0, 1}∗ is injective.
Definition 8.3 (Prefix codes). f : A → {0, 1}∗ is a prefix code 3 if no codeword is a prefix of another
(e.g., 010 is a prefix of 0101).
Example: A = {a, b, c}.
• f (a) = 0, f (b) = 1, f (c) = 10 – not uniquely decodable, since f (ba) = f (c) = 10.
• f (a) = 0, f (b) = 01, f (c) = 011, f (d) = 0111 – uniquely decodable but not prefix, since as long
as 0 appears, we know that the previous codeword has terminated.
Remark 8.3.
94
2. Similar to prefix-free codes, one can define suffix-free codes. Those are also uniquely decodable
(one should start decoding in reverse direction).
3. By definition, any uniquely decodable code does not have the empty string as a codeword.
Hence f : X → {0, 1}+ in both Definition 8.2 and Definition 8.3.
4. Unique decodability means that one can decode from a stream of bits without ambiguity, but
one might need to look ahead in order to decide the termination of a codeword. (Think of the
last example). In contrast, prefix codes allow the decoder to decode instantaneously without
looking ahead.
5. Prefix code ↔ binary tree (codewords are leaves) ↔ strategy to ask “yes/no” questions
1. Let f : A → {0, 1}∗ be uniquely decodable. Set la = l(f (a)). Then f satisfies the Kraft
inequality X
2−la ≤ 1. (8.3)
a∈A
2. Conversely, for any set of code length {la : a ∈ A} satisfying (8.3), there exists a prefix code
f , such that la = l(f (a)). Moreover, such an f can be computed efficiently.
Note: The consequence of Theorem 8.5 is that as far as compression efficiency is concerned, we can
forget about uniquely decodable codes that are not prefix codes.
Proof. We prove the Kraft inequality for prefix codes and uniquely decodable codes separately.
The purpose for doing a separate proof for prefix codes is to illustrate the powerful technique of
probabilistic method. The idea is from [AS08, Exercise 1.8, p. 12].
Let f be a prefix code. Let us construct a probability space such that the LHS of (8.3) is the
probability of some event, which cannot exceed one. To this end, consider the following scenario:
Generate independent Bern( 12 ) bits. Stop
P if a codeword has been written, otherwise continue. This
process terminates with probability a∈A 2−la . The summation makes sense because the events
that a given codeword is written are mutually exclusive, thanks to the prefix condition.
Now let f be a uniquely decodable code. The proof uses generating function as a device for
counting. (The analogy in coding theory is the weight P enumerator
PL function.) First assume A is
l l
finite. Then L = maxa∈A la is finite. Let Gf (z) = a∈A z = l=0 Al (f )z , where Al (f ) denotes
a
the number of codewords of length l in f . For k ≥ 1, define f k : Ak → {0, 1}+ as the symbol-by-
P k k P P
symbol extension of f . Then Gf k (z) = ak ∈Ak z l(f (a )) = a1 · · · ak z la1 +···+lak = [Gf (z)]k =
PkL k l f , f k is lossless. Hence Al (f k ) ≤ 2l . Therefore we have
l=0 Al (f )z . By the unique decodability of P
Gf (1/2) = Gf k (1/2) ≤ kL for all k. Then a∈A 2−la = Gf (1/2) ≤ limk→∞ (kL)1/k → 1. If A is
k
P
countably infinite, for any finite subset A0 ⊂ A, repeating the same argument gives a∈A0 2−la ≤ 1.
The proof is complete by the arbitrariness of A0 . P
Conversely, given a set of code lengths {la : a ∈ A} s.t. a∈A 2−la ≤ 1, construct a prefix code
f as follows: First relabel A to N and assume that 1 ≤ l1 ≤ l2 ≤ . . .. For each i, define
i−1
X
ai , 2−lk
k=1
95
with a1 = 0. Then ai < 1 by Kraft inequality. Thus we define the codeword f (i) ∈ {0, 1}+ as the
first li bits in the binary expansion of ai . Finally, we prove that f is a prefix code by contradiction:
Suppose for some j > i, f (i) is the prefix of f (j), since lj ≥ li . Then aj − ai ≤ 2−li , since they agree
on the most significant li bits. But aj − ai = 2−li + 2−li+1 + . . . > 2−li , which is a contradiction.
Open problems:
In view of Theorem 8.5, the optimal average code length among all prefix (or uniquely decodable)
codes is given by the following optimization problem
X
L∗ (X) , min PX (a)la (8.4)
a∈A
X
s.t. 2−la ≤ 1
a∈A
la ∈ N
This is an integer programming (IP) problem, which, in general, is computationally hard to solve.
It is remarkable that this particular IP can be solved in near-linear time, thanks to the Huffman
algorithm. Before describing the construction of Huffman codes, let us give bounds to L∗ (X) in
terms of entropy:
Theorem 8.6.
H(X) ≤ L∗ (X) ≤ H(X) + 1 bit. (8.5)
l m
Proof. “≤” Consider the following length assignment la = log2 PX1(a) ,4 which satisfies Kraft
P −la ≤
P
since a∈A l 2 m a∈A PX (a) = 1. By Theorem 8.5, there exists a prefix code f such that
l(f (a)) = log2 PX (a) and El(f (X)) ≤ H(X) + 1.
1
“≥” We give two proofs for the converse. One of the commonly used ideas to deal with
combinatorial optimization is relaxation. Our first idea is to drop the integer constraints in (8.4)
and relax it into the following optimization problem, which obviously provides a lower bound
X
L∗ (X) , min PX (a)la (8.6)
a∈A
X
s.t. 2−la ≤ 1 (8.7)
a∈A
This is a nice convex programming problem, since the objective function is affine and the feasible set
is convex. Solving (8.6) by Lagrange multipliers (Exercise!) yields the minimum is equal to H(X)
(achieved at la = log2 PX1(a) ).
4
Such a code is called a Shannon code.
96
Another proof is the following: For any f satisfying Kraft inequality, define a probability measure
−la
Q(a) = P 2 2−la . Then
a∈A
X
El(f (X)) − H(X) = D(P kQ) − log 2−la
a∈A
≥0
Next we describe the Huffman code, which achieves the optimum in (8.4). In view of the fact
that prefix codes and binary trees are one-to-one, the main idea of Huffman code is to build the
binary tree bottom-up: Given a pmf {PX (a) : a ∈ A},
2. Delete the two symbols and add a new symbol (with combined probabilities). Add the new
symbol as the parent node of the previous two symbols in the binary tree.
The algorithm terminates in |A| − 1 steps. Given the binary tree, the code assignment can be
obtained by assigning 0/1 to the branches. Therefore the time complexity is O(|A|) (sorted pmf) or
O(|A| log |A|) (unsorted pmf).
Example: A = {a, b, c, d, e}, PX = {0.25, 0.25, 0.2, 0.15, 0.15}.
Huffman tree: Codebook:
0 1
f (a) = 00
0.55 0.45 f (b) = 10
0 1 0 1 f (c) = 11
a 0.3 b c f (d) = 010
0 1 f (e) = 011
d e
Theorem 8.7 (Optimality of Huffman codes). The Huffman code achieves the minimal average
code length (8.4) among all prefix (or uniquely decodable) codes.
1. Does not exploit memory. Solution: block Huffman coding. Shannon’s original idea from
1948 paper: in compressing English text, instead of dealing with letters and exploiting the
nonequiprobability of the English alphabet, working with pairs of letters to achieve more
compression (more generally, n-grams). Indeed, compressing the block (S1 , . . . , Sn ) using its
Huffman code achieves H(S1 , . . . , Sn ) within one bit, but the complexity is |A|n !
2. Non-universal (constructing the Huffman code needs to know the source distribution). This
brings us the question: Is it possible to design universal compressor which achieves entropy for
a class of source distributions? And what is the price to pay? See Homework and Lecture 11.
97
2. Lempel-Ziv algorithm: low-complexity, universal, provably optimal in a very strong sense –
Section 11.7.
H(X) − log2 [e(H(X) + 1)] ≤ E[l(f ∗ (X))] ≤ H(X) ≤ E[l(fHuffman (X))] ≤ H(X) + 1.
98
§ 9. Fixed-length (almost lossless) compression. Slepian-Wolf.
Note: If we want g ◦ f = 1X , then k ≥ log2 |X |. But, the transmission link is erroneous anyway...
and it turns out that by tolerating a little error probability , we gain a lot in terms of code length!
Indeed, the key idea is to allow errors: Instead of insisting on g(f (x)) = x for all x ∈ X ,
consider only lossless decompression for a subset S ⊂ X :
(
x x∈S
g(f (x)) =
e x 6∈ S
f : X → {0, 1}k
g : {0, 1}k → X ∪ {e}
Optimal codes:
99
• Variable-length: f ∗ encodes the 2k −1 symbols with the highest probabilities to {φ, 0, 1, 00, . . . , 1k−1 }.
• Fixed-length: The optimal compressor f maps the elements of S into (00 . . . 00), . . . , (11 . . . 10)
and the rest to (11 . . . 11). The decompressor g decodes perfectly except for outputting e upon
receipt of (11 . . . 11).
Note: In Definition 9.1 we require that the errors are always detectable, i.e., g(f (x)) = x or e.
Alternatively, we can drop this requirement and allow undetectable errors, in which case we can of
course do better since we have more freedom in designing codes. It turns out that we do not gain
much by this relaxation. Indeed, if we define
˜∗ (X, k) = inf{P [g(f (X)) 6= X] : f : X → {0, 1}k , g : {0, 1}k → X ∪ {e}},
then ∗ k
P ˜ (X, k) = 1−sum of 2 largest masses of X. This follows immediately from P [g(f (X)) = X] =
P (x) where S , {x : g(f (x)) = x} satisfies |S| ≤ 2 k , because f takes no more than 2k values.
x∈S X
Compared to Theorem 9.1, we see that ˜∗ (X, k) and ˜∗ (X, k) do not differ much. In particular,
∗ (X, k + 1) ≤ ˜∗ (X, k) ≤ ∗ (X, k).
Corollary 9.1 (Shannon). Let S n be i.i.d. Then
∗ n 0 R > H(S)
lim (S , nR) =
n→∞ 1 R < H(S)
p
lim ∗ (S n , nH(S) + nV (S)γ) = 1 − Φ(γ).
n→∞
where Φ(·) is the CDF of N (0, 1), H(S) = E[log PS1(S) ] is the entropy, V (S) = Var[log PS1(S) ] is the
varentropy which is assumed to be finite.
100
Proof. Construction: use those 2k − 1 symbols with the highest probabilities.
The analysis is essentially the same as the lower bound in Theorem 8.4 from Lecture 8. Note
1
that the mth largest mass PX (m) ≤ m . Therefore
X X X
∗ (X, k) = PX (m) = 1{m≥2k } PX (m) ≤ 1n 1
≥2k
o P (m)
X = E1nlog 1
o.
PX (m) 2 P (X) ≥k
X
m≥2k
Theorem 9.4.
1
∗ (X, k) ≤ P log2 > k − τ + 2−τ , ∀τ > 0. (9.2)
PX (X)
Note: In fact, Theorem 9.3 is always stronger than Theorem 9.4. Still, we present the proof of
Theorem 9.4 and the technology behind it – random coding – a powerful technique for proving
existence (achievability) which we heavily rely on in this course. To see that Theorem 9.3 gives
a better bound, note that even the first term in (9.2) exceeds (9.1). Nevertheless, the method of
proof for this weaker bound will be useful for generalizations.
Proof. Construction: random coding (Shannon’s magic). For a given compressor f , the optimal
decompressor which minimizes the error probability is the maximum a posteriori (MAP) decoder,
i.e.,
g ∗ (w) = argmax PX|f (X) (x|w) = argmax PX (x),
x x:f (x)=w
which can be hard to analyze. Instead, let us consider the following (suboptimal) decompressor g:
x, ∃! x ∈ X s.t. f (x) = w and log2 PX1(x) ≤ k − τ,
g(w) = (exists unique high-probability x that is mapped to w)
e, o.w.
to be the set of high-probability inputs whose hashes collide with that of x. Then we have the
following estimate for probability of error:
1
P [g(f (X)) = e] = P log2 > k − τ ∪ {J(X, C) 6= ∅}
PX (X)
1
≤ P log2 > k − τ + P [J(X, C) 6= φ]
PX (X)
The first term does not depend on the codebook C, while the second term does. The idea now
is to randomize over C and show that when we average over all possible choices of codebook, the
second term is smaller than 2−τ . Therefore there exists at least one codebook that achieves the
desired bound. Specifically, let us consider C which is uniformly distributed over all codebooks and
101
independently of X. Equivalently, since C can be represented by an |X | × k binary matrix, whose
rows correspond to codewords, we choose each entry to be independent fair coin flips.
Averaging the error probability (over C and over X), we have
EC [P [J(X, C) 6= φ]] = EC,X 1 ∃x0 6=X:log
n
1
o
2 P (x0 ) ≤k−τ,cx0 =cX
X
X
≤ EC,X 1nlog 1
≤k−τ
o1
{cx0 =cX }
(union bound)
2 P (x0 )
x0 6=X X
X
= 2−k EX 1{PX (x0 )≥2−k+τ }
x0 6=X
X
≤ 2−k 1{PX (x0 )≥2−k+τ }
x0 ∈X
−k k−τ
≤2 2 = 2−τ .
Remark 9.1 (Why the proof works). Compressor f (x) = cx can be thought as hashing x ∈ X to a
random k-bit string cx ∈ {0, 1}k .
102
Remark 9.4 (AEP for memoryless sources). Consider iid S n . By WLLN,
1 1 P
log −
→H(S). (9.3)
n PS n (S n )
In other words, S n is concentrated on the set Tnδ which is exponentially smaller than the whole
space. In almost loss compression we can simply encode this set losslessly. Although this is different
than the optimal encoding, Corollary 9.1 indicates that in the large-n limit the optimal compressor
is no better.
The property (9.3) is often referred as the Asymptotic Equipartition Property (AEP), in the
sense that the random vector is concentrated on a set wherein each reliazation is roughly equally
likely up to the exponent. Indeed, Note that for any sn ∈ Tnδ , its likelihood is concentrated around
PS n (sn ) ∈ 2−(H(S)±δ)n , called δ-typical sequences.
103
Next we still consider fixed-blocklength code and study the fundamental limit of error probability
∗ (X, k)for the following coding paradigms:
• Linear Compression
Our goal is to find compressor with structures. The simplest one can think of is probably linear
operation, which is also highly desired for its simplicity (low complexity). But of course, we have to
be on a vector space where we can define linear operations. In this part, we assume X = S n , where
each coordinate takes values in a finite field (Galois Field), i.e., Si ∈ Fq , where q is the cardinality
of Fq . This is only possible if q = pn for some prime p and n ∈ N. So Fq = Fpn .
Definition 9.2 (Galois Field). F is a finite set with operations (+, ·) where
• 0, 1 ∈ F s.t. 0 + a = 1 · a = a.
• distributive: a · (b + c) = (a · b) + (a · c)
Example:
Compression is achieved if k ≤ n, i.e., H is a fat matrix. Of course, we have to tolerate some error
(almost lossless). Otherwise, lossless compression is only possible with k ≥ n, which not interesting.
104
Theorem 9.5 (Achievability). Let X ∈ Fnq be a random vector. ∀τ > 0, ∃ linear compressor
H : Fnq → Fkq and decompressor g : Fkq → Fnq ∪ {e}, s.t.
1
P [g(HX) 6= X] ≤ P logq > k − τ + q −τ
PX (X)
Remark 9.5. Consider the Hamming space q = 2. In comparison with Shannon’s random coding
achievability, which uses k2n bits to construct a completely random codebook, here for linear codes
we need kn bits to randomly generate the matrix H, and the codebook is a k-dimensional linear
subspace of 6the Hamming space.
Proof. Fix τ . As pointed in the proof of Shannon’s random coding theorem (Theorem 9.4), given
the compressor H, the optimal decompressor is the MAP decoder, i.e., g(w) = argmaxx:Hx=w PX (x),
which outputs the most likely symbol that is compatible with the codeword received. Instead, let us
consider the following (suboptimal) decoder for its ease of analysis:
(
x ∃!x ∈ Fnq : w = Hx, x − h.p.
g(w) =
e otherwise
where we used the short-hand:
1
x − h.p. (high probability) ⇔ logq < k − τ ⇔ PX (x) ≥ q −k+τ .
PX (x)
Note that this decoder is the same as in the proof of Theorem 9.4. The proof is also mostly the
same, except now hash collisions occur under the linear map H. By union bound,
1
P [g(f (X)) = e] ≤ P logq > k − τ + P ∃x0 − h.p. : x0 6= X, Hx0 = HX
PX (x)
X X
1
(union bound) ≤ P logq >k−τ + PX (x) 1{Hx0 = Hx}
PX (x) x 0 0x −h.p.,x 6=x
Now we use random coding to average the second term over all possible choices of H. Specifically,
choose H as a matrix independent of X where each entry is iid and uniform on Fq . For distinct x0
and x1 , the collision probability is
PH [Hx1 = Hx0 ] = PH [Hx2 = 0] (x2 , x1 − x0 6= 0)
= PH [H1 · x2 = 0] k
(iid rows)
where H1 is the first row of the matrix H, and each row of H is independent. This is the probability
that Hi is in the orthogonal complement of x2 . On Fnq , the orthogonal complement of a given
non-zero vector has cardinality q n−1 . So the probability for the first row to lie in this subspace is
q n−1 /q n = 1/q, hence the collision probability 1/q k . Averaging over H gives
X X
EH 1{Hx0 = Hx} = PH [Hx0 = Hx] = |{x0 : x0 − h.p., x0 6= x}|q −k ≤ q k−τ q −k = q −τ
x0 −h.p.,x0 6=x x0 −h.p.,x0 6=x
105
2. Note that in this case it is not possible to make all errors detectable.
3. Can we loosen the requirement on Fq to instead be a commutative ring? In general, no, since
zero divisors in the commutative ring ruin the key proof ingredient of low collision probability
in the random hashing. E.g. in Z/6Z
1 2
0 0
P H . = 0 = 6−k but P H . = 0 = 3−k ,
.. ..
0 0
since 0 · 2 = 3 · 2 = 0 in Z/6Z.
• f : X × Y → {0, 1}k
Note: The side information Y need not be discrete. The source X is, of course, discrete.
Note that conditioned on Y = y, the problem reduces to compression without side information
where the source X is distributed according to PX|Y =y . Since Y is known to both the compressor
and decompressor, they can use the best code tailored for this distribution. Recall ∗ (X, k) defined
in Definition 9.1, the optimal probability of error for compressing X using k bits, which can also be
denoted by ∗ (PX , k). Then we have the following relationship
106
i.i.d.
Corollary 9.2. (X, Y ) = (S n , T n ) where the pairs (Si , Ti ) ∼ PST
(
0 R > H(S|T )
lim ∗ (S n |T n , nR) =
n→∞ 1 R < H(S|T )
Proof. Using the converse Theorem 9.2 and achievability Theorem 9.4 (or Theorem 9.3) for com-
pression without side information, we have
1 −τ ∗ 1
P log > k + τ Y = y − 2 ≤ (PX|Y =y , k) ≤ P log > k Y = y
PX|Y (X|y) PX|Y (X|y)
By taking the average over all y ∼ PY , we get the theorem. For the corollary
n
1 1 1X 1 P
log n n
= log −
→H(S|T )
n PS n |T n (S |T ) n PS|T (Si |Ti )
i=1
• f : X → {0, 1}k
• P[g(f (X), Y ) 6= X] ≤
Now the very surprising result: Even without side information at the compressor, we can still
compress down to the conditional entropy!
Theorem 9.7 (Slepian-Wolf, ’73).
∗ 1
(X|Y, k) ≤ ∗SW (X|Y, k) ≤ P log ≥ k − τ + 2−τ
PX|Y (X|Y )
107
Corollary 9.3.
(
0 R > H(S|T )
lim ∗SW (S n |T n , nR) =
n→∞ 1 R < H(S|T )
Remark 9.7. Definition 9.4 does not include the zero-undected-error condition (that is g(f (x), y) =
x or e). In other words, we allow for the possibility of undetected errors. Indeed, if we require this
condition, the side-information savings will be mostly gone. Indeed, assuming PX,Y (x, y) > 0 for
all (x, y) it is clear that under zero-undetected-error condition, if f (x1 ) = f (x2 ) = c then g(c) = e.
Thus except for c all other elements in {0, 1}k must have unique preimages. Similarly, one can
show that Slepian-Wolf theorem does not hold in the setting of variable-length lossless compression
(i.e. average length is H(X) not H(X|Y ).)
Proof. LHS is obvious, since side information at the compressor and decompressor is better than
only at the decompressor.
For the RHS, first generate a random codebook with iid uniform codewords: C = {cx ∈ {0, 1}k :
x ∈ X } independently of (X, Y ), then define the compressor and decoder as
f (x) = cx
(
x ∃!x : cx = w, x − h.p.|y
g(w, y) =
0 o.w.
where we used the shorthand x − h.p.|y ⇔ log2 P 1(x|y) < k − τ . The error probability of this
X|Y
scheme, as a function of the code book C, is
1
E(C) = P log ≥ k − τ or J(X, C|Y ) 6= ∅
PX|Y (X|Y )
1
≤ P log ≥ k − τ + P [J(X, C|Y ) 6= ∅]
PX|Y (X|Y )
X
1
= P log ≥k−τ + PXY (x, y)1{J(x,C|y)6=∅} .
PX|Y (X|Y ) x,y
= 2k−τ P[cx0 = cx ]
= 2−τ
Hence the theorem follows as usual from two terms in the union bound.
108
X {0, 1}k1
Compressor f1
Decompressor g
(X̂, Ŷ )
Y {0, 1}k2
Compressor f2
• (f1 , f2 , g) is (k1 , k2 , )-code if f1 : X → {0, 1}k1 , f2 : Y → {0, 1}k2 , g : {0, 1}k1 × {0, 1}k2 →
X × Y, s.t. P[(X̂, Ŷ ) 6= (X, Y )] ≤ , where (X̂, Ŷ ) = g(f1 (X), f2 (Y )).
Achievable
H(T )
Region
H(T |S)
R1
H(S|T ) H(S)
Since H(T ) − H(T |S) = H(S) − H(S|T ) = I(S; T ), the slope is −1.
Proof. Converse: Take (R1 , R2 ) 6∈ RSW . Then one of three cases must occur:
1. R1 < H(S|T ). Then even if encoder and decoder had full T n , still can’t achieve this (from
compression with side info result – Corollary 9.2).
3. R1 + R2 < H(S, T ). Can’t compress below the joint entropy of the pair (S, T ).
109
Achievability: First note that we can achieve the two corner points. The point (H(S), H(T |S))
can be approached by almost lossless compressing S at entropy and compressing T with side infor-
mation S at the decoder. To make this rigorous, let k1 = n(H(S) + δ) and k2 = n(H(T |S) + δ). By
Corollary 9.1, there exist f1 : S n → {0, 1}k1 and g1 : {0, 1}k1 → S n s.t. P [g1 (f1 (S n )) 6= S n ] ≤
n → 0. By Theorem 9.7, there exist f2 : T n → {0, 1}k2 and g2 : {0, 1}k1 × S n → T n
s.t. P [g2 (f2 (T n ), S n ) 6= T n ] ≤ n → 0. Now that S n is not available, feed the S.W. decompressor
with g(f (S n )) and define the joint decompressor by g(w1 , w2 ) = (g1 (w1 ), g2 (w2 , g1 (w1 ))) (see below):
Sn Ŝ n
f1 g1
Tn T̂ n
f2 g2
P [g(f1 (S n ), f2 (T n )) 6= (S n , T n )]
= P [g1 (f1 (S n )) 6= S n ] + P [g2 (f2 (T n ), g(f1 (S n ))) 6= T n , g1 (f1 (S n )) = S n ]
≤ P [g1 (f1 (S n )) 6= S n ] + P [g2 (f2 (T n ), S n ) 6= T n ]
≤ 2n → 0.
(Exercise: Write down the details rigorously yourself!) Therefore, all convex combinations of points
in the achievable regions are also achievable, so the achievable region must be convex.
110
X {0, 1}k1
Compressor f1
Decompressor g
X̂
Y {0, 1}k2
Compressor f2
Furthermore, for every such random variable U the rate pair (H(X|U ), I(Y ; U )) is achievable with
vanishing error.
where (9.9) follows from I(W2 , X k−1 ; Yk |Y k−1 ) = I(W2 ; Yk |Y k−1 ) + I(X k−1 ; Yk |W2 , Y k−1 ) and the
⊥ X k−1 |Y k−1 ; and (9.10) from Y k−1 ⊥
fact that (W2 , Yk ) ⊥ ⊥ Yk . Comparing (9.7) and (9.10) we
notice that denoting Uk = (W2 , X k−1 , Y k−1 ) we have
n
1X
(R1 , R2 ) ≥ (H(Xk |Uk ), I(Uk ; Yk ))
n
k=1
and thus (from convexity) the rate pair must belong to the region spanned by all pairs (H(X|U ), I(U ; Y )).
111
To show that without loss of generality the auxiliary random variable U can be taken to be
|Y| + 1 valued, one needs to invoke Caratheodory’s theorem on convex hulls. We omit the details.
Finally, showing that for each U the mentioned rate-pair is achievable, we first notice that if there
were side information at the decompressor in the form of the i.i.d. sequence U n correlated to X n ,
then Slepian-Wolf theorem implies that only rate R1 = H(X|U ) would be sufficient to reconstruct
X n . Thus, the question boils down to creating a correlated sequence U n at the decompressor by
using the minimal rate R2 . This is the content of the so called covering lemma, see Theorem 26.5
below: It is sufficient to use rate I(U ; Y ) to do so. We omit further details.
112
§ 10. Compressing stationary ergodic sources
In this lecture, we shall examine similar results for ergodic processes and we first state the main
theory as follows:
Theorem 10.1 (Shannon-McMillan). Let {S1 , S2 , . . . } be a stationary and ergodic discrete process,
then
1 1 P
log −
→H, also a.s. and in L1 (10.3)
n PS n (S n )
Proof. Shannon-McMillan (we only need convergence in probability) + Theorem 8.4 + Theorem 9.1
which tie together the respective CDF of the random variable l(f ∗ (S n )) and log PSn1(sn ) .
In Lecture 9 we learned the asymptotic equipartition property (AEP) for iid sources. Here we
generalize it to stationary ergodic sources thanks to Shannon-McMillan.
Corollary 10.2 (AEP for stationary ergodic sources). Let {S1 , S2 , . . . } be a stationary and ergodic
discrete process. For any δ > 0, define the set
n 1 1
δ
Tn = s : log
− H ≤ δ .
n PS n (sn )
Then
1. P S n ∈ Tnδ → 1 as n → ∞.
Note:
113
• Convergence almost surely for stationary ergodic processes [Breiman 1956] (Either of the last
two results implies the convergence Theorem 10.1 in probability.)
• For a Markov chain, existence of typical sequences can be understood by thinking of Markov
process as sequence of independent decisions regarding which transitions to take. It is then
clear that Markov process’s trajectory is simply a transformation of trajectories of an i.i.d.
process, hence must similarly concentrate similarly on some typical set.
It is easy to check that all shift-invariant events belong to Ftail . The inclusion is strict, as for
example the event
{∃n : xi = 0, ∀ odd i ≥ n}
is in Ftail but not shift-invariant.
114
Proposition 10.1 (Poincare recurrence). Let τ be measure-preserving for (Ω, F, P). Then for any
measurable A with P[A] > 0 we have
[
P[ τ −k A|A] = P[τ k (ω) ∈ A occurs infinitely often|A] = 1 .
k≥1
S
Proof. Let B = k≥1 τ
−k A. It is sufficient to show that P[A ∩ B] = P[A] or equivalently
P[A ∪ B] = P[B] . (10.5)
To that end notice that τ −1 A ∪ τ −1 B = B and thus
P[τ −1 (A ∪ B)] = P[B] ,
but the left-hand side equals P[A ∪ B] by the measure-preservation of τ , proving (10.5).
Note: Consider τ mapping initial state of the conservative (Hamiltonian) mechanical system to its
state after passage of a given unit of time. It is known that τ preserves Lebesgue measure in phase
space (Liouville’s theorem). Thus Poincare recurrence leads to rather counter-intuitive conclusions.
For example, opening the barrier separating two gases in a cylinder allows them to mix. Poincare
recurrence says that eventually they will return back to the original separated state (with each gas
occupying roughly its half of the cylinder).
Definition 10.3 (Ergodicity). A transformation τ is ergodic if ∀E ∈ Finv we have P[E] = 0 or 1.
A process {Si } is ergodic if all shift invariant events are deterministic, i.e., for any shift invariant
event E, P [S1∞ ∈ E] = 0 or 1.
Example:
• {Sk = k 2 }: ergodic but not stationary
• {Sk = S0 }: stationary but not ergodic (unless S0 is a constant). Note that the singleton set
E = {(s, s, . . .)} is shift invariant and P [S1∞ ∈ E] = P [S0 = s] ∈ (0, 1) – not deterministic.
• {Sk } i.i.d. is stationary and ergodic (by Kolmogorov’s 0-1 law, tail events have no randomness)
• (Sliding-window construction of ergodic processes)
If {Si } is ergodic, then {Xi = f (Si , Si+1 , . . . )} is also ergodic. It is called a B-process if Si
is i.i.d. P∞ −n−1
Example, Si ∼ Bern( 12 ) i.i.d., Xk = n=0 2 Sk+n = 2Xk−1 mod 1. The marginal
distribution of Xi is uniform on [0, 1]. Note that Xk ’s behavior is completely deterministic:
given X0 , all the future Xk ’s are determined exactly. This example shows that certain
deterministic maps exhibit ergodic/chaotic behavior under iterative application: although
the trajectory is completely deterministic, its time-averages converge to expectations and in
general “look random”.
• There are also stronger conditions than ergodicity. Namely, we say that τ is mixing (or strong
mixing) if
P[A ∩ τ −n B] → P[A]P[B] .
We say that τ is weakly mixing if
n
X 1
P[A ∩ τ −n B] − P[A]P[B] → 0 .
n
k=1
115
• {Si }: finite irreducible Markov chain with recurrent states is ergodic (in fact strong mixing),
regardless of initial distribution.
Toy example: kernel P (0|1) = P (1|0) = 1 with initial dist. P (S0 = 0) = 0.5. This process only
has two sample paths: P [S1∞ = (010101 . . .)] = P [S1∞ = (101010 . . .)] = 12 . It is easy to verify
this process is ergodic (in the sense defined above!). Note however, that in Markov-chain
literature a chain is called ergodic if it is irreducible, aperiodic and recurrent. This example
does not satisfy this definition (this clash of terminology is a frequent source of confusion).
• (optional) {Si }: stationary zero-mean Gaussian process with autocovariance function R(n) =
E[S0 Sn∗ ].
n
1 X
lim R[t] = 0 ⇔ {Si } ergodic ⇔ {Si } weakly mixing
n→∞ n + 1
t=0
lim R[n] = 0 ⇔ {Si } mixing
n→∞
Intuitively speaking, an ergodic process can have infinite memory in general, but the memory
is weak. Indeed, we see that for a stationary Gaussian process ergodicity means the correlation
dies (in the Cesaro-mean sense).
The spectral measure is defined as the (discrete time) Fourier transform of the autocovariance
sequence {R(n)}, in the sense that there exists a unique probability measure µ on [− 12 , 12 ] such
that R(n) = E exp(i2nπX) where X ∼ µ. The spectral criteria can be formulated as follows:
Detailed exposition on stationary Gaussian processes can be found in [Doo53, Theorem 9.3.2,
pp. 474, Theorem 9.7.1, pp. 493–494].1
In the special case where f depends on finitely many coordinates, say, f = f (S1 , . . . , Sm ), we have
n
1X
lim f (Sk , . . . , Sk+m−1 ) = E f (S1 , . . . , Sm ). a.s. and in L1
n→∞ n
k=1
1
Thanks Prof. Bruce Hajek for the pointer.
116
Interpretation: time average converges to ensemble average.
Example: Consider f = f (S1 )
• {Si } is iid. Then Theorem 10.2 is SLLN (strong LLN).
• {Si } is such that Si = S1 for all i – non-ergodic. Then Theorem 10.2 fails unless S1 is a
constant.
Definition 10.4. {Si : i ∈ N} is an mth order Markov chain if PSt+1 |S1t = PSt+1 |St−m+1
t for all t ≥ m.
It is called time homogeneous if PSt+1 |St−m+1
t = PSm+1 |S1m .
Remark 10.1. Showing (10.3) for an mth order time homogeneous Markov chain {Si } is a direct
application of Birkhoff-Khintchine.
n
1 1 1X 1
log n
= log
n PS (S )
n n PSt |S t−1 (St |S t−1 )
t=1
n
1 1 1 X 1
= log m
+ log l−1
n PS (S ) n
m PSt |S t−1 (Sl |Sl−m )
t=m+1 t−m
n
1 1 1 X 1
= log m + log t−1 , (10.6)
n PS1 (S1 ) n PSm+1 |S1m (St |St−m )
| {z } | t=m+1
{z }
→0
→H(Sm+1 |S1m ) by Birkhoff-Khintchine
1
where we applied Theorem 10.2 with f (s1 , s2 , . . .) = log P m (sm+1 |sm
.
Sm+1 |S1 1 )
Now let’s prove (10.3) for a general stationary ergodic process {Si } which might have infinite
memory. The idea is to approximate the distribution of that ergodic process by an m-th order
MC (finite memory) and make use of (10.6); then let m → ∞ to make the approximation accurate
(Markov approximation).
Proof of Theorem 10.1 in L1 . To show that (10.3) converges in L1 , we want to show that
1 1
E log n
− H → 0, n → ∞.
n PS n (S )
To this end, fix an m ∈ N. Define the following auxiliary distribution for the process:
∞
Y
Q (m)
(S1∞ ) =P S1m (S1m ) t−1
PSt |S t−1 (St |St−m )
t−m
t=m+1
∞
Y
t−1
stat.
= PS1m (S1m ) PSm+1 |S1m (St |St−m )
t=m+1
Note that under Q(m) , {Si } is an mth -order time-homogeneous Markov chain.
By triangle inequality,
1 1 1 1 1 1
E log − H ≤E log − log
n PS n (S n ) n PS n (S n ) n (m) n
QS n (S )
| {z }
,A
1 1
+ E log (m) − Hm + |Hm − H|
n n
QS n (S ) | {z }
| {z } ,C
,B
117
where Hm , H(Sm+1 |S1m ).
Now
• For term A,
1 dPS n 1 (m) 2 log e
E[A] = EP log (m)
≤ D(P S n kQ n ) +
S
n dQ n n en
S
where
" #
1 (m) 1 PS n (S n )
D(PS n kQS n ) = E log Q
n n PS m (S m ) nt=m+1 PSm+1 |S 1 (St |S t−1 )
m t−m
Lemma 10.1.
dP
EP log ≤ D(P kQ) + 2 log e .
dQ e
118
Intuitively:
n
1X k 1
An = T = (I − T n )(I − T )−1
n n
k=1
Then, if f ⊥ ker(I − T ) we should have An f → 0, since only components in the kernel can blow up.
This intuition is formalized in the proof below.
Let’s further decompose f into two parts f = f1 +f2 , where f1 ∈ ker(I −T ) and f2 ∈ ker(I −T )⊥ .
Observations:
• if g ∈ ker(I − T ), g must be a constant function. This is due to the ergodicity. Consider
indicator function 1A , if 1A = 1A ◦ τ = 1τ −1 A , then P[A] = 0 or 1. For a general case, suppose
g = T g and g is not constant, then at least some set {g ∈ (a, b)} will be shift-invariant and
have non-trivial measure, violating ergodicity.
• ker(I − T ) = ker(I − T ∗ ). This is due to the fact that T is unitary:
where in the last step we used the fact that Cauchy-Schwarz (f, g) ≤ kf k · kgk only holds with
equality for g = cf for some constant c.
• ker(I − T )⊥ = ker(I − T ∗ )⊥ = [Im(I − T )], where [Im(I − T )] is an L2 closure.
• g ∈ ker(I − T )⊥ ⇐⇒ E[g] = 0. Indeed, only zero-mean functions are orthogonal to constants.
With these observations, we know that f1 = m is a const. Also, f2 ∈ [Im(I − T )] so we further
approximate it by f2 = f0 + h1 , where f0 ∈ Im(I − T ), namely f0 = g − g ◦ τ for some function
g ∈ L2 , and kh1 k1 ≤ kh1 k2 < . Therefore we have
An f1 = f1 = E[f ]
1
An f0 = (g − g ◦ τ n ) → 0 a.s. and L1
n
P n P 1 a.s.
since E[ n≥1 ( g◦τ
n ) ] = E[g ]
2 2
n2
< ∞ =⇒ n1 g ◦ τ n −−→0.
The proof is completed by showing
2
P lim sup An (h + h1 ) ≥ δ ≤ . (10.7)
n δ
Indeed, then by taking → 0 we will have shown
P lim sup An (f ) ≥ E[f ] + δ = 0
n
as required.
Proof of (10.7) makes use of the Maximal Ergodic Lemma stated as follows:
Theorem 10.3 (Maximal Ergodic Lemma). Let (P, τ ) be a probability measure and a measure-
preserving transformation. Then for any f ∈ L1 (P) we have
E[f 1{sup
n≥1 An f >a} ] kf k1
P sup An f > a ≤ ≤
n≥1 a a
1 Pn−1
where An f = n k=0 f ◦ τ k.
119
Note: This is a so-called “weak L1 ” estimate for a sublinear operator supn An (·). In fact, this
theorem is exactly equivalent to the following result:
Lemma 10.2 (Estimate for the maximum of averages). Let {Zn , n = 1, . . .} be a stationary process
with E[|Z|] < ∞ then
|Z1 + . . . + Zn | E[|Z|]
P sup >a ≤ ∀a > 0
n≥1 n a
Proof. The argument for this Lemma has originally been quite involved, until a dramatically simple
proof (below) was found by A. Garcia.
Define
n
X
Sn = Zk (10.8)
k=1
Ln = max{0, Z1 , . . . , Z1 + · · · + Zn } (10.9)
Mn = max{0, Z2 , Z2 + Z3 , . . . , Z2 + · · · + Zn } (10.10)
Sn
Z ∗ = sup (10.11)
n≥1 n
from which Lemma follows by upper-bounding the left-hand side with E[|Z1 |].
In order to show (10.12) we first notice that {Ln > 0} % {Z ∗ > 0}. Next we notice that
Z1 + Mn = max{S1 , . . . , Sn }
and furthermore
Z1 + Mn = Ln on {Ln > 0}
Thus, we have
Z1 1{Ln >0} = Ln − Mn 1{Ln >0}
where we do not need indicator in the first term since Ln = 0 on {Ln > 0}c . Taking expectation we
get
where we used Mn ≥ 0, the fact that Mn has the same distribution as Ln−1 , and Ln ≥ Ln−1 ,
respectively. Taking limit as n → ∞ in (10.15) we obtain (10.12).
120
10.4* Sinai’s generator theorem
It turns out there is a way to associate to every probability-preserving transformation (p.p.t.) τ
a number, called Kolmogorov-Sinai entropy. This number is invariant to isomorphisms of p.p.t.’s
(appropriately defined).
Definition 10.5. Fix a probability-preserving transformation τ acting on probability space (Ω, F, P).
Kolmogorov-Sinai entropy of τ is defined as
1
H(τ ) , sup lim H(X0 , X0 ◦ τ, . . . , X0 ◦ τ n−1 ) ,
X0 n→∞ n
where supremum is taken over all finitely-valued random variables X0 : Ω → X and measurable
with respect to F.
Note that every random variable X0 generates a stationary process adapted to τ , that is
Xk , X0 ◦ τ k .
In this way, Kolmogorov-Sinai entropy of τ equals the maximal entropy rate among all stationary
processes adapted to τ . This quantity may be extremely hard to evaluate, however. One help comes
in the form of the famous criterion of Y. Sinai. We need to elaborate on some more concepts before:
P[E∆E 0 ] = 0 .
σ{Y, Y ◦ τ, . . . , Y ◦ τ n , . . .} = F mod P
Theorem 10.4 (Sinai’s generator theorem). Let Y be the generator of a p.p.t. (Ω, F, P, τ ). Let
H(Y) be the entropy rate of the process Y = {Yk = Y ◦ τ k , k = 0, . . .}. If H(Y) is finite, then
H(τ ) = H(Y).
Proof. Notice that since H(Y) is finite, we must have H(Y0n ) < ∞ and thus H(Y ) < ∞. First, we
argue that H(τ ) ≥ H(Y). If Y has finite alphabet, then it is simply from the definition. Otherwise
let Y be Z+ -valued. Define a truncated version Ỹm = min(Y, m), then since Ỹm → Y as m → ∞ we
have from lower semicontinuity of mutual information, cf. (3.13), that
H(Y |Ỹ ) ≤ ,
121
Then, consider the chain
Thus, entropy rate of Ỹ (which has finite-alphabet) can be made arbitrarily close to the entropy
rate of Y, concluding that H(τ ) ≥ H(Y).
The main part is showing that for any stationary process X adapted to τ the entropy rate is
upper bounded by H(Y). To that end, consider X : Ω → X with finite X and define as usual the
process X = {X ◦ τ k , k = 0, 1, . . .}. By generating property of Y we have that X (perhaps after
modification on a set of measure zero) is a function of Y0∞ . So are all Xk . Thus
where we used the continuity-in-σ-algebra property of mutual information, cf. (3.14). Rewriting the
latter limit differently, we have
lim H(X0 |Y0n ) = 0 .
n→∞
Fix > 0 and choose m so that H(X0 |Y0m ) ≤ . Then consider the following chain:
where we used stationarity of (Xk , Yk ) and the fact that H(X0 |Y0n−i ) < for i ≤ n − m. After
dividing by n and passing to the limit our argument implies
H(X) ≤ H(Y) + .
122
Notice that since X̃0n is a function of Y0n+m we have
H(X̃0n ) ≤ H(Y0n+m ) .
Dividing by m and passing to the limit we obtain that for entropy rates
H(X̃) ≤ H(Y) .
Finally, to relate X̃ to X notice that by construction
P[X̃j 6= Xj ] ≤ .
Since both processes take values on a fixed finite alphabet, from Corollary 5.2 we infer that
|H(X) − H(X̃)| ≤ log |X | + h() .
Altogether, we have shown that
H(X) ≤ H(Y) + log |X | + h() .
Taking → 0 we conclude the proof.
Examples:
• Let Ω = [0, 1], F–Borel σ-algebra, P = Leb and
(
2ω, ω < 1/2
τ (ω) = 2ω mod 1 =
2ω − 1, ω ≥ 1/2
It is easy to show that Y (ω) = 1{ω < 1/2} is a generator and that Y is an i.i.d. Bernoulli(1/2)
process. Thus, we get that Kolmogorov-Sinai entropy is H(τ ) = log 2.
• Let Ω be the unit circle S1 , F – Borel σ-algebra, P be the normalized length and
τ (ω) = ω + γ
γ
i.e. τ is a rotation by the angle γ. (When 2π is irrational, this is known to be an ergodic
p.p.t.). Here Y = 1{|ω| < 2π} is a generator for arbitrarily small and hence
H(τ ) ≤ H(X) ≤ H(Y0 ) = h() → 0 as → 0 .
This is an example of a zero-entropy p.p.t.
Remark 10.2. Two p.p.t.’s (Ω1 , τ1 , P1 ) and (Ω0 , τ0 , P0 ) are called isomorphic if there exists fi :
Ωi → Ω1−i defined Pi -almost everywhere and such that 1) τ1−i ◦ fi = f1−i ◦ τi ; 2) fi ◦ f1−i is identity
−1
on Ωi (a.e.); 3) Pi [f1−i E] = P1−i [E]. It is easy to see that Kolmogorov-Sinai entropies of isomorphic
p.p.t.s are equal. This observation was made by Kolmogorov in 1958. It was revoluationary, since it
allowed to show that p.p.t.s corresponding shifts of iid Bern(1/2) and iid Bern(1/3) procceses are
not isomorphic. Before, the only invariants known were those obtained from studying the spectrum
of a unitary operator
Uτ : L2 (Ω, P) → L2 (Ω, P) (10.16)
φ(x) 7→ φ(τ (x)) . (10.17)
However, the spectrum of τ corresponding to any non-constant i.i.d. process consists of the entire
unit circle, and thus is unable to distinguish Bern(1/2) from Bern(1/3).2
2
To see thePstatement about the spectrum, let Xi be iid with zero mean and unit variance. Then consider φ(x∞
1 )
defined as m m
√1
k=1 e
iωk
xk . This φ has unit energy and as m → ∞ we have kUτ φ − eiω φkL2 → 0. Hence every eiω
belongs to the spectrum of Uτ .
123
§ 11. Universal compression
In this lecture we will discuss how to produce compression schemes that do not require apriori
knowledge of the distribution. Here, compressor is a map X n → {0, 1}∗ . Now, however, there is no
one fixed probability distribution PX n on X n . The plan for this lecture is as follows:
1. We will start by discussing the earliest example of a universal compression algorithm (of
Fitingof). It does not talk about probability distributions at all. However, it turns out to be
asymptotically optimal simulatenously for all i.i.d. distributions and with small modifications
for all finite-order Markov chains.
2. Next class of universal compressors is based on assuming that a the true distribution PX n
belongs to a given class. These methods proceed by choosing a good model distribution QX n
serving as the minimax approximation to each distribution in the class. The compression
algorithm for a single distribution QX n is then designed as in previous chapters.
3. Finally, an entirely different idea are algorithms of Lempel-Ziv type. These automatically
adapt to the distribution of the source, without any prior assumptions required.
Throughout this section instead of describing each compression algorithm, we will merely specify
some distribution QX n and apply one of the following constructions:
• Sort all xn in the order of decreasing QX n (xn ) and assign values from {0, 1}∗ as in Theorem 8.1,
this compressor has lengths satisfying
1
`(f (xn )) ≤ log .
QX n (xn )
• Set lengths to be
1
`(f (xn )) , dlog e
QX n (xn )
and apply Kraft’s inequality Theorem 8.5 to construct a prefix code.
and in this way we may and will always replace lengths with log QX n1(xn ) . In this way, the only job
of a universal compression algorithm is to specify QX n .
124
Remark 11.1. Furthermore, if we only restrict attention to prefix codes, then any code f : X n →
n
{0, 1}∗ defines a distribution QX n (xn ) = 2−`(f (x )) (we assume the code’s tree is full). In this
way, for prefix-free codes results on redundancy, stated in terms of optimizing the choice of QX n ,
imply tight converses too. For one-shot codes without prefix constraints the optimal answers
are slightly different, however. (For example, the optimal universal code for all i.i.d. sources
satisfies E[`(f (X n ))] ≈ H(X n ) + |X 2|−3 log n in contrast with |X 2|−1 log n for prefix-free codes,
cf. [BF14, KS14].)
Associate to each xn an interval Ixn = [Fn (xn ), Fn (xn ) + QX n (xn )). These intervals are disjoint
subintervals of [0, 1). Now encode
Recall that dyadic intervals are intervals of the type [m2−k , (m + 1)2−k ] where a is an odd integer.
Clearly each dyadic interval can be associated with a binary string in {0, 1}∗ . We set f (xn ) to be
that string. The resulting code is a prefix code satisfying
n 1
`(f (x )) ≤ log2 + 1.
QX n (xn )
(This is an exercise.)
Observe that
X
Fn (xn ) = Fn−1 (xn−1 ) + QX n−1 (xn−1 ) QXn |X n−1 (y|xn−1 )
y<xn
and thus Fn (xn ) can be computed sequentially if QX n−1 and QXn |X n−1 are easy to compute.
This method is the method of choice in many modern compression algorithms because it allows
to dynamically incorporate the learned information about the stream, in the form of updating
QXn |X n−1 (e.g. if the algorithm detects that an executable file contains a long chunk of English text,
it may temporarily switch to QXn |X n−1 modeling the English language).
125
type-class containing xn ). From Stirling’s approximation this can be shown to be
Then Fitingof argues that it should be possible to produce a prefix code with
This can be done in many ways. In the spirit of what we will do next, let us define
Again, it can be shown that there exists a code such that lengths
This implies that for every 1-st order stationary Markov chain X1 → X2 → · · · → Xn we have
This can be further continued to define Φ2 (xn ) and build a universal code, asymptotically
optimal for all 2-nd order Markov chains etc.
126
11.3 Optimal compressors for a class of sources. Redundancy.
So we have seen that we can construct compressor f : X n → {0, 1}∗ that achieves
simultaneously for all i.i.d. sources (or even all r-th order Markov chains). What should we do next?
Krichevsky suggested that the next barrier should be to optimize regret, or redundancy:
Replacing here lengths with log QX1 n we define redundancy of the distribution QX n as
Thus, the question of designing the best universal compressor (in the sense of optimizing worst-case
deviation of the average length from the entropy) becomes the question of finding solution of:
Note that under condition of finiteness of Rn∗ , Theorem 4.5 gives the maximin and capacity
representation
Thus redundancy is simply the capacity of the channel θ → X n . This result, obvious in hindsight,
was rather surprising in the early days of universal compression.
Finding exact QX n -minimizer in (11.6) is a daunting task even for the simple class of all i.i.d.
Bernoulli sources (i.e. Θ = [0, 1], PX n |θ = Bernn (θ)). It turns out, however, that frequently the
approximate minimizer has a rather nice structure: it matches the Jeffreys prior.
Remark 11.2. (Shtarkov, Fitingof and individual sequence approach) There is a connection between
the combinatorial method of Fitingof and the method of optimality for a class. Indeed, following
(S)
Shtarkov we may want to choose distribution QX n so as to minimize the worst-case redundancy for
each realization xn (not average!):
PX n |θ (xn |θ0 )
min max sup log (11.9)
QX n n x θ0 QX n (xn )
127
This leads to Shtarkov’s distribution (also known as the normalized maximal likehood (NML) code):
(S)
QX n (xn ) = c sup PX n |θ (xn |θ0 ) , (11.10)
θ0
where c is the normalization constant. If class {PX n |θ , θ ∈ Θ} is chosen to be all i.i.d. distributions
on X then
(S)
i.i.d. QX n (xn ) = c exp{−nH(P̂xn )} , (11.11)
(S)
and thus compressing w.r.t. QX n recovers Fitingof’s construction Φ0 up to O(log n) differences
between nH(P̂xn ) and Φ0 (xn ). If we take PX n |θ to be all 1-st order Markov chains, then we get
construction Φ1 etc. Note also, that the problem (11.9) can also be written as minimization of the
regret for each individual sequence (under log-loss, with respect to a parameter class PX n |θ ):
1 1
min max log n
− inf log . (11.12)
QX n x n
QX (x )
n θ0 PX n |θ (xn |θ0 )
The gospel is that if there is a reason to believe that real-world data xn is likely to be generated by
one of the models PX n |θ , then using minimizer of (11.12) will resu lt in the compressor that both
learns the right model and compresses with respect to it.
128
In order to derive the caod Q∗X n we first propose a guess that the caid Pθ in (11.8) is some
distribution with smooth density on Θ (this can only be justified by an apriori belief that the caid
in such a natural problem should be something that involves all θ’s). Then, we define
Z
n
QX n (x ) , PX n |θ (xn |θ0 )Pθ (θ0 )dθ0 . (11.13)
Θ
Before proceeding further, we recall the following method of approximating exponential integrals
(called Laplace method). Suppose that f (θ) has a unique minimum at the interior point θ̂ of Θ
and that Hessian Hessf is uniformly lower-bounded by a multiple of identity (in particular, f (θ) is
strongly convex). Then taking Taylor expansion of π and f we get
Z Z
1 T 2
−nf (θ)
π(θ)e dθ = (π(θ̂) + O(ktk))e−n(f (θ̂)− 2 t Hessf (θ̂)t+o(ktk )) dt (11.14)
Θ
Z
T dx
= π(θ̂)e−nf (θ̂) e−x Hessf (θ̂)x √ (1 + O(n−1/2 )) (11.15)
Rd nd
d
−nf (θ̂) 2π 1
2
= π(θ̂)e q (1 + O(n−1/2 )) (11.16)
n
det Hessf (θ̂)
d 2π Pθ (θ̂) 1
log QX n (xn ) = −nH(θ̂) + log + log q + O(n− 2 ) ,
2 n log e
det JF (θ̂)
d Pθ (θ0 ) 1
D(PX n |θ=θ0 kQX n ) = n(E[H(θ̂)] − H(X|θ = θ0 )) + log n − log p + const + O(n− 2 ) ,
2 0
det JF (θ )
(11.17)
where const is some constant (independent of prior Pθ or θ0 ). The first term is handled by the next
Lemma.
i.i.d.
Lemma 11.1. Let X n ∼ P on finite alphabet X and let P̂ be the empirical type of X n then
|X | − 1 1
E[D(P̂ kP )] = log e + o .
2n n
√
Proof. Notice that n(P̂ − P ) converges in distribution to N (0, Σ), where Σ = diag(P ) − P P T ,
where P is an |X |-by-1 column vector. Thus, computing second-order Taylor expansion of D(·kP ),
cf. (4.16), we get the result.
129
Continuing (11.17) we get in the end
d Pθ (θ0 ) 1
D(PX n |θ=θ0 kQX n ) = log n − log p + const + O(n− 2 ) (11.18)
2 0
det JF (θ )
under the assumption of smoothness of prior Pθ and that θ0 is not too close to the boundary.
Consequently, we can see that in order for the prior Pθ be the saddle point solution, we should have
p
Pθ (θ0 ) ∼ det JF (θ0 ) ,
provided that such density is normalizable. Prior proportional to square-root of the determinant of
Fisher information matrix is known as Jeffreys prior. In our case, using the explicit expression for
Fisher information (4.17) we get
1
Pθ∗ = Beta(1/2, 1/2, · · · , 1/2) = cd qQ , (11.19)
d
j=0 θj
Making the arguments in this subsection rigorous is far from trivial, see [CB90, CB94] for details.
R1 θ a (1−θ)b
1
This is obtained from the identity 0
√ dθ = π 1·3···(2a−1)·1·3···(2b−1)
2a+b (a+b)!
for integer a, b ≥ 0. This identity can
θ(1−θ)
θ
be derived by change of variable z = 1−θ
and using the standard keyhole contour on the complex plain.
130
(KT ) t1 + 21
QXn |X n−1 (1|xn−1 ) = , t1 = #{j ≤ n − 1 : xj = 1} (11.23)
n
(KT ) t0 + 21
QXn |X n−1 (0|xn−1 ) = , t0 = #{j ≤ n − 1 : xj = 0} (11.24)
n
This is the famous “add 1/2” rule of Krichevsky and Trofimov. Note that this sequential assignment
is very convenient for use in prediction as well as in implementing an arithmetic coder.
Remark 11.4 (Laplace “add 1” rule). A slightly less optimal choice of QX n results from Laplace
prior: just take Pθ to be uniform on [0, 1]. Then, in the Bernoulli (d = 1) case we get
(Lap) 1
QX n = n
, w = #{j : xj = 1} . (11.25)
w (n + 1)
(Lap) t1 + 1
QXn |X n−1 (1|xn−1 ) = , t1 = #{j ≤ n − 1 : xj = 1} .
n+1
We notice two things. First, the distribution (11.25) is exactly the same as Fitingof’s (11.5). Second,
this distribution “almost” attains the optimal first-order term in (11.20). Indeed, when X n is iid
Bern(θ) we have for the redundancy:
" #
1 n
E log (Lap) − H(X ) = log(n + 1) + E log
n
− nh(θ) , W ∼ Bino(n, θ) . (11.26)
Q n (X n ) W
X
From Stirling’s expansion we know that as n → ∞ this redundancy evaluates to 12 log n + O(1),
uniformly in θ over compact subsets of (0, 1). However, for θ = 0 or θ = 1 the Laplace redun-
dancy (11.26) clearly equals log(n + 1). Thus, supremum over θ ∈ [0, 1] is achieved close to endpoints
and results in suboptimal redundancy log n + O(1). Jeffrey’s prior (11.21) fixes the problem at the
endpoints.
3. After that nature reveals the next sample xt and our loss for t-th prediction is evaluated as
1
log .
Qt (xt |xt−1 )
131
Goal (informal): Come up with the algorithm to minimize the average per-letter loss:
n
1X 1
`({Qt }, xn ) , log .
n Qt (xt |xt−1 )
t=1
Note that to make this goal formal, we need to explain how xn is generated. Consider first a
naive requirement that the worst-case loss is minimized:
min max
n
`({Qt }, xn ) .
{Qt }n
t=1
x
This is clearly trivial. Indeed, at any step t the distribution P̂t must have at least one atom with
weight ≤ |X1 | , and hence for any predictor
max
n
`({Qt }, xn ) ≥ log |X | ,
x
which is clearly achieved iff Qt (·) ≡ |X1 | , i.e. if predictor makes absolutely no prediction. This is of
course very natural: in the absence of whatsoever prior information on xn it is impossible to predict
anything.
The exciting idea, originated by Feder, Merhav and Gutman, cf. [FMG92, MF98], is to replace
loss with regret, i.e. the gap to the best possible static oracle. More exactly, suppose a non-causal
oracle can look at the entire string xn and output a constant Qt = Q1 . From non-negativity of
divergence this non-causal oracle achieves:
n
1X 1
`oracle (xn ) = min log = H(P̂xn ) .
Q n Q(xt )
t=1
Can causal (but time-varying) predictor come close to this performance? In other words, we define
regret as
reg({Qt }, xn ) , `({Qt }, xn ) − H(P̂xn )
and ask to minimize the worst-case regret:
Excitingly, non-trivial predictors emerge as solutions to the above problem, which furthermore
do not rely on any assumptions on the prior distribution of xn .
We next consider the case of X = {0, 1} for simplicity. To solve (11.27), first notice that designing
a sequence {Qt (·|xt−1Q
} is equivalent to defining one joint distribution QX n and then factorizing the
latter as QX n (xn ) = t Qt (xt |xt−1 ). (At this point, one should recall the worst-case redundancy of
Shtarkov, cf. Remark 11.2.) Then the problem (11.27) becomes simply
1 1
reg∗n = minn max log − H(P̂xn ) .
QX
n
x n QX n (xn )
We may lower-bound the max over xn with the average over the X n ∼ Bern(θ)n and obtain (also
applying Lemma 11.1):
1 |X | − 1 1
reg∗n ≥ Rn∗ + + o( ) ,
n 2n n
where Rn∗ is the universal compression redundancy defined in (11.6), whose asymptotics we derived
in (11.20).
132
(KT )
On the other hand, taking QX n from Krichevsky-Trofimov (11.22) we find after some algebra
and Stirling’s expansions:
(KT ) 1
max − log QX n (xn ) − nH(P̂xn ) = log n + O(1) .
n x 2
In all, we conclude that,
1 ∗ |X | − 1 log n
reg∗n = Rn + O(1) = + O(1/n) ,
n 2 n
and remarkable, the regret converges to zero. i.e. the causal predictor can approach the perforamnce
of a non-causal oracle. Explicit (asymptotically optimal) sequential prediction rules are given by
Krichevsky-Trofimov’s “add 1/2” rules (11.24). We note that the resulting rules are also independent
of n (“horizon-free”). This is a very desirable property not shared by the sequential predictors (also
asymptotically optimal) derived from factorizing the Shtarkov’s distribution (11.10).
the basis for arithmetic coding. As long as P̂ converges to the actual conditional probability, we
n−1
will get to the entropy rate of H(Xn |Xn−r ). Note that Krichevsky-Trofimov assignment (11.24)
is clearly learning the distribution too: as n grows, the estimator QXn |X n−1 converges to the true
PX (provided sequence is i.i.d.). So in some sense the converse is also true: any good universal
compression scheme is inherently learning the true distribution.
The main drawback of the learn-then-compress approach is the following. Once we extend the
class of sources to include those with memory, we invariably are lead to the problem of learning the
joint distribution PX r−1 of r-blocks. However, the number of samples required to obtain a good
0
estimate of PX r−1 is exponential in r. Thus learning may proceed rather slowly. Lempel-Ziv family
0
of algorithms works around this in an ingeniously elegant way:
• First, estimating probabilities of rare substrings takes longest, but it is also the least useful,
as these substrings almost never appear at the input.
• Second, and most crucial, observation is that a great estimate of PX r (xr ) is given by the
reciprocal of the time till the last observation of xr in the incoming stream.
• Third, there is a prefix code2 mapping any integer n to binary string of length roughly log2 n:
Thus, by encoding the pointer to the last observation of xr via such a code we get a string of
length roughly log PX r (xr ) automatically.
2
2− log2 k−2 log2 log(k+1) < ∞ and use Kraft’s inequality.
P
For this just notice that k≥1
133
There are a number of variations of these basic ideas, so we will only attempt to give a rough
explanation of why it works, without analyzing any particular algorithm.
We proceed to formal details. First, we need to establish a Kac’s lemma.
Lemma 11.2 (Kac). Consider a finite-alphabet stationary ergodic process . . . , X−1 , X0 , X1 . . .. Let
−1
L = inf{t > 0 : X−t = X0 } be the last appearance of symbol X0 in the sequence X−∞ . Then for any
u such that P[X0 = u] > 0 we have
1
E[L|X0 = u] = .
P[X0 = u]
In particular, mean recurrence time E[L] = |supp(PX )|.
Proof. Note that from stationarity the following probability
P[∃t ≥ k : Xt = u]
P[∃t ≥ 0 : Xt = u] = P[∃t ∈ Z : Xt = u] .
However, the last event is shift-invariant and thus must have probability zero or one by ergodic
assumption. But since P[X0 = u] > 0 it cannot be zero. So we conclude
P[∃t ≥ 0 : Xt = u] = 1 . (11.29)
Next, we have
X
E[L|X0 = u] = P[L ≥ t|X0 = u] (11.30)
t≥1
1 X
= P[L ≥ t, X0 = u] (11.31)
P[X0 = u]
t≥1
1 X
= P[X−t+1 6= u, . . . , X−1 6= u, X0 = u] (11.32)
P[X0 = u]
t≥1
1 X
= P[X0 6= u, . . . , Xt − 2 6= u, Xt−1 = u] (11.33)
P[X0 = u]
t≥1
1
= P[∃t ≥ 0 : Xt = u] (11.34)
P[X0 = u]
1
= , (11.35)
P[X0 = u]
where (11.30) is the standard expression for the expectation of a Z+ -valued random variable, (11.33)
is from stationarity, (11.34) is because the events corresponding to different t are disjoint, and (11.35)
is from (11.29).
The following proposition serves to explain the basic principle behind operation of Lempel-Ziv:
Theorem 11.1. Consider a finite-alphabet stationary ergodic process . . . , X−1 , X0 , X1 . . . with
−1
entropy rate H. Suppose that X−∞ is known to the decoder. Then there exists a sequence of
n−1 −1
prefix-codes fn (x0 , x−∞ ) with expected length
1 −1
E[`(fn (X0n−1 , X∞ ))] → H ,
n
134
Proof. Let Ln be the last occurence of the block xn−1
0 in the string x−1
−∞ (recall that the latter is
known to decoder), namely
1
E[Ln |X0n−1 = xn−1
0 ]= .
P[X0n−1 = x0n−1 ]
We know encode Ln using the code (11.28). Note that there is crucial subtlety: even if Ln < n and
thus [−t, −t + n − 1] and [0, n − 1] overlap, the substring xn−1
0 can be decoded from the knowledge
of Ln .
We have, by applying Jensen’s inequality twice and noticing that n1 H(X0n−1 ) & H and
1 n−1
n log H(X0 ) → 0 that
1 1 1
E[`(fint (Ln ))] ≤ E[log ] + o(1) → H .
n n PX n−1 (X0n−1 )
0
From Kraft’s inequality we know that for any prefix code we must have
1 1 −1
E[`(fint (Ln ))] ≥ H(X0n−1 |X−∞ )=H.
n n
135
Part III
136
§ 12. Binary hypothesis testing
H0 : X ∼ P
H1 : X ∼ Q
Where under hypothesis H0 (the null hypothesis) X is distributed according to P , and under H1
(the alternative hypothesis) X is distributed according to Q. A test between two distributions
chooses either H0 or H1 based on an observation of X
• Randomized test: PZ|X : X → {0, 1}, so that PZ|X (0|x) ∈ [0, 1].
Let Z = 0 denote that the test chooses P , and Z = 1 when the test chooses Q.
Remark: This setting is called “testing simple hypothesis against simple hypothesis”. Simple
here refers to the fact that under each hypothesis there is only one distribution that could generate
the data. Composite hypothesis is when X ∼ P and P is only known to belong to some class of
distributions.
So for any test PZ|X there is an associated (α, β). There are a few ways to determine the “best
test”
• Bayesian: Assume prior distributions P[H0 ] = π0 and P[H1 ] = π1 , minimize the expected error
137
• Minimax: Assume there is a prior distribution but it is unknown, so choose the test that
preforms the best for the worst case priors
∗
Pm = min max π0 π1|0 + π1 π0|1
tests π0
Definition 12.2. Given (P, Q), the region of achievable points for all randomized tests is
[
R(P, Q) = {(P [Z = 0], Q[Z = 0])} ⊂ [0, 1]2 (12.1)
PZ|X
R(P, Q)
βα (P, Q)
Remark 12.2. This region encodes a lot of useful information about the relationship between P
and Q. For example,1
P = Q ⇔ R(P, Q) = P ⊥ Q ⇔ R(P, Q) =
Moreover, TV(P, Q) = maximal length of vertical line intersecting the lower half of R(P, Q) (HW).
Theorem 12.1 (Properties of R(P, Q)).
138
2. R(P, Q) contains the diagonal.
Proof. 1. For convexity, suppose (α0 , β0 ), (α1 , β1 ) ∈ R(P, Q), then each specifies a test PZ0 |X , PZ1 |X
respectively. Randomize between these two test to get the test λPZ0 |X + λ̄PZ1 |X for λ ∈ [0, 1],
which achieves the point (λα0 + λ̄α1 , λβ0 + λ̄β1 ) ∈ R(P, Q).
Closedness will follow from the explicit determination of all boundary points via Neyman-
Pearson Lemma – see Remark 12.3. In more complicated situations (e.g. in testing against
composite hypothesis) simple explicit solutions similar to Neyman-Pearson Lemma are not
available but closedness of the region can frequently be argued still. The basic reason is that
the collection of functions {g : RX → [0,
R 1]} forms a weakly-compact set and hence its image
under a linear functional g 7→ ( gdP, gdQ) is closed.
3. If (α, β) ∈ R(P, Q), then form the test that chooses P whenever PZ|X choses Q, and chooses
Q whenever PZ|X choses P , which gives (1 − α, 1 − β) ∈ R(P, Q).
The region R(P, Q) consists of the operating points of all randomized tests, which include
deterministic tests as special cases. The achievable region of deterministic tests are denoted by
[
Rdet (P, Q) = {(P (E), Q(E)}. (12.2)
E
One might wonder the relationship between these two regions. It turns out that R(P, Q) is given by
the closed convex hull of Rdet (P, Q).
We first recall a couple of notations:
Consequently, if P and Q are on a finite alphabet X , then R(P, Q) is a polygon of at most 2|X |
vertices.
Proof. “⊃”: Comparing (12.1) and (12.2), by definition, R(P, Q) ⊃ Rdet (P, Q)). By Theorem 12.1,
R(P, Q) is closed convex, and we are done with the ⊃ direction.
“⊂”: Given any randomized test PZ|X , put g(x) = PZ=0|X=x . Then g is a measurable function.
Let’s recall the following lemma:
R
Lemma 12.1 (Area rule). For any positive random variable U ≥ 0, E[U ] = R+ P [U ≥ u] du.
RU R R
Proof. By Fubini, E[U ] = E 0 du = E 1{U ≥u} du = E1{U ≥u} du.
139
Thus,
X Z 1
P [Z = 0] = g(x)P (x) = EP [g(X)] = P [g(X) ≥ t]dt
x 0
X Z 1
Q[Z = 0] = g(x)Q(x) = EQ [g(X)] = Q[g(X) ≥ t]dt
x 0
R
where we applied the formula E[U ] = P [U ≥ t] dt for U ≥ 0. Therefore the point (P [Z = 0], Q[Z =
0]) ∈ R is a mixture of points (P [g(X) ≥ t], Q[g(X) ≥ t]) ∈ Rdet , averaged according to t uniformly
distributed on the unit interval. Hence R ⊂ cl(co(Rdet )).
The last claim follows because there are at most 2|X | subsets in (12.2).
1
Example: Testing Bern(p) versus Bern(q), p < 2 < q. Using Theorem 12.2, note that there are
22 = 4 events E = ∅, {0}, {1}, {0, 1}. Then
β
)
q)
(p, q)
n(
er
,B
p)
n(
er
(p̄, q̄)
(B
R
α
0 1
Notes:
140
• The rationale for defining extended values ±∞ of T (x) are the following observations:
This leads to the following useful consequence: For any g ≥ 0 and any τ ∈ R (note: τ = ±∞
is excluded) we have
Below, these and similar inequalities are only checked for the cases of T taking real (not
extended) values, but from this remark it should be clear how to treat the general case.
Theorem 12.3.
Proof. (2)
X P (x) X
QT (t) = Q(x)1{log = t} = Q(x)1{et Q(x) = P (x)}
Q(x)
X X
X P (x)
= e−t P (x)1{log = t} = e−t PT (t)
Q(x)
X
141
where we used (12.5) to justify restriction to finite values of T .
(1) To show T is a s.s, we need to show PX|T = QX|T . For the discrete case we have:
P (x) P (x)
PX (x)PT |X (t|x) P (x)1{ Q(x) = et } ef Q(x)1{ Q(x) = et }
PX|T (x|t) = = =
PT (t) PT (t) PT (t)
QXT (xt) (2) QXT
= −t = = QX|T (x|t).
e PT (t) QT
The general argument is done similarly to the proof of (12.6).
From Theorem 12.2 we know that to obtain the achievable region R(P, Q), one can iterate over
all subsets and compute the region Rdet (P, Q) first, then take its closed convex hull. But this is a
dP
formidable task if the alphabet is huge or infinite. But we know that the LLR log dQ is a sufficient
dP
statistic. Next we give bounds to the region R(P, Q) in terms of the statistics of log dQ . As usual,
there are two types of statements:
• Converse (outer bounds): any point in R(P, Q) must satisfy ...
142
Note: In this case, we do not need P Q, since ±∞ is a reasonable and meaningful value for the
log likelihood ratio.
dP
Lemma 12.3 (Randomized tests). P [Z = 0] − γQ[Z = 0] ≤ P log dQ > log γ .
Proof. Almost identical to the proof of the previous Lemma 12.2:
X X
P [Z = 0] − γQ[Z = 0] = PZ|X (0|x)(p(x) − γq(x)) ≤ PZ|X (0|x)(p(x) − γq(x))1{p(x)>γq(x)}
x x
h dP i h dP i
= P log > log γ, Z = 0 − Q log > log γ, Z = 0
dQ dQ
h dP i
≤ P log > log γ .
dQ
Theorem 12.5 (Strong Converse). ∀(α, β) ∈ R(P, Q), ∀γ > 0,
h dP i
α − γβ ≤ P log > log γ (12.8)
dQ
1 h dP i
β − α ≤ Q log < log γ (12.9)
γ dQ
Proof. Apply Lemma 12.3 to (P, Q, γ) and (Q, P, 1/γ).
Note: Theorem 12.5 provides an outer bound for the region R(P, Q) in terms of half-spaces. To see
this, suppose one fixes γ > 0 and looks at the line α − γβ = c and slowing increases c from zero,
there is going to be a maximal c, say c∗ , at which point the line touches the lower boundary of the
region. Then (12.8) says that c∗ cannot exceed P [log dQdP
> log γ]. Hence R must lie to the left of
the line. Similarly, (12.9) provides bounds for the upper boundary. Altogether Theorem 12.5 states
that R(P, Q) is contained in the intersection of a collection of half-spaces indexed by γ.
Note: To apply the strong converse Theorem 12.5, we need to know the CDF of the LLR, whereas
to apply the weak converse Theorem 12.4 we need only to know the expectation of the LLR, i.e.,
divergence.
where the last equality follows from the fact that we are free to choose PZ|X (0|x), and the best
choice is obvious:
P (x)
PZ|X (0|x) = 1 log ≥ log t .
Q(x)
Thus, we have shown that all supporting hyperplanes are parameterized by LLR-tests. This
completely recovers the region R(P, Q) except for the points corresponding to the faces (flat pieces)
of the region. To be precise, we state the following result.
143
Theorem 12.6 (Neyman-Pearson Lemma: “LRT is optimal”). For any α, βα is attained by the
following test: dP
1 log dQ > τ
dP
PZ|X (0|x) = λ log dQ =τ (12.10)
dP
0 log dQ <τ
Proof of Theorem 12.6. Let t = exp(τ ). Given any test PZ|X , let g(x) = PZ|X (0|x) ∈ [0, 1]. We
want to show that
h dP i h dP i
α = P [Z = 0] = EP [g(X)] = P > t + λP =t (12.11)
dQ dQ
goal h dP i h dP i
⇒ β = Q[Z = 0] = EQ [g(X)] ≥ Q > t + λQ =t (12.12)
dQ dQ
Using the simple fact that EQ [f (X)1n dP ≤to ] ≥ t−1 EP [f (X)1n dP ≤to ] for any f ≥ 0 twice, we have
dQ dQ
1
≥ EP [g(X)1n dP ≤to ] +EQ [g(X)1n dP >to ]
t dQ dQ
| {z }
(12.11) 1
h dP i
= EP [(1 − g(X))1n dP >to ] + λP = t + EQ [g(X)1n dP >to ]
t dQ dQ dQ
| {z }
h dP i
≥ EQ [(1 − g(X))1n dP >to ] + λQ = t + EQ [g(X)1n dP >to ]
dQ dQ dQ
h dP i h dP i
=Q > t + λQ =t .
dQ dQ
Remark 12.3. As a consequence of the Neyman-Pearson lemma, all the points on the boundary of
the region R(P, Q) are attainable. Therefore
Since α 7→ βα is convex on [0, 1], hence continuous, the region R(P, Q) is a closed convex set.
Consequently, the infimum in the definition of βα is in fact a minimum.
Furthermore, the lower half of the region R(P, Q) is the convex hull of the union of the following
two sets: (
dP
α = P log dQ >τ
dP
τ ∈ R ∪ {±∞}.
β = Q log dQ >τ
and (
dP
α = P log dQ ≥τ
dP
τ ∈ R ∪ {±∞}.
β = Q log dQ ≥τ
dP
Therefore it does not lose optimality to restrict our attention on tests of the form 1{log dQ ≥ τ } or
dP
1{log dQ > τ }. The convex combination (randomization) of the above two styles of tests lead to the
achievability of the Neyman-Pearson lemma (Theorem 12.6).
144
Remark 12.4. The test (12.10) is related to LRT2 as follows:
dP dP
P [log dQ > t] P [log dQ > t]
1 1
α α
t t
τ τ
dP
1. Left figure: If α = P [log dQ > τ ] for some τ , then λ = 0, and (12.10) becomes the LRT
Z = 1 log dP ≤τ .
n o
dQ
dP
2. Right figure: If α 6= P [log dQ > τ ] for any τ , then we have λ ∈ (0, 1), and (12.10) is equivalent
to randomize over tests: Z = 1nlog dP ≤τ o with probability λ̄ or 1nlog dP <τ o with probability λ.
dQ dQ
Proof.
h dP i X n P (x) o
Q log >τ = Q(x)1 > eτ
dQ Q(x)
X n P (x) o h dP i
≤ P (x)e−τ 1 > eτ = e−τ P log >τ .
Q(x) dQ
12.6 Asymptotics
Now we have many samples from the underlying distribution
i.i.d.
H0 : X1 , . . . , Xn ∼ P
i.i.d.
H1 : X1 , . . . , Xn ∼ Q
We’re interested in the asymptotics of the error probabilities π0|1 and π1|0 . There are two main
types of tests, both which the convergence rate to zero error is exponential.
1. Stein Regime: What is the best exponential rate of convergence for π0|1 when π1|0 has to be
≤ ?
(
π1|0 ≤
π0|1 → 0
2
Note that it so happens that in Definition 12.3 the LRT is defined with an ≤ instead of <.
145
2. Chernoff Regime: What is the trade off between exponents of the convergence rates of π1|0
and π0|1 when we want both errors to go to 0?
(
π1|0 → 0
π0|1 → 0
146
§ 13. Hypothesis testing asymptotics I
Setup:
H0 : X n ∼ PX n H1 : X n ∼ QX n
test PZ|X n : X n → {0, 1}
specification 1 − α = π1|0 β = π0|1
Note: Motivation of this objective: usually a “miss”(0|1) is much worse than a “false alarm” (1|0).
Definition 13.1 (-optimal exponent). V is called an -optimal exponent in Stein’s regime if
V = lim V .
→0
Theorem 13.1 (Stein’s lemma). Let PX n = PXn i.i.d. and QX n = QnX i.i.d. Then
Consequently,
V = D(P kQ).
Example: If it is required that α ≥ 1 − 10−3 , and β ≤ 10−40 , what’s the number of samples needed?
10−40
Stein’s lemma provides a rule of thumb: n & − logD(P kQ) .
147
dP dPX n Pn dP
Proof. Denote F = log dQ , and Fn = log dQ X n
= i=1 log dQ (Xi ) – iid sum.
Recall Neyman Pearson’s lemma on optimal tests (likelihood ratio test): ∀τ ,
then pick n large enough (depends on , δ) such that α ≥ 1−, we have the exponent E = D −δ
achievable, V ≥ E. Further let δ → 0, we have that V ≥ D.
and the second is due to the weak converse Theorem 12.4 proved in the last lecture (data
processing inequality for divergence).
∀ achievable exponent E < V , by definition, there exists a sequence of tests PZ|X n such
that αn ≥ 1 − and βn ≤ 2−nE . Plugging it in (13.3) and using h ≤ log 2, we have
Therefore
D(P kQ)
V ≤
1−
Notice that this is weaker than what we hoped to prove, and this weak converse result is
tight for → 0, i.e., for Stein’s exponent we did have the desired result V = lim→0 V ≥
D(P kQ).
148
b) (strong converse) In proving the weak converse, we only made use of the expectation of
Fn in (13.3), we need to make use of the entire distribution (CDF) in order to obtain
stronger results.
Recall the strong converse result which we showed in the last lecture:
Here, suppose there exists a sequence of tests PZ|Xn which achieve αn ≥ 1 − and
βn ≤ 2−nE . Then
Pick log γ = n(D + δ), by (13.1) the RHS goes to 0, and we have
Remark 13.1 (Ergodic). Just like in last section of data compression. Ergodic assumptions on
PX n and QX n allow one to show that
1
V = lim D(PX n kQX n )
n→∞ n
the counterpart of (13.3), which is the key for picking the appropriate τ , for ergodic sequence X n is
the Birkhoff-Khintchine convergence theorem.
Remark 13.2. The theoretical importance of knowing the Stein’s exponents is that:
Thus knowledge of Stein’s exponent V allows one to prove exponential bounds on probabilities of
arbitrary sets, the technique is known as “change of measure”.
H0 : X n ∼ PXn H1 : X n ∼ QnX
But our objective in this section is to have both types of error probability to vanish exponentially
fast simultaneously. We shall look at the following specification:
149
Apparently, E0 (resp. E1 ) can be made arbitrarily big at the price of making E1 (resp. E0 ) arbitrarily
small. So the problem boils down to the optimal tradeoff, i.e., what’s the achievable region of
(E0 , E1 )? This problem is solved by [Hoe65, Bla74].
and we are interested in the asymptotics when n → ∞, in which scenario we know (13.1) and (13.2)
occur.
Stein’s regime corresponds to the corner points. Indeed, Theorem 13.1 tells us that when fixing
αn = 1 − , namely E0 = 0, picking τ = D(P kQ) − δ (δ → 0) gives the exponential convergence
rate of βn as E1 = D(P kQ). Similarly, exchanging the role of P and Q, we can achieves the point
(E0 , E1 ) = (D(QkP ), 0). More generally, to achieve the optimal tradeoff between the two corner
points, we need to introduce a powerful tool – Large Deviation Theory.
Note: Here is a roadmap of the upcoming 2 lectures:
∗ , tilted distribution P )
1. basics of large deviation (ψX , ψX λ
(E0 (θ) = ψP∗ (θ), E1 (θ) = ψP∗ (θ) − θ) characterize the achievable boundary.
150
h Pn i
i=1 Xi
what is the rate function E(γ) = − limn→∞ n1 log P n ≥ γ ? (Chernoff’s ineq.)
To motivate, let us recall the usual Chernoff bound: For iid X n , for any λ ≥ 0,
" n # " n
! #
X X
P Xi ≥ nγ = P exp λ Xi ≥ exp(nλγ)
i=1 i=1
" n
!#
Markov X
≤ exp(−nλγ)E exp λ Xi
i=1
= exp {−nλγ + n log E [exp(λX)]} .
Optimizing over λ ≥ 0 gives the non-asymptotic upper bound (concentration inequality) which
holds for any n:
" n #
X n o
P Xi ≥ nγ ≤ exp − n sup(λγ − log E [exp(λX)]) .
λ≥0 | {z }
i=1 log MGF
ψX (λ) = log(E[exp(λX)]), λ ∈ R.
1. MGF exists, i.e., ∀λ ∈ R, ψX (λ) < ∞, which, in particular, implies all moments exist.
2. X 6=const.
Example:
λ2
• Gaussian: X ∼ N (0, 1) ⇒ ψX (λ) = 2 .
• Example of R.V. such that ψX (λ) does not exist: X = Z 3 with Z ∼ Gaussian. Then
ψX (λ) = ∞, ∀λ] 6= 0.
1. ψX is convex;
2. ψX is continuous;
0 E[XeλX ]
ψX (λ) = = e−ψX (λ) E[XeλX ].
E[eλX ]
0 (0) = E [X].
In particular, ψX (0) = 0, ψX
151
0 ≤ b;
4. If a ≤ X ≤ b a.s., then a ≤ ψX
5. Conversely, if
0 0
A = inf ψX (λ), B = sup ψX (λ),
λ∈R λ∈R
then A ≤ X ≤ B a.s.;
0 is strictly increasing.
6. ψX is strictly convex, and consequently, ψX
7. Chernoff bound:
P (X ≥ γ) ≤ exp(−λγ + ψX (λ)), λ ≥ 0.
Remark 13.3. The slope of log MGF encodes the range of X. Indeed, 4) and 5) of Theorem 13.2
together show that the smallest closed interval containing the support of PX equals (closure of) the
0 . In other words, A and B coincide with the essential infimum and supremum (min and
range of ψX
max of RV in the probabilistic sense) of X respectively,
Proof. Note: 1–4 can be proved right now. 7 is the usual Chernoff bound. The proof of 5–6 relies
on Theorem 13.4, which can be skipped for now.
where the Lp -norm of RV is defined by kU kp = (E|U |p )1/p . Applying to E[e(θλ1 +θ̄λ2 )X ] with
p = 1/θ, q = 1/θ̄, we get
E[exp((λ1 /p + λ2 /q)X)] ≤ k exp(λ1 X/p)kp k exp(λ2 X/q)kq = E[exp(λ1 X)]θ E[exp(λ2 X)]θ̄ ,
2. By our assumptions on X, domain of ψX is R, and by the fact that convex function must be
continuous on the interior of its domain, we have that ψX is continuous on R.
e|X| ≤ eX + e−X
|XeλX | ≤ e|(λ+1)X| ≤ e(λ+1)X + e−(λ+1)X
152
Second, by the existence and continuity of E[|XeλX |], u 7→ E[|XeuX |] is integrable on [0, λ],
we can switch order of integration and differentiation as follows:
Z λ Z λ
Fubini
eψX (λ) = E[eλX ] = E 1 + XeuX du = 1 + E XeuX du
0 0
0
⇒ ψX (λ)eψX (λ) = E[Xe λX
]
4.
0 E[XeλX ]
a ≤ X ≤ b ⇒ ψX (λ) = ∈ [a, b].
E[eλX ]
5. Suppose PX [X > B] > 0 (for contradiction), then PX [X > B + 2] > 0 for some small > 0.
But then Pλ [X ≤ B + ] → 0 for λ → ∞ (see Theorem 13.4.3 below). On the other hand, we
know from Theorem 13.4.2 that EPλ [X] = ψX 0 (λ) ≤ B. This is not yet a contradiction, since
Pλ might still have some very small mass at a very negative value. To show that this cannot
happen, we first assume that B − > 0 (otherwise just replace X with X − 2B). Next note
that
153
for some constant c. Then,
EPλ [|X|1{X < B − }] = E[|X|eλX−ψX (λ) 1{X < B − }] (13.6)
λX−c−λ(B− 2 )
≤ E[|X|e 1{X < B − }] (13.7)
λ(B−)−c−λ(B− 2 )
≤ E[|X|e ] (13.8)
−λ 2 −c
= E[|X|]e →0 λ→∞ (13.9)
where the first inequality is from (13.5) and the second from X < B − . Thus, the first term
in (13.4) goes to 0 implying the desired contradiction.
6. Suppose ψX is not strictly convex. Since we know that ψX is convex, then ψX must be
“flat” (affine) near some point, i.e., there exists a small neighborhood of some λ0 such that
ψX (λ0 + u) = ψX (λ0 ) + ur for some r ∈ R. Then ψPλ (u) = ur for all u in small neighborhood
of zero, or equivalently EPλ [eu(X−r) ] = 1 for u small. The following Lemma 13.1 implies
Pλ [X = r] = 1, but then P [X = r] = 1, contradicting the assumption X 6= const.
Proof. Expand in Taylor series around u = 0 to obtain E[S] = 0, E[S 2 ] = 0. Alternatively, we can
extend the argument we gave for differentiating ψX (λ) to show that the function z 7→ E[ezS ] is
holomorphic on the entire complex plane1 . Thus by uniqueness, E[euS ] = 1 for all u.
∗ : R → R ∪ {+∞} is given by the Legendre-
Definition 13.4 (Rate function). The rate function ψX
Fenchel transform of the log MGF:
∗
ψX (γ) = sup λγ − ψX (λ) (13.10)
λ∈R
Note: The maximization (13.10) is a nice convex optimization problem since ψX is strictly convex,
so we are maximizing a strictly concave function. So we can find the maximum by taking the
∗ is the dual of ψ in the sense of convex
derivative and finding the stationary point. In fact, ψX X
analysis.
∗ ).
Theorem 13.3 (Properties of ψX
1
More precisely, if we only know that E[eλS ] is finite for |λ| ≤ 1 then the function z 7→ E[ezS ] is holomorphic in
the vertical strip {z : |Rez| < 1}.
154
1. Let A = essinf X and B = esssup X. Then
0
λγ − ψX (λ) for some λ s.t. γ = ψX (λ), A<γ<B
∗ 1
ψX (γ) = log P (X=γ) γ = A or B
+∞, γ < A or γ > B
Proof. By Theorem 13.2.4, since A ≤ X ≤ B a.s., we have A ≤ ψX 0 ≤ B. When γ ∈ (A, B), the
strictly concave function λ 7→ λγ − ψX (λ) has a single stationary point which achieves the unique
maximum. When γ > B (resp. < A), λ 7→ λγ − ψX (λ) increases (resp. decreases) without bounds.
When γ = B, since X ≤ B a.s., we have
∗
ψX (B) = sup λB − log(E[exp(λX)]) = − log inf E[exp(λ(X − B))]
λ∈R λ∈R
eλx
Pλ (dx) = P (dx) = eλx−ψX (λ) P (dx) (13.11)
E[eλX ]
In other words, if P has a pdf p, then the pdf of Pλ is given by pλ (x) = eλx−ψX (λ) p(x).
Note: The set of distributions {Pλ : λ ∈ R} parametrized by λ is called a standard exponential
family, a very useful model in statistics. See [Bro86, p. 13].
Example:
• Gaussian: P = N (0, 1) with density p(x) = √1 exp(−x2 /2). Then Pλ has density
2π
exp(λx)
√1 exp(−x2 /2) = √1 exp(−(x − λ)2 /2). Hence Pλ = N (λ, 1).
exp(λ2 /2) 2π 2π
eλ
• Binary: P is uniform on {±1}. Then Pλ (1) = eλ +e−λ
which puts more (resp. less) mass on 1
D
if λ > 0 (resp. < 0). Moreover, Pλ −
→δ1 if λ → ∞ or δ−1 if λ → −∞.
• Uniform: P is uniform on [0, 1]. Then Pλ is also supported on [0, 1] with pdf pλ (x) = λ exp(λx)
eλ −1
.
Therefore as λ increases, Pλ becomes increasingly concentrated near 1, and Pλ → δ1 as λ → ∞.
Similarly, Pλ → δ0 as λ → −∞.
155
So we see that Pλ shifts the mean of P to the right (resp. left) when λ > 0 (resp. < 0). Indeed, this
is a general property of tilting.
Theorem 13.4 (Properties of Pλ ).
1. Log MGF:
ψPλ (u) = ψX (λ + u) − ψX (λ)
3.
D D
Therefore if Xλ ∼ Pλ , then Xλ −
→ essinf X = A as λ → −∞ and Xλ −
→ esssup X = B as
λ → ∞.
3.
where the last inequality is due to the usual Chernoff bound (Theorem 13.2.7): P [X > b] ≤
exp(−λb + ψX (λ)).
156
§ 14. Information projection and Large deviation
Proof. First note that if the events have zero probability, then both sides coincide with infinity.
1 Pn
Indeed, if P n k=1 Xk > γ = 0, then P [X > γ] = 0. Then EQ [X] > γ ⇒ Q[X > γ] > 0 ⇒ Q 6
P ⇒ D(QkP ) = ∞ and hence (14.1) holds trivially. The case for (14.2) is similar.
In the sequel
1 Pn we assume both probabilities are nonzero. We start by proving (14.1). Set
P [En ] = P n k=1 Xk > γ .
Lower Bound on P [En ]: Fix a Q such that EQ [X] > γ. Let X n be iid. Then by WLLN,
" n #
X LLN
Q[En ] = Q Xk > nγ = 1 − o(1).
k=1
157
And a lower bound for the binary divergence is
1
d(Q[En ]kP [En ]) ≥ −h(Q[En ]) + Q[En ] log
P [En ]
Upper Bound on P [En ]: The key observation is that given any X and any event E, PX (E) > 0
can be expressed via the divergence between the conditional and unconditional
P distribution as:
1
log PX (E) = D(PX|X∈E kPX ). Define P̃X n = PX n | Xi >nγ , under which
P Xi > nγ holds a.s. Then
1
log = D(P̃X n kPX n ) ≥ inf D(QX n kPX n ) (14.4)
P [En ]
P
QX n :EQ [ Xi ]>nγ
We now show that the last problem “single-letterizes”, i.e., reduces n = 1. Consider the following
two steps:
n
X
D(QX n kPX n ) ≥ D(QXj kP ) (14.5)
j=1
n
1X
≥ nD(Q̄kP ) , Q̄ , QXj , (14.6)
n
j=1
where the first step follows from Corollary 2.1 after noticing that PX n = P n , and the second step is
by convexity of divergence Theorem 4.1. From this argument we conclude that
inf
P D(QX n kPX n ) = n · inf D(QkP ) (14.7)
QX n :EQ [ Xi ]>nγ Q:EQ [X]>γ
inf
P D(QX n kPX n ) = n · inf D(QkP ) (14.8)
QX n :EQ [ Xi ]≥nγ Q:EQ [X]≥γ
In particular, (14.4) and (14.7) imply the required lower bound in (14.1).
Next we prove (14.2). First, notice that the lower bound argument (14.4) applies equally well,
so that for each n we have
1 1
log 1 Pn ≥ inf D(QkP ) .
n P n k=1 Xk ≥ γ Q : EQ [X]≥γ
• Case I: P
P [X > γ] = 0. If P [X ≥ γ] = 0, then both sides of (14.2) are +∞. If P [X = γ] > 0,
then P [ Xk ≥ nγ] = P [X1 = . . . = Xn = γ] = P [X = γ]n . For the right-hand side, since
D(QkP ) < ∞ =⇒ Q P =⇒ Q(X ≤ γ) = 1, the only possibility for EQ [X] ≥ γ is that
1
Q(X = γ) = 1, i.e., Q = δγ . Then inf EQ [X]≥γ D(QkP ) = log P (X=γ) .
158
P P
• Case II: P [X > γ] > 0. Since P[ Xk ≥ γ] ≥ P[ Xk > γ] from (14.1) we know that
1 1
lim sup log 1 Pn ≤ inf D(QkP ) .
n→∞ n P n k=1 Xk ≥ γ Q : EQ [X]>γ
Indeed, let P̃ = PX|X>γ which is well defined since P [X > γ] > 0. For any Q such
that EQ [X] ≥ γ, set Q̃ = ¯Q + P̃ satisfies EQ̃ [X] > γ. Then by convexity, D(Q̃kP ) ≤
1
¯D(QkP ) + D(P̃ kP ) = ¯D(QkP ) + log P [X>γ] . Sending → 0, we conclude the proof
of (14.9).
Denote the minimizing distribution Q by Q∗ . The next result shows that intuitively the “line”
between P and optimal Q∗ is “orthogonal” to E.
Q∗
Distributions on X
Theorem 14.2. Suppose ∃Q∗ ∈ E such that D(Q∗ kP ) = minQ∈E D(QkP ), then ∀Q ∈ E
D(QkP ) ≥ D(QkQ∗ ) + D(Q∗ kP )
Proof. If D(QkP ) = ∞, then we’re done, so we can assume that D(QkP ) < ∞, which also implies
that D(Q∗ kP ) < ∞. For θ ∈ [0, 1], form the convex combination Q(θ) = θ̄Q∗ + θQ ∈ E. Since Q∗ is
the minimizer of D(QkP ), then1
∂
0≤ D(Q(θ) kP ) = D(QkP ) − D(QkQ∗ ) − D(Q∗ kP )
∂θ θ=0
and we’re done.
1
This can be found by taking the derivative and matching terms (Exercise). Be careful with exchanging derivatives
and integrals. Need to use dominated convergence theorem similar as in the “local behavior of divergence” in
Proposition 4.1.
159
Remark 14.1. If we view the picture above in the Euclidean setting, the “triangle” formed by P ,
Q∗ and Q (for Q∗ , Q in a convex set, P outside the set) is always obtuse, and is a right triangle
only when the convex set has a “flat face”. In this sense, the divergence is similar to the squared
Euclidean distance, and the above theorem is sometimes known as a “Pythagorean” theorem.
The interesting set of Q’s that we will focus next is the “half-space” of distributions E = {Q :
EQ [X] ≥ γ}, where X : Ω → R is some fixed function (random variable). This is justified by
relation (to be established) with the large deviation exponent in Theorem 14.1. First, we solve this
I-projection problem explicitly.
Theorem 14.3. Given a distribution P on Ω and X : Ω → R let
0
A = inf ψX = essinf X = sup{a : X ≥ a P -a.s.} (14.10)
0
B= sup ψX = esssup X = inf{b : X ≤ b P -a.s.} (14.11)
2. Whenever the minimum is finite, minimizing distribution is unique and equal to tilting of P
along X, namely2
dPλ = exp{λX − ψ(λ)} · dP (14.14)
160
mean from EP X to γ, in particular λ ≥ 0. Moreover, ψX
∗ (γ) = λγ − ψ (λ). Take any Q such that
X
EQ [X] ≥ γ, then
dQdPλ
D(QkP ) = EQ log (14.15)
dP dPλ
dPλ
= D(QkPλ ) + EQ [log ]
dP
= D(QkPλ ) + EQ [λX − ψX (λ)]
≥ D(QkPλ ) + λγ − ψX (λ)
∗
= D(QkPλ ) + ψX (γ)
∗
≥ ψX (γ), (14.16)
where the last inequality holds with equality if and only if Q = Pλ . In addition, this shows the
minimizer is unique, proving the second claim. Note that even in the corner case of γ = B (assuming
P (X = B) > 0) the minimizer is a point mass Q = δB , which is also a tilted measure (P∞ ), since
Pλ → δB as λ → ∞, cf. Theorem 13.4.3.
Another version of the solution, given by expression (14.13), follows from Theorem 13.3.
For the third claim, notice that there is nothing to prove for γ < EP [X], while for γ ≥ EP [X]
we have just shown
∗
ψX (γ) = min D(QkP )
Q:EQ [X]≥γ
Corollary 14.1. ∀Q with EQ [X] ∈ (A, B), there exists a unique λ ∈ R such that the tilted
distribution Pλ satisfies
and furthermore the gap in the last inequality equals D(QkPλ ) = D(QkP ) − D(Pλ kP ).
Proof. Proceed as in the proof of Theorem 14.3, and find the unique λ s.t. EPλ [X] = ψX
0 (λ) = E [X].
Q
∗
Then D(Pλ kP ) = ψX (EQ [X]) = λEQ [X] − ψX (λ). Repeat the steps (14.15)-(14.16) obtaining
D(QkP ) = D(QkPλ ) + D(Pλ kP ).
Remark: For any Q, this allows us to find a tilted measure Pλ that has the same mean yet
smaller (or equal) divergence.
161
Q 6≪ P
One Parameter Family
γ=A
P b
D(Pλ ||P ) EQ [X] = γ
λ=0 = ψ ∗ (λ)
b
λ>0 Q
b
γ=B
Q∗
=Pλ
Q 6≪ P
Space of distributions on R
• Each set {Q : EQ [X] = γ} corresponds to a slice. As γ varies from A to B, the curves fill the
entire space except for the corner regions.
• As γ varies, the Pλ ’s trace out a curve via ψ ∗ (γ) = D(Pλ kP ). This set of distributions is
called a one parameter family, or exponential family.
Key Point: The one parameter family curve intersects each γ-slice E = {Q : EQ [X] = γ}
“orthogonally” at the minimizing Q∗ ∈ E, and the distance from P to Q∗ is given by ψ ∗ (λ). To see
this, note that applying Theorem 14.2 to the convex set E gives us D(QkP ) ≥ D(QkQ∗ ) + D(Q∗ kP ).
Now thanks to Corollary 14.1, we in fact have equality D(QkP ) = D(QkQ∗ ) + D(Q∗ kP ) and
Q∗ = Pλ for some tilted measure.
Chernoff bound revisited: The proof of upper bound in Theorem 14.1 is via the definition of
information projection. Theorem 14.3 shows that the value of the information projection coincides
with the rate function (conjugate of log-MGF). This shows the optimality of the Chernoff bound
(recall Theorem 13.2.7). Indeed, we directly verify this for completeness: For all λ ≥ 0,
" n #
X
P Xk ≥ nγ ≤ e−nγλ (EP [eλX ])n = e−n(λγ−ψX (λ))
k=1
where we used iid Xk ’s in the expectation. Optimizing over λ ≥ 0 to get the best upper bound:
∗
sup λγ − ψX (λ) = sup λγ − ψX (λ) = ψX (γ)
λ≥0 λ∈R
where the first equality follows since γ ≥ EP [X], therefore λ 7→ λγ − ψX (λ) is increasing when λ ≤ 0.
Hence " n #
X ∗
P Xk ≥ nγ ≤ e−nψX (γ) . (14.17)
k=1
Remark: The Chernoff bound is tight precisely because, from information projection, the lower
bound showed that the best change of measure is to change to the tilted measure Pλ .
162
14.4 Generalization: Sanov’s theorem
Theorem 14.4 (Sanov’s Theorem). P Consider observing n samples X1 , . . . , Xn ∼ iid P . Let P̂ be
the empirical distribution, i.e., P̂ = n nj=1 δXj . Let E be a convex set of distributions. Then under
1
Note: Examples of regularity conditions: space X is finite and E is closed with non-empty interior;
space X is Polish and the set E is weakly closed and has non-empty interior.
Proof sketch. The lower bound is proved as in Theorem 14.1: Just take an arbitrary Q ∈ E and
apply a suitable version of WLLN to conclude Qn [P̂ ∈ E] = 1 + o(1).
For the upper bound we can again adapt the proof from Theorem 14.1. Alternatively, we can
write the convex set E as an intersection of half spaces. Then we’ve already solved the problem
for half-spaces {Q : EQ [X] ≥ γ}. The general case follows by the following consequence of
Theorem 14.2: if Q∗ is projection of P onto E1 and Q∗∗ is projection of Q∗ on E2 , then Q∗∗ is also
projection of P onto E1 ∩ E2 :
(
D(Q∗ kP ) = minQ∈E1 D(QkP )
D(Q∗∗ kP ) = min D(QkP ) ⇐
Q∈E1 ∩E2 D(Q∗∗ kQ∗ ) = minQ∈E2 D(QkQ∗ )
and from here proceed by tilting from Q∗ to Q∗∗ and note that D(Q∗ kP ) + D(Q∗∗ kQ∗ ) = D(Q∗∗ kP ).
Remark: Sanov’s theorem tells us the probability that, after observing n iid samples of a
distribution, the empirical distribution is still far away from the true distribution, is exponentially
small.
163
§ 15. Hypothesis testing asymptotics II
Setup:
H0 : X n ∼ P X n H1 : X n ∼ QX n (i.i.d.)
n
test PZ|X n : X → {0, 1}
(n) (n)
specification: 1 − α = π1|0 ≤ 2−nE0 β = π0|1 ≤ 2−nE1
Bounds:
• achievability (Neyman Pearson)
α = 1 − π1|0 = P [Fn > τ ], β = π0|1 = Q[Fn > τ ]
x
ψP∗ (θ) = sup θλ − ψP (λ)
λ∈R
164
Note that since ψP (0) = ψP (1) = 0 from convexity ψP (λ) is finite on 0 ≤ λ ≤ 1. Furthermore,
assuming P Q and Q P we also have that λ 7→ ψP (λ) continuous everywhere on [0, 1] (
on (0, 1) it follows from convexity, but for boundary points we need more detailed arguments).
Consequently, all the results in this section apply under just the conditions of P Q and Q P .
However, since in previous lecture we were assuming that log-MGF exists for all λ, we will only
present proofs under this extra assumption.
Theorem 15.1. Let P Q, Q P , then
parametrized by −D(P kQ) ≤ θ ≤ D(QkP ) characterizes the best exponents on the boundary of
achievable (E0 , E1 ).
Note: The geometric interpretation of the above theorem is shown in the following picture, which rely
on the properties of ψP (λ) and ψP∗ (θ). Note that ψP (0) = ψP (1) = 0. Moreover, by Theorem 13.3
∗ ), θ 7→ E (θ) is increasing, θ 7→ E (θ) is decreasing.
(Properties of ψX 0 1
λ→1
which generalizes Kullback-Leibler divergence since Dλ (P kQ) −−−→ D(P kQ). Note that ψP (λ) =
(λ − 1)Dλ (QkP ) = −λD1−λ (P kQ). This provides another explanation that ψP is negative between
0 and 1, and the slope at endpoints is: ψP0 (0) = −D(P kQ) and ψP0 (1) = D(QkP ).
Corollary 15.1 (Bayesian criterion). Fix a prior (π0 , π1 ) such that π0 + π1 = 1 and 0 < π0 < 1.
Denote the optimal Bayesian (average) error probability by
with exponent
1 1
E , lim log ∗ .
n→∞ n Pe (n)
Then
E = max min(E0 (θ), E1 (θ)) = ψP∗ (0) = − inf ψP (λ),
θ λ∈[0,1]
regardless of the prior, and ψP∗ (0) , C(P, Q) is called the Chernoff exponent.
165
Remark 15.2 (Bhattacharyya distance). There is an important special case in which Chernoff
exponent simplifies. Instead of i.i.d. observations, consider independent, but not identically
distributed observations. Namely, suppose that two hypotheses correspond to two different strings
xn and x̃n over a finite alphabet X . The hypothesis tester observes Y n = (Yj , j = 1, . . . n) obtained
by applying one of the strings to the input of the memoryless channel PY |X (the alphabet Y does
not need to be finite, but we assume this below). Extending Corollary it can be shown, that in this
case optimal probability Pe∗ (xn , x̃n ) has (Chernoff) exponent1
n
1X X
E = − inf log PY |X (y|xt )λ PY |X (y|x̃t )1−λ .
λ∈[0,1] n
t=1 y∈Y
If |X | = 2 and if compositions of xn and x̃n are equal (!), the expression is invariant under λ ↔ (1−λ)
and thus from convexity in λ we infer that λ = 12 is optimal, yielding E = n1 dB (xn , x̃n ), where
n
X Xq
dB (xn , x̃n ) = − log PY |X (y|xt )PY |X (y|x̃t )
t=1 y∈Y
is known as Bhattacharyya distance between codewords xn and x̃n . Without the two assumptions
stated, dB (·, ·) does not necessarily give the optimal error exponent. We do, however, always have
the bounds
1 −2dB (xn ,x̃n ) n n
2 ≤ Pe∗ (xn , x̃n ) ≤ 2−dB (x ,x̃ )
4
with the upper bound being the more tight the more joint composition of (xn , x̃n ) resembles that of
(x̃n , xn ).
P
Proof of Theorem 15.1. The idea is to apply the large deviation theory to iid sum nk=1 Tk . Specif-
ically, let’s rewrite the bounds in terms of T :
Achievability: Using Neyman Pearson test, for fixed τ = −nθ, apply the large deviation
theorem:
" n #
(n)
X ∗
1 − α = π1|0 = P Tk ≥ nθ = 2−nψP (θ)+o(n) , for θ ≥ EP T = −D(P kQ)
k=1
" n #
(n)
X ∗
β= π0|1 =Q Tk < nθ = 2−nψQ (θ)+o(n) , for θ ≤ EQ T = D(QkP )
k=1
1
In short, the tilting parameter λ does not need to change between coordinates t corresponding to different values
of (xt , x̃t ).
166
Notice that by the definition of T we have
where the distribution Pλ 2 is tilting of P along T , cf. (14.14), which moves from P0 = P to
P1 = Q as λ ranges from 0 to 1:
Proof. The first part is verified trivially. Indeed, if we fix λ and let θ(λ) , EPλ [T ], then from (13.13)
we have
D(Pλ kP ) = ψP∗ (θ) ,
whereas
dPλ dPλ dP
D(Pλ kQ) = EPλ log = EPλ log = D(Pλ kP ) − EPλ [T ] = ψP∗ (θ) − θ .
dQ dP dQ
Also from (13.12) we know that as λ ranges in [0, 1] the mean θ = EPλ [T ] ranges from −D(P kQ) to
D(QkP ).
To prove the second claim (15.4), the key observation is the following: Since Q is itself a tilting
of P along T (with λ = 1), the following two families of distributions
167
are in fact the same family with Qλ0 = Pλ0 +1 .
Now, suppose that Q∗ achieves the minimum in (15.4) and that Q∗ 6= Q, Q∗ 6= P (these cases
should be verified separately). Note that we have not shown that this minimum is achieved, but it
will be clear that our argument can be extended to the case of when Q0n is a sequence achieving the
infimum. Then, on one hand, obviously
D(Q∗ kP ) ≤ D(QkP ) .
Therefore,
dQ∗ dQ
EQ∗ [T ] = EQ∗ log = D(Q∗ kP ) − D(Q∗ kQ) ∈ [−D(P kQ), D(QkP )] . (15.5)
dP dQ∗
Next, we have from Corollary 14.1 that there exists a unique Pλ with the following three properties:3
EPλ [T ] = EQ∗ [T ]
D(Pλ kP ) ≤ D(Q∗ kP )
D(Pλ kQ) ≤ D(Q∗ kQ)
Thus, we immediately conclude that minimization in (15.4) can be restricted to Q∗ belonging to the
family of tilted distributions {Pλ , λ ∈ R}. Furthermore, from (15.5) we also conclude that λ ∈ [0, 1].
Hence, characterization of E1∗ (E0 ) given by (15.3) coincides with the one given by (15.4).
3
Small subtlety: In Corollary 14.1 we ask EQ∗ [T ] ∈ (A, B). But A, B – the essential range of T – depend on the
distribution under which the essential range is computed, cf. (14.10). Fortunately, we have Q P and P Q, so
essential range is the same under both P and Q. And furthermore (15.5) implies that EQ∗ [T ] ∈ (A, B).
168
Note: Geometric interpretation of (15.4) is as follows: As λ increases from 0 to 1, or equivalently,
θ increases from −D(P kQ) to D(QkP ), the optimal distribution traverses down the curve. This
curve is in essense a geodesic connecting P to Q and exponents E0 ,E1 measure distances to P and
Q. It may initially sound strange that the sum of distances to endpoints actually varies along the
geodesic, but it is a natural phenomenon: just consider the unit circle with metric induced by the
ambient Euclidean metric. Than if p and q are two antipodal points, the distance from intermediate
point to endpoints do not sum up to d(p, q) = 2.
169
Different realizations of Xk are informative to different levels, the total “information” we receive
follows a random process. Therefore, instead of fixing the sample size n, we can make n a stopping
time τ , which gives a “better” (E0 , E1 ) tradeoff. Solution is the concept of sequential test:
• Informally: Sequential test Z at each step declares either “H0 ”, “H1 ” or “give me one more
sample”.
The easiest way to see why sequential tests may be dramatically superior to fixed-sample-size
tests is the following example: Consider P = 12 δ0 + 12 δ1 and Q = 12 δ0 + 12 δ−1 . Since P 6⊥ Q, we
also have P n 6⊥ Qn . Consequently, no finite-sample-size test can achieve zero error under both
hypotheses. However, an obvious sequential test (wait for the first appearance of ±1) achieves zero
error probability with finite average number of samples (2) under both hypotheses. This advantage
is also very clear in the achievable error exponents as the next figure shows.
for large l0 , l1 , then the following inequality for the exponents holds
E0 E1 ≤ D(P kQ)D(QkP ).
4
This assumption is satisfied for discrete distributions on finite spaces.
170
with optimal boundary achieved by the sequential probability ratio test SPRT(A, B) (A, B are large
positive numbers) defined as follows:
τ = inf{n : Sn ≥ B or Sn ≤ −A}
0, if Sτ ≥ B
Z=
1, if Sτ < −A
where
n
X P (Xk )
Sn = log
Q(Xk )
k=1
EQ [Sτ ] = − EQ [τ ]D(QkP ) .
M̃n , Mmin(τ,n)
171
To that end, consider a decomposition
X
1E = 1E∩{τ =n} .
n≥0
By monotone convergence theorem applied to (15.10) it is sufficient to verify that for every n
EP [1E∩{τ =n} ] = EQ [exp{Sτ }1E∩{τ =n} ] . (15.11)
dP|Fn
This, however, follows from the fact that E ∩ {τ = n} ∈ Fn and dQ|Fn = exp{Sn } by the very
definition of Sn .
We now proceed to the proof. For achievability we apply (15.10) to infer
π1|0 = P[Sτ ≤ −A]
= EQ [exp{Sτ }1{Sτ ≤ −A}]
≤ e−A
Next, we denot τ0 = inf{n : Sn ≥ B} and observe that τ ≤ τ0 , whereas expectation of τ0 we estimate
from (15.8):
EP [τ ] ≤ EP [τ0 ] = EP [Sτ0 ] ≤ B + c0 ,
where in the last step we used the boundedness assumption to infer
Sτ0 ≤ B + c0
Thus
B + c0 B
l0 = EP [τ ] ≤ EP [τ0 ] ≤ ≈ for large B
D(P kQ) D(P kQ)
Similarly we can show π0|1 ≤ e−B and l1 ≤ A
D(QkP ) for large A. Take B = l0 D(P kQ), A =
l1 D(QkP ), this shows the achievability.
Converse: Assume (E0 , E1 ) achievable for large l0 , l1 and apply data processing inequality of
divergence:
d(P(Z = 1)kQ(Z = 1)) ≤ D(PkQ) Fτ
= EP [Sτ ] = EP [τ ]D(P kQ) from (15.8)
= l0 D(P kQ)
notice that for l0 E0 and l1 E1 large, we have d(P(Z = 1)kQ(Z = 1)) ≈ l1 E1 , therefore l1 E1 .
l0 D(P kQ). Similarly we can show that l0 E0 . l1 D(QkP ), finally we have
E0 E1 ≤ D(P kQ)D(QkP ), as l0 , l1 → ∞
172
Part IV
Channel coding
173
§ 16. Channel coding
c1 b
b
b
D1 b
b b
b cM
b b
b
DM
Figure 16.1: When X = Y, the decoding regions can be pictured as a partition of the space, each
containing one codeword.
Note: The underlying probability space for channel coding problems will always be
f PY |X g
W −→ X −→ Y −→ Ŵ
1
For randomized encoder/decoders, we identify f and g as probability transition kernels PX|W and PŴ |Y .
174
When the source alphabet is [M ], the joint distribution is given by:
1
(general) PW XY Ŵ (m, a, b, m̂) = P (a|m)PY |X (b|a)PŴ |Y (m̂|b)
M X|W
1
(deterministic f, g) PW XY Ŵ (m, cm , b, m̂) = P (b|cm )1{b ∈ Dm̂ }
M Y |X
Throughout the notes, these quantities will be called:
• W - Original message
• Y - Channel output
• Ŵ - Decoded message
3. Bit error rate: in the special case when M = 2k , we identify W with a k-bit string S k ∈ Fk2 .
P
Then the bit error rate is Pb , k1 kj=1 P[Sj 6= Ŝj ], which means the average fraction of errors
in a k-bit block. It is also convenient to introduce in this case the Hamming distance
dH (S k , Ŝ k ) , #{i : Si 6= Ŝj } .
Then, the bit-error rate becomes the normalized expected Hamming distance:
1
Pb = E[dH (S k , Ŝ k )] .
k
To distinguish the bit error rate Pb from the previously defined Pe (resp. Pe,max ), we will also
call the latter the average (resp. max) block error rate.
The most typical metric is average probability of error, but the others will be used occasionally in
the course as well. By definition, Pe ≤ Pe,max . Therefore the maximum error probability is a more
stringent criterion which offers uniform protection for all codewords.
Remark: log2 M ∗ gives the maximum number of bits that we can pump through a channel PY |X
while still guaranteeing the error probability (in the appropriate sense) is at most .
175
Example: The random transformation BSC(n, δ) (binary symmetric channel) is defined as
X = {0, 1}n
Y = {0, 1}n
Y n = Xn ⊕ Zn
i.i.d.
where Z n ∼ Bern(δ). Pictorially, the BSC(n, δ) channel takes a binary sequence length n and flips
the bits independently with probability δ:
0 1 0 0 1 1 0 0 1 1
PY n |X n
1 1 0 1 0 1 0 0 0 1
Question: When δ = 0.11, n = 1000, what is the maximum number of bits you can send with
Pe ≤ 10−3 ?
Ideas:
0. Can one send 1000 bits with Pe ≤ 10−3 ? No and apparently the probability that at least one
bit is flipped is Pe = 1 − (1 − δ)n ≈ 1. This implies that uncoded transmission does not meet
our objective and coding is necessary – tradeoff: reduce the number of bits to send, increase
the probability of success.
0 0 1 0
Decoding by majority vote, the probability of error of this scheme is Pe ≈ kP[Binom(l, δ) > l/2]
and kl ≤ n = 1000, which for Pe ≤ 10−3 gives l = 21, k = 47 bits.
176
3. Shannon’s theorem (to be shown soon) tells us that over memoryless channel of blocklength n
the fundamental limit satisfies
log M ∗ = nC + o(n) (16.1)
as n → ∞ and for arbitrary ∈ (0, 1). Here C = maxX I(X1 ; Y1 ) is the capacity of the
single-letter channel. In our case we have
1
I(X; Y ) = max I(X; X + Z) = log 2 − h(δ) ≈ bit
PX 2
So Shannon’s expansion (16.1) can be used to predict (non-rigorously, of course) that it should
be possible to send around 500 bits reliably. As it turns out, for this blocklength this is not
quite possible.
4. Even though calculating log M ∗ is not computationally feasible (searching over all codebooks
is doubly exponential in block length n), we can find bounds on log M ∗ that are easy to
compute. We will show later in the course that in fact, for BSC(1000, .11)
5. The first codes to approach the bounds on log M ∗ are called Turbo codes (after the turbocharger
engine, where the exhaust is fed back in to power the engine). This class of codes is known as
sparse graph codes, of which LDPC codes are particularly well studied. As a rule of thumb,
these codes typically approach 80 . . . 90% of log M ∗ when n ≈ 103 . . . 104 .
16.2.1 Determinism
1. Given any encoder f : [M ] → X , the decoder that minimizes Pe is the Maximum A Posteriori
(MAP) decoder, or equivalently, the Maximal Likelihood (ML) decoder, since the codewords
are equiprobable (W is uniform):
= argmax P [Y = y|W = m]
m∈[M ]
Each u in the expectation gives a deterministic encoder, hence there is a deterministic encoder
that is at least as good as the average of the collection, i.e., ∃u0 s.t. Pe (u0 ) ≤ P[W 6= Ŵ ]
177
Remark: If instead we use maximal probability of error as our performance criterion, then
these results don’t hold; randomized encoders and decoders may improve performance. Example:
consider M = 2 and we are back to the binary hypotheses testing setup. The optimal decoder (test)
that minimizes the maximal Type-I and II error probability, i.e., max{1 − α, β}, is not deterministic,
if max{1 − α, β} is not achieved at a vertex of the region R(P, Q).
where the first inequality is obvious and the second follow from the union bound. Taking expectation
of the above expression gives the theorem.
Theorem 16.2 (Assouad). If M = 2k then
Pb ≥ min{P[Ŵ = c0 |W = c] : c, c0 ∈ Fk2 , dH (c, c0 ) = 1} .
Proof. Let ei be a length k vector that is 1 in the i-th position, and zero everywhere else. Then
k
X k
X
1{Si 6= Ŝi } ≥ 1{S k = Ŝ k + ei }
i=1 i=1
Next, notice
dH (x, y) ≥ r1{dH (x, y) = r}
and take the expectation with x ∼ A, y ∼ B.
178
Remark: In statistics, Assouad’s Lemma is a useful tool for obtaining lower bounds on the
minimax risk of an estimator. Say the data X is distributed according to Pθ parameterized by θ ∈ Rk
and let θ̂ = θ̂(X) be an estimator for θ. The goal is to minimize the maximal risk supθ∈Θ Eθ [kθ − θ̂k1 ].
A lower bound (Bayesian) to this worst-case risk is the average risk E[kθ − θ̂k1 ], where θ is distributed
to any prior. Consider θ uniformly distributed on the hypercube {0, }k with side length embedded
in the space of parameters. Then
k
inf sup E[kθ − θ̂k1 ] ≥ min (1 − TV(Pθ , Pθ0 )) . (16.4)
θ̂ θ∈{0,}k 4 dH (θ,θ0 )=1
This can be proved using similar ideas to Theorem 16.2. WLOG, assume that = 1.
(a)
1 1
E[kθ − θ̂k1 ] ≥ E[kθ − θ̂dis k1 ] = E[dH (θ, θ̂dis )]
2 2
Xk k
1 (b) 1 X
≥ min P[θi 6= θ̂i ] = (1 − TV(PX|θi =0 , PX|θi =1 ))
2 θ̂i =θ̂i (X) 4
i=1 i=1
(c) k
≥ min (1 − TV(Pθ , Pθ0 )) .
4 dH (θ,θ0 )=1
Here θ̂dis is the discretized version of θ̂, i.e. the closest point on the hypercube to θ̂ and so
(a) follows from |θi − θ̂i | ≥ 12 1{|θi −θ̂i |>1/2} = 12 1{θi 6=θ̂dis,i } , (b) follows from the optimal binary
hypothesis testing for θi given X, (c) follows from the convexity of TV: TV(PX|θi =0 , PX|θi =1 ) =
1 P 1 P 1 P
TV( 2k−1 θ:θi =0 PX|θ , 2k−1 θ:θi =1 PX|θ ) ≤ 2k−1 θ:θi =0 TV(PX|θ , PX|θ⊕ei ) ≤ maxdH (θ,θ0 )=1 TV(Pθ , Pθ0 ).
Alternatively, (c) also follows from by providing the extra information θ\i and allowing θ̂i = θ̂i (X, θ\i )
in the second line.
2. When M = 2k
supX I(X; Y )
log M ≤
log 2 − h(Pb )
179
P
2. Now S k → X → Y → Ŝ k . Recall from Theorem 5.1 that for iid S n , I(Si ; Ŝi ) ≤ I(S k ; Ŝ k ).
This gives us
k
X
sup I(X; Y ) ≥ I(X; Y ) ≥ I(Si , Ŝi )
X i=1
1
1X
≥k d P[Si = Ŝi ]
k 2
!
1X
k
1
≥ kd P[Si = Ŝi ]
k 2
i=1
1
= kd 1 − Pb
= k(log 2 − h(Pb ))
2
where the second line used Fano’s inequality (Theorem 5.4) for binary random variable (or
divergence data processing), and the third line used the convexity of divergence.
PY |X (Y |X)
i(X; Y ) = log ,
PY (Y )
180
§ 17. Channel coding: achievability bounds
Notation: in the following proofs, we shall make use of the independent pairs (X, Y ) ⊥
⊥ (X, Y )
181
Notice a subtle difference between (17.1) and (17.2) for the continuous case: In (17.2) the Radon-
Nikodym derivative is only defined up to sets of measure zero, therefore whenever PX (x) = 0 the
value of PY (i(x, Y ) > t) is undefined. This problem does not occur with definition (17.1), and that
is why we prefer it. In any case, for discrete X , Y, or under other regularity conditions, all the
definitions are equivalent.
Proposition 17.1 (Properties of information density).
2. If there is a bijective transformation (X, Y ) → (A, B), then almost surely iPXY (X; Y ) =
iPAB (A; B) and in particular, distributions of i(X; Y ) and i(A; B) coincide.
3. (Conditioning and unconditioning trick) Suppose that f (y) = 0 and g(x) = 0 whenever
i(x; y) = −∞, then1
Proof. The proof is simply a change of measure. For example, to see (17.3), note
X X PY (y)
Ef (Y ) = PY (y)f (y) = PY |X (y|x) f (y)
PY |X (y|x)
y∈Y y∈Y
notice that by the assumption on f (·), the summation is valid even if for some y we have that
PY |X (y|x) = 0. Similarly, E[f (x, Y )] = E[exp{−i(x; Y )}f (x, Y )|X = x]. Integrating over x ∼ PX
gives (17.5). The rest are by interchanging X and Y .
Corollary 17.1.
Remark 17.4. We have used this trick before: For any probability measure P and any measure Q,
h dP i
Q log ≥ t ≤ exp(−t). (17.9)
dQ
for example, in hypothesis testing (Corollary 12.1). In data compression, we frequently used the fact
that |{x : log PX (x) ≥ t}| ≤ exp(−t), which is also of the form (17.9) with Q = counting measure.
1 dPY |X
Note that (17.3) holds when i(x; y) is defined as i = log PY
, and (17.4) holds when i(x; y) is defined as
dP
i = log PX|Y
X
. (17.5) and (17.6) hold under either of the definitions. Since in the following we shall only make use of
(17.3) and (17.5), this is another reason we adopted definition (17.1).
182
17.2 Shannon’s achievability bound
Theorem 17.1 (Shannon’s achievability bound). For a given PY |X , ∀PX , ∀τ > 0, ∃(M, )-code
with
Proof. Recall that for a given codebook {c1 , . . . , cM }, the optimal decoder is MAP, or equivalently,
ML, since the codewords are equiprobable:
= argmax PY |X (y|cm )
m∈[M ]
The step of maximizing the likelihood can make analyzing the error probability difficult. Similar to
what we did in almost loss compression (e.g., Theorem 7.4), the magic in showing the following
two achievability bounds is to consider a suboptimal decoder. In Shannon’s bound, we consider a
threshold-based suboptimal decoder g(y) as follows:
m, ∃!cm s.t. i(cm ; y) ≥ log M + τ
g(y) =
e, o.w.
Interpretation
i(cm ; y) ≥ log M + τ ⇔ PX|Y (cm |y) ≥ M exp(τ )PX (cm ),
i.e., the likelihood of cm being the transmitted codeword conditioned on receiving y exceeds some
threshold.
For a given codebook (c1 , . . . , cM ), the error probability is:
where W is uniform on [M ].
We generate the codebook (c1 , . . . , cM ) randomly with cm ∼ PX i.i.d. for m ∈ [M ]. By symmetry,
the error probability averaging over all possible codebooks is given by:
E[Pe (c1 , . . . , cM )]
= E[Pe (c1 , . . . , cM )|W = 1]
= P[{i(c1 ; Y ) ≤ log M + τ } ∪ {∃m 6= 1, i(cm , Y ) > log M + τ }|W = 1]
M
X
≤ P[i(c1 ; Y ) ≤ log M + τ |W = 1] + P[i(cm ; Y ) > log M + τ |W = 1] (union bound)
m=2
= P [i(X; Y ) ≤ log M + τ ] + (M − 1)P i(X; Y ) > log M + τ (random codebook)
≤ P [i(X; Y ) ≤ log M + τ ] + (M − 1) exp(−(log M + τ )) (by Corollary 17.1)
≤ P [i(X; Y ) ≤ log M + τ ] + τ ) + exp(−τ )
Finally, since the error probability averaged over the random codebook satisfies the upper bound,
there must exist some code allocation whose error probability is no larger than the bound.
183
Remark 17.5 (Typicality).
• The property of a pair (x, y) satisfying the condition {i(x; y) ≥ γ} can be interpreted as “joint
typicality”. Such version of joint typicality is useful when random coding is done in product
spaces with cj ∼ PXn (i.e. coordinates of the codeword are iid).
• A popular alternative to the definition of typicality is to require that the empirical joint
distribution is close to the true joint distribution, i.e., P̂xn ,yn ≈ PXY , where
1
P̂xn ,yn (a, b) = · #{j : xj = a, yj = b} .
n
This definition is natural for cases when random coding is done with cj ∼ uniform on the set
{xn : P̂xn ≈ PX } (type class).
P[Ŵ 6= j|W = j] = P[i(cj ; Y ) ≤ γ|W = j] + P[i(cj ; Y ) > γ, ∃k ∈ [j − 1], s.t. i(ck ; Y ) > γ]
j−1
X
≤ P[i(cj ; Y ) ≤ γ|W = j] + P[i(ck ; Y ) > γ|W = j].
k=1
Averaging over the randomly generated codebook, the expected error probability is upper bounded
184
by:
M
1 X
E[Pe (c1 , . . . , cM )] = P[Ŵ 6= j|W = j]
M
j=1
1 X
M j−1
X
≤ P[i(X; Y ) ≤ γ] + P[i(X; Y ) > γ]
M
j=1 k=1
M −1
= P[i(X; Y ) ≤ γ] + P[i(X; Y ) > γ]
2
M −1
= P[i(X; Y ) ≤ γ] + E[exp(−i(X; Y ))1 {i(X; Y ) > γ}] (by (17.3))
2
h M −1 i
= E 1 {i(X; Y ) ≤ γ} + exp(−i(X; Y ))1 {i(X, Y ) > γ}
2
h M −1 i M −1
= E min 1, exp(−i(X; Y )) (γ = log minimizes the upper bound)
2 2
M − 1 +
= E exp − i(X; Y ) − log .
2
To optimize over γ, note the simple observation that U 1E + V 1{E c } ≥ min{U, V }, with equality iff
U ≥ V on E. Therefore for any x, y, 1[i(x; y) ≤ γ]+ M2−1 e−i(x;y) 1[i(x; y) > γ] ≥ min(1, M2−1 e−i(x;y) ),
achieved by γ = log M2−1 regardless of x, y.
Note: Dependence-testing: The RHS of (17.11) is equivalent to the minimum error probability of
the following Bayesian hypothesis testing problem:
H0 : X, Y ∼ PX,Y versus H1 : X, Y ∼ PX PY
2 M −1
prior prob.: π0 = , π1 = .
M +1 M +1
Note that X, Y ∼ PX,Y and X, Y ∼ PX PY , where X is the sent codeword and X is the unsent
codeword. As we know from binary hypothesis testing, the best threshold for the LRT to minimize
the weighted probability of error is log ππ10 .
Note: In Theorem 17.2 we have avoided minimizing over τ in Shannon’s bound (17.10) to get the
minimum upper bound in Theorem 17.1. Moreover, DT bound is stronger than the best Shannon’s
bound (with optimized τ ).
Note: Similar to the random coding achievability bound of almost lossless compression (Theorem
7.4), in Theorem 17.1 and Theorem 17.2 we only need the random codewords to be pairwise
independent.
185
Remark 17.6 (Comparison with Shannon’s bound). We can also interpret (17.12) as for fixed M ,
there exists an (M, )max -code that achieves the maximal error probability bounded as follows:
M
≤ P[i(X; Y ) < log γ] +
γ
Take log γ = log M + τ , this gives the bound of exactly the same form in (17.10). However, the
two are proved in seemingly quite different ways: Shannon’s bound is by random coding, while
Feinstein’s bound is by greedily selecting the codewords. Nevertheless, Feinstein’s bound is stronger
in the sense that it concerns about the maximal error probability instead of the average.
Proof. Recall the goal is to find codewords c1 , . . . , cM ∈ X and disjoint subsets (decoding regions)
D1 , . . . , DM ⊂ Y, s.t.
PY |X (Di |ci ) ≥ 1 − , ∀i ∈ [M ].
The idea is to construct a codebook of size M in a greedy way.
For every x ∈ X , associate it with a preliminary decode region defined as follows:
Ex , {y : i(x; y) ≥ log γ}
Notice that the preliminary decoding regions {Ex } may be overlapping, and we will trim them into
final decoding regions {Dx }, which will be disjoint.
We can assume that P[i(X; Y ) < log γ] ≤ , for otherwise the R.H.S. of (17.12) is negative and
there is nothing to prove. We first claim that there exists some c such that PY [Ec |X = c] ≥ 1 − .
Show by contradiction. Assume that ∀c ∈ X , P[i(c; Y ) ≥ log γ|X = c] < 1 − , then average over
c ∼ PX , we have P[i(X; Y ) ≥ log γ] < 1 − , which is a contradiction.
Then we construct the codebook in the following greedy way:
1. Pick c1 to be any codeword such that PY [Ec1 |X = c1 ] ≥ 1 − , and set D1 = Ec1 ;
2. Pick c2 to be any codeword such that PY [Ec2 \D1 |X = c2 ] ≥ 1 − , and set D2 = Ec2 \D1 ;
...
−1
3. Pick cM to be any codeword such that PY [EcM \ ∪M j=1 Dj |X = cM ] ≥ 1 − , and set DM =
−1
EcM \ ∪M
j=1 Dj . We stop if no more codeword can be found, i.e., M is determined by the
stopping condition:
∀c ∈ X , PY [Ex0 \ ∪M
j=1 Dj |X = c] < 1 −
186
§ 18. Linear codes. Channel capacity
This time we shall use a shortcut to prove the above bounds and in which case Pe = Pe,max .
c∈C
⇔ c ∈ row span of G
⇔ c ∈ KerH, for some H ∈ Fq(n−k)×n s.t. HGT = 0.
Note: For linear codes, the codebook is a k-dimensional linear subspace of Fnq (ImG or KerH). The
matrix H is called a parity check matrix.
Example: (Hamming code) The [7, 4, 3]2 Hamming code over F2 is a linear code with G = [I; P ]
and H = [−P T ; I] is a parity check matrix.
1 0 0 0 1 1 0
0 1 1 0 1 1 0 0
1 0 0 1 0 1
G=
0
H= 1 0 1 1 0 1 0
0 1 0 0 1 1 x5
0 1 1 1 0 0 1
0 0 0 1 1 1 1
x2 x1
x4
Note:
x6 x7
x3
• Parity check: all four bits in the same circle sum up to zero.
187
Linear codes are almost always examined with channels of additive noise, a precise definition of
which is given below:
Definition 18.2 (Additive noise). PY |X is additive-noise over Fnq if
PY |X (y|x) = PZ n (y − x) ⇔ Y = X + Z n where Z n ⊥
⊥X
Now: Given a linear code and an additive-noise channel PY |X , what can we say about the
decoder?
Theorem 18.1. Any [k, n]Fq linear code over an additive-noise PY |X has a maximum likelihood
decoder g : Fnq → Fnq such that:
1. g(y) = y − gsynd (Hy T ), i.e., the decoder is a function of the “syndrome” Hy T only
3. Pe,max = Pe ,
where gsynd : Fn−k
q → Fnq , defined by gsynd (s) = argmaxz:HxT =s PZ (z), is called the “syndrome
decoder”, which decodes the most likely realization of the noise.
Proof. 1. The maximum likelihood decoder for linear code is
3. For any u,
P[Ŵ 6= u|W = u] = P[g(cu +Z) 6= cu ] = P[cu +Z−gsynd (HcTu +HZ T ) 6= cu ] = P[gsynd (HZ T ) 6= Z].
188
Remark 18.2. The analogy between Theorem 17.2 and Theorem 18.2 is the same as Theorem 9.4
and Theorem 9.5 (full random coding vs random linear codes).
Proof. Recall that in proving the Shannon’s achievability bound or the DT bound (Theorem 17.1
and 17.2), we select the code words c1 , . . . , cM i.i.d ∼ PX and showed that
M −1
E[Pe (c1 , . . . , cM )] ≤ P[i(X; Y ) ≤ γ] + P[i(X; Y ) ≥ γ]
2
As noted after the proof of the DT bound, we only need the random codewords to be pairwise
independent. Here we will adopt a similar approach. Note that M = q k .
Let’s first do a quick check of the capacity achieving input distribution for PY |X with additive
noise over Fnq :
max I(X; Y ) = max H(Y ) − H(Y |X) = max H(Y ) − H(Z n ) = n log q − H(Z n ) ⇒ PX∗ uniform on Fnq
PX PX PX
cu = uG + h, ∀u ∈ Fkq
G and h are drawn from the new ensemble, where the k × n entries of G and the 1 × n
entries of h are i.i.d. uniform over Fq . We add the dithering to eliminate the special role
that the all-zero codeword plays (since it is contained in any linear codebook).
cu ∼ uniform on Fnq
cu0 = u0 G + h = uG + h + (u0 − u)G = cu + (u0 − u)G
Step 3: Repeat the same argument in proving DT bound for the symmetric and pairwise independent
codewords, we have
M −1
E[Pe (c1 , . . . , cM )] ≤ P[i(X; Y ) ≤ γ] + P(i(X, Y ) ≥ γ)
2
+ +
M − 1 + q k −1
⇒Pe ≤ E[exp{− i(X; Y ) − log }] = E[q − i(X;Y )−logq 2 ] ≤ E[q − i(X;Y )−k
]
2
where we used M = q k and picked the base of log to be q.
189
Step 4: compute i(X; Y ):
PZ n (b − a) 1
i(a; b) = logq −n
= n − logq
q PZ n (b − a)
therefore
+
1
− n−k−logq
Pe ≤ E[q PZ n (Z n )
] (18.2)
Step 5: Kill h. We claim that there exists a linear code without dithering such that (18.2) is
satisfied. Indeed shifting a codebook has no impact on its performance. We modify the
coding scheme with G, h which achieves the bound in the following way: modify the
decoder input Y 0 = Y − h, then when cu is sent, the additive noise PY 0 |X becomes then
Y 0 = uG + h + Z n − h = uG + Z n , which is equivalent to that the linear code generated by
G is used. Note that this is possible thanks to the additivity of the noisy channel.
Remark 18.3. • The ensemble cu = uG + h has the pairwise independence property. The joint
entropy H(c1 , . . . , cM ) = H(G) + H(h) = (nk + n) log q is significantly smaller than Shannon’s
“fully random” ensemble we used in the previous lecture. Recall that in that ensemble each cj
was selected independently uniform over Fnq , implying H(c1 , . . . , cM ) = q k n log q. Question:
where minimum is over all distributions with P [ci = a, cj = b] = q −2n when i 6= j (pairwise
independent, uniform codewords). Note that H(c1 , . . . , cM ) ≥ H(c1 , c2 ) = 2n log q. Similarly,
we may ask for (ci , cj ) to be uniform over all pairs of distinct elements. In this case Wozencraft
ensemble for the case of n = 2k achieves H(c1 , . . . , cqk ) ≈ 2n log q.
• With some non-zero probability G may fail to be full rank [Exercise: Find P [rank(G) < k] as a
function of n, k, q!]. In such a case, there are two identical codewords and hence Pe,max ≥ 1/2.
There are two alternative ensembles of codes which do not contain such degenerate codebooks:
Analysis of random coding over such ensemble is similar, except that this time (X, X̄) have
distribution
1
PX,X̄ = 2n 1 0
q − q n {X6=X }
uniform on all pairs of distinct codewords and not pairwise independent.
190
18.2 Channels and channel capacity
Basic question of data transmission: How many bits can one transmit reliably if one is allowed to
use the channel n times?
• input alphabet A
• output alphabet B
Note: we do not insist on PY n |X n to have any relation for different n, but it is common that
the conditional distribution of the first k letters of the n-th transformation is in fact a function of
only the first k letters of the input and this function equals PY k |X k – the k-th transformation. Such
channels, in particular, are non-anticipatory: channel outputs are causal functions of channel inputs.
Channel characteristics:
Pyn |xn = PZ n (y n − xn ) ⇔ Y n = X n + Z n .
• DMC (discrete memoryless stationary channel): A DMC can be specified in two ways:
Example:
191
Definition 18.4 (Fundamental Limits). For any channel,
Definition 18.5 (Channel capacity). The -capacity C and Shannon capacity C are
1
C , lim inf log M ∗ (n, )
n→∞ n
C = lim C .
→0+
Remark 18.4.
• This operational definition of the capacity represents the maximum achievable rate at which
one can communicate through a channel with probability of error less than . In other words,
for any R < C, there exists an (n, exp(nR), n )-code, such that n → 0.
• Typically, the -capacity behaves like the plot below on the left-hand side, where C0 is called
the zero-error capacity, which represents the maximal achievable rate with no error. Often
times C0 = 0, meaning without tolerating any error zero information can be transmitted. If C
is constant for all (see plot on the right-hand side), then we say that the strong converse
holds (more on this later).
Cǫ Cǫ
strong converse
holds
Zero error b
C0
Capacity
ǫ ǫ
0 1 0 1
192
Proposition 18.2 (Equivalent definitions of C and C).
Proof. This trivially follows from applying the definitions of M ∗ (n, ) (DIY).
Question: Why do we define capacity C and C with respect to average probability of error, say,
(max)
C and C (max) ? Why not maximal probability of error? It turns out that these two definitions
are equivalent, as the next theorem shows.
Theorem 18.3. ∀τ ∈ (0, 1),
Proof. The second inequality is obvious, since any code that achieves a maximum error probability
also achieves an average error probability of .
For the first inequality, take an (n, M, (1 − τ ))-code, and define the error probability for the j th
codeword as
λj , P[Ŵ 6= j|W = j]
Then X X X
M (1 − τ ) ≥ λj = λj 1{λj ≤} + λj 1{λj >} ≥ |{j : λj > }|.
Hence |{j : λj > }| ≤ (1 − τ )M . [Note that this is exactly Markov inequality!] Now by
removing those codewords1 whose λj exceeds , we can extract an (n, τ M, )max -code. Finally, take
M = M ∗ (n, (1 − τ )) to finish the proof.
(max)
Corollary 18.1 (Capacity under maximal probability of error). C = C for all > 0 such
that C = C− . In particular, C (max) = C.2
Proof. Using the definition of M ∗ and the previous theorem, for any fixed τ > 0
1
C ≥ C(max) ≥ lim inf log τ M ∗ (n, (1 − τ )) ≥ C(1−τ )
n→∞ n
(max)
Sending τ → 0 yields C ≥ C ≥ C− .
193
Remark: This quantity is not the same as the Shannon capacity, and has no direct operational
interpretation as a quantity related to coding. Rather, it is best to think of this only as taking the
n-th random transformation in the channel, maximizing over input distributions, then normalizing
and looking at the limit of this sequence.
Next we give coding theorems to relate information capacity (information measures) to
Shannon capacity (operational quantity).
Ci
Theorem 18.4 (Upper Bound for C ). For any channel, ∀ ∈ [0, 1), C ≤ 1− and C ≤ Ci .
Remark 18.5. The above result, known as Shannon’s Noisy Channel Theorem, is perhaps
the most significant result in information theory. For communications engineers, the major surprise
was that C > 0, i.e. communication over a channel is possible with strictly positive rate for any
arbitrarily small probability of error. This result influenced the evolution of communication systems
to block architectures that used bits as a universal currency for data, along with encoding and
decoding procedures.
Before giving the proof of Theorem 18.5, we show the second equality in (18.5). Notice that
Ci for stationary memoryless channels is easy to compute: Rather than solving an optimization
problem for each n and taking the limit of n → ∞, computing Ci boils down to maximizing mutual
information for n = 1. This type of result is known as “single-letterization” in information theory.
Proposition 18.3 (Memoryless input is optimal for memoryless channels).
Ci = sup I(X; Y ).
PX
194
Q Pn
Proof. Recall that for product kernels PY n |X n = PYi |Xi , we have I(X n ; Y n ) ≤ k=1 I(Xk ; Yk ),
with equality when Xi ’s are independent. Then
1
Ci = lim inf sup I(X n ; Y n ) = lim inf sup I(X; Y ) = sup I(X; Y ).
n→∞ n PX n n→∞ P
X PX
Proof of Theorem 18.5. ∀PX , and let PX n = PXn (iid product). Recall Shannon’s or Feinstein’s
achievability bound (Theorem 17.1 or 17.3): For any n, M and any τ > 0, there exists (n, M, n )-code,
s.t.
n ≤ P[i(X n ; Y n ) ≤ log M + τ ] + exp(−τ )
Here the information density is defined as
n
X n
X
dPY n |X n n n dPY |X
i(X n ; Y n ) = log (Y |X ) = log (Yk |Xk ) = i(Xk ; Yk ),
dPY n dPY
k=1 k=1
which is a sum of iid r.v.’s with mean I(X; Y ). Set log M = n(I(X; Y ) − 2δ) for δ > 0, and taking
τ = δn in Shannon’s bound, we have
hX n i
n→∞
n ≤ P i(Xk ; Yk ) ≤ nI(X; Y ) − δn + exp(−δn) −−−→ 0,
k=1
195
2. The speed of achieving capacity: Suppose we want to achieve 90% of the capacity, we want
to know how long do we need wait? The blocklength is a good proxy for delay. In other
words, we want to know how fast the gap to capacity vanish as blocklength grows. Shannon’s
theorem shows that the gap C − n1 log M ∗ (n, ) = o(1). Next theorem shows that under proper
conditions, the o(1) term is in fact O( √1n ).
The main tool in the proof of Theorem 18.5 is the WLLN. The lower bound C ≥ Ci in
Theorem 18.5 shows that log M ∗ (n, ) ≥ nC + o(n) (since normalizing by n and taking the liminf
must result in something ≥ C). If instead we do a more refined analysis using the CLT, we find
Theorem 18.7. For any stationary memoryless channel with C = maxPX I(X; Y ) (i.e. ∃PX∗ =
argmaxPX I(X; Y )) such that V = Var[i(X ∗ ; Y ∗ )] < ∞, then
√ √
log M ∗ (n, ) ≥ nC − nV Q−1 () + o( n),
where Q(·) is the complementary Gaussian CDF and Q−1 (·) is its functional inverse.
Proof. Writing the little-o notation in terms of lim inf, our goal is
log M ∗ (n, ) − nC
lim inf √ ≥ −Q−1 () = Φ−1 (),
n→∞ nV
where Φ(t) is the standard normal CDF.
Recall Feinstein’s bound
log M ∗ (n, ) − nC
lim inf √ ≥ t ∀t s.t. Φ(t) <
n→∞ nV
Taking t % Φ−1 (), and writing the liminf in little o form completes the proof
√ √
log M ∗ (n, ) ≥ nC − nV Q−1 () + o( n)
196
18.4 Examples of DMC
Binary symmetric channels
C
δ̄
0 0 1 bit
δ
δ
δ
δ̄ 0 1 1
1 1 2
Y = X + Z, Z ∼ Bern(δ) ⊥
⊥X
Capacity of BSC:
C = sup I(X; Y ) = 1 − h(δ)
PX
Proof. I(X; X + Z) = H(X + Z) − H(X + Z|X) = H(X + Z) − H(Z) ≤ 1 − h(δ), with equality iff
X ∼ Bern(1/2).
Note: More generally, for all additive-noise channel over a finite abelian group G, C = supPX I(X; X+
Z) = log |G| − H(Z), achieved by uniform X.
δ
1 1 δ
δ̄ 0 1
Capacity of BEC:
C = sup I(X; Y ) = 1 − δ bits
PX
Note: Without evaluating Shannon’s formula, it is clear thta C ≤ 1 − δ, because δ-fraction of the
message are lost, i.e., even if the encoder knows a priori where the erasures are going to occur, the
rate still cannot exceed 1 − δ.
197
18.5* Information Stability
We saw that C = Ci for stationary memoryless channels, but what other channels does this hold
for? And what about non-stationary channels? To answer this question, we introduce the notion of
information stability.
Definition 18.7. A channel is called information stable if there exists a sequence of input distribu-
tion {PX n , n = 1, 2, . . .} such that
1 P
i(X n ; Y n )−
→ Ci .
n
For example, we can pick PX n = (PX∗ )n for stationary memoryless channels. Therefore stationary
memoryless channels are information stable.
The purpose for defining information stability is the following theorem.
Theorem 18.8. For an information stable channel, C = Ci .
Proof. Like the stationary, memoryless case, the upper bound comes from the general converse Theo-
rem 16.4, and the lower bound uses a similar strategy as Theorem 18.5, except utilizing the definition
of information stability in place of WLLN.
The next theorem gives conditions to check for information stability in memoryless channels
which are not necessarily stationary.
Theorem 18.9. A memoryless channel is information stable if there exists {Xk∗ : k ≥ 1} such that
both of the following hold:
n
1X
I(Xk∗ ; Yk∗ ) → Ci (18.6)
n
k=1
∞
X 1
V ar[i(Xn∗ ; Yn∗ )] < ∞ . (18.7)
n2
n=1
where convergence to 0 follows from Kronecker lemma (Lemma 18.1 to follow) applied with
bn = n2 , xn = Var[i(Xn∗ ; Yn∗ )]/n2 .
The second part follows from the first. Indeed, notice that
n
1X
Ci = lim inf sup I(Xk ; Yk ) .
n→∞ n P Xk
k=1
198
Now select PXk∗ such that
I(Xk∗ ; Yk∗ ) ≥ sup I(Xk ; Yk ) − 2−k .
P Xk
(Note that each supPX I(Xk ; Yk ) ≤ log min{|A|, |B|} < ∞.) Then, we have
k
n
X n
X
I(Xk∗ ; Yk∗ ) ≥ sup I(Xk ; Yk ) − 1 ,
k=1 k=1 PXk
and hence normalizing by n we get (18.6). We next show that for any joint distribution PX,Y we
have
Var[i(X; Y )] ≤ 2 log2 (min(|A|, |B|)) . (18.9)
The argument is symmetric in X and Y , so assume for concreteness that |B| < ∞. Then
E[i2 (X; Y )]
Z X
2 2
, dPX (x) PY |X (y|x) log PY |X (y|x) + log PY (y) − 2 log PY |X (y|x) · log PY (y)
A y∈B
Z X
≤ dPX (x) PY |X (y|x) log2 PY |X (y|x) + log2 PY (y) (18.10)
A y∈B
Z X X
= dPX (x) PY |X (y|x) log2 PY |X (y|x) + PY (y) log2 PY (y)
A y∈B y∈B
Z
≤ dPX (x)g(|B|) + g(|B|) (18.11)
A
=2g(|B|) ,
where (18.10) is because 2 log PY |X (y|x) · log PY (y) is always non-negative, and (18.11) follows
because each term in square-brackets can be upper-bounded using the following optimization
problem:
X n
g(n) , sup aj log2 aj . (18.12)
P n
aj ≥0: j=1 aj =1 j=1
Since the x log2 x has unbounded derivative at the origin, the solution of (18.12) is always in the
interior of [0, 1]n . Then it is straightforward to show that for n > e the solution is actually aj = n1 .
For n = 2 it can be found directly that g(2) = 0.5629 log2 2 < log2 2. In any case,
Finally, because of the symmetry, a similar argument can be made with |B| replaced by |A|.
199
Proof. Since bn ’s are strictly increasing, we can split up the summation and bound them from above
n
X m
X n
X
bk xk ≤ bm xk + bk xk
k=1 k=1 k=m+1
Since this holds for any m, we can make the last term arbitrarily small.
Important example: For jointly Gaussian (X, Y ) we always have bounded variance:
cov[X, Y ]
Var[i(X; Y )] = ρ2 (X, Y ) log2 e ≤ log2 e , ρ(X, Y ) = p . (18.13)
Var[X] Var[Y ]
Indeed, first notice that we can always represent Y = X̃ + Z with X̃ = aX ⊥⊥ Z. On the other
hand, we have
log e x̃2 + 2x̃z σ2 2
i(x̃; y) = − 2 2z , z , y − x̃ .
2 σY2 σY σZ
From here by using Var[·] = Var[E[·|X̃]] + Var[·|X̃] we need to compute two terms separately:
2
X̃ 2 − σX̃
log e 2
σZ
E[i(X̃; Y )|X̃] = 2 ,
2 σY
and hence
2 log2 e 4
Var[E[i(X̃; Y )|X̃]] = σX̃ .
4σY4
On the other hand,
2 log2 e 2 2 4
Var[i(X̃; Y )|X̃] = [4σX̃ σZ + 2σX̃ ].
4σY4
Putting it all together we get (18.13). Inequality (18.13) justifies information stability of all sorts of
Gaussian channels (memoryless and with memory), as we will see shortly.
200
§ 19. Channels with input constraints. Gaussian channels.
An
b b
b
b Fn b
b b
b b b
b b
b
Definition 19.2 (Separable cost constraint). A channel with separable cost constraint is specified
as follows:
1. A, B: input/output spaces
2. PY n |X n : An → B n , n = 1, 2, . . .
3. Cost function c : A → R̄
Input constraint: average per-letter cost of a codeword xn (with slight abuse of notation)
n
1X
c(xn ) = c(xk ) ≤ P
n
k=1
Example: A = B = R
• Average power constraint (separable):
n
1X √
|xi |2 ≤ P ⇔ kxn k2 ≤ nP
n
i=1
201
Definition 19.3. Some basic definitions in parallel with the channel capacity without input
constraint.
• Information capacity
1
Ci (P ) = lim inf sup I(X n ; Y n )
n→∞ n PX n :E[Pnk=1 c(Xk )]≤nP
• Information stability: Channel is information stable if for all (admissible) P , there exists a
sequence of channel input distributions PX n such that the following two properties hold:
1 P
iPX n ,Y n (X n ; Y n )−
→Ci (P ) (19.1)
n
P[c(X n ) > P + δ] → 0 ∀δ > 0 . (19.2)
Note: These are the usual definitions, except that in Ci (P ), we are Pn permitted to maximize
I(X ; Y ) using input distributions from the constraint set {PX n : E[ k=1 c(Xk )] ≤ nP } instead
n n
2. One of the following is true: f (P ) is continuous and finite on (P0 , ∞), or f = ∞ on (P0 , ∞).
202
Proof. In the first part all statements are obvious, except for concavity, which follows from the
concavity of PX 7→ I(X; Y ). For any PXi such that E [c(Xi )] ≤ Pi , i = 0, 1, let X ∼ λ̄PX0 + λPX1 .
Then E [c(X)] ≤ λ̄P0 + λP1 and I(X; Y ) ≥ λ̄I(X0 ; Y0 ) + λI(X1 ; Y1 ). Hence f (λ̄P0 + λP1 ) ≥
λ̄f (P0 ) + λf (P1 ). The second claim follows from concavity of f (·).
To extend these results to Ci (P ) observe that for every n
1
P 7→ sup I(X n ; Y n )
n PX n :E[c(X n )]≤P
is concave. Then taking lim inf n→∞ the same holds for Ci (P ).
An immediate consequence is that memoryless input is optimal for memoryless channel with
separable cost, which gives us the single-letter formula of the information capacity:
Corollary 19.1 (Single-letterization). Information capacity of stationary memoryless channel with
separable cost:
Ci (P ) = f (P ) = sup I(X; Y ).
E[c(X)]≤P
Proof. Ci (P ) ≥ f (P ) is obvious by using PX n = (PX )n . For “≤”, use the concavity of f (·), we have
that for any PX n ,
Xn n
X X n
1
I(X n ; Y n ) ≤ I(Xj ; Yj ) ≤ f (E[c(Xj )])≤nf E[c(Xj )] ≤ nf (P ).
n
j=1 j=1 j=1
?
19.2 Capacity under input constraint C(P ) = Ci (P )
Theorem 19.1 (General weak converse).
Ci (P )
C (P ) ≤
1−
Proof. The argument is the same as before: Take any (n, M, , P )-code, W → X n → Y n → Ŵ .
Apply Fano’s inequality, we have
203
Note: when F = X , it reduces to the original Feinstein’s Lemma.
Proof. Similar to the proof of the original Feinstein’s Lemma, define the preliminary decoding
regions Ec = {y : i(c; y) ≥ log γ} for all c ∈ X . Sequentially pick codewords {c1 , . . . , cM } from
the set F and the final decoding region {D1 , . . . , DM } where Dj , Ecj \ ∪j−1
k=1 Dk . The stopping
criterion is that M is maximal, i.e.,
∀x0 ∈ F, PY [Ex0 \ ∪M
j=1 Dj X = x0 ] < 1 −
⇔ ∀x0 ∈ X , PY [Ex \ ∪M c
j=1 Dj X = x0 ] < (1 − )1[x0 ∈ F ] + 1[x0 ∈ F ]
0
From here, we can complete the proof by following the same steps as in the proof of Feinstein’s
lemma (Theorem 17.3).
Theorem 19.3 (Achievability). For any information stable channel with input constraints and
P > P0 we have
C(P ) ≥ Ci (P ) (19.3)
Proof. Let us consider a special case of the stationary memoryless channel (the proof for general
information stable channel follows similarly). So we assume PY n |X n = (PY |X )n .
Fix n ≥ 1. Since the channel is stationary memoryless, we have PY n |X n = (PY |X )n . Choose a
PX such that E[c(X)] < P , Pick log M = n(I(X; P Y ) − 2δ) and log γ = n(I(X; Y ) − δ).
With the input constraint set Fn = {xn : n1 c(xk ) ≤ P }, and iid input distribution PX n = PXn ,
we apply the extended Feinstein’s Lemma, there exists an (n, M, n , P )max -code with the encoder
satisfying input constraint F and the error probability
P
Also, since E[c(X)] < P , by WLLN, we have PX n (Fn ) = P ( n1 c(xk ) ≤ P ) → 1.
n (1 + o(1)) ≤ o(1)
⇒ n → 0 as n → ∞
⇒ ∀, ∃n0 , s.t. ∀n ≥ n0 , ∃(n, M, n , P )max -code, with n ≤
Therefore
1
C (P ) ≥ log M = I(X; Y ) − 2δ, ∀δ > 0, ∀PX s.t. E[c(X)] < P
n
⇒ C (P ) ≥ sup lim (I(X; Y ) − 2δ)
PX :E[c(X)]<P δ→0
⇒ C (P ) ≥ sup I(X; Y ) = Ci (P −) = Ci (P )
PX :E[c(X)]<P
where the last equality is from the continuity of Ci on (P0 , ∞) by Proposition 19.1. Notice
that for general information stable channel, we just need to use the definition to show that
P (i(X n ; Y n ) ≤ n(Ci − δ)) → 0, and all the rest follows.
204
Theorem 19.4 (Shannon capacity). For an information stable channel with cost constraint and
for any admissible constraint P we have
C(P ) = Ci (P ).
Proof. The case of P = P0 is treated in the homework. So assume P > P0 . Theorem 19.1
shows C (P ) ≤ C1−
i (P )
, thus C(P ) ≤ Ci (P ). On the other hand, from Theorem 19.3 we have
C(P ) ≥ Ci (P ).
Note: In homework, you will show that C(P0 ) = Ci (P0 ) also holds, even though Ci (P ) may be
discontinuous at P0 .
19.3 Applications
19.3.1 Stationary AWGN channel
Z ∼ N (0, σ 2 )
X Y
+
Definition 19.5 (AWGN). The additive Gaussian noise (AWGN) channel is a stationary memoryless
additive-noise channel with separable cost constraint: A = B = R, c(x) = x2 , PY |X is given by
⊥ X, and average power constraint EX 2 ≤ P .
Y = X + Z, where Z ∼ N (0, σ 2 ) ⊥
In other words, Y = X + Z , where Z n ∼ N (0, In ).
n n n
Gaussian
Note: Here “white” = uncorrelated = independent.
Note: Complex AWGN channel is similarly defined: A = B = C, c(x) = |x|2 , and Z n ∼ CN (0, In )
Theorem 19.5. For stationary (C)-AWGN channel, the channel capacity is equal to information
capacity, and is given by:
1 P
C(P ) = Ci (P ) = log 1 + 2 for AWGN
2 σ
P
C(P ) = Ci (P ) = log 1 + 2 for C-AWGN
σ
Then use Theorem 4.6 (Gaussian saddle point) to conclude X ∼ N (0, P ) (or CN (0, P )) is the
unique caid.
205
Note: Since Z n ∼ N (0, σ 2√ ), then with high probability,
kZ n k2 concentrates around nσ 2 . Similarly, duethe power
constraint and the fact that Z n ⊥
⊥ X n , we have E kY n k2 = c3
E kY n k2 + E kZ n k2 ≤ n(P + σ 2 ) and the received vector c4
p
p n
√ c2
Y n lies in an `2 -ball of radius approximately n(P + 2
√ σ ). nσ 2
(P
c1
+
Since the noise can at most perturb the codeword by √nσ 2
σ
2
)
in Euclidean distance, if we 2
p can pack M balls of radius nσ c5
c8
2
into the `2 -ball of radius n(P + σ ) centered at the origin,
···
then this gives a good codebook and decision regions. The
packing number is related to the volume ratio. Note that c6
c7
the volume of an `2 -ball of radius r in Rn is given by cn rn
+σ 2 ))n/2 n/2 cM
for some constant cn . Then cn (n(P cn (nσ 2 )n/2
= 1 + σP2 .
Take the log and divide by n, we get 21 log 1 + σP2 .
Why the above is not a proof for either achievability or converse?
Theorem 19.5 applies to Gaussian noise. What if the noise is non-Gaussian and how sensitive is
the capacity formula 12 log(1 + SNR) to the Gaussian assumption? Recall the Gaussian saddlepoint
result we have studied in Lecture 4 where we showed that for the same variance, Gaussian noise
is the worst which shows that the capacity of any non-Gaussian noise is at least 12 log(1 + SNR).
Conversely, it turns out the increase of the capacity can be controlled by how non-Gaussian the
noise is (in terms of KL divergence). The following result is due to Ihara.
Theorem 19.6 (Additive Non-Gaussian noise). Let Z be a real-valued random variable independent
of X and EZ 2 < ∞. Let σ 2 = Var Z. Then
1 P 1 P
log 1 + 2 ≤ sup I(X; X + Z) ≤ log 1 + 2 + D(PZ kN (EZ, σ 2 )).
2 σ PX :EX 2 ≤P 2 σ
Proof. Homework.
Note: The quantity D(PZ kN (EZ, σ 2 )) is sometimes called the non-Gaussianness of Z, where
N (EZ, σ 2 ) is a Gaussian with the same mean and variance as Z. So if Z has a non-Gaussian density,
say, Z is uniform on [0, 1], then the capacity can only differ by a constant compared to AWGN,
which still scales as 12 log SNR in the high-SNR regime. On the other hand, if Z is discrete, then
D(PZ kN (EZ, σ 2 )) = ∞ and indeed in this case one can show that the capacity is infinite because
the noise is “too weak”.
206
Theorem 19.7 (Waterfilling). The capacity of L-parallel AWGN channel is given by
L
1X + T
C = log
2
j=1
σj2
Proof.
Ci (P ) = sup I(X L ; Y L )
E[Xi2 ]≤P
P
PX L :
L
X
≤ sup sup I(Xk ; Yk )
Pk ≤P,Pk ≥0 k=1 E[Xk2 ]≤Pk
P
L
X 1 Pk
= sup log(1 + )
P
Pk ≤P,Pk ≥0 k=1 2 σk2
with equality if Xk ∼ N (0, Pk ) are independent. So the question boils down to the last
Pmaximization
problem – power allocation: Denote the Lagragian multipliers P1 for the constraint Pk ≤ PPby λ
Pk
and for the constraint Pk ≥ 0 by µk . We want to solve max 2 log(1 + σ2 ) − µk Pk + λ(P − Pk ).
k
First-order condition on Pk gives that
1 1
2 = λ − µk , µ k P k = 0
2 σ k + Pk
Note: The figure illustrates the power allocation via water-filling. In this particular case, the second
branch is too noisy (σ2 too big) such that it is better be discarded, i.e., the assigned power is zero.
Note: [Significance of the waterfilling theorem] In the high SNR regime, the capacity for 1 AWGN
channel is approximately 12 log P , while the capacity for L parallel AWGN channel is approximately
207
L
2 log( PL ) ≈ L2 log P for large P . This L-fold increase in capacity at high SNR regime leads to the
powerful technique of spatial multiplexing in MIMO.
Also notice that this gain does not come from multipath diversity. Consider the scheme that a
single stream of data is sent through every parallel channel simultaneously, with multipath diversity,
the effective noise level is reduced to L1 , and the capacity is approximately log(LP ), which is much
smaller than L2 log( PL ) for P large.
then the capacity of the non-stationary AWGN channel is given by the parameterized form: C(T ) =
C̃i (T ) with input power constraint P̃ (T ).
Proof. Fix T > 0. Then it is clear from the waterfilling solution that
n
X 1 T
sup I(X n ; Y n ) = log+ , (19.4)
j=1
2 σj2
Now, by assumption, the LHS of (19.5) converges to P̃ (T ). Thus, we have that for every δ > 0
Ci (P̃ (T ) − δ) ≤ C̃i (T ) (19.6)
Ci (P̃ (T ) + δ) ≥ C̃i (T ) (19.7)
Taking δ → 0 and invoking continuity of P 7→ Ci (P ), we get that the information capacity satisfies
Ci (P̃ (T )) = C̃i (T ) .
The channel is information stable. Indeed, from (18.13)
log2 e Pj log2 e
Var(i(Xj ; Yj )) = ≤
2 Pj + σj2 2
and thus
n
X 1
Var(i(Xj ; Yj )) < ∞ .
n2
j=1
From here information stability follows via Theorem 18.9.
Note: Non-stationary AWGN is primarily interesting due to its relationship to the stationary
Additive Colored Gaussian noise channel in the following discussion.
208
19.5* Stationary Additive Colored Gaussian noise channel
Definition 19.8 (Additive colored Gaussian noise channel ). An Additive Colored Gaussian noise
channel is defined as follows: A = B = R, c(x) = x2 , PYj |Xj : Yj = Xj + Zj , where Zj is a stationary
Gaussian process with spectral density fZ (ω) > 0, ω ∈ [−π, π].
Theorem 19.9. The capacity of stationary ACGN channel is given by the parameterized form:
Z 2π
1 1 T
C(T ) = log+ dω
2π 0 2 fZ (ω)
Z 2π +
1
P (T ) = T − fZ (ω) dω
2π 0
e
Finally since U is unitary, C = C.
Note: Noise is born white, the colored noise is essentially due to some filtering.
209
19.6* Additive White Gaussian Noise channel with Intersymbol
Interference
Definition 19.9 (AWGN with ISI). An AWGN channel with ISI is defined as follows: A = B = R,
c(x) = x2 , and the channel law PY n |X n is given by
n
X
Yk = hk−j Xj + Zk , k = 1, . . . , n
j=1
where Zk ∼ N (0, 1) is white Gaussian noise, {hk , k = −∞, . . . , ∞} are coefficients of a discrete-time
channel filter.
Theorem 19.10. Suppose that the sequence {hk } is an inverse Fourier transform of a frequency
response H(ω):
Z 2π
1
hk = eiωk H(ω)dω .
2π 0
Assume also that H(ω) is a continuous function on [0, 2π]. Then the capacity of the AWGN channel
with ISI is given by
Z 2π
1 1
C(T ) = log+ (T |H(ω)|2 )dω
2π 0 2
Z 2π +
1 1
P (T ) = T− dω
2π 0 |H(ω)|
2
1
Proof. (Sketch) At the decoder apply the inverse filter with frequency response ω 7→ H(ω) . The
equivalent channel then becomes a stationary colored-noise Gaussian channel:
Ỹj = Xj + Z̃j ,
where Z̃j is a stationary Gaussian process with spectral density
1
fZ̃ (ω) = .
|H(ω)|2
Then apply Theorem 19.9 to the resulting channel.
Remark: to make the above argument rigorous one must carefully analyze the non-zero error
introduced by truncating the deconvolution filter to finite n.
Capacity achieving input distribution PX∗ is discrete, with finitely many atoms on [−A, A]. Moreover,
2
the convergence speed of limA→∞ C(A, P ) = 12 log(1 + P ) is of the order e−O(A ) .
For details, see [Smi71] and [PW14, Section III].
210
19.8* Gaussian channels with fading
Fading channels are often used to model the urban signal propagation with multipath or shadowing.
The received signal Yi is modeled to be affected by multiplicative fading coefficient Hi and additive
noise Zi :
Yi = Hi Xi + Zi , Zi ∼ N (0, 1)
In the coherent case (also known as CSIR – for channel state information at the receiver), the
receiver has access to the channel state information of Hi , i.e. the channel output is effectively
(Yi , Hi ). Whenever Hj is a stationary ergodic process, we have the channel capacity given by:
1
C(P ) = E 2
log(1 + P |H| )
2
and the capacity achieving input distribution is the usual PX = N (0, P ). Note that the capacity
C(P ) is in the order of log P and we call the channel “energy efficient”.
In the non-coherent case where the receiver does not have the information of Hi , no simple
expression for the channel capacity is known. It is known, however, that the capacity achieving
input distribution is discrete [AFTS01], and the capacity scales as [TE97, LM03]
With introduction of multiple antenna channels, there are endless variations, theoretical open
problems and practically unresolved issues in the topic of fading channels. We recommend consulting
the textbook [TV05] for details.
211
§ 20. Lattice codes (by O. Ordentlich)
212
where ties are broken in a systematic manner. We define the modulo operation w.r.t. a lattice Λ as
The basic Voronoi region of Λ, denoted by V, is the set of all points in Rn which are quantized
to the zero vector. The systematic tie-breaking in (20.2) ensures that
[
· (V + t) = Rn ,
t∈Λ
S
where · denotes disjoint union. Thus, V is a fundamental cell of Λ.
Definition 20.2. A measurable set S ∈ Rn is called a fundamental cell of Λ if
[
· (S + t) = Rn .
t∈Λ
At , S ∩ (t + V); Dt , V ∩ (t + S).
Note that
Dt = [(−t + V) ∩ S] + t
= A−t + t.
Thus
X X X
Vol(S) = Vol(At ) = Vol(A−t + t) = Vol(Dt ) = Vol(V).
t∈Λ t∈Λ t∈Λ
Moreover
[ [ [
S = · At = · A−t = · Dt − t,
t∈Λ t∈Λ t∈Λ
and therefore
[
[S] mod Λ = · Dt = V.
t∈Λ
213
Corollary 20.1. If S is a fundamental cell of a lattice Λ with generating matrix G, then Vol(S) =
| det(G)|. In Particular, Vol(V) = | det(G)|.
Proof. Let P = G · [0, 1)n and note that it is a fundamental cell of Λ as Rn = Zn + [0, 1)n . The
claim now follows from Proposition 20.1 since Vol(P) = | det(G)| · Vol([0, 1)n ) = | det(G)|.
Definition 20.3 (Lattice decoder). A lattice decoder w.r.t. the lattice Λ returns for every y ∈ Rn
the point QΛ (y).
Remark 20.1. Recall that for linear codes, the ML decoder merely consisted of mapping syndromes
to shifts. Similarly, it can be shown that a lattice decoder can be expressed as
QΛ (y) = y − gsynd [G−1 y] mod 1 , (20.3)
for some gsynd : [0, 1)n 7→ Rn , where the mod 1 operation above is to be understood as componentwise
modulo reduction. Thus, a lattice decoder is indeed much more “structured” than ML decoder for a
random code.
Note that for an additive channel Y = X + Z, if X ∈ Λ we have that
An alternative interpretation of this property, is that for a sequence Λ(n) that is good for coding,
for any 0 < δ < 1 holds
Vol B (1 − δ)reff (Λ(n) ) ∩ V (n)
lim = 1.
n→∞ Vol B (1 − δ)reff (Λ(n) )
Roughly speaking, the Voronoi region of a lattice that is good for coding is as resilient to semi
norm-ergodic noise as a ball with the same volume.
214
reff
(b)
(a)
Figure 20.1: (a) shows a lattice in R2 , and (b) shows its Voronoi region and the corresponding
effective ball.
satisfies the power constraint. Moreover [Loe97], there exist a shift v such that
Vol (S)
|C| ≥
Vol(V)
√ !n
nSNR
=
reff (Λ)
n
= 2 2 (log(SNR)−log(1+δ)) .
To see this, let V ∼ Uniform(V), and write the expected size of |C| as
X
E|C| = E 1((t + V) ∈ S)
t∈Λ
Z X
1
= 1((t + v) ∈ S)dv
Vol(V) v∈V t∈Λ
Z
1
= 1(x ∈ S)dx
Vol(V) x∈Rn
Vol(S)
= . (20.6)
Vol(V)
215
For decoding, we will simply apply the lattice decoder QΛ (Y − v) on the shifted output. Since
Y − v = t + Z for some t ∈ Λ, the error probability is
Pe = Pr (QΛ (Y − v) 6= t) = Pr(Z ∈
/ V).
r2 (Λ)
Since Λ is good for coding and effn = (1 + δ) > n1 EkZk2 , the error probability of this scheme over
an additive semi norm-ergodic noise channel will vanish with n. Taking δ → 0 we see that any rate
R < 21 log(SNR) can be achieved reliably. Note that for this coding scheme (encoder+decoder) the
average error probability and the maximal error probability are the same.
The construction above gets us close to the AWGN channel capacity. We note that a possible
reason for the loss of +1 in the achievable rate is the suboptimality of the lattice decoder for the
codebook C. The lattice decoder assumes all points of Λ were equally likely to be transmitted.
However, in C only lattice points inside the ball can be transmitted. Indeed, it was shown [UR98]
that if one replaces the lattice decoder with a decoder that takes
T the√ shaping region into account,
there exist lattices and shifts for which the codebook (v + Λ) B( nSNR) is capacity achieving.
The main drawback of this approach is that the decoder no longer exploits the full structure of
the lattice, so the advantages of using a lattice code w.r.t. some typical member of the Gaussian
i.i.d. ensemble are not so clear anymore.
A nested lattice code (sometimes also called “Voronoi constellation”) based on the nested lattice
pair Λc ⊂ Λf is defined as [CS83, FJ89, EZ04]
L , Λf ∩ V c . (20.8)
Proposition 20.2.
Vol(Vc )
|L| = .
Vol(Vf )
1
Thus, the codebook L has rate R = n log |L| = log Γ(Λf , Λc ).
Let
[
S , · (t + Vf ),
t∈L
216
and note that
[
Rn = · (b + Vf )
b∈Λf
[ [
= · · (a + t + Vf )
a∈Λc t∈L
!!
[ [
= · a+ · (t + Vf )
a∈Λc t∈L
[
= · (a + S) .
a∈Λc
We will use the codebook L with a standard lattice decoder, ignoring the fact that only points
in Vc were transmitted. Therefore, the resilience to noise will be dictated mainly by Λf . The role of
the coarse lattice Λc is to perform shaping. In order to maximize the rate of the codebook L without
violating the power constraint, we would like Vc to have the maximal possible volume, under the
constraint that the average power of a transmitted codeword is no more than nSNR.
The average transmission power of the codebook L is related to a quantity called the second
moment of a lattice. Let U ∼ Uniform(V). The second moment of Λ is defined as σ 2 (Λ) , n1 EkUk2 .
Let W ∼ Uniform(B(reff (Λ)). By the isoperimetric inequality [Zam14]
1 r2 (Λ)
σ 2 (Λ) ≥ EkWk2 = eff .
n n+2
A lattice Λ exhibits a good tradeoff between average power and volume if its second moment is close
to that of B(reff (Λ).
Definition 20.5 (Goodness for MSE quantization). A sequence of lattices Λ(n) with growing
dimension, is called good for MSE quantization if
nσ 2 Λ(n)
lim = 1.
n→∞ r 2 Λ(n)
eff
Remark 20.2. Note that both “goodness for coding” and “goodness for quantization” are scale
invariant properties: if Λ satisfy them, so does αΛ for any α ∈ R.
Theorem 20.1 ([OE15]). If Λ is good for MSE quantization and U ∼ Uniform(V), then U is semi
norm-ergodic. Furthermore, if Z is semi norm-ergodic and statistically independent of U, then for
any α, β ∈ R the random vector αU + βZ is semi norm-ergodic.
Theorem 20.2 ([ELZ05, OE15]). For any finite nesting ratio Γ(Λf , Λc ), there exist a nested lattice
pair Λc ⊂ Λf where the coarse lattice Λc is good for MSE quantization and the fine lattice Λf is
good for coding.
217
Figure 20.2: An example of a nested lattice code. The points and Voronoi region of Λc are plotted
in blue, and the points of the fine lattice in black.
Z
L X LY α L Yeff t̂
t mod-Λ QΛf (·) mod-Λ
−
We now describe the Mod-Λ coding scheme introduced by Erez and Zamir [EZ04]. Let Λc ⊂ Λf
be a nested lattice pair, where the coarse lattice is good for MSE quantization and has σ 2 (Λc ) =
SNR(1 − ), whereas the fine lattice is good for coding and has reff2 (Λ ) = n SNR (1 + ). The rate
f 1+SNR
is therefore
1 Vol(Vc )
R = log
n Vol(Vf )
2
1 reff (Λc )
= log 2 (Λ )
2 reff f
!
1 SNR(1 − )
→ log SNR
(20.9)
2 1+SNR (1 + )
1
→ log (1 + SNR) ,
2
r2 (Λ )
where in (20.9) we have used the goodness of Λc for MSE quantization, that implies effn c → σ 2 (Λc ).
The scheme also uses common randomness, namely a dither vector U ∼ Uniform(Vc ) statistically
independent of everything, known to both the transmitter and the receiver. In order to transmit a
message w ∈ [1, . . . , 2nR ] the encoder maps it to the corresponding point t = t(w) ∈ L and transmits
X = [t + U] mod Λ. (20.10)
Lemma 20.1 (Crypto Lemma). Let Λ be a lattice in Rn , let U ∼ Uniform(V) and let V be a
random vector in Rn , statistically independent of U. The random vector X = [V + U] mod Λ is
uniformly distributed over V and statistically independent of V.
218
Proof. For any v ∈ Rn the set v + V is a fundamental cell of Λ. Thus, by Proposition 20.1 we have
that [v + V] mod Λ = V and Vol(v + V) = Vol(V). Thus, for any v ∈ Rn
X|V = v ∼ [v + U] mod Λ ∼ Uniform(V).
The Crypto Lemma ensures that n1 EkXk2 = (1 − )SNR, but our power constraint was kXk2 ≤
nSNR. Since X is uniformly distributed over Vc and Λc is good for MSE quantization, Theorem 20.1
implies that kXk2 ≤ nSNR with high probability. Thus, whenever the power constraint is violated
we can just transmit 0 instead of X, and this will have a negligible effect on the error probability of
the scheme.
The receiver scales its observation by a factor α > 0 to be specified later, subtracts the dither U
and reduces the result modulo the coarse lattice
Yeff = [αY − U] mod Λc
= [X − U + (α − 1)X + αZ] mod Λc
= [t + (α − 1)X + αZ] mod Λc (20.11)
= [t + Zeff ] mod Λc , (20.12)
where we have used the modulo distributive law in (20.11), and
Zeff = (α − 1)X + αZ (20.13)
is effective noise, that is statistically independent of t, with effective variance
2 1
σeff (α) , EkZeff k2 < α2 + (1 − α)2 SNR. (20.14)
n
Since Z is semi norm-ergodic, and X is uniformly distributed over the Voronoi region of a lattice that
is good for MSE quantization, Theorem 20.1 implies that Zeff is semi norm-ergodic with effective
2 (α). Setting α = SNR/(1 + SNR), such as to minimize the upper bound on σ 2 (α)
variance σeff eff
2 < SNR/(1 + SNR).
results in effective variance σeff
The receiver next computes
t̂ = [QΛf (Yeff )] mod Λc
= QΛf (t + Zeff ) mod Λc , (20.15)
and outputs the message corresponding to t̂ as its estimate. Since Λf is good for coding, Zeff is
semi norm-ergodic, and
2 (Λ )
reff f SNR 2
= (1 + ) > σeff ,
n 1 + SNR
we have that Pr(t̂ 6= t) → 0 as the lattice dimension tends to infinity. Thus, we have proved the
following.
Theorem 20.3. There exist a coding scheme based on a nested lattice pair, that reliably achieves
any rate below 12 log(1 + SNR) with lattice decoding for all additive semi norm-ergodic channels. In
particular, if the additive noise is AWGN, this scheme is capacity achieving.
Remark 20.3. In the Mod-Λ scheme the error probability does not depend on the chosen message,
such that Pe,max = Pe,avg . However, this required common randomness in the form of the dither U.
By a standard averaging argument it follows that there exist some fixed shift u that achieves the
same, or better, Pe,avg . However, for a fixed shift the error probability is no longer independent of
the chosen message.
219
20.4 Dirty Paper Coding
Assume now that the channel is
Y = X + S + Z,
where Z is a unit variance semi norm-ergodic noise, X is subject to the same power constraint
kXk2 ≤ nSNR as before, and S is some arbitrary interference vector, known to the transmitter but
not to the receiver.
Naively, one can think that the encoder can handle the interference S just by subtracting it
from the transmitted codeword. However, if the codebook is designed to exactly meet the power
constraint, after subtracting S the power constraint will be violated. Moreover, if kSk2 > nSNR,
this approach is just not feasible.
Using the Mod-Λ scheme, S can be cancelled out with no cost in performance. Specifically,
instead of transmitting X = [t + U] mod Λc , the transmitted signal in the presence of known
interference will be
X = [t + U − αS] mod Λc .
Clearly, the power constraint is not violated as X ∼ Uniform(Vc ) due to the Crypto Lemma (now,
U should also be independent of S). The decoder is exactly the same as in the Mod-Λ scheme with
no interference. It is easy to verify that the interference is completely cancelled out, and any rate
below 21 log(1 + SNR) can still be achieved.
Remark 20.4. When Z is Gaussian and S is Gaussian there is a scheme based on random codes
that can reliably achieve 12 log(1 + SNR). For arbitrary S, to date, only lattice based coding schemes
are known to achieve the interference free capacity. There are many more scenarios where lattice
codes can reliably achieve better rates than the best known random coding schemes.
Note that any point in Λ(F) can be decomposed as x = p−1 c + a for some c ∈ C(F) (where we
identify the elements of Zp with the integers [0, 1, . . . , p−1]) and a ∈ Zn . Thus, for any x1 , x2 ∈ Λ(F)
220
we have
x1 + x2 = p−1 (c1 + c2 ) + a1 + a2
= p−1 ([c1 + c2 ] mod p + pa) + a1 + a2
= p−1 c̃ + ã
∈ Λ(F)
where c̃ = [c1 + c2 ] mod p ∈ C(F) due to the linearity of C(F), and a and ã are some vectors in
Zn . It can be verified similarly that for any x ∈ Λ(F) it holds that −x ∈ Λ(F), and that if all
codewords in C(F) are distinct, then Λ(F) has a finite minimum distance. Thus, Λ(F) is indeed
a lattice. Moreover, if F is full-rank over Zp , then the number of distinct codewords in C(F) is
pk . Consequently, the number of lattice points in every integer shift of the unit cube is pk , so the
corresponding Voronoi region must satisfy Vol(V) = p−k .
Similarly, we can construct a nested lattice pair from a linear code. Let 0 ≤ k 0 < k and let F0 be
the sub-matrix obtained by taking only the first k 0 rows of F. The matrix F0 generates a linear
code C 0 (F0 ) that is nested in C(F), i.e., C 0 (F0 ) ⊂ C(F). Consequently we have that Λ(F0 ) ⊂ Λ(F),
and the nesting ratio is
k−k0
Γ(Λ(F), Λ(F0 )) = p n .
An advantage of this nested lattice construction for Voronoi constellations is that there is a very
simple mapping between messages and codewords in L = Λf ∩ Vc . Namely, we can index our set
0 0 0
of 2nR = pk−k messages by all vectors in Zk−k
p . Then, for each message vector w ∈ Zk−k p , the
corresponding codeword in L = Λ(F) ∩ V(Λ(F0 )) is obtained by constructing the vector
w̃T = [0 · · 0} wT ] ∈ Zkp ,
| ·{z (20.16)
k0 zeros
and taking t = t(w) = [w̃T F] mod p mod Λ(F0 ). Also, in order to specify the codebook L, only
the (finite field) generating matrix F is needed.
If we take the elements of F to be i.i.d. and uniform over Zp , we get a random ensemble of
nested lattice codes. It can be shown that if p grows fast enough with the dimension n (taking
p = O(n(1+)/2 ) suffices) almost all pairs in the ensemble have the property that both the fine and
coarse lattice are good for both coding and for MSE quantization [OE15].
Disclaimer: This text is a very brief and non-exhaustive survey of the applications of lattices
in information theory. For a comprehensive treatment, see [Zam14].
221
§ 21. Channel coding: energy-per-bit, continuous-time channels
In the last lecture, we analyzed the maximum number of information bits (M ∗ (n, , P )) that can be
pumped through for given n time use of the channel under the energy constraint P . Today we shall
study the counterpart of it: without any time constraint, in order to send k information bits, what
is the minimum energy needed? (E ∗ (k, ))
Definition 21.1 ( (E, 2k , ) code). For a channel W → X ∞ → Y ∞ → Ŵ , where Y ∞ = X ∞ + Z ∞ ,
a (E, 2k , ) code is a pair of encoder-decoder:
f : [2k ] → R∞ , g : R∞ → [2k ],
such that 1). ∀m, kf (m)k22 ≤ E,
2). P [g(f (W ) + Z ∞ ) 6= W ] ≤ .
Note: Operational meaning of lim→0 E ∗ (k, ): it suggests the smallest battery one needs in order
to send k bits without any time constraints, below that level reliable communication is impossible.
Theorem 21.1 ((Eb /N0 )min = −1.6dB).
E ∗ (k, ) N0 1
lim lim sup = , = −1.6dB (21.2)
→0 k→∞ k log2 e log2 e
Proof.
222
1. (“≥” converse part)
1
−h() + k ≤ d((1 − )k ) (Fano)
M
≤ I(W ; Ŵ ) (data processing for divergence)
≤ I(X ∞ ; Y ∞ ) (data processing for M.I.)
X∞
≤ I(Xi ; Yi ) ( lim I(X n ; U ) = I(X ∞ ; U ))
n→∞
i=1
X∞
1 EXi2
≤ log(1 + ) (Gaussian)
2 N0 /2
i=1
∞
log e X EXi2
≤ (linearization)
2 N0 /2
i=1
E
≤ log e
N0
E ∗ (k, ) N0 h()
⇒ ≥ ( − ).
k log e k
E ∗ (kn , ) nP
lim sup ≤ lim sup ∗
n→∞ kn n→∞ log Mmax (n, , P )
P
= 1 ∗ (n, , P )
lim inf n→∞ n log Mmax
P
= 1 P
2 log(1 + N0 /2 )
Note: [Remark] In order to send information reliably at Eb /N0 = −1.6dB, infinitely many time
slots are needed, and the information rate (spectral efficiency) is thus 0. In order to have non-zero
spectral efficiency, one necessarily has to step back from −1.6 dB.
223
Note: [PPM code] The following code, pulse-position modulation (PPM), is very efficient in terms
of Eb /N0 .
√
PPM encoder: ∀m, f (m) = (0, 0, . . . , E
|{z} ,...) (21.3)
m-th location
It is not hard to derive an upper bound on the probability of error that this code achieves [PPV11,
Theorem 2]: " ( ! )#
r
2E
≤ E min M Q + Z ,1 , Z ∼ N (0, 1) .
N0
In fact,
√ the code√can be further slightly optimized by subtracting the common center of gravity
(2−k E, . . . , 2−k E . . .) and rescaling each codeword to satisfy the power constraint. The resulting
constellation (simplex code) is conjectured to be non-asymptotic optimum in terms of Eb /N0 for
small (“simplex conjecture”).
21.2 What is N0 ?
In the above discussion, we have assumed Zi ∼ N (0, N0 /2), but how do we determine N0 ?
In reality the signals are continuous time (CT) process, the continuous time AWGN channel for
the RF signals is modeled as:
where noise N (t) (added at the receiver antenna) is a real stationary ergodic process and is assumed
to be “white Gaussian noise” with single-sided PSD N0 . Figure 21.1 at the end illustrates the
communication architecture. In the following discussion, we shall find the equivalent discrete
time (DT) AWGN model for the continuous time (CT) AWGN model in (21.4), and identify the
relationship between N0 in the DT model and N (t) in the CT model.
• observations:
224
Estimate the average power dissipation at the resistor:
Z T
1 ergodic (*)
lim Ft2 dt = E[F 2 ] = N0 B
T →∞ T t=0
If for some constant N0 , (*) holds for any narrow band with center frequency fc and bandwidth B,
then N (t) is called a “white noise” with one-sided PSD N0 .
Typically, white noise comes from thermal noise at the receiver antenna. Thus:
N0 ≈ kT (21.5)
where k = 1.38 × 10−23 is the Boltzmann constant, and T is the absolute temperature. The unit of
N0 is (W att/Hz = J).
An intuitive explanation to (21.5) is as follows: the thermal energy carried by each microscopic
degree of freedom (dof) is approximately kT 2 ; for bandwidth B and duration T , there are in total
2BT dof; by “white noise” definition we have the total energy of the noise to be:
kT
N0 BT = 2BT ⇒ N0 = kT.
2
Denote the set of all real finite energy signals f (t) by L2 (R), it is a vector space with the inner
product of two signals f (t), g(t) defined by
Z ∞
< f, g >= f (t)g(t)dt.
t=−∞
Definition 21.3 (WhiteR noise). N (t) isR a white noise with two-sided PSD being constant N0 /2 if
∞ ∞
∀f, g ∈ L2 (R) such that −∞ f 2 (t)dt = −∞ g 2 (t)dt = 1, we have that
1.
Z ∞
N0
< f, N >, f (t)N (t)dt ∼ N (0, ). (21.6)
−∞ 2
2. The joint distribution of (< f, N >, < g, N >) is jointly Gaussian with covariance equal to
inner product < f, g >.
Note: By this definition, N (t) is not a stochastic process, rather it is a collection of linear mappings
that map any f ∈ L2 (R) to a Gaussian random variable.
225
Note: Informally, we write:
N0
N (t) is white noise with one-sided PSD N0 (or two-sided PSD N0 /2) ⇐⇒ E[N (t)N (s)] = δ(t − s)
2
(21.7)
Note: The concept of one-sided PSD arises when N (t) is necessarily real, since in that case power
spectrum density is symmetric around 0, and thus to get the noise power in band [a, b] one can get
Z b Z b Z −a
noise power = Fone-sided (f )df = + Ftwo-sided (f )df ,
a a −b
where ωc = 2πfc . The LP F2 with high cutoff frequency ∼ 34 fc serves to kill the high frequency
√
component after demodulation, and the amplifier of magnitude 2 serves to preserve the total
energy of the signal, so that in the absence of noise we have that YB (t) = XB (t). Therefore,
e (t) ∼ C
YB (t) = XB (t) + N
e (t) is a complex Gaussian white noise and
where N
e (t)N
E[N e (s)∗ ] = N0 δ(t − s).
226
where the additive noise Zi is given by:
Z ∞
Zi = Ne (t)sincB (t − i )dt ∼ i.i.d CN (0, N0 ). (by (21.6))
t=−∞ B
if we focus on the real part of all signals, it is consistent with the real AWGN channel model in
(21.1).
Finally, the energy of the signal is preserved:
∞
X
|Xi |2 = kXB (t)k22 = kX(t)k22 .
i=−∞
Note: [Punchline]
Then
1 ∗ P
lim lim inf log MCT (T, , P ) = B log(1 + ), (21.8)
→0 n→∞ T N0 B
Proof. Consider the DT equivalent C-AWGN channel of this CT model, we have that
1 ∗ 1 ∗
log MCT (T, , P ) = log MC−AWGN (BT, , P/B)
T T
This is because:
227
• The power constraint in the DT model changed because for blocklength BT we have
BT
X
|Xi |2 = kX(t)k22 ≤ P T ,
i=1
P
thus per-letter power constraint is B.
Calculate the rate of the equivalent DT AWGN channel and we are done.
Note the above “theorem” is not rigorous, since conditions 1 and 2 are mutually exclusive:
any time limited non-trivial signal cannot be band limited. Rigorously, one should relax 2 by
constraining the signal to have a vanishing out-of-band energy as T → ∞. Rigorous approach to
this question lead to the theory of prolate spheroidal functions.
228
Thus,
1 P P
lim lim inf log2 M ∗ (T, , P ) = E ∗ (k,)
= log2 e ,
→0 n→∞ T lim→0 lim supk→∞ N0
k
where the last step is by Theorem 21.1.
229
230
3. Let C(P ) be the capacity-cost function of the channel (in the usual sense of capacity, as
defined in (19.1). Assuming P0 = 0 and C(0) = 0 it is not hard to show that:
C(P ) C(P ) d
Cpuc = sup = lim = C(P ) .
P P P →0 P dP P =0
4. The surprising discovery of Verdú is that one can avoid computing C(P ) and derive the
Cpuc directly. This is a significant help, as for many practical channels C(P ) is unknown.
Additionally, this gives a yet another fundamental meaning to KL-divergence.
Q
Theorem 21.4. For a stationary memoryless channel PY ∞ |X ∞ = PY |X with P0 = c(x0 ) = 0
(i.e. there is a symbol of zero cost), we have
Proof. Let
D(PY |X=x kPY |X=x0 )
CV = sup .
x6=x0 c(x)
QY ∞ |X ∞ = QY ∞ , PY∞|X=x0 .
231
Then, as usual we have from the data-processing for divergence
1
(1 − ) log M + h() ≤ d(1 − k ) (21.11)
M
≤ D(PW,X ∞ ,Y ∞ ,Ŵ kQW,X ∞ ,Y ∞ ,Ŵ ) (21.12)
= D(PY ∞ |X ∞ kQY ∞ |PX ∞ ) (21.13)
"∞ #
X
=E d(Xt ) , (21.14)
t=1
where the last step is by the cost constraint (21.10). Thus, dividing by E and taking limits we get
Cpuc ≤ CV .
Achievability: We generalize the PPM code (21.3). For each x1 ∈ X and n ∈ Z+ we define the
encoder f as follows:
··· (21.17)
f (M ) = ( x0 , . . . , x0 , x1 , x1 , . . . , x1 ) (21.18)
| {z } | {z }
n(M −1)-times n-times
Now, by Stein’s lemma there exists a subset S ⊂ Y n with the property that
Yn ∈S =⇒ Ŵ = 1 (21.21)
2n
Yn+1 ∈S =⇒ Ŵ = 2 (21.22)
··· (21.23)
From the union bound we find that the overall probability of error is bounded by
232
At the same time the total cost of each codeword is given by nc(x1 ). Thus, taking n → ∞ and after
straightforward manipulations, we conclude that
D(PY |X=x1 kPY |X=x0 )
Cpuc ≥ .
c(x1 )
This holds for any symbol x1 ∈ X , and so we are free to take supremum over x1 to obtain Cpuc ≥ CV ,
as required.
i.e. the total energy expended by each codeword must be almost entirely concentrated in very large
spikes. Such a coding method is called “flash signalling”. Thus, we can see that unlike non-fading
AWGN (for which due to rotational symmetry all codewords can be made “mellow”), the only hope
of achieving full Cpuc in the presence of fading is by signalling in huge bursts of energy.
This effect manifests itself in the speed of convergence to Cpuc with increasing constellation sizes.
∗
Namely, the energy-per-bit E (k,) k behaves as
r
E ∗ (k, ) const −1
= (−1.59 dB) + Q () (AWGN) (21.27)
k k
r
E ∗ (k, ) 3 log k
= (−1.59 dB) + (Q−1 ())2 (fading) (21.28)
k k
Fig. 21.2 shows numerical details.
233
14
12
10
Achievability
8
Converse
dB
2
fading+CSIR, non-fading AWGN
0
−1.59 dB
−2
100 101 102 103 104 105 106 107 108
Information bits, k
Figure 21.2: Comparing the energy-per-bit required to send a packet of k-bits for different channel
E ∗ (k,)
models (curves represent upper and lower bounds on the unknown optimal value k ). As a
comparison: to get to −1.5 dB one has to code over 6 · 104 data bits when the channel is non-fading
AWGN or fading AWGN with Hj known perfectly at the receiver. For fading AWGN without
knowledge of Hj (noCSI), one has to code over at least 7 · 107 data bits to get to the same −1.5 dB.
Plot generated via [Spe15].
234
§ 22. Advanced channel coding. Source-Channel separation.
Topics: Strong Converse, Channel Dispersion, Joint Source Channel Coding (JSCC)
Proof. We will give a sketch of the proof. Take an (n, M, )-code for channel PY |X . The main trick
is to consider an auxiliary channel QY |X which is easier to analyze.
PY n |X n
W Xn Yn Ŵ
QY n |X n
235
Sketch 1: Here, we take QY n |X n = (PY∗ )n , where PY∗ is the capacity-achieving output distri-
bution (caod) of the channel PY |X .1 Note that for communication purposes, QY n |X n is a useless
channel; it ignores the input and randomly picks a member of the output space according to (PY∗ )n ,
so that X n and Y n are decoupled (independent). Consider the probability of error under each
channel:
1
Q[Ŵ = W ] = (Blindly guessing the sent codeword)
M
P[Ŵ = W ] = 1 −
Since the random variable 1{Ŵ =W } has a huge mass under P and small mass under Q, this looks
like a great binary hypothesis test to distinguish the two distributions, PW X n Y n Ŵ and QW X n Y n Ŵ .
Since any hypothesis test can’t beat the optimal Neyman-Pearson test, we get the upper bound
1
β1− (PW X n Y n Ŵ , QW X n Y n Ŵ ) ≤ (22.1)
M
(Recall that βα (P, Q) = inf P [E]≥α Q[E]). Since the likelihood ratio is a sufficient statistic for this
hypothesis test, we can test only between
From the Neyman Pearson test, the optimal HT takes the form
PX n Y n PX n Y n
βα (PX n Y n , PX n (PY∗ )n ) = Q log ≥ γ where α = P log ≥ γ
| {z } | {z } PX n (PY∗ )n PX n (PY∗ )n
P Q
236
So under each hypothesis P and Q, the difference Y n − X n takes the form
1
Q : Y n − X n ∼ Bern( )n
2
P : Y − X ∼ Bern(δ)n
n n
Putting this all together, we see that any (n, M, ) code for the BSC satisfies
1 1 1
2−nd(δk 2 )+o(n) ≤ =⇒ log M ≤ nd(δk ) + o(n)
M 2
Since this is satisfied for all codes, it is also satisfied for the optimal code, so we get the converse
bound
1 1
lim inf log M ∗ (n, ) ≤ d(δk ) = log 2 − h(δ)
n→∞ n 2
For a general channel, this computation can be much more difficult. The expression for β in this
case is
∗ 1
β1− (PX n PY n |X n , PX n (PY∗ )n ) = 2−nD(PY |X kPY |P̄X )+o(n) ≤ (22.3)
M
where P̄X is unknown (depending on the code).
Explanation of (22.3): A statistician observes sequences of (X n , Y n ):
Xn = [ 0 1 2 0 0 1 2 2 ]
Yn =[a b b a c c a b]
On the marked three blocks, test between iid samples of PY |X=0 vs PY∗ , which has exponent
D(PY |X=0 kPY∗ ). Thus, intuitively averaging over the composition of the codeword we get that the
exponent of β is given by (22.3).
Recall that from the saddle point characterization of capacity (Theorem 4.4) for any distribution
P̄X we have
D(PY |X kPY∗ |P̄X ) ≤ C . (22.4)
Thus from (22.3) and (22.1):
Sketch 2: (More formal) Again, we will choose a dummy auxiliary channel QY n |X n = (QY )n .
However, choice of QY will depend on one of the two cases:
237
1. If |B| < ∞ we take QY = PY∗ (the caod) and note that from (18.12) we have
X
PY |X (y|x0 ) log2 PY |X (y|x0 ) ≤ log2 |B| ∀x0 ∈ A
y
and since miny PY∗ (y) > 0 (without loss of generality), we conclude that for any distribution
of X on A we have
PY |X (Y |X)
Var log |X ≤ K < ∞ ∀PX . (22.5)
QY (Y )
By simple counting it is clear that from any (n, M, ) code, it is possible to select an (n, M 0 , )
subcode, such that a) all codeword have the same composition P0 ; and b) M 0 > nM |A| . Note
that, log M = log M 0 + O(log n) and thus we may replace M with M 0 and focus on the analysis
of the chosen subcode. Then we set QY = PY |X ◦ P0 . In this case, from (18.9) we have
PY |X (Y |X)
Var log |X ≤ K < ∞ X ∼ P0 . (22.7)
QY (Y )
Furthermore, we also have
PY |X (Y |X)
E log |X = D(PY |X kQY |P0 ) = I(X; Y ) ≤ C X ∼ P0 . (22.8)
QY (Y )
238
From here, we apply Chebyshev inequality and (22.5) or (22.7) to get
√ K 02
P Sn ≤ n E[Sn |X n ] + K 0 n|X n ≥ 1 − .
K
K 02
If we set K 0 so large that 1 − K > 2 then overall we get that
√
log β1− (PX n Y n , PX n (QY )n ) ≥ −nC − K 0 n − log .
In the homework, we will explore in detail proofs of the strong converse for the BSC and the AWGN
channel.
239
It is easy to see that PY∗ = Unif[0, 1] is the capacity-achieving output distribution and
sup I(X; Y ) = C .
PX
Thus by Theorem 18.6 the capacity of the corresponding stationary memoryless channel is C. We
next show that nevertheless the -capacity can be strictly greater than C.
Indeed, fix blocklength n and consider a single letter distribution PX assigning equal weights
to all atoms (j, m) with m = exp{2nC}. It can be shown that in this case, the distribution of a
single-letter information density is given by
(
1
2nC, w.p. 2n
i(X; Y ) ≈ 1
0, w.p.1 − 2n
Therefore, from Theorem 17.1 we get that for > 1 − e−1/2 there exist (n, M, )-codes with
log M ≥ 2nC .
In particular,
C ≥ 2C ∀ > 1 − e−1/2
1. DMC
The following expansion holds for a fixed 0 < < 1/2 and n → ∞
√
log M ∗ (n, ) = nC − nV Q−1 () + O(log n)
where Q−1 is the inverse of the complementary standard normal CDF, the channel capacity is
C = I(X ∗ ; Y ∗ ) = E[i(X ∗ ; Y ∗ )], and the channel dispersion2 is V = Var[i(X ∗ ; Y ∗ )|X ∗ ].
2
There could be multiple capacity-achieving input distributions, in which case PX ∗ should be chosen as the one
that minimizes Var[i(X ∗ ; Y ∗ )|X ∗ ]. See [PPV10a] for more details.
240
√
Proof. For achievability, we have shown (Theorem 18.7) that log M ∗ (n, ) ≥ nC − nV Q−1 () by
refining the proof of the noisy channel coding theorem using the CLT.
The converse statement is log M ∗ ≤ − log β1− (PX n Y n , PX n (PY∗ )n ). For the BSC, we showed
that the RHS of the previous expression is
1 1 √ √
− log β1− (Bern(δ)n , Bern( )n ) = nd(δk ) + nV Q−1 () + o( n)
2 2
(see homework) where the dispersion is
" #
Bern(δ)
V = VarZ∼Bern(δ) log (Z) .
Bern( 12 )
Remark: This expansion only applies for certain channels (as described in the theorem). If,
for example, Var[i(X; Y )] = ∞, then the theorem need not hold and there are other stable (non-
Gaussian) distributions that we might converge to instead. Also notice that for DMC without cost
constraint
Var[i(X ∗ ; Y ∗ )|X ∗ ] = Var[i(X ∗ ; Y ∗ )]
since (capacity saddle point!) E[i(X ∗ ; Y ∗ )|X ∗ = x] = C for PX ∗ -almost all x.
22.3.1 Applications
As stated earlier, direct computation of M ∗ (n, ) by exhaustive search doubly exponential in
complexity, and thus is infeasible in most cases. However, we can get an easily computable
approximation using the channel dispersion via
√
log M ∗ (n, ) ≈ nC − nV Q−1 ()
Consider a BEC (n = 500, δ = 1/2) as an example of using this approximation. For this channel,
the capacity and dispersion are
C =1−δ
V = δ δ̄
Where δ̄ = 1 − δ. Using these values, our approximation for this BEC becomes
√ p
log M ∗ (500, 10−3 ) ≈ nC − nV Q−1 () = nδ̄ − nδ δ̄Q−1 (10−3 ) ≈ 215.5 bits
In the homework, for the BEC(500, 1/2) we obtained bounds 213 ≤ log M ∗ (500, 10−3 ) ≤ 217, so
this approximation falls in the middle of these bounds.
Examples of Channel Dispersion
241
For a few common channels, the dispersions are
Punchline: Although the only machinery needed for this approximation is the CLT, the results
produced are incredibly useful. Even though log M ∗ is nearly impossible to compute on its own, by
only finding C and V we are able to get a good approximation that is easily computable.
10−4 10−4
P∗ SNR P∗ SNR
After inspecting these plots, one may believe that the k1 → n1 code is better, since it requires
a smaller SNR to achieve the same error probability. However, there are many factors, such as
blocklength, rate, etc. that don’t appear on these plots. To get a fair comparison, we can use the
notion of normalized rate. To each (n, 2k , )-code, define
k k
Rnorm = ∗ ≈ p
log2 MAW GN (n, , P ) nC(P ) − nV (P )Q−1 ()
Take = 10−4 , and P (SNR) according to the water fall plot corresponding to Pe = 10−4 , and we
can compare codes directly (see Fig. 22.1). This normalized rate gives another motivation for the
expansion given in Theorem 22.2.
242
Normalized rates of code families over AWGN, Pe=0.0001
1
0.95
0.9
Galileo HGA
Turbo R=1/2
0.75 Cassini/Pathfinder
Galileo LGA
Hermitian curve [64,32] (SDD)
0.7 Reed−Solomon (SDD)
BCH (Koetter−Vardy)
Polar+CRC R=1/2 (List dec.)
0.65 ME LDPC R=1/2 (BP)
0.6
0.55
0.5 2 3 4 5
10 10 10 10
Blocklength, n
Normalized rates of code families over BIAWGN, Pe=0.0001
1
0.95
0.9
Turbo R=1/3
Turbo R=1/6
Turbo R=1/4
0.85
Voyager
Normalized rate
Galileo HGA
Turbo R=1/2
Cassini/Pathfinder
0.8
Galileo LGA
BCH (Koetter−Vardy)
Polar+CRC R=1/2 (L=32)
Polar+CRC R=1/2 (L=256)
0.75
Huawei NB−LDPC
Huawei hybrid−Cyclic
ME LDPC R=1/2 (BP)
0.7
0.65
0.6 2 3 4 5
10 10 10 10
Blocklength, n
Figure 22.1: Normalized rates for various codes. Plots generated via [Spe15].
243
Sk Encoder Xn Yn Decoder Ŝ k
Source (JSCC) Channel (JSCC)
• Goal: P[S k 6= Ŝ k ] ≤
• Encoder: f : Ak → X n
• Decoder: g : Y n → Ak
• Fundamental Limit (Optimal probability of error): ∗JSCC (k, n) = inf f,g P[S k 6= Ŝ k ]
k
where the rate is R = n (symbol per channel use).
Note: In channel coding we are interested in transmitting M messages and all messages are born
equal. Here we want to convey the source realizations which might not be equiprobable (has
redundancy). Indeed, if S k is uniformly distributed on, say, {0, 1}k , then we are back to the channel
coding setup with M = 2k under average probability of error, and ∗JSCC (k, n) coincides with
∗ (n, 2k ) defined in Section 22.1.
Note: Here, we look for a clever scheme to directly encode k symbols from A into a length n channel
input such that we achieve a small probability of error over the channel. This feels like a mix of two
problems we’ve seen: compressing a source and coding over a channel. The following theorem shows
that compressing and channel coding separately is optimal. This is a relief, since it implies that we
do not need to develop any new theory or architectures to solve the Joint Source Channel Coding
problem. As far as the leading term in the asymptotics is concerned, the following two-stage scheme
is optimal: First use the optimal compressor to eliminate all the redundancy in the source, then use
the optimal channel code to add redundancy to combat the noise in the data transmission.
Theorem 22.3. Let the source {Sk } be stationary memoryless on a finite alphabet with entropy H.
Let the channel be stationary memoryless with finite capacity C. Then
(
→ 0 R < C/H
∗JSCC (nR, n) n → ∞.
6→ 0 R > C/H
Note: Interpretation: Each source symbol has information content (entropy) H bits. Each channel
use can convey C bits. Therefore to reliably transmit k symbols over n channel uses, we need
kH ≤ nC.
Proof. Achievability. The idea is to separately compress our source and code it for transmission.
Since this is a feasible way to solve the JSCC problem, it gives an achievability bound. This
separated architecture is
f1 f2 PY n |X n g2 g1
S k −→ W −→ X n −→ Y n −→ Ŵ −→ Ŝ k
Where we use the optimal compressor (f1 , g1 ) and optimal channel code (maximum probability of
error) (f2 , g2 ). Let W denote the output of the compressor which takes at most Mk values. Then
1
(From optimal compressor) log Mk > H + δ =⇒ P[Ŝ k 6= S k (W )] ≤ ∀k ≥ k0
k
1
(From optimal channel code) log Mk < C − δ =⇒ P[Ŵ 6= m|W = m] ≤ ∀m, ∀k ≥ k0
n
244
Using both of these,
And therefore if R(H + δ) < C − δ, then ∗ → 0. By the arbitrariness of δ > 0, we conclude the
weak converse for any R > C/H.
Converse: channel-substitution proof. Let QS k Ŝ k = US k PŜ k where US k is the uniform
distribution. Using data processing
1
D(PS k Ŝ k kQS k Ŝ k ) = D(PS k kUS k ) + D(PŜ|S k kPŜ |PS k ) ≥ d(1 − k )
|A|k
Rearranging this gives
1
I(S k ; Ŝ k ) ≥ d(1 − k ) − D(PS k kUS k )
|A|k
log |A| + H(S k ) − k log |A|
≥ − log 2 + k¯
= H(S k ) − log 2 − k log |A|
Which follows from expanding out the terms. Now, normalizing and taking the sup of both sides
gives
1 1 k
sup I(X n ; Y n ) ≥ H(S k ) − log |A| + o(1)
n Xn n n
letting R = k/n, this shows
RH − C
C ≥ RH − R log |A| =⇒ ≥ >0
R log |A|
where the last expression is positive when R > C/H.
Converse: usual proof. Any JSCC encoder/decoder induces a Markov chain
S k → X n → Y n → Ŝ k .
On the other hand, since P[S k 6= Ŝ k ] ≤ n , Fano’s inequality (Theorem 5.3) yields
245
§ 23. Channel coding with feedback
f1 : [M ] → A
f2 : [M ] × B → A
..
.
fn : [M ] × B n−1 → A
• Decoder:
g : B n → [M ]
246
Here the symbol transmitted at time t depends on both the message and the history of received
symbols:
Xt = ft (W, Y1t−1 )
Hence the probability space is as follows:
W ∼ uniform on [M ]
PY |X
X1 = f1 (W ) −→ Y1
.. −→ Ŵ = g(Y n )
.
PY |X
Xn = fn (W, Y1n−1 ) −→ Yn
Cf b = C = Ci = sup I(X; Y )
PX
2. At (t+1)-th step, having knowledge of πt all messages are partitioned into classes Pa , according
to the values ft+1 (·, Y t ):
Pa , {j ∈ [M ] : ft+1 (j, Y t ) = a} a ∈ A.
Then transmitter, possessing the knowledge of the true message W , selects a letter Xt+1 =
ft+1 (W, Y t ).
3. Channel perturbs Xt+1 into Yt+1 and both parties compute the updated posterior:
Notice that (this is the crucial part!) the random multiplier satisfies:
XX PY |X (y|a)
E[log Bt+1 (W )|Y t ] = πt (Pa ) log P = I(π̃t , PY |X ) (23.1)
a∈A y∈B a∈A πt (Pa )a
247
The goal of the code designer is to come up with such a partitioning {Pa , a ∈ A} that the speed
of growth of πt (W ) is maximal. Now, analyzing the speed of growth of a random-multiplicative
process is best done by taking logs:
t
X
log πt (j) = log Bs + log π0 (j) .
s=1
Intutively, we expect that the process log πt (W ) resembles a random walk starting from − log M and
having a positive drift. Thus to estimate the time it takes for this process to reach value 0 we need
to estimate the upward drift. Appealing to intuition and the law of large numbers we approximate
t
X
log πt (W ) − log π0 (W ) ≈ E[log Bs ] .
s=1
Finally, from (23.1) we conclude that the best idea is to select partitioning at each step in such a
way that π̃t ≈ PX∗ (caid) and this obtains
log πt (W ) ≈ tC − log M ,
implying that the transmission terminates in time ≈ logCM . The important lesson here is the
following: The optimal transmission scheme should map messages to channel inputs in such a way
that the induced input distribution PXt+1 |Y t is approximately equal to the one maximizing I(X; Y ).
This idea is called posterior matching and explored in detail in [SF11].1
Converse: we are left to show that Cf b ≤ Ci .
Recall the key in proving weak converse for channel coding without feedback: Fano’s inequality
plus the graphical model
W → X n → Y n → Ŵ . (23.2)
Then
−h() + ¯ log M ≤ I(W ; Ŵ ) ≤ I(X n ; Y n ) ≤ nCi .
With feedback the probabilistic picture becomes more complicated as the following figure shows
for n = 3 (dependence introduced by the extra squiggly arrows):
X1 Y1 X1 Y1
W X2 Y2 Ŵ W X2 Y2 Ŵ
X1 Y1 X1 Y1
without feedback with feedback
1
Note that the magic of Shannon’s theorem is that this optimal partitioning can also be done blindly, i.e. it is
∗
possible to preselect partitions Pa in a way that is independent of πt (but dependent on t) and so that πt (Pa ) ≈ PX (a)
with overwhelming probability and for all t ∈ [n].
248
So, while the Markov chain realtion in (23.2) is still true, the input-output relation is no longer
memoryless2
n
Y
PY n |X n (y n |xn ) 6= PY |X (yj |xj ) (!)
j=1
There is still a large degree of independence in the channel, though. Namely, we have
(Y t−1 , W ) →Xt → Yt , t = 1, . . . , n (23.3)
W → Y n → Ŵ (23.4)
Then
−h() + ¯ log M ≤ I(W ; Ŵ ) (Fano)
n
≤ I(W ; Y ) (Data processing applied to (23.4))
Xn
= I(W ; Yt |Y t−1 ) (Chain rule)
t=1
Xn
≤ I(W, Y t−1 ; Yt ) (I(W ; Yt |Y t−1 ) = I(W, Y t−1 ; Yt ) − I(Y t−1 ; Yt ))
t=1
Xn
≤ I(Xt ; Yt ) (Data processing applied to (23.3))
t=1
≤ nCt
The following result (without proof) suggests that feedback does not even improve the speed of
approaching capacity either (under fixed-length block coding) and can at most improve smallish
log n terms:
Theorem 23.2 (Dispersion with feedback). For weakly input-symmetric DMC (e.g. additive noise,
BSC, BEC) we have: √
log Mf∗b (n, ) = nC − nV Q−1 () + O(log n)
(The meaning of this is that for such channels feedback can at most improve smallish log n
terms.)
249
⊥ W , so that Q[W = Ŵ ] =
a) Under Q, W ⊥ 1
M while P[W = Ŵ ] ≥ 1 − .
b) The two graphical models give the factorization:
thus D(P kQ) = I(X n ; Y n ) measures the information flow through the links X n → Y n .
n
1 dpi mem−less,stat X
−h() + ¯ log M = d(1 − k ) ≤ D(P kQ) = I(X n ; Y n ) = I(X; Y ) ≤ nCi
M
i=1
(23.7)
2. Notice that when feedback is present, X n → Y n is not memoryless due to the transmission
protocol, let’s unfold the probability space over time to see the dependence. As an example,
the graphical model for n = 3 is given below:
If we define Q similarly as in the case without feedback, we will encounter a problem at the
second
Pn last inequality in (23.7), as with feedback I(X n ; Y n ) can be significantly larger than
n n
i=1 I(X; Y ). Consider the example where X2 = Y1 , we have I(X ; Y ) = +∞ independent
of I(X; Y ).
We also make the observation that if Q is defined in (23.6), D(P kQ) = I(X n ; Y n ) measures
the information flow through all the 6→ and links. This motivates us to find a proper Q
such that D(P kQ) only captures the information flow through all the 6→ links {Xi → Yi : i =
1, . . . , n}, so that D(P kQ) closely relates to nCi , while still guarantees that W ⊥
⊥ W , so that
Q[W 6= Ŵ ] = M . 1
3. Formally, we shall restrict QW,X n ,Y n ,Ŵ ∈ Q, where Q is the set of distributions that can be
factorized as follows:
QW,X n ,Y n ,Ŵ = QW QX1 |W QY1 QX2 |W,Y1 QY2 |Y1 · · · QXn |W,Y n−1 QYn |Y n−1 QŴ |Y n (23.8)
PW,X n ,Y n ,Ŵ = PW PX1 |W PY1 |X1 PX2 |W,Y1 PY2 |X2 · · · PXn |W,Y n−1 PYn |Xn PŴ |Y n (23.9)
250
⊥ W under Q: W and Ŵ are d-separated by X n .
Verify that W ⊥
Notice that in the graphical models, when removing 9 we also added the directional links
between the Yi ’s, these links serve to maximally preserve the dependence relationships between
variables when 9 are removed, so that Q is the “closest” to P while W ⊥ ⊥ W is satisfied.
1
Now we have that for Q ∈ Q, d(1 − k M ) ≤ D(P kQ), in order to obtain the least upper
bound, in Lemma 23.1 we shall show that:
n
X
inf D(PW,X n ,Y n ,Ŵ kQW,X n ,Y n ,Ŵ ) = I(Xt ; Yt |Y t−1 )
Q∈Q
t=1
n
X
= EY t−1 [I(PXt |Y t−1 , PY |X )]
t=1
Xn
≤ I(EY t−1 [PXt |Y t−1 ], PY |X ) (concavity of I(·, PY |X ))
t=1
Xn
= I(PXt , PY |X )
t=1
≤nCi .
nC + h() C
−h() + ¯ log M ≤ nCi ⇒ log M ≤ ⇒ Cf b, ≤ ⇒ Cf b ≤ C.
1− 1−
4. Notice that the above proof is also valid even when cost constraint is present.
Lemma 23.1.
n
X
inf D(PW,X n ,Y n ,Ŵ kQW,X n ,Y n ,Ŵ ) = I(Xt ; Yt |Y t−1 ) (23.10)
Q∈Q
t=1
~ n ; Y n ), directed information)
(, I(X
Proof. By chain rule, we can show that the minimizer Q ∈ Q must satisfy the following equalities:
QX,W = PX,W ,
QXt |W,Y t−1 = PXt |W,Y t−1 , (check!)
QŴ |Y n = PW |Y n .
and therefore
= D(PY1 |X1 kQY1 |X1 ) + D(PY2 |X2 ,Y1 kQY2 |Y1 |X2 , Y1 ) + · · · + D(PYn |Xn ,Y n−1 kQYn |Y n−1 |Xn , Y n−1 )
= I(X1 ; Y1 ) + I(X2 ; Y2 |Y1 ) + · · · + I(Xn ; Yn |Y n−1 )
251
23.3 When is feedback really useful?
Theorems 23.1 and 23.2 state that feedback does not improve communication rate neither asymptot-
ically nor for moderate blocklengths. In this section, we shall examine three cases where feedback
turns out to be very useful.
1
Cf b,0 = max min log (23.11)
PX y∈B PX (Sy )
where
Sy = {x ∈ A : PY |X (y|x) > 0}
denotes the set of input symbols that can lead to the output symbol y.
Note: For stationary memoryless channel,
def. def. Thm 23.1 Shannon
C0 ≤ Cf b,0 ≤ Cf b = lim Cf b, = C = lim C = Ci = sup I(X; Y )
→0 →0 PX
All capacity quantities above are defined with (fixed-length) block codes.
Note:
1. In DMC for both zero-error capacities (C0 and Cf b,0 ) only the support of the transition
matrix PY |X , i.e., whether PY |X (b|a) > 0 or not, matters. The value of PY |X (b|a) > 0
is irrelevant. That is, C0 and Cf b,0 are functions of a bipartite graph between input and
output alphabets. Furthermore, the C0 (but not Cf b,0 !) is a function of the confusability
graph – a simple undirected graph on A with a 6= a0 connected by an edge iff ∃b ∈ B
s.t. PY |X (b|a)PY |X (b|a0 ) > 0.
2. That Cf b,0 is not a function of the confusability graph alone is easily seen from comparing the
polygon channel (next remark) with L = 3 (for which Cf b,0 = log 23 ) and the useless channel
with A = {1, 2, 3} and B = {1} (for which Cf b,0 = 0). Clearly in both cases confusability
graph is the same – a triangle.
3. Usually C0 is very hard to compute, but Cf b,0 can be obtained in closed form as in (23.11).
Example: (Polygon channel)
1
5
4
3
252
• Zero-error capacity C0 :
– L = 3: C0 = 0
– L = 5: C0 = 12 log 5 (Shannon ’56-Lovasz ’79).
Achievability:
a) blocklength one: {1, 3}, rate = 1 bit.
b) blocklength two: {(1, 1), (2, 3), (3, 5), (4, 2), (5, 4)}, rate = 12 log 5 bit – optimal!
– L = 7: 3/5 log 7 ≤ C0 ≤ log 3.32 (Exact value unknown to this day)
– Even L = 2k: C0 = log L2 for all k (Why? Homework.).
– Odd L = 2k + 1: C0 = log L2 + o(1) as k → ∞ (Bohman ’03)
• Zero-error capacity with feedback (proof: exercise!)
L
Cf b,0 = log , ∀L,
2
which can be strictly bigger than C0 .
4. Notice that Cf b,0 is not necessarily equal to Cf b = lim→0 Cf b, = C. Here is an example when
Example:
Then
C0 = log 2
2
Cf b,0 = max − log max( δ, 1 − δ) (PX∗ = (δ/3, δ/3, δ/3, δ̄))
δ 3
5 3
= log > C0 (δ ∗ = )
2 5
On the other hand, Shannon capacity C = Cf b can be made arbitrarily close to log 4 by
picking the cross-over probability arbitrarily close to zero, while the confusability graph stays
the same.
Proof of Theorem 23.3. 1. Fix any (n, M, 0)-code. Denote the confusability set of all possible
messages that could have produced the received signal y t = (y1 , . . . , yt ) for all t = 0, 1, . . . , n
by:
= 0 ⇔ ∀y n ∈ B n , |En (y n )| = 1 or 0. (23.12)
253
2. The key quantities in the proof are defined as follows:
θf b = min max PX (Sy ),
PX y∈B
4. “≥” (achievability)
Let’s construct a code that achieves (M, n, 0).
254
The above example with |A| = 3 illustrates that the encoder f1 partitions the space of all
messages to 3 groups. The encoder f1 at the first stage encodes the groups of messages into
a1 , a2 , a3 correspondingly. When channel outputs y1 and assume that Sy1 = {a1 , a2 }, then the
decoder can eliminate a total number of M PX∗ (a3 ) candidate messages in this round. The
“confusability set” only contains the remaining M PX∗ (Sy1 ) messages. By definition of PX∗ we
know that M PX∗ (Sy1 ) ≤ M θf b . In the second round, f2 partitions the remaining messages
into three groups, send the group index and repeat.
By similar arguments, each interaction reduces the uncertainty by a factor of at least θf b .
After n iterations, the size of “confusability set” is upper bounded by M θfnb , if M θfnb ≤ 1,3
then zero error probability is achieved. This is guaranteed by choosing log M = −n log θf b .
Therefore we have shown that −n log θf b bits can be reliably delivered with n + O(1) channel
uses with feedback, thus
Cf b,0 ≥ − log θf b
lC
log MV∗ LF (l, ) = + O(log l)
1−
Example: For BSC(0.11), without feedback, n = 3000 is needed to achieve 90% of capacity C,
while with VLF code l = En = 200 is enough to achieve that.
Yk = Xk + Zk , Zk ∼ N (0, σ 2 ) i.i.d.
E[Xk2 ] ≤ P, power constraint in expectation
3 ∗
Some rounding-off errors need to be corrected in a few final steps (because PX may not be closely approximable
when very few messages are remaining). This does not change the asymptotics though.
255
Note:
Pn If we insist the codeword satisfies power constraint almost surely instead on average, i.e.,
2
k=1 Xk ≤ nP a.s., then the scheme below does not work!
256
Therefore, with Elias’ scheme of sending A ∼ N (0, Var A), after the n-th use of the AWGN(P )
channel with feedback,
P n
Var Nn = Var(Ân − A) = 2−2nC Var A = Var A,
P + σ2
which says that the reduction of uncertainty in the estimation is exponential fast in n.
Schalkwijk-Kailath Elias’ scheme can also be used to send digital data. Let W ∼ uniform on
2 2k
M -PAM constellation in ∈ [−1, 1], i.e., {−1, −1 + M , · · · , −1 + M , · · · , 1}. In the very first step W
is sent (after scaling to satisfy the power constraint):
√
X0 = P W, Y0 = X0 + Z0
Since Y0 and X0 are both known at the encoder, it can compute Z0 . Hence, to describe W it is
sufficient for the encoder to describe the noise realization Z0 . This is done by employing the Elias’
scheme (n − 1 times). After n − 1 channel uses, and the MSE estimation, the equivalent channel
output:
Finally, the decoder quantizes Ye0 to the nearest PAM point. Notice that
" √ # √ !
1 P 2 (n−1)C P
≤ P |Ze0 | > =P 2 −(n−1)C
|Z| > = 2Q
2M 2M 2M
√
P
⇒ log M ≥ (n − 1)C + log − log Q−1 ( )
2 2
= nC + O(1).
Hence if the rate is strictly less than capacity, the error probability decays doubly exponentially
√
fast as n increases. More importantly, we gained an n term in terms of log M , since for the case
without feedback we have
√
log M ∗ (n, ) = nC − nV Q−1 () + O(log n) .
Example: = 1⇒ channel capacity C = 0.5 bit per channel use. To achieve error probability
P(n−1)C (n−1)C
−3
10 , 2Q 2
2M ≈ 10−3 , so e 2M ≈ 3, and lognM ≈ n−1 log 8
n C − n . Notice that the capacity is
achieved to within 99% in as few as n = 50 channel uses, whereas the best possible block codes
without feedback require n ≈ 2800 to achieve 90% of capacity.
Take-away message:
Feedback is best harnessed with adaptive strategies. Although it does not increase capacity
under block coding, feedback greatly boosts reliability as well as reduces coding complexity.
257
§ 24. Capacity-achieving codes via Forney concatenation
Shannon’s Noisy Channel Theorem assures us the existence of capacity-achieving codes. However,
exhaustive search for the code has double-exponential complexity: Search over all codebook of size
2nR over all possible |X |n codewords.
Plan for today: Constructive version of Shannon’s Noisy Channel Theorem. The goal is to show
that for BSC, it is possible to achieve capacity in polynomial time. Note that we need to consider
three aspects of complexity
• Encoding
• Decoding
Shannon’s theorem shows that for stationary memoryless channels, n , ∗ (n, exp(nR)) → 0 for any
R < C = supX I(X; Y ). Now we want to know how fast it goes to zero as n → ∞. It turns out
the speed is exponential, i.e., n ≈ exp(−nE(R)) for some error exponent E(R) as a function R,
which is also known as the reliability function of the channel. Determining E(R) is one of the most
long-standing open problems in information theory. What we know are
• Lower bound on E(R) (achievability): Gallager’s random coding bound (which analyzes the
ML decoder, instead of the suboptimal decoder as in Shannon’s random coding bound or DT
bound).
It turns out there exists a number Rcrit ∈ (0, C), called the critical rate, such that the lower and
upper bounds meet for all R ∈ (Rcrit , C), where we obtain the value of E(R). For R ∈ (0, Rcrit ), we
do not even know the existence of the exponent!
Deriving these bounds is outside the scope of this lecture. Instead, we only need the positivity
of error exponent, i.e., for any R < C, E(R) > 0. On the other hand, it is easy to see that
E(C−) = 0 as a consequence of weak converse. Since as the rate approaches capacity from below,
the communication becomes less reliable. The next theorem is a simple application of large deviation.
258
Theorem 24.1. For any DMC, for any R < C = supX I(X; Y ),
Proof. Fix R < C so that C − R > 0. Let PX∗ be the capacity-achieving input distribution, i.e.,
C = I(X ∗ ; Y ∗ ). Recall Shannon’s random coding bound (DT/Feinstein work as well):
As usual, we apply this bound with iid PX n = (PX∗ )n , log M = nR and τ = n(C−R)2 , to conclude the
achievability of
1 n n C +R n(C − R)
n ≤ P i(X ; Y ) ≤ + exp − .
n 2 2
P
Since i(X n ; Y n ) = i(Xk ; Yk ) is an iid sum, and Ei(X; Y ) = C > (C + R)/2, the first term is
upper bounded by exp(−nψT∗ ( R+C 2 )) where T = i(X; Y ). The proof is complete since n is smaller
than the sum of two exponentially small terms.
Note: Better bound can be obtained using DT bound. But to get the best lower bound on E(R)
we know (Gallager’s random coding bound), we have to analyze the ML decoder.
m ≤ exp(−mE(R)) = n−bE(R)
where E(R) > 0. Apply this code to each m-bit sub-block and apply ML decoding to each block.
n
The encoding/decoding complexity is at most m exp(O(m)) = nO(1) . To analyze the probability of
error, use union bound:
n
Pe ≤ m ≤ n−bE(R)+1 ≤ n−α ,
m
α+1
if we choose b ≥ E(R) .
Remark 24.1. The final question boils down to how to find the shorter code of blocklength m in
poly(n)-time. This will be done if we can show that we can find good code (satisfying the Shannon
random coding bound) for BSC of blocklenth m in exponential time. To this end, let us go through
the following strategies:
1. Exhaustive search: A codebook is a subset of cardinality 2Rm out of 2m possible codewords.
m
Total number of codebooks: 22Rm = exp(Ω(m2Rm )) = exp(Ω(nc log n)). The search space is
too big.
2. Linear codes: In Lecture 18 we have shown that for additive-noise channels on finite fields we
can focus on linear codes. For BSC, each linear code is parameterized by a generator matrix,
2
with Rm2 entries. Then there are a total of 2Rm = nΩ(log n) – still superpolynomial and we
cannot afford the search over all linear codes.
259
3. Toeplitz generator matrices: In homework we see that it does not lose generality to focus on
linear codes with Toeplitz generator matrices, i.e., G such that Gij = Gi−1,j−1 for all i, j > 1.
Toeplitz matrices are determined by diagonals. So there are at most 22m = nO(1) and we can
find the optimal one by exhaustive search in poly(n)-time.
Since the channel is additive-noise, linear codes + syndrome decoder leads to the same maximal
probability of error as average (Lecture 18).
The concatenated code C : {0, 1}kK → {0, 1}nN works as follows (Fig. 24.1):
1. Collect the kK message bits into K symbols in the alphabet B, apply Cout componentwise to
get a vector in B N
2. Map each symbol in B into k bits and apply Cin componentwise to get an nN -bit codeword.
kK
The rate of the concatenated code is the product of the rates of the inner and outer codes: R = nN.
Cin Din
Cin Din
kK k n k kK
Cout Cin Din Dout
Cin Din
Cin Din
Figure 24.1: Concatenated code, where there are N inner encoder-decoder pairs.
• Use a Reed-Solomon code as the outer code which can correct a constant fraction of errors.
260
Reed-Solomon (RS) codes are linear codes from FK q → Fq where the block length N = q − 1
N
and the message length is K. Similar to P the Reed-Muller code, the RS code treats the input
(a0 , a1 , . . . , aK−1 ) as a polynomial p(x) = K−1
i=0 ai z i over F of degree at most K − 1, and encodes it
q
by its values at all non-zero elements. Therefore the RS codeword is a vector (p(α) : α ∈ Fq \{0}) ∈
FNq . Therefore the generator matrix of RS code is a Vandermonde matrix.
The RS code has the following advantages:
2. The encoding and decoding (e.g., Berlekamp-Massey decoding algorithm) can be implemented
in poly(N ) time.
In fact, as we will see later, any efficient code which can correct a constant fraction of errors will
suffice as the outer code for our purpose.
Now we show that we can achieve any rate below capacity and exponentially small probability
of error in polynomial time: Fix η, > 0 arbitrary.
• Inner code: Let k = (1 − h(δ) − η)n. By Theorem 24.1, there exists a Cin : {0, 1}k → {0, 1}n ,
which is a linear (n, 2k , n )-code and maximal error probability n ≤ 2−nE(η) . By Remark 24.1,
Cin can be chosen to be a linear code with Toeplitz generator matrix, which can be found in
2n time. The inner decoder is ML, which we can afford since n is small.
• Outer code: We pick the RS code with field size q = 2k with blocklength N = 2k − 1. Pick
the number of message bits to be K = (1 − )N . Then we have Cout : FK
2k
→ FN
2k
.
Then we obtain a concatenated code C : {0, 1}kK → {0, 1}nN with blocklength L = nN = n2Cn for
some constant C and rate R = (1 − )(1 − h(δ) − η). It is clear that the code can be constructed in
2n = poly(L) time and all encoding/decoding operations are poly(L) time.
Now we analyze the probability of error: Let us conditioned on the message bits (input to Cout ).
Since the outer code can correct N
2 errors, an error happens only if the number of erroneous inner
N
encoder-decoder pairs exceeds 2 . Since the channel is memoryless, each of the N pairs makes an
error independently1 with probability at most n . Therefore the number of errors is stochastically
smaller than Bin(N, n ), and we can upper bound the total probability of error using Chernoff
bound:
N
Pe ≤ P Bin(N, n ) ≥ ≤ exp (−N d(/2kn )) = exp (−Ω(N log N )) = exp(−Ω(L)).
2
where we have used = Ω(1) and hence n ≤ exp(−Ω(n)) and d(/2kn ) ≥ 2 log 2n = Ω(n) =
Ω(log N ).
Note: For more details see the excellent exposition by Spielman [Spi97]. For modern constructions
using sparse graph codes which achieve the same goal in linear time, see, e.g., [Spi96].
1
Here controlling the maximal error probability of inner code is the key. If we only have average error probability,
then given a uniform distributed input to the RS code, the output symbols (which are the inputs to the inner encoders)
need not be independent, and Chernoff bound is not necessarily applicable.
261
Part V
262
§ 25. Rate-distortion theory
1. Lossless data compression: Given a discrete ergodic source S k , we know how to encode to
pure bits W ∈ [2k ].
2. Binary HT: Given two distribution P and Q, we know how to distinguish them optimally.
Next topic, lossy data compression: Given X, find a k-bit representation W , X → W → X̂,
such that X̂ is a good reconstruction of X.
Real-world examples: codecs consist of a compressor and a decompressor
• Image: JPEG...
• Video: MPEG...
Domain Range
Continuous Analog
Sampling time Quantization
Signal
Discrete Digital
time
263
25.1.1 Scalar Uniform Quantization
The idea of qunatizing an inherently continuous-valued signal was most explicitly expounded in the
patenting of Pulse-Coded Modulation (PCM) by A. Reeves, cf. [Ree65] for some interesting historical
notes. His argument was that unlike AM and FM modulation, quantized (digital) signals could be
sent over long routes without the detrimental accumulation of noise. Some initial theoretical analysis
of the PCM was undertaken in 1947 by Oliver, Pierce, and Shannon (same Shannon), cf. [OPS48].
For a random variable X ∈ [−A/2, A/2] ⊂ R, the scalar uniform quantizer qU (X) with N
quantization points partitions the interval [−A/2, A/2] uniformly
N equally spaced points
−A A
2 2
where D denotes the average distortion. Often R = log2 N is used instead of N , so that we think
about the number of bits we can use for quantization instead of the number of points. To analyze
this scalar uniform quantizer, we’ll look at the high-rate regime (R 1). The key idea in the high
rate regime is that (assuming a smooth density PX ), each quantization interval ∆j looks nearly flat,
so conditioned on ∆j , the distribution is accurately approximately by a uniform distribution.
Nearly flat for
large partition
∆j
Let cj be the j-th quantization point, and ∆j be the j-th quantization interval. Here we have
N
X
DU (R) = E|X − qU (X)| = 2
E[|X − cj |2 |X ∈ ∆j ]P[X ∈ ∆j ] (25.1)
j=1
N
X |∆j |2
(high rate approximation) ≈ P[X ∈ ∆j ] (25.2)
12
j=1
A 2
(N ) A2 −2R
= = 2 , (25.3)
12 12
where we used the fact that the variance of Uniform[−a, a] = a2 /3.
How much do we gain per bit?
Var(X)
10 log10 SN R = 10 log10
E|X − qU (X)|2
12V ar(X)
= 10 log10 + (20 log10 2)R
A2
= constant + (6.02dB)R
264
For example, when X is uniform on [− A2 , A2 ], the constant is 0. Every engineer knows the rule
of thumb “6dB per bit”; adding one more quantization bit gets you 6 dB improvement in SNR.
However, here we can see that this rule of thumb is valid only in the high rate regime. (Consequently,
widely articulated claims such as “16-bit PCM (CD-quality) provides 96 dB of SNR” should be
taken with a grain of salt.)
Note: The above deals with X with a bounded support. When X is unbounded, a wise thing to do
is to allocate the quantization points to the range of values that are more likely and saturate the
large values at the dynamic range of the quantizer. Then there are two contributions, known as
the granular distortion and overload distortion. This leads us to the question: Perhaps uniform
quantization is not optimal?
Often the way such quantizers are implemented is to take a monotone transformation of the source
f (X), perform uniform quantization, then take the inverse function:
f
X U
q qU (25.4)
X̂ qU (U )
f −1
i.e., q(X) = f −1 (qU (f (X))). The function f is usually called the compander (compressor+expander).
One of the choice of f is the CDF of X, which maps X into uniform on [0, 1]. In fact, this compander
architecture is optimal in the high-rate regime (fine quantization) but the optimal f is not the CDF
(!). We defer this discussion till Section 25.1.4.
In terms of practical considerations, for example, the human ear can detect sounds with volume
as small as 0 dB, and a painful, ear-damaging sound occurs around 140 dB. Achieveing this is
possible because the human ear inherently uses logarithmic companding function. Furthermore,
many natural signals (such as differences of consecutive samples in speech or music (but not samples
themselves!)) have an approximately Laplace distribution. Due to these two factors, a very popular
and sensible choice for f is the µ-companding function
265
which compresses the dynamic range, uses more bits for smaller |X|’s, e.g. |X|’s in the range of
human hearing, and less quantization bits outside this region. This results in the so-called µ-law
which is used in the digital telecommunication systems in the US, while in Europe a slightly different
compander called the A-law is used.
Intuitively, we would think that the optimal quantization regions should be contiguous; otherwise,
given a point cj , our reconstruction error will be larger. Therefore in one dimension quantizers are
piecewise constant:
2. Draw the Voronoi regions around the chosen quantization points (aka minimum distance
tessellation, or set of points closest to cj ), which forms a partition of the space.
3. Update the quantization points by the centroids (E[X|X ∈ D]) of each Voronoi region.
4. Repeat.
b b
b b
b b
b b
b b
Lloyd’s clever observation is that the centroid of each Voronoi region is (in general) different than
the original quantization points. Therefore, iterating through this procedure gives the Centroidal
Voronoi Tessellation (CVT - which are very beautiful objects in their own right), which can be
viewed as the fixed point of this iterative mapping. The following theorem gives the results about
Lloyd’s algorithm
Theorem 25.1 (Lloyd).
266
2. The optimal quantization strategy is always a CVT.
3. CVT’s need not be unique, and the algorithm may converge to non-global optima.
Remark 25.1. The third point tells us that Lloyd’s algorithm isn’t always guaranteed to give the
optimal quantization strategy.1 One sufficient condition for uniqueness of a CVT is the log-concavity
of the density of X [Fleischer ’64]. Thus, for Gaussian PX , Lloyd’s algorithm outputs the optimal
quantizer, but even for Gaussian, if N > 3, optimal quantization points are not known in closed
form! So it’s hard to say too much about optimal quantizers. Because of this, we next look for an
approximation in the regime of huge number of points.
Remark 25.2 (k-means). A popular clustering method called k-means is the following: Given n
data points x1 , . . . , xn ∈ Rd , the goal is to find k centers µ1 , . . . , µk ∈ Rd to minimize the objective
function
Xn
min kxi − µj k2 .
j∈[k]
i=1
This is equivalent to solving the optimal vector quantization problem analogous to (25.5):
min EkX − q(X)k2
q:|Im(q)|≤k
P
where X is distributed according to the empirical distribution over the dataset, namely, n1 ni=1 δxi .
Solving the k-means problem is NP-hard in the worst case, and Lloyd’s algorithm is commonly used
heuristic.
267
To find the optimal density λ that gives the best
R 1/3 R reconstruction
R 2/3 (minimum MSE)
R −2whenR X1/3has
−2
density p, we use Hölder’s inequality: p ≤ ( pλ ) ( λ) . Therefore pλ ≥ ( p )3 ,
1/3
1/3
with equality iff pλ−2 ∝ λ. Hence the optimizer is λ∗ (x) = Rp (x) . Therefore when N = 2R ,3
p1/3 dx
Z 3
1
Dscalar (R) ≈ 2−2R p 1/3
(x)dx
12
So our optimal quantizer density in the high rate regime is proportional to the cubic root of the
density of our source. This approximation is called the Panter-Dite approximation. For example,
• When X ∈ [− A2 , A2 ], using Hölder’s inequality again h1, p1/3 i ≤ k1k 3 kp1/3 k3 = A2/3 , we have
2
1
Dscalar (R) ≤ 2−2R A2 = DU (R)
12
where the RHS is the uniform quantization error given in (25.1). Therefore as long as the
source distribution is not uniform, there is strict improvement. For uniform distribution,
uniform quantization is, unsurprisingly, optimal.
• When X ∼ N (0, σ 2 ), this gives
√
2 −2R π 3
Dscalar (R) ≈ σ 2 (25.6)
2
Note: In fact, in scalar case the optimal non-uniform quantizer can be realized using the compander
architecture (25.4) that we discussed in Section 25.1.2: As an exercise, use Taylor expansion to
analyze the quantization error of (25.4) when N → ∞. The optimal compander f : R → [0, 1] turns
t
p1/3 (t)dt
R
out to be f (x) = R−∞
∞ [Bennett ’48, Smith ’57].
1/3 (t)dt
−∞ p
268
Now, from Jensen’s inequality we have
Z Z
1 1 1 22h(X)
pX (x) 2 dx ≥ exp{−2 pX (x) log Λ(x) dx} ≈ 2−2H(q(X)) ,
12 Λ (x) 12 12
concluding that uniform quantizer is asymptotically optimal.
Furthermore, it turns out that for any source, even the optimal vector quantizers (to be considered
2h(X)
next) can not achieve distortion better that 2−2R 2 2πe – i.e. the maximal improvement they can
gain (on any iid source!) is 1.53 dB (or 0.255 bit/sample). This is one reason why scalar uniform
quantizers followed by lossless compression is an overwhelmingly popular solution in practice.
which, compared to (25.6), saves 4.35 dB (or 0.72 bit/sample). This should be rather surprising, so
let’s reiterate: even when X1 , . . . , Xn are iid, we can get better performance by quantizing the Xi ’s
jointly. One instance of this surprising effect is the following:
Hamming Game Given 100 unbiased bits, we want to look at them and scribble something
down on a piece of paper that can store 50 bits at most. Later we will be asked to guess the
original 100 bits, with the goal of maximizing the number of correctly guessed bits. What is the best
strategy? Intuitively, it seems the optimal strategy would be to store half of the bits then randomly
guess on the rest, which gives 25% BER. However, as we will show in the next few lectures, the
optimal strategy amazingly achieves a BER of 11%. How is this possible? After all we are guessing
independent bits and the utility function (BER) treats all bits equally. Some intuitive explanation:
1. Applying scalar quantization componentwise results in quantization region that are hypercubes,
which might not be efficient for covering.
2. Concentration of measures removes many source realizations that are highly unlikely. For
example, if we think about quantizing a single Gaussian X, then we need to cover large portion
of R in order to cover the cases of significant deviations of X from 0. However, when we are
quantizing many (X1 , . . . , Xn ) together, the law of large numbers makes sure that many Xj ’s
cannot conspire together and all produce large values. Indeed, (X1 , . . . , Xn ) concentrates near
a sphere. Thus, we may exclude large portions of the space Rn from consideration.
• X ∈ X : continuous source
• W : discrete data
• X̂ ∈ X̂ : reconstruction
269
A distortion metric is a function d : X × X̂ → R ∪ {+∞} (loss function). There are various
formulations of the lossy compression problem:
3. Variable length, max distortion: W ∈ {0, 1}∗ , d(X, X̂) ≤ D a.s., minimize E[length(W )] or
H(X̂) = H(W ).
Note: In this course we focus on lossy compression with fixed length and average distortion. The
difference between average distortion and excess distortion is analogous to average risk bound and
high-probability bound in statistics/machine learning.
Definition 25.1. Rate-distortion problem is characterized by a pair of alphabets A, Â, a single-
letter distortion function d(·, ·) : A × Â → R ∪ {+∞} and a source – a sequence of A-valued r.v.’s
(S1 , S2 , . . .). A separable distortion metric is defined for n-letter vectors by averaging the single-letter
distortions:
1X
d(an , ân ) , d(ai , âi )
n
An (n, M, D)-code is
• Encoder f : An → [M ]
• Decoder g : [M ] → Ân
Fundamental limit:
Now that we have the definition, we give the (surprisingly simple) general converse
Theorem 25.2 (General Converse). For all lossy codes X → W → X̂ such that E[d(X, X̂)] ≤ D,
we have
where W ∈ [M ].
Proof.
where the last inequality follows from the fact that PX̂|X is a feasible solution (by assumption).
1. ϕX is convex, non-increasing.
270
2. ϕX continuous on (D0 , ∞), where D0 = inf{D : ϕX (D) < ∞}.
3. If
(
D0 x=y
d(x, y) =
> D0 x 6= y
4. Let
Dmax = inf Ed(X, x̂).
x̂∈X̂
Then ϕX (D) = 0 for all D > Dmax . If D0 > Dmax then also ϕX (Dmax ) = 0.
Note: If Dmax = Ed(X, x̂) for some x̂, then x̂ is the “default” reconstruction of X, i.e., the best
estimate when we have no information about X. Therefore D ≥ Dmax can be achieved for free.
This is the reason for the notation Dmax despite that it is defined as an infimum.
Example: (Gaussian with MSE distortion) For X ∼ N (0, σ 2 ) and d(x, y) = (x − y)2 , we have
2
ϕX (D) = 12 log+ σD . In this case D0 = 0 which is not attained; Dmax = σ 2 and if D ≥ σ 2 , we can
simply output X̂ = 0 as the reconstruction which requires zero bits.
Proof.
4. For any D > Dmax we can set X̂ = x̂ deterministically. Thus I(X; x̂) = 0. The second claim
follows from continuity.
In channel coding, we looked at the capacity and the information capacity. We define the
Information Rate-Distortion function in an analogous way here, which by itself is not an operational
quantity.
Definition 25.2. The Information Rate-Distortion function for a source is
1
Ri (D) = lim sup ϕS n (D) where ϕS n (D) = inf I(S n ; Ŝ n )
n→∞ n PŜ n |S n :E[d(S n ,Ŝ n )]≤D
271
3. If
(
D0 x=y
d(x, y) =
> D0 x 6= y
Proof. Properties 1-4 follow directly from corresponding properties of φS n in Theorem 25.3 and
property 5 will be established in the next section.
i.e. that our separable distortion metric d doesn’t grow too fast. Note that (by Minkowski’s
inequality) for stationary memoryless sources we have a single-letter bound:
Remark 25.3. Theorem is only useful for p > 1, since for p = 1 the right-hand side of (25.11) does
not converge to 0 as → 0.
Proof. We transform the first code into the second by adding one codeword:
(
f (x) d(x, g(f (x))) ≤ D
f 0 (x) =
M + 1 o/w
(
g(j) j ≤ M
g 0 (j) =
x̂0 j =M +1
272
Then
273
§ 26. Rate distortion: achievability bounds
26.1 Recap
Recall from the last lecture:
1
R(D) = lim sup log M ∗ (n, D), (rate distortion function)
n→∞ n
1
Ri (D) = lim sup ϕS n (D), (information rate distortion function)
n→∞ n
and
Also, we showed the general converse: For any (M, D)-code X → W → X̂ we have
log M ≥ ϕX (D)
=⇒ log M ∗ (n, D) ≥ ϕS n (D)
=⇒ R(D) ≥ Ri (D)
In this lecture, we will prove the achievability bound and establish the identity R(D) = Ri (D)
for stationary memoryless sources.
First we show that Ri (D) can be easily calculated for memoryless source without going through
the multi-letter optimization problem. This is the counterpart of Corollary 19.1 for channel capacity
(with separable cost function).
Theorem 26.1 (Single-letterization). For stationary memoryless source S n and separable distortion
d,
Ri (D) = ϕS (D)
Proof. By definition we have that ϕS n (D) ≤ nϕS (D) by choosing a product channel: PŜ n |S n =
(PŜ|S )n . Thus Ri (D) ≤ ϕS (D).
274
For the converse, take any PŜ n |S n such that the constraint E[d(S n , Ŝ n )] ≤ D is satisfied, we have
n
X
I(S n ; Ŝ n ) ≥ I(Sj , Ŝj ) (S n independent)
j=1
n
X
≥ ϕS (E[d(Sj , Ŝj )])
j=1
n
X
1
≥ nϕS E[d(Sj , Ŝj )] (convexity of ϕS )
n
j=1
Then
R(D) = Ri (D) = inf I(S; Ŝ). (26.1)
PŜ|S :E[d(S,Ŝ)]≤D
Remark 26.1. • Note that Dmax < ∞ does not imply that d(·, ·) only takes values in R,
i.e. theorem permits d(a, â) = ∞.
• It should be remarked that when Dmax = ∞ (e.g. S ∼ Cauchy) typically R(D) = ∞. Indeed,
suppose that d(·, ·) is a metric (i.e. finite valued and satisfies triangle inequality). Then, for
any x0 ∈ An we have
d(X, X̂) ≥ d(X, x0 ) − d(x0 , X̂) .
Thus, for any finite codebook {c1 , . . . , cM } we have maxj d(x0 , cj ) < ∞ and therefore
E[d(X, X̂)] ≥ E[d(X, x0 )] − max d(x0 , cj ) = ∞ .
j
So that R(D) = ∞ for any finite D. This observation, however, should not be interpreted as
absolute impossibility of compression for such sources. It is just not possible with fixed-rate
codes. As an example, for quadratic distortion and Cauchy-distributed S, Dmax = ∞ since S
has infinite second-order moments. But it is easy to see that Ri (D) < ∞ for any D ∈ (0, ∞).
In fact, in this case Ri (D) is a hyperbola-like curve that never touches either axis. A non-
trivial compression can be attained with compressors S n → W of bounded entropy H(W )
(but unbounded alphabet of W ). Indeed if we take W to be a ∆-quantized version of S and
notice that differential entropy of S is finite, we get from (25.7) that Ri (∆) ≤ H(W ) < ∞.
Interesting question: Is H(W ) = nRi (D) + o(n) attainable?
275
• Techniques in proving (26.1) for memoryless sources can be applied to prove it for “stationary
ergodic” sources with changes similar to those we have discussed in lossless compression
(Lecture 10).
Before giving a formal proof, we illustrate the intuition non-rigorously.
26.2.1 Intuition
Try to throw in M points C = {c1 , . . . , cM } ∈ Ân which are drawn i.i.d. according to a product
distribution QnŜ where QŜ is some distribution on Â. Examine the simple encoder-decoder pair:
The basic idea is the following: Since the codewords are generated independently of the source,
the probability that a given codeword offers good reconstruction is (exponentially) small, say, .
However, since we have many codewords, the chance that there exists a good one can be of high
probability. More precisely, the probability that no good codeword exist is (1 − )M , which can be
very close to zero as long as M 1 .
To explain the intuition further, let us consider the excess distortion of this code: P[d(S n , Ŝ n ) >
D]. Define
Psuccess , P[∃c ∈ C, s.t. d(S n , c) ≤ D]
Then
Thus we conclude that ∀QŜ , ∀δ > 0 we can pick M = 2n(E(QŜ )+δ) and the above code will have
arbitrarily small excess distortion:
276
We optimize QŜ to get the smallest possible M :
= ϕS (D)
Note:
• This theorem says that from an arbitrary PY |X such that Ed(X, Y ) ≤ D, we can extract a
good code with average distortion D plus some extra terms which will vanish in the asymptotic
regime.
• The proof uses the random coding argument. The role of the deterministic y0 is a “fail-safe”
codeword (think of y0 as the default reconstruction with Dmax = E[d(X, y0 )]). We add y0 to
the random codebook for “damage control”, to hedge against the (highly unlikely and unlucky)
event that we end up with a terrible codebook.
Proof. Similar to the previous intuitive argument, we apply random coding and generate the
codewords randomly and independently of the source:
i.i.d.
C = {c1 , . . . , cM } ∼ PY ⊥
⊥X
and add the “fail-safe” codeword cM +1 = y0 . We adopt the same encoder-decoder pair (26.2) –
(26.3) and let X̂ = g(f (X)). Then by definition,
To simplify notation, let Y be an independent copy of Y (similar to the idea of introducing unsent
codeword X in channel coding – see Lecture 17):
PX,Y,Y = PX,Y PY
277
where PRY = PY . Recall the formula for computing the expectation of a random variable U ∈ [0, a]:
a
E[U ] = 0 P[U ≥ u]du. Then the average distortion is
where
• (26.18) uses the following trick in dealing with (1 − δ)M for δ 1 and M 1. First, recall
the standard rule of thumb: (
0, n n 1
(1 − n )n ≈
1, n n 1
In order to obtain firm bounds of similar flavor, consider
union bound
1 − δM ≤ (1 − δ)M ≤ e−δM (log(1 − δ) ≤ −δ)
≤ e−M/γ (γδ ∧ 1) + |1 − γδ|+ (∀γ > 0)
−M/γ +
≤e + |1 − γδ|
• (26.19) is simply change of measure using i(x; y) = log P PY (y) (i.e., conditioning-unconditioning
Y |X (y|x)
trick for information density, cf. Proposition 17.1.
278
• (26.20):
P[d(X, X̂) > D] ≤ e−M/γ + P[{d(X, Y ) > D} ∪ {i(X; Y ) > log γ}]
Proof. Proceed exactly as in the proof of Theorem 26.3, replace (26.12) by P[d(X, X̂) > D] = P[∀j ∈
[M ], d(X, cj ) > D] = EX [(1 − P[d(X, Y ) ≤ D|X])M ], and continue similarly.
Finally, we are able to prove Theorem 26.2 rigorously by applying Theorem 26.3 to iid sources
X = S n and n → ∞:
Proof of Theorem 26.2. Our goal is the achievability: R(D) ≤ Ri (D) = ϕS (D).
WLOG we can assume that Dmax = E[d(S, ŝ0 )] achieved at some fixed ŝ0 – this is our default
reconstruction; otherwise just take any other fixed sequence so that the expectation is finite. The
default reconstruction for S n is ŝn0 = (ŝ0 , . . . , ŝ0 ) and E[d(S n , ŝn0 )] = Dmax < ∞ since the distortion
is separable.
Fix some small δ > 0. Take any PŜ|S such that E[d(S, Ŝ)] ≤ D − δ. Apply Theorem 26.3 to
(X, Y ) = (S n , Ŝ n ) with
PX = PS n
PY |X = PŜ n |S n = (PŜ|S )n
log M = n(I(S; Ŝ) + 2δ)
log γ = n(I(S; Ŝ) + δ)
n
1X
d(X, Y ) = d(Sj , Ŝj )
n
j=1
y0 = ŝn0
279
we conclude that there exists a compressor f : An → [M + 1] and g : [M + 1] → Ân , such that
E[d(S n , g(f (S n )))] ≤ E[d(S n , Ŝ n )] + E[d(S n , ŝn0 )]e−M/γ + E[d(S n , ŝn0 )1{i(S n ;Ŝ n )>log γ } ]
≤ D − δ + Dmax e− exp(nδ) + E[d(S n , ŝn0 )1En ], (26.21)
| {z } | {z }
→0 →0 (later)
where
1 X
n
WLLN
En = {i(S n ; Ŝ n ) > log γ} = i(Sj ; Ŝj ) > I(S; Ŝ) + δ ====⇒ P[En ] → 0
n
j=1
If we can show the expectation in (26.21) vanishes, then there exists an (n, M, D)-code with:
M = 2n(I(S;Ŝ)+2δ) , D = D − δ + o(1) ≤ D.
It remains to show the expectation in (26.21) vanishes. This is a simple consequence of the
L
uniform integrability of the sequence {d(S n , ŝn0 )}. (Indeed, any sequence Vn →1 V is uniformly
integrable.) If you do not know what uniform integrability is, here is a self-contained proof.
Lemma 26.1. For any positive random variable U , define g(δ) = supH:P[H]≤δ E[U 1H ]. Then1
δ→0
EU < ∞ ⇒ g(δ) −−−→ 0.
b→∞
Proof. For any b > 0, E[U 1H ] ≤ E[U 1{U >b} ] + bδ, where E[U 1{U >b} ] −−−→ 0 by dominated
√
convergence theorem. Then the proof is completed by setting b = 1/ δ.
P
Now d(S n , ŝn0 ) = n1 P of U . Since E[U ] = Dmax < ∞ by assumption,
Uj , where Uj are iid copies
applying Lemma 26.1 yields E[d(S n , ŝn0 )1En ] = n1 E[Uj 1En ] ≤ g(P[En ]) → 0, since P[En ] → 0. We
are done proving the theorem.
Note: It seems that in Section 26.2.1 and in Theorem 26.2 we applied different relaxations in
showing the lower bound, how come they turn out to yield the same tight asymptotic result?
This is because the key to both proofs is to estimate the exponent (large deviations) of the
underlined probabilities in (26.7) and (26.17), respectively. To get the right exponent, as we know,
the key is to apply tilting (change of measure) to the distribution solving the information projection
problem (26.9). In the case, when PY = (QŜ )n = (PŜ )n is chosen as the solution to rate-distortion
optimization inf I(S; Ŝ), the resulting tilting is precisely given by 2−i(X;Y ) .
1
In fact, ⇒ is ⇔.
280
26.3* Covering lemma
Goal:
In other words:
1. The minimal rate will depend (although it is not obvious) on whether the encoder An → W
knows about the test that the tester is running (or equivalently whether he knows the function
f (·, ·)).
P
2. If the function is known to be of the form f (An , B n ) = nj=1 f1 (Aj , Bj ), then evidently the
job of the encoder is the following: For any realization of the sequence An , we need to generate
a sequence B n such that joint composition (empirical distribution) is very close to PA,B .
3. If R = H(A), we can compress An and send it to “B side”, who can reconstruct An perfectly
and use that information to produce B n through PB n |An .
281
Our previous argument turns out to give a sharp answer for the case when encoder is aware of
the tester’s algorithm. Here is a precise result:
Theorem 26.5 (Covering Lemma). ∀PA,B and R > I(A; B), let C = {c1 , . . . , cM } where each
codeword cj is i.i.d. drawn from distribution PBn . ∀ > 0, for M ≥ 2n(I(A;B)+) we have that:
Stronger form: ∀F
Proof. Following similar arguments of the proof for Theorem 26.3, we have
Note: [Intuition] To generate B n , there are around 2nH(B) high probability sequences; for each An
sequence, there are around 2nH(B|A) B n sequences that have the same joint distribution, therefore, it
nH(B)
is sufficient to describe the class of B n for each An sequence, and there are around 22nH(B|A) = 2nI(A;B)
classes.
Although Covering Lemma is a powerful tool, it does not imply that the constructed joint
distribution QAn B n can fool any permutation invariant tester. In other words, it is not guaranteed
that
n
sup |QAn ,B n (F ) − PA,B (F )| → 0 .
F ⊂A ×B ,permut.invar.
n n
Indeed, a sufficient statistic for a permutation invariant tester is a joint type P̂An ,c . Our code
satisfies P̂An ,c ≈ PA,B , but it might happen that P̂An ,c although close to PA,B still takes highly
unlikely values (for example, if we restrict all c to have the same composition P0 , the tester can
√
easily detect the problem since PBn -measure of all strings of composition P0 cannot exceed O(1/ n)).
Formally, to fool permutation invariant tester we need to have small total variation between the
distribution on the joint types under P and Q. (It is natural to conjecture that rate R = I(A; B)
should be sufficient to achieve this requirement, though).
A related question is about the minimal possible rate (i.e. cardinality of W ∈ [2nR ]) required to
have small total variation:
n
TV(QAn ,B n , PAB )≤ (26.22)
Note that condition (26.22) guarantees that any tester (permutation invariant or not) is fooled to
believe he sees the truly iid (An , B n ). The minimal required rate turns out to be (Cuff’2012):
R= min I(A, B; U )
A→U →B
a quantity known as Wyner’s common information C(A; B). Showing that Wyner’s common
n (in TV) we have
information is a lower-bound is not hard. Indeed, since QAn ,B n ≈ PAB
I(QAt−1 ,B t−1 , QAt Bt |At−1 ,B t−1 ) ≈ I(PAt−1 ,B t−1 , PAt Bt |At−1 ,B t−1 ) = 0
282
(Here one needs to use finiteness of the alphabet of A and B and the bounds relating H(P ) − H(Q)
with TV(P, Q)). We have (under Q!)
where in the last step we used the crucial observation that under Q there is a Markov chain
At → W → B t
and that Wyner’s common information PA,B 7→ C(A; B) should be continuous in the total variation
distance on PA,B . Showing achievability is a little more involved.
283
§ 27. Evaluating R(D). Lossy Source-Channel separation.
Last time: For stationary memoryless (iid) sources and separable distortion, under the assumption
that Dmax < ∞.
R(D) = Ri (D) = inf I(S; Ŝ).
PŜ|S :Ed(S,Ŝ)≤D
Proof. Since Dmax = p, in the sequel we can assume D < p for otherwise there is nothing to show.
(Achievability) We’re free to choose any PŜ|S , so choose S = Ŝ + Z, where Ŝ ∼ Ber(p0 ) ⊥
⊥Z∼
0
Ber(D), and p is such that
p0 ∗ D = p0 (1 − D) + (1 − p0 )D = p,
p−D
i.e., p0 = 1−2D . In other words, the backward channel PS|Ŝ is BSC(D). This induces some forward
channel PŜ|S . Then,
Since one such PŜ|S exists, we have the upper bound R(D) ≤ h(p) − h(D).
(Converse) First proof: For any PŜ|S such that P [S 6= Ŝ] ≤ D ≤ p ≤ 12 ,
284
Second proof: Here is a more general strategy. Denote the random transformation from the
Ŝ|S with EQ [d(S, Ŝ)] ≤ D
∗ . Now we need to show that there is no better Q
achievability proof by PŜ|S
and a smaller mutual information. Then consider the chain:
Since the upper and lower bound agree, we have R(D) = |h(p) − h(D)|+ .
For example, when p = 1/2, D = .11, then R(D) = 1/2 bit. In the Hamming game where we
compressed 100 bits down to 50, we indeed can do this while achieving 11% average distortion,
compared to the naive scheme of storing half the string and guessing on the other half, which
achieves 25% average distortion.
Interpretation: By WLLN, the distribution PSn = Ber(p)n concentrates near the Hamming
sphere of radius np as n grows large. The above result about Hamming sources tells us that the
optimal reconstruction points are from PŜn = Ber(p0 )n where p0 < p if p < 1/2 and p0 = 1/2 if
p = 1/2, which concentrates on a smaller sphere of radius np0 (note the reconstruction points are
some exponentially small subset of this sphere).
S(0, np)
S(0, np′ )
Hamming Spheres
It is interesting to note that none of the reconstruction points are the same as any of the typical
source realizations.
285
(Achievability) Choose S = Ŝ + Z , where Ŝ ∼ N (0, σ 2 − D) ⊥
⊥ Z ∼ N (0, D). In other words,
the backward channel PS|Ŝ is AWGN with noise power D. Since everything is jointly Gaussian, the
2 2
forward channel can be easily found to be PŜ|S = N ( σ σ−D
2 S,
σ −D
σ2
D). Then
1 σ2 1 σ2
I(S; Ŝ) = log =⇒ R(D) ≤ log
2 D 2 D
(Converse) Let PŜ|S be any conditional distribution such that EP |S − Ŝ|2 ≤ D. Denote the
∗ . We use the same trick as before
forward channel in the achievability by PŜ|S
" ∗ #
PS| Ŝ
∗
I(PS , PŜ|S ) = D(PS|Ŝ kPS|Ŝ |PŜ ) + EP log
PS
" ∗ #
PS| Ŝ
≥ EP log
PS
(S−Ŝ)2
1 −
√ e 2D
= EP log 2πD S2
√ 1 e− 2σ2
2πσ 2
" #
1 σ 2 log e S 2 |S − Ŝ|2
= log + EP −
2 D 2 σ2 D
1 σ2
≥
log .
2 D
Again, the upper and lower bounds agree.
The interpretation in the Gaussian case is very similar√to the case of the Hamming source. As n
grows large, our source distribution concentrates on S(0, pnσ 2 ) (n-sphere in Euclidean space rather
than Hamming), and our reconstruction points on S(0, n(σ 2 − D)). So again the picture is two
nested sphere.
How sensitive is the rate-distortion formula to the Gaussianity assumption of the source? The
following is a counterpart of Theorem 19.6 for channel capacity:
Theorem 27.1. Assume that ES = 0 and Var S = σ 2 . Let the distortion metric be quadratic:
d(s, ŝ) = (s − ŝ)2 . Then
1 σ2 1 σ2
log+ − D(PS kN (0, σ 2 )) ≤ R(D) = inf I(S; Ŝ) ≤ log+ .
2 D PŜ|S :E(Ŝ−S)2 ≤D 2 D
Note: This result is in exact parallel to what we proved in Theorem 19.6 for additive-noise channel
capacity:
1 P 1 P
log 1 + 2 ≤ sup I(X; X + Z) ≤ log 1 + 2 + D(PZ kN (0, σ 2 )).
2 σ PX :EX 2 ≤P 2 σ
286
Proof. Again, assume D < Dmax = σ 2 . Let SG ∼ N (0, σ 2 ).
2 σ 2 −D
∗
“≤”: Use the same PŜ|S = N ( σ σ−D
2 S, σ2
D) in the achievability proof of Gaussian rate-
distortion function:
∗
R(D) ≤ I(PS , PŜ|S )
σ2 − D σ2 − D
= I(S; S + W) W ∼ N (0, D)
σ2 σ2
σ2 − D
≤ I(SG ; SG + W ) by Gaussian saddle point (Theorem 4.6)
σ2
1 σ2
= log .
2 D
“≥”: For any PŜ|S such that E(Ŝ − S)2 ≤ D. Let PS|
∗ = N (Ŝ, D) denote AWGN with noise
Ŝ
power D. Then
Remark: The theory of quantization and the rate distortion theory at large have played a
significant role in pure mathematics. For instance, Hilbert’s thirteenth problem was partially solved
by Arnold and Kolmogorov after they realized that they could classify spaces of functions looking
at the optimal quantizer for such functions.
In fact there are also iterative algorithms (Blahut-Arimoto) that computes R(D). However, for
the peace of mind it is good to know there are some general reasons why tricks like we used in
Hamming/Gaussian actually are guaranteed to work.
287
Theorem 27.2. 1. Suppose PY ∗ and PX|Y ∗ PX are found with the property that E[d(X, Y ∗ )] ≤ D
and for any PXY with E[d(X, Y )] ≤ D we have
dPX|Y ∗
E log (X|Y ) ≥ I(X; Y ∗ ) . (27.4)
dPX
• E[d(X, Y )] ≤ D and
• PY PY ∗ and
• I(X; Y ) < ∞
1. The first part is a sufficient condition for optimality of a given PXY ∗ . The second part gives a
necessary condition that is convenient to narrow down the search. Indeed, typically the set of
PXY satisfying those conditions is rich enough to infer from (27.4):
dPX|Y ∗
log (x|y) = R(D) − θ[d(x, y) − D] ,
dPX
for a positive θ > 0.
2. Note that the second part is not valid without assuming PY PY ∗ . A counterexample to
this and various other erroneous (but frequently encountered) generalizations is the following:
A = {0, 1}, PX = Bern(1/2), Â = {0, 1, 00 , 10 } and
The R(D) = |1 − h(D)|+ , but there are a bunch of non-equivalent optimal PY |X , PX|Y and
PY ’s.
Proof. First part is just a repetition of the proofs above, so we focus on part 2. Suppose there exists
a counterexample PXY achieving
dPX|Y ∗
I1 = E log (X|Y ) < I ∗ = R(D) .
dPX
and thus
D(PX|Y kPX|Y ∗ |PY ) < ∞ . (27.5)
Before going to the actual proof, we describe the principal idea. For every λ we can define a joint
distribution
PX,Yλ = λPX,Y + (1 − λ)PX,Y ∗ .
288
Then, we can compute
PX|Yλ PX|Yλ PX|Y ∗
I(X; Yλ ) = E log (X|Yλ ) = E log (27.6)
PX PX|Y ∗ PX
PX|Y ∗ (X|Yλ )
= D(PX|Yλ kPX|Y ∗ |PYλ ) + E (27.7)
PX
= D(PX|Yλ kPX|Y ∗ |PYλ ) + λI1 + (1 − λ)I∗ . (27.8)
From here we will conclude, similar to Prop. 4.1, that the first term is o(λ) and thus for sufficiently
small λ we should have I(X; Yλ ) < R(D), contradicting optimality of coupling PX,Y ∗ .
We proceed to details. For every λ ∈ [0, 1] define
dPY
ρ1 (y) , (y) (27.9)
dPY ∗
λρ1 (y)
λ(y) , (27.10)
λρ1 (y) + λ̄
(λ)
PX|Y =y = λ(y)PX|Y =y + λ̄(y)PX|Y ∗ =y (27.11)
dPYλ = λdPY + λ̄dPY ∗ = (λρ1 (y) + λ̄)dPY ∗ (27.12)
D(y) = D(PX|Y =y kPX|Y ∗ =y ) (27.13)
(λ)
Dλ (y) = D(PX|Y =y kPX|Y ∗ =y ) . (27.14)
Notice:
On {ρ1 = 0} : λ(y) = D(y) = Dλ (y) = 0
and otherwise λ(y) > 0. By convexity of divergence
Dλ (y) ≤ λ(y)D(y)
and therefore
1
Dλ (y)1{ρ1 (y) > 0} ≤ D(y)1{ρ1 (y) > 0} .
λ(y)
Notice that by (27.5) the function ρ1 (y)D(y) is non-negative and PY ∗ -integrable. Then, applying
dominated convergence theorem we get
Z Z
1 1
lim dPY ∗ Dλ (y)ρ1 (y) = dPY ∗ ρ1 (y) lim Dλ (y) = 0 (27.15)
λ→0 {ρ1 >0} λ(y) {ρ1 >0} λ→0 λ(y)
289
where in the penultimate step we used Dλ (y) = 0 on {ρ1 = 0}. Hence, (27.15) shows
(λ)
D(PX|Y kPX|Y ∗ |PYλ ) = o(λ) , λ → 0.
Finally, since
(λ)
PX|Y ◦ PYλ = PX ,
we have
(λ) dPX|Y ∗ dPX|Y ∗ ∗
I(X; Yλ ) = D(PX|Y kPX|Y ∗ |PYλ ) + λ E log (X|Y ) + λ̄ E log (X|Y ) (27.19)
dPX dPX
= I ∗ + λ(I1 − I ∗ ) + o(λ) , (27.20)
f ch. g
S k −→ X n −→ Y n −→ Ŝ k
Sk Xn Yn Ŝ k
Source JSCC enc Channel JSCC dec
k
R= n
290
27.3.1 Converse
The converse for the JSCC is quite simple. Note that since there is no under consideration, the
strong converse is the same as the weak converse. The proof architecture is identical to the weak
converse of lossless JSCC which uses Fano’s inequality.
Theorem 27.3 (Converse). For any source such that
1
Ri (D) = lim inf I(S k ; Ŝ k )
k→∞ k PŜ k |S k :E[d(S k ,Ŝ k )]≤D
we have
Ri (D)
ρ∗ (D) ≥
Ci
Remark: The requirement of this theorem on the source isn’t too stringent; the limit expression
for Ri (D) typically exists for stationary sources (like for the entropy rate)
Proof. Take a (k, n, D)-code S k → X n → Y n → Ŝ k . Then
Which follows from data processing and taking inf/sup. Normalizing by 1/k and taking the liminf
as n → ∞
1
(LHS) lim inf sup I(X n ; Y n ) = Ci
n→∞ n PX n
1
(RHS) lim inf inf I(S kn ; Ŝ kn ) = Ri (D)
n→∞ kn PŜ kn |S kn
Note: Clearly the assumptions in Theorem 27.3 are satisfied for memoryless sources. If the source
S is iid Bern(1/2) with Hamming distortion, then Theorem 27.3 coincides with the weak converse
for channel coding under bit error rate in Theorem 16.4:
nC
k≤
1 − h(pb )
which we proved using ad hoc techniques. In the case of channel with cost constraints, e.g., the
AWGN channel with C(SNR) = 21 log(1 + SNR), we have
C(SNR)
pb ≥ h−1 1 −
R
This is often referred to as the Shannon limit in plots comparing the bit-error rate of practical codes.
See, e.g., Fig. 2 from [RSU01] for BIAWGN (binary-input) channel. This is erroneous, since the
pb above refers to the bit-error of data bits (or systematic bits), not all of the codeword bits. The
latter quantity is what typically called BER in the coding-theoretic literature.
291
27.3.2 Achievability via separation
The proof strategy is similar to the lossless JSCC: We construct a separated lossy compression
and channel coding scheme using our tools from those areas, i.e., let the JSCC encoder to be
the concatenation of a loss compressor and a channel encoder, and the JSCC decoder to be the
concatenation of a channel decoder followed by a loss compressor, then show that this separated
construction is optimal.
Theorem 27.4. For any stationary memoryless source (PS , A, Â, d) satisfying assumption A1
(below), and for any stationary memoryless channel PY |X ,
R(D)
ρ∗ (D) =
C
Note: The assumption on the source is to control the distortion incurred by the channel decoder
making an error. Although we know that this is a low-probability event, without any assumption
on the distortion metric, we cannot say much about its contribution to the end-to-end average
distortion. This will not be a problem if the distortion metric is bounded (for which Assumption A1
is satisfied of course). Note that we do not have this nuisance in the lossless JSCC because we at
most suffer the channel error probability (union bound).
The assumption is rather technical which can be skipped in the first reading. Note that it is
trivially satisfied by bounded distortion (e.g., Hamming), and can be shown to hold for Gaussian
source and MSE distortion.
Proof. The converse direction follows from the previous theorem. For the other direction, we
constructed a separated compression / channel coding scheme. Take
S k −→ W −→ X n −→ Y n −→ Ŵ −→ Ŝ k
Note that here we need a maximum probability of error code since when we concatenate these
two schemes, W at the input of the channel is the output of the source compressor, which is not
guaranteed to be uniform. Now that we have a scheme, we must analyze the average distortion to
show that it meets the end-to-end distortion constraint. We start by splitting the expression into
two cases
By assumption on our lossy code, we know that the first term is ≤ D. In the second term, we know
that the probability of the event {W =6 Ŵ } is small by assumption on our channel code, but we
cannot say anything about E[d(S , Ŝ (Ŵ ))] unless, for example, d is bounded. But by Lemma 27.1
k k
(2) d(ak0 , Ŝ k ) ≤ L for all quantization outputs Ŝ k , where ak0 = (a0 , . . . , a0 ) is some fixed string of
length k from the Assumption A1 below.
292
The second bullet says that all points in the reconstruction space are “close” to some fixed string.
Now we can deal with the troublesome term
E[d(S k , Ŝ k (Ŵ ))1{W 6= Ŵ }] ≤ E[1{W 6= Ŵ }λ(d(S k , âk0 ) + d(ak0 , Ŝ k ))]
(by point (2) above) ≤ λE[1{W 6= Ŵ }d(S k , âk0 )] + λE[1{W 6= Ŵ }L]
≤ λo(1) + λL → 0 as → 0
where in the last step we applied the same uniform integrability argument that showed vanishing of
the expectation in (26.21) before.
In all, our scheme meets the average distortion constraint. Hence we conclude that for ∀ρ >
R(D)
C , ∃ sequence of (k, n, D + o(1))-codes.
a â
b b
a0 â0
A Â
But the assumption isn’t easy to verify, or clear which sources satisfy the assumptions. Because of
this, we now give a few sufficient conditions for Assumption A1 to hold.
Trivial Condition: If the distortion function is bounded, then the assumption A1 holds
automatically. In other words, if we have a discrete source with finite alphabet |A|, |Â| < ∞ and
a finite distortion function d(a, â) < ∞, then A1 holds. More generally, we have the following
criterion.
Theorem 27.5 (Criterion for satisfying A1). If A = Â and d(a, â) = ρq (a, â) for some metric ρ
with q ≥ 1, and Dmax , inf â0 E[d(S, â0 )] < ∞, then A1 holds.
Proof. Take a0 = â0 that achieves finite Dp (in fact, any points can serve as centers in a metric
space). Then
q
1 1 1
( ρ(a, â))q ≤ ρ(a, a0 ) + ρ(a0 , â)
2 2 2
1 1
(Jensen’s) ≤ ρq (a, a0 ) + ρq (a0 , â)
2 2
293
And thus d(a, â) ≤ 2q−1 (d(a, a0 ) + d(a0 , â)). Taking λ = 2q−1 verifies (1) and (2) in A1. To verify
(3), we can use this generalized triangle inequality for our source
So we see that metrics raised to powers (e.g. squared Euclidean norm) satisfy the condition A1.
The lemma used in Theorem 27.4 is now given.
Lemma 27.1. Fix a source satisfying A1 and an arbitrary PŜ|S . Let R > I(S; Ŝ), L > max{E[d(a0 , Ŝ)], d(a0 , â0 )}
and D > E[d(S, Ŝ)]. Then, there exists a (k, 2kR , D)-code such that for every reconstruction point
x̂ ∈ Âk we have d(ak0 , x̂) ≤ L.
Note that this is NOT a separable distortion metric. Also note that without any change in d1 -
distortion we can remove all (if any) reconstruction points x̂ with d(ak0 , x̂) > L. Furthermore, from
the WLLN we have for any D > D0 > E[d(S, Ŝ 0 )]
as k → ∞ (since E[d(S, Ŝ)] < D0 and E[a0 , Ŝ] < L). Thus, overall we get M = 2kR reconstruction
points (c1 , . . . , cM ) such that
P[ min d(S k , cj ) > D0 ] → 0
j∈[M ]
where the last estimate follows from uniform integrability as in the vanishing of expectation in (26.21).
Thus, for sufficiently large n the expected distortion is ≤ D, as required.
To summarize the results in this section, under stationarity and memorylessness assumptions on
the source and the channel, we have shown that the following separately-designed scheme achieves
the optimal rate for lossy JSCC: First compress the data, then encode it using your favorite channel
code, then decompress at the receiver.
294
27.4 What is lacking in classical lossy compression?
Examples of some issues with the classical compression theory:
• compression: we can apply the standard results in compression of a text file, but it is extremely
difficult for image files due to the strong spatial correlation. For example, the first sentence
and the last in Tolstoy’s novel are pretty uncorrelated. But the regions in the upper-left
and bottom-right corners of one image can be strongly correlated. Thus for practicing the
lossy compression of videos and images the key problem is that of coming up with a good
“whitening” basis.
• JSCC: Asymptotically the separation principle sounds good, but the separated systems can
be very unstable - no graceful degradation. Consider the following example of JSCC.
Example: Source = Bern( 12 ), channel = BSC(δ).
R(D)
1. separate compressor and channel encoder designed for C(δ) =1
2. a simple JSCC:
ρ = 1, Xi = Si
295
Part VI
Advanced topics
296
§ 28. Applications to statistical decision theory
In this lecture we discuss applications of information theory to statistical decision theory. Although
this lecture only focuses on statistical lower bound (converse result), let us remark in passing that
the impact of information theory on statistics is far from being only on proving impossibility results.
Many procedures are based on or inspired by information-theoretic ideas, e.g., those based on
metric entropy, pairwise comparison, maximum likelihood estimator and analysis, minimum distance
estimator (Wolfowitz), maximum entropy estimators, EM algorithm, minimum description length
(MDL) principle, etc.
We discuss two methods: LeCam-Fano (hypothesis testing) method and the rate-distortion
(mutual information) method.
We begin with the decision-theoretic setup of statistical estimation. The general paradigm is
the following:
θ
|{z} → |{z}
X → |{z} θ̂
parameter data estimator
• Parameter space: Θ 3 θ
• Estimator: θ̂ = θ̂(X)
The goal is make random variable `(θ, θ̂) small either in probability or in expectation, uniformly
over the unknown parameter θ. To this end, we define the minimax risk
297
The Bayes risk for a prior π is the minimum that the average risk can achieve, i.e.
R∗ ≥ Rπ∗ (28.1)
and in fact R∗ = supπ∈M(Θ) Rπ∗ whenever the minimax theorem holds, where M(Θ) denotes the
collection of all probability distributions on Θ. In other words, solving the minimax problem can be
done by finding the least-favorable (Bayesian) prior. Almost all of the minimax lower bounds boil
down to bounding from below the Bayes risk for some prior. When this prior is uniform on just two
points, the method is known under a special name of (two-point) LeCam or LeCam-Fano method.
Note also that when `(θ, θ̂) = kθ − θ̂k22 is the quadratic `2 risk, the optimal estimator achieving Rπ∗
is easy to describe: θ̂∗ = E[θ|X]. This fact, however, is of limited value, since typically conditional
expectation is very hard to analyze.
• Fundamental limit:
R∗ (n) , sup inf E[(θ̂(X n ) − θ)2 |θ = θ0 ]
θ0 ∈[0,1] θ̂
where JF (θ0 ) = Var[ ∂ ln f∂θ(X|θ) |θ = θ0 ] is the Fisher information (4.7). In our case, for any unbiased
estimator (i.e. b(θ) = 0) we have
θ0 (1 − θ0 )
E[(θ̂ − θ)2 |θ = θ0 ] ≥ ,
n
298
and we can see from (28.2) that θ̂emp is optimal in the class of unbiased estimators.
Can biased estimators do better? The answer is yes. Consider
1 − n X 1 1
θ̂bias = (Xi − ) + ,
n 2 2
i
where choice of n > 0 “shrinks” the estimator towards 12 and regulates the bias-variance tradeoff.
1
In particular, setting n = √n+1 achieves the minimax risk
1
sup E[(θ̂bias − θ)2 |θ = θ0 ] = √ , (28.4)
θ0 4( n + 1)2
which is better than the empirical mean (28.2), but only slightly.
How do we show that arbitrary biased estimators can not do significantly better? This is where
LeCam-Fano method comes handy. Suppose some estimator θ̂ achieves
W → θ → X n → θ̂ → Ŵ
• W ∼ Bern(1/2)
• X n is i.i.d. Bern(θ)
The idea here is that we use our high-quality estimator to distinguish between two hypotheses
θ = 1/2 ± κ∆n . Notice that for probability of error we have:
E[(θ̂ − θ)2 ] 1
P[W 6= Ŵ ] = P[θ̂ > 1/2|θ = 1/2 − κ∆n ] ≤ 2 2
≤ 2
κ ∆n κ
where the last steps are by Chebyshev and (28.5), respectively. Thus, from Fano’s inequality
Theorem 5.3 we have
1
I(W ; Ŵ ) ≥ 1 − 2 log 2 − h(κ−2 ) .
κ
On the other hand, from data-processing and golden formula we have
As ∆n → 0 we have
h(1/2 − κ∆n ) = log 2 − 2 log e · (κ∆n )2 + o(∆2n ) .
299
So altogether, we get that for every fixed κ we have
1
1 − 2 log 2 − h(κ−2 ) ≤ 2n log e · (κ∆n )2 + o(n∆2n ) .
κ
In particular, by optimizing over κ we get that for some constant c ≈ 0.015 > 0 we have
c
∆2n ≥ + o(1/n) .
n
Together with (28.2), we have
0.015 1
+ o(1/n) ≤ R∗ (n) ≤ ,
n 4n
and thus the empirical-mean estimator is rate-optimal.
We mention that for this particular problem (estimating mean of Bernoulli samples) the minimax
risk is known exactly:
1
R∗ (n) = √ (28.6)
4(1 + n)2
but obtaining this requires different methods.1 In fact, even showing R∗ (n) = 4n 1
+ o(1/n) requires
careful priors on θ (unlike the simple two-point prior we used above).2
We demonstrated here the essense of the Fano method of proving lower (impossibility) bounds
in statistical decision theory. Namely, given an estimation task we select a prior, uniform on finitely
many θ’s, which on one hand yields a rather small information I(θ; X) and on the other hand has
sufficiently separated points which thus should be distinguishable by a good estimator. For more
see [Yu97].
A natural (and very useful) generalization is to consider non-discrete prior Pθ , and use the
following natural chain of inequalities
where
f (Pθ , R) , inf{I(θ; θ̂) : Pθ̂|θ s.t. E[`(θ, θ̂)] ≤ R}
is the rate-distortion function. This method we discuss next.
1
The easiest way to get this is to apply (28.1). . Fortunately, in this case if π is the β-distribution, computation of
conditional expectation can be performed in closed form, and optimizing parameters of the β-distribution one recovers
a lower bound that together with (28.4) establishes (28.6). Note that the resulting worst-case π is not uniform, and in
fact β → ∞ (i.e. π concentrates in a small region around θ = 1/2).
2
It follows from the following Bayesian Cramer-Rao lower bound [GL95] : For any estimator θ̂ and for any prior
π(θ)dθ with smooth density π we have
(log e)2
Eθ∼π [(θ̂(X) − θ)2 ] ≥ ,
E[JF (θ)] + JF (π)
R 0 (θ))2
where JF (θ) is as in (28.3), JF (π) , (log e)2 (ππ(θ) dθ. Then taking π supported on a n−1/4 -neighborhood
surrounding a given point θ0 we get that E[JF (θ)] = θ0 (1−θ0 ) + o(n) and JF (π) = o(n), yielding
n
θ0 (1 − θ0 )
R∗ (n) ≥ + o(1/n) .
n
This is a rather general phenomenon: Under regularity assumptions in any iid estimation problem θ → X n → θ̂ with
quadratic loss we have
1
R∗ (n) = + o(1/n) .
inf θ JF (θ)
300
28.2 Mutual information method
The main workhorse will be
2. Rate-distortion theory
To illustrate the mutual information method and its execution in various problems, we will discuss
three vignettes:
• Denoise a vector;
• Community detection.
Here’s the main idea of the mutual information method. Fix some prior π and we turn to lower
bound Rπ∗ . The unknown θ is distributed according to π. Let θ̂ be a Bayes optimal estimator that
achieves the Bayes risk Rπ∗ .
The mutual information method consists of applying the data processing inequality to the
Markov chain θ → X → θ̂:
dpi
inf I(θ; θ̂) ≤ I(θ, θ̂) ≤ I(θ; X). (28.7)
∗
Pθ̃|θ :E`(θ,θ̃)≤Rπ
Note that
• The leftmost quantity can be interpreted as the minimum amount of information required for
an estimation task, which is reminiscent of rate-distortion function.
• The rightmost quantity can be interpreted as the amount of information provided by the
data about the parameter. Sometimes it suffices to further upper-bound it by capacity of the
channel θ 7→ X:
I(θ; X) ≤ sup I(θ; X). (28.8)
π∈M(Θ)
• This chain of inequalities is reminiscent of how we prove the converse in joint-source channel
coding (Section 27.3), with the capacity-like upper bound and rate-distortion-like lower bound.
301
28.2.1 Denoising (Gaussian location model)
The setting is the following: given n noisy observations of a high-dimensional vector θ ∈ Rp ,
i.i.d.
Xi ∼ N (θ, Ip ), i = 1, . . . , n (28.9)
The loss is simply the quadratic error: `(θ, θ̂) = kθ − θ̂k22 . Next we show that
p
R∗ = , ∀p, n. (28.10)
n
P
Upper bound. Consider the estimator X̄ = n1 ni=1 Xi . Then X̄ ∼ N (θ, n1 Ip ) and clearly
EkX̄ − θk22 = p/n.
Lower bound. Consider a Gaussian prior θ ∼ N (0, σ 2 Ip ). Instead of evaluating the exact Bayes
risk (MMSE) which is a simple exercise, let’s implement the mutual information method (28.7).
Given any estimator θ̂. Let D = Ekθ̂ − θk22 . Then
p σ2 suff stat p σ2
log = inf I(θ; θ̂) ≤ I(θ, θ̂) ≤ I(θ; X) = I(θ; X̄) = log 1 + .
2 D/p Pθ̃|θ :Ekθ−θ̃k22 ≤D 2 1/n
where the left inequality follows from the Gaussian rate-distortion function (27.3) and the single-
letterization result (Theorem 26.1) that reduces p dimensions to one dimension. Putting everything
together we have
pσ 2
R∗ ≥ Rπ∗ ≥ .
1 + nσ 2
Optimizing over σ 2 (by sending it to ∞), we have R∗ ≥ p/n.
302
Lower bound. In view of the oracle lower bound, it suffices to consider k = O(p). Next we
assume k ≤ p/16. Consider a p-dimensional Hamming sphere of radius k, i.e.
where wH (b)
qis the Hamming weights of b. Let b be drawn uniformly from the set B and θ = τ b,
where τ = k
100 log kp . Thus, we have the following Markov chain which represents our problem
model,
b → θ → X → θ̂ → b̂.
τ 2k
Note that the channel θ → X is just p uses of the AWGN channel, with power p , and thus
by Theorem 4.6 and single-letterization (Theorem 5.1) we have
p τ 2k log e
I(θ; θ̂) ≤ I(θ; X) ≤ log 1 + ≤ sup kθk22 = ckτ 2 ,
2 p θ∈G 2
for some c > 0. We note that related techniques have been used in proving lower bound for stable
recovery in noiseless compressed sensing [PW12].
To give a lower bound for I(θ; θ̂), consider
Thus,
τ 2 dH (b, b̂) = kτ b̂ − θk22 ≤ 4kθ − θ̂k22 ,
where dH denotes the Hamming distance between b and b̂. Suppose that Ekθ̂ − θk22 = τ 2 k. Then
we have EdH (b, b̂) ≤ 4k. Our goal is to show that is at least a small constant by the mutual
information method. First,
Before we bound the RHS, let’s first guess its behavior. Note that it is the rate-distortion function
of the random vector b, which is uniform over B, the Hamming sphere of radius k, and each entry is
Bern(k/p). Had the entries been iid, then rate-distortion theory ((27.1) and Theorem 26.1) would
yield that the RHS is simply p(h(k/p) − h(4k/p)). Next, following the proof of (27.1), we show
that this behavior is indeed correct:
303
The solution is W = Bern(m/p)⊗p . One way to get this is to write H(W ) = p log 2−D(PW kBern(1/2)⊗p )
and apply Theorem 14.3 with X = wH (W ), to get that optimal PW (w) ∼ exp{cwH (w)}. In the
end we get Combine this with the previous bound, we get
p 4k
I(b̂; b) ≥ log − ph( ).
k p
On the other hand, we have
p
I(b̂; b) ≤ I(θ; Y ) ≤ cτ 2 = c0 k log .
k
p
Note that h(α) −α log α for α < 41 . WLOG, since k ≤ 16 , we have ≥ c0 for some universal
constant c0 . Therefore
p
R∗ ≥ τ 2 k & k log .
k
Combining with the result in the oracle lower bound, we have the desired.
p
R∗ & k + k log
k
or for general n ≥ 1
k ep
R∗ & log .
n k
∗
Remark 28.1. Let Rk,p = R∗ . For the case k = o(p), the sharp asymptotics is
∗ p
Rk,p ≥ (2 + op (1))k log .
k
To prove this result, we need to first show that for the case k = 1,
∗
R1,p ≥ (2 + op (1)) log p.
Next, show that for any k, the minimax risk is lower bounded by the Bayesian risk with the block
prior. The block prior is that we divide the p-coordinate into k blocks, and pick one coordinate
from each p/k-coordinate uniformly. With this prior, one can show
∗ ∗ p
Rk,p ≥ kR1,p/k = (2 + op (1))k log .
k
304
Given the network size n and the community size k, there exists a sharp condition on the edge
density (p, q) that says the community needs to be sufficient denser than the outside. It turns
out this is precisely described by the binary divergence d(pkq). Under the assumption that p/q is
bounded, e.g., p = 2q, the information-theoretic necessary condition is
kd(pkq)
k · d(pkq) → ∞ and lim inf ≥ 2. (28.14)
n→∞ log nk
This condition is tight in the sense that if in the above “≥” is replaced by “>”, then there exists an
estimator (e.g., the maximal likelihood estimator) that achieves (28.13).
Next we only prove the necessity of the second condition in (28.14), again using the mutual
information method. Let ξ and ξˆ be the indicator vector of the community C and the estimator
Ĉ, respectively. Thus ξ = (ξ1 , . . . , ξn ) is uniformly drawn from the set {x ∈ {0, 1}n : wH (x) = k}.
Therefore ξi ’s are individually Bern(k/n). Let E[dH (ξ, ξ)] ˆ = n k, where n → 0 by assumption.
Consider the following chain of inequalities, which lower bounds the amount of information required
for a distortion level n :
dpi
ˆ ξ) ≥
I(A; ξ) ≥ I(ξ; min ˜ ξ) ≥ H(ξ) −
I(ξ; max H(ξ˜ ⊕ ξ)
E[d(ξ̃,ξ)]≤n k E[d(ξ̃,ξ)]≤n k
(28.12) n n k n
= log − nh ≥ k log (1 + o(1)),
k n k
k
where the last step follows from the bound nk ≥ nk , the assumption k/n is bounded away from
one, and the bound h(p) ≤ −p log p + p for p ∈ [0, 1].
On the other hand, to bound the mutual information, we use the golden formula Corollary 3.1
and choose a simple reference Q:
305
§ 29. Multiple-access channel
Note: In network community, people are mostly interested in channel access control mechanisms
that help to detect or avoid data packet collisions so that the channel is shared among multiple
users.
The key to achieve this is to use coding so that collisions are resolvable.
In the following discussion we shall focus on the case with two users. This is without loss of
much generality, as all the results can easily be extended to N users.
Definition 29.1.
• Multiple-access channel: {PY n |An ,B n : An × B n → Y n , n = 1, 2, . . . }.
• a (n, M1 , M2 , ) code is specified by
f1 : [M1 ] → An , f2 : [M2 ] → B n
g : Y n → [M1 ] × [M2 ]
1
Note that there is a lot of research on how to achieve just these 37%. Indeed, ALOHA in a nutshell simply
postulates that every time a user has a packet to transmit, he should attempt transmission in each time slot with
probability p, independently. The optimal setting of p is the inverse of the number of actively trying users. Thus, it is
non-trivial how to learn the dynamically changing number of active users without requiring a central authority. This
is how ideas such as exponential backoff etc arise.
306
W1 , W2 ∼ uniform, and the codes achieves
[
P[{W1 6= Ŵ1 } {W2 6= Ŵ2 }] ≤
• Asymptotics:
C = cl lim inf R∗ (n, )
n→∞
lim inf An = {ω : ω ∈ An , ∀n ≥ n0 }
n
lim sup An = {ω : ω infinitely occur}
n
• \
C = lim C = C
>0
where co is the set operator of constructing the convex hull followed by taking the closure, and
P enta(·, ·) is defined as follows:
0 ≤ R1 ≤ I(A; Y |B)
P enta(PA , PB ) = (R1 , R2 ) : 0 ≤ R2 ≤ I(B; Y |A)
R1 + R2 ≤ I(A, B; Y )
0 ≤ R1 ≤ I(A; Y |B, U )
P enta(PA|U , PB|U |PU ) = (R1 , R2 ) : 0 ≤ R2 ≤ I(B; Y |A, U )
R1 + R2 ≤ I(A, B; Y |U )
307
Note: the two forms in (29.1) and (29.2) are equivalent without cost constraints. In the case when
constraints such as Ec1 (A) ≤ P1 and Ec2 (B) ≤ P2 are present, only the second expression yields
the true capacity region.
c1 , . . . , cM1 ∈ A, d1 , . . . , dM2 ∈ B
where the codes are drawn i.i.d from distributions: c1 , . . . , cM1 ∼ i.i.d. PA , d1 , . . . , dM2 ∼ i.i.d. PB .
The decoder operates in the following way: report (m, m0 ) if it is the unique pair that satisfies:
308
h [ [ i
⇒ Pe ≤ P {i12 (A, B; Y ) ≤ log γ12 } {i1 (A; Y |B) ≤ log γ1 } {i2 (B; Y |A) ≤ log γ2 } + P[E12 ] + P[E1 ] + P[E2 ]
where
Note: [Intuition] Consider the decoding step when a random codebook is used. We observe Y
and need to solve an M -ary hypothesis testing problem: Which of {PY |A=cm ,B=dm0 }m,m0 ∈[M1 ]×[M2 ]
produced the sample Y ?
Recall that in P2P channel coding, we had a similar problem and the M-ary hypothesis testing
problem was converted to M binary testing problems:
X 1
PY |X=cj vs PY−j , P ≈ PY
M − 1 Y |X=ci
i6=j
i.e. distinguish cj (hypothesis H0 ) against the average distribution induced by all other codewords
(hypothesis H1 ), which for a random coding ensemble cj ∼ PX is very well approximated by
PY = PY |X ◦ PX . The optimal test for this problem is roughly
PY |X=c
& log(M − 1) =⇒ decide PY |X=cj (29.4)
PY
1
since the prior for H0 is M , while the prior for H1 is MM−1 .
The proof above followed the same idea except that this time because of the two-dimensional
grid structure:
309
there are in fact binary HT of three kinds
1 X X
(P 12) ∼ test PY |A=cm ,B=dm0 vs PY |A=ci ,B=dj ≈ PY
(M1 − 1)(M2 − 1) 0i6=m j6=m
1 X
(P 1) ∼ test PY |A=cm ,B=dm0 vs PY |A=ci ,B=dm0 ≈ PY |B=dm0
M1 − 1
i6=m
1 X
(P 2) ∼ test PY |A=cm ,B=dm0 vs PY |A=cm ,B=dj ≈ PY |A=cm
M2 − 1 0 j6=m
And analogously to (29.4) the optimal tests are given by comparing the respective information
densities with log M1 M2 , log M1 and log M2 .
Another observation following from the proof is that the following decoder would also achieve
exactly the same performance:
• Step 1: rule out all cells (i, j) with i12 (ci , dj ; Y ) . log M1 M2 .
• Step 2: If the messages remaining are NOT all in one row or one column, then FAIL.
• Step 3a: If the messages remaining are all in one column m0 then declare Ŵ2 = m0 . Rule out
all entries in that column with i1 (ci ; Y |dm0 ) . log M1 . If more than one entry remains, FAIL.
Otherwise declare the unique remaining entry m as Ŵ1 = m.
• Step 3b: Similarly with column replaced by row, i1 with i2 and log M1 with log M2 .
The importance of this observation is that in the regime when RHS of (29.3) is small, the
decoder always finds it possible to basically decode one message, “subtract” its influence and then
decode the other message. Which of the possibilities 3a/3b appears more often depends on the
operating point (R1 , R2 ) inside C.
A + B , {(a + b) : a ∈ A, b ∈ B}
310
2. Achievability
STP: ∀PA , PB , ∀(R1 , R2 ) ∈ P enta(PA , PB ), ∃(n, 2nR1 , 2nR2 , )code.
Apply Lemma 29.1 with:
by WLLN, the first part goes to zero, and for any (R1 , R2 ) such that R1 < I(A; Y |B) − δ
and R2 < I(B; Y |A) − δ and R1 + R2 < I(A, B; Y ) − δ, the second part goes to zero as well.
Therefore, if (R1 , R2 ) ∈ interior of the pentagon, there exists a (M1 , M2 , = o(1)) code.
3. Weak converse
1
Q[W1 = Ŵ1 , W2 = Ŵ2 ] = , P[W1 = Ŵ1 , W2 = Ŵ2 ] ≥ 1 −
M1 M2
d-proc:
1
d(1 − k ) ≤ inf D(P kQ) = I(An , B n ; Y n )
M1 M2 Q∈(∗)
1
⇒R1 + R2 ≤ I(An , B n ; Y n ) + o(1)
n
311
To get separate bounds, we apply the same trick to evaluate the information flow from the
link between A → Y and B → Y separately:
1
Q1 [W2 = Ŵ2 ] = , P[W2 = Ŵ2 ] ≥ 1 −
M2
d-proc:
1
d(1 − k ) ≤ inf D(P kQ1 ) = I(B n ; Y n |An )
M2 Q1 ∈(∗1)
1
⇒R2 ≤ I(B n ; Y n |An ) + o(1)
n
similarly we can show that
1
R2 ≤ I(An ; Y n |B n ) + o(1)
n
P
For memoryless channels, we know that n1 I(An , B n ; Y n ) ≤ n1 k I(Ak , Bk ; Yk ). Similarly,
since given B n the channel An → Y n is still memoryless we have
n
X n
X
I(An ; Y n |B n ) ≤ I(Ak ; Yk |B n ) = I(Ak ; Yk |Bk )
k=1 k=1
therefore
h1 X i
(R1 , R2 ) ∈ P entak
n
k
[
⇒C ∈ co P enta
PA ,PB
312
§ 30. Examples of MACs. Maximal Pe and zero-error capacity.
30.1 Recap
Last time we defined the multiple access channel as the sequence of random transformations
{PY n |An B n : An × B n → Y n , n = 1, 2, . . .}
were co denotes the convex hull of the sets Penta, and Penta is
R1 ≤ I(A; Y |B)
Penta(PA , PB ) = R2 ≤ I(B; Y |A)
R1 + R2 ≤ I(A, B; Y )
Note that the union of Pentas need not look like a Penta region itself, as we will see in a later
example.
Where in this case the last constraint is not applicable; it does not restrict the capacity region.
313
R2
A PY |A YA C2
B PY |B YB R1
C1
Hence our capacity region is a rectangle bounded by the individual capacities of each channel.
Proof. Since the max above is achieved by an extreme point on one of the Penta regions, we can
drop the convex closure operation to get
[ [
max{R1 + R2 : (R1 , R2 ) ∈ co Penta(PA , PB )} = max{R1 + R2 : (R1 , R2 ) ∈ Penta(PA , PB )}
max {R1 + R2 : (R1 , R2 ) ∈ Penta(PA , PB )} ≤ max I(A, B; Y )
PA ,PB PA ,PB
Where the last step follows from the definition of Penta. Now we need to show that the constraint
on R1 + R2 in Penta is active at at least one point, so we need to show that I(A, B; Y ) ≤
I(A; Y |B) + I(B; Y |A) when A ⊥⊥ B, which follows from applying Kolmogorov identities
Y =A+B+Z mod 2
Z ∼ Ber(δ)
A, B ∈ {0, 1}
Since the output Y can only be 0 or 1, the capacity of this channel can be no larger than 1 bit. If
B doesn’t transmit at all, then A can achieve capacity 1 − h(δ) (and B can achieve capacity when
A doesn’t transmit), so that R1 , R2 ≤ 1 − h(δ). By time sharing we can obtain any point between
these two. This gives an inner bound on the capacity region. For an outer bound, we use Theorem
30.1, which gives
Hence R1 + R2 ≤ 1 − h(δ), so by this outer bound, we can do no better than time sharing between
the two individual channel capacity points.
314
R2
A 1 − h(δ)
Y
B
R1
1 − h(δ)
Remark: Even though this channel seems so simple, there are still hidden things about it, which
we’ll see later.
B
R1
1
Now we can ask: how do we achieve the corner points of the region, e.g. R1 = 1/2 and R2 = 1?
The answer gives insights into how to code for this channel. Take the greedy codebook B = {0, 1}n
(the entire space), then the channel A → Y is a DMC:
1
2 0
0 1
2
1
1
1 2
1 2
2
315
Which we recognize as a BEC(1/2) (no preference to either −1 or 1), which has capacity 1/2. How
do we decode? The idea is successive cancellation, where first we decode A, then remove  from Y ,
then decode B.
Yn Dec Ân
An BEC(1/2)
Bn B̂ n
Using this strategy, we can use a single user code for the BEC (an object we understand well) to
attain capacity.
Proof. Using the assumption that A = g(Y ) and expanding the mutual information
I(A; Y |B) + I(B; Y |A) = H(A) − H(Y |A) − H(Y |A, B) = H(A, Y ) − H(Y |A, B)
= H(Y ) − H(Y |A, B) = I(A, B; Y )
Therefore the R1 + R2 constraint is not active, so our region is a rectangle.
By symmetry, we take PB = Ber(1/2). When PA = Ber(p), the output has H(Y ) = p + h(p).
Using the above fact, the capacity region for the Multiplier MAC is
(
[ R1 ≤ H(A) = h(p)
C = co
R2 ≤ H(Y ) − H(A) = p
We can view this as the graph of the binary entropy function on its side, parametrized by p:
R1
1
1/2
R2
1
To achieve the extreme point (1, 1/2) of this region, we can use the same scheme as for the Adder
MAC: take the codebook of A to be {0, 1}n , then B sees a BEC(1/2). Again, successive cancellation
decoding can be used.
For future reference we note:
316
Lemma 30.1. The full capacity region of multiplier MAC is achieved with zero error.
Proof. For a given codebook D of user B the number of messages that user A can send equals the
total number of erasure patters that codebook D can tolerate with vanishing probability of error.
Fix rate R2 < 1 and let D be a row-span of a random linear nR2 × n binary matrix. Then randomly
erase each column with probability 1 − R2 − . Since on average there will be n(R2 + ) columns
left, the resulting matrix is still full-rank and the decoding is possible. In other words,
P[D is decodable, # of erasures ≈ n(1 − R2 − )] → 1 .
Hence, by counting the total number of erasures, for a random linear code we have
E[# of decodable erasure patterns for D] ≈ 2nh(1−R2 −)+o(n) .
And result follows by selecting a random element of the D-ensemble and then taking the codebook
of user A to be the set of decodable erasure patterns for a selected D.
e1 0 b b
1
{0, 1, 2, 3} ∋ A Erasure Y b
1 b
1 b b
2
2 b b
2 2 b
b e2
{−, +} ∋ B 3 b b
3 3 b
Here, B is received perfectly, We can use the fact above to see that the capacity region is
(
R1 ≤ 32
C=
R2 ≤ 1
For future reference we note the following:
Lemma 30.2. The zero-error capacity of the contraction MAC satisfies
R1 ≤ h(1/3) + (2/3 − p) log 2 , (30.1)
R2 ≤ h(p) (30.2)
for some p ∈ [0, 1/2]. In particular, the point R1 = 23 , R2 = 1 is not achievable with zero error.
Proof. Let C and D denote the zero-error codebooks of two users. Then for each string bn ∈ {+, −}n
denote
Ubn = {an : aj ∈ {0, 1} if bj = +, aj ∈ {2, 3} if bj = −} .
Then clearly for each bn we have
n ,D)
|Ubn | ≤ 2d(b ,
where d(bn , D) denotes the minimum Hamming distance from string bn to the set D. Then,
X n
|C| ≤ 2d(b ,D) (30.3)
bn
n
X
= 2j |{bn : d(bn , D) = j}| (30.4)
j=0
317
For a given cardinality |D| the set that maximizes the above sum is the Hamming ball. Hence,
R2 = h(p) + O(1) implies
Y =A+B+Z
Z ∼ N (0, 1)
E[A2 ] ≤ P1 , E[B 2 ] ≤ P2
A Y
B R1
1
2 log(1 + P1 )
Where the region is Penta(N (0, P1 ), N (0, P2 )). How do we achieve the rates in this region? We’ll
look at a few schemes.
1. TDMA: A and B switch off between transmitting at full rate and not transmitting at all. This
achieves any rate pair in the form
1 1
R1 = λ log(1 + P1 ), R2 = λ̄ log(1 + P2 )
2 2
Which is the dotted line on the plot above. Clearly, there are much better rates to be gained
by smarter schemes.
2. FDMA (OFDM): Dividing users into different frequency bands rather than time windows
gives an enormous advantage. Using frequency division, we can attain rates
1 P1 1 P2
R1 = λ log 1 + , R2 = λ̄ log 1 +
2 λ 2 λ̄
318
In fact, these rates touch the boundary of the capacity region at its intersection with the
R1 = R2 line. The optimal rate occurs when the power at each transmitter makes the noise
look white:
P1 P2 P1
= =⇒ λ∗ =
λ λ̄ P1 + P2
While this touches the capacity region at one point, it doesn’t quite reach the corner points.
Note, however, that practical systems (e.g. cellular networks) typically employ power control
that ensures received powers Pi of all users are roughly equal. In this case (i.e. when P1 = P2 )
the point where FDMA touches the capacity boundary is at a very desirable location of
symmetric rate R1 = R2 . This is one of the reasons why modern standards (e.g. LTE 4G) do
not employ any specialized MAC-codes and use OFDM together with good single-user codes.
3. Rate Splitting/Successive Cancellation: To reach the corner points, we can use successive
cancellation, similar to the decoding schemes in the Adder and Multiplier MACs. We can use
rates:
1
R2 = log(1 + P2 )
2
1 1 P1
R1 = (log(1 + P1 + P2 ) − log(1 + P2 )) = log 1 +
2 2 1 + P2
The second expression suggests that A transmits at a rate for an AWGN channel that has
power constraint P1 and noise 1 + P2 , i.e. the power used by B looks like noise to A.
Z
Y
A Dec A Â
B Dec B B̂
Theorem 30.2. There exists a successive-cancellation code (i.e. (E1 , E2 , D1 , D2 )) that achieves
the corner points of the Gaussian MAC capacity region.
Proof. Random coding: B n ∼ N (0, P2 )n . Since An now sees noise 1 + P2 , there exists a code
for A with rate R1 = 12 log(1 + P1 /(1 + P2 )).
This scheme (unlike the above two) can tolerate frame un-synchronization between the two
transmitters. This is because any chunk of length n has distribution N (0, P2 )n . It has
generalizations to non-corner points and to arbitrary number of users. See [RU96] for details.
319
Proof. The key observation for deterministic MAC is that C (max) = C0 (zero error capacity)
when ≤ 1/2. This is because when any two strings can be confused, the maximum probability
of error
• Contraction MAC: C0 6= C
• Multiplier MAC: C0 = C
• Adder MAC: C0 6= C. For this channel, no one yet can show that C0,sum < 3/2. The
idea is combinatorial in nature: produce two sets (Sidon sets) such that all pairwise sums
between the two do not overlap.
2. Separation does not hold: In the point to point channel, through joint source channel coding
we saw that an optimal architecture is to do source coding then channel coding separately.
This doesn’t hold for the MAC. Take as a simple example the Adder MAC with a correlated
source and bandwidth expansion factor ρ = 1. Let the source (S, T ) have joint distribution
1/3 1/3
PST =
0 1/3
We encode S n to channel input An and T n to channel input B n . The simplest possible scheme
is to not encoder at all; simply take Sj = Aj and Tj = Bj . Take the decoder
Ŝ T̂
Yj = 0 =⇒ 0 0
Yj = 1 =⇒ 0 1
Yj = 2 =⇒ 1 1
Which gives P[Ŝ n = S n , T̂ n = T n ] = 1, since we are able to take advantage of the zero entry
in joint distribution of our correlated source.
Can we achieve this with a separated source? Amazingly, even though the above scheme is so
simple, we can’t! The compressors in the separated architecture operate in the Slepian-Wolf
region (see Theorem 9.7)
R1 ≥ H(S|T )
R2 ≥ H(T |S)
R1 + R2 ≥ H(S, T ) = log 3
Hence the sum rate for compression must be ≥ log 3, while the sum rate for the Adder MAC
must be ≤ 3/2, so these two regions do not overlap, hence we can not operate at a bandwidth
expansion factor of 1 for this source and channel.
320
R2
Slepian Wolf
1 log 3
1/2
3/2
Adder MAC
R1
1/2 1
3. Linear codes beat generic ones: Consider a BSC-MAC and suppose that two users A and
B have independet k-bit messages W1 , W2 ∈ Fk2 . Suppose the receiver is only interested in
estimating W1 + W2 . What is the largest ratio k/n? Clearly, separation can achieve
1
k/n ≈ (log 2 − h(δ))
2
by simply creating a scheme in which both W1 and W2 are estimated and then their sum is
computed.
A more clever solution is however to encode
An = G · W 1 ,
B n = G · W2 ,
Y n = An + B n + Z n = G(W1 + W2 ) + Z n .
where G is a generating matrix of a good k-to-n linear code. Then, provided that
the sum W1 + W2 is decodable (see Theorem 18.2). Hence even for a simple BSC-MAC there
exist clever ways to exceed MAC capacity for certain scenarios. Note that this “distributed
computation” can also be viewed as lossy source coding with a distortion metric that is only
sensitive to discrepancy between W1 + W2 and Ŵ1 + Ŵ2 .
4. Dispersion is unknown: We have seen that for the point-to-point channel, not only we know
the capacity, but the next-order terms (see Theorem 22.2). For the MAC-channel only the
capacity is known. In fact, let us define
∗
Rsum (n, ) , sup{R1 + R2 : (R1 , R2 ) ∈ R∗ (n, )} .
321
So it is not even known if sum-rate approaches sum-capacity from above or from below as
n → ∞! What is even more surprising, is that the dependence of the residual term on is not
clear at all. In fact, despite the decades of attempts, even for = 0 the best known bound to
date is just the Fano’s inequality(!)
∗ 3
Rsum (n, 0) ≤ .
2
322
§ 31. Random number generators
Let’s play the following game: Given a stream of Bern(p) bits, with unknown p, we want to turn
them into pure random bits, i.e., independent fair coin flips Bern(1/2). Our goal is to find a universal
way to extract the most number of bits.
In 1951 von Neumann [vN51] proposed the following scheme: Divide the stream into pairs of
bits, output 0 if 10, output 1 if 01, otherwise do nothing and move to the next pair. Since both 01
and 10 occur with probability pq (where q = 1 − p throughout this lecture), regardless of the value
of p, we obtain fair coin flips at the output. To measure the efficiency of von Neumann’s scheme,
note that, on average, we have 2n bits in and 2pqn bits out. So the efficiency (rate) is pq. The
question is: Can we do better?
Several variations:
2. Exact v.s. approximately fair coin flips: in terms of total variation or Kullback-Leibler
divergence
31.1 Setup
Recall from Lecture 8 that {0, 1}∗ = ∪k≥0 {0, 1}k = {∅, 0, 1, 00, 01, . . . } denotes the set of all finite-
length binary strings, where ∅ denotes the empty string. For any x ∈ {0, 1}∗ , let l(x) denote the
length of x.
Let’s first introduce the definition of random number generator formally. If the input vector is
X, denote the output (variable-length) vector by Y ∈ {0, 1}∗ . Then the desired property of Y is
the following: Conditioned on the length of Y being k, Y is uniformly distributed on {0, 1}k .
Definition 31.1 (Extractor). We say Ψ : {0, 1}∗ → {0, 1}∗ is an extractor if
The rate of Ψ is
E[l(Ψ(X n ))] i.i.d.
rΨ (p) = lim sup , X n ∼ Bern(p).
n→∞ n
Note that the von Neumann scheme above defines a valid extractor ΨvN (with ΨvN (x2n+1 ) =
ΨvN (x2n )), whose rate is rvN (p) = pq. Clearly this is wasteful, because even if the input bits are
already fair, we only get 25% in return.
323
31.2 Converse
No extractor has a rate higher than the binary entropy function. The proof is simply data processing
inequality for entropy and the converse holds even if the extractor is allowed to be non-universal
(depending on p).
Theorem 31.1. For any extractor Ψ and any p ∈ (0, 1),
1 1
rΨ (p) ≥ h(p) = p log2 + q log2 .
p q
Proof. Let L = Ψ(X n ). Then
nh(p) = H(X n ) ≥ H(Ψ(X n )) = H(Ψ(X n )|L) + H(L) ≥ H(Ψ(X n )|L) = E [L] bits,
where the step follows from the assumption on Ψ that Ψ(X n ) is uniform over {0, 1}k conditioned
on L = k.
The rate of von Neumann extractor and the entropy bound are plotted below. Next we present
two extractors, due to Elias [Eli72] and Peres [Per92] respectively, that attain the binary entropy
function. (More precisely, both construct a sequence of extractors whose rate approaches the entropy
bound).
rate
1 bit
rvN
p
0 1 1
2
324
Proof. We defined f by partitioning [M ] into subsets whose cardinalities are powers of two, and
assign elements P in each subset to binary strings of that length. Formally, denote the binary expansion
of M by M = ni=0 mi 2i , where the most significant bit mn = 1 and n = blog2 M c + 1. Those
non-zero mi ’s defines a partition [M ] = ∪tj=0 Mj , where |Mi | = 2ij . Map the elements of Mj to
{0, 1}ij . Finally, notice that uniform distribution conditioned on any subset is still uniform.
To prove the bound on the expected length, the upper bound follows from the same entropy
argument log2 M = H(U ) ≥ H(f (U )) ≥ H(f (U )|l(f (U ))) = E[l(f (U ))], and the lower bound
follows from
n n n
1 X 1 X 2n X i−n 2n+1
E[l(f (U ))] = i
mi 2 · i = n − i
mi 2 (n − i) ≥ n − 2 (n − i) ≥ n − ≥ n − 4,
M M M M
i=0 i=0 i=0
Elias’ extractor Let w(xn ) define the Hamming weight (number of ones) of a binary string. Let
Tk = {xn ∈ {0, 1}n : w(xn ) = k} define the Hamming sphere of radius k. For each 0 ≤ k ≤ n, we
apply the function f from Lemma 31.1 to each Tk . This defines a mapping ΨE : {0, 1}∗ → {0, 1}∗
and then we extend it to ΨE : {0, 1}n → {0, 1}∗ by applying the mapping per n-bit block and
discard the last incomplete block. Then it is clear that the rate is given by n1 E[l(ΨE (X n ))]. By
Lemma 31.1, we have
n n
E log − 4 ≤ E[l(ΨE (X ))] ≤ E log
n
w(X n ) w(X n )
Using Stirling’s approximation (see, e.g., [Ash65, Lemma 4.7.1]), we have
2nh(p) n 2nh(p)
√ ≤ ≤√ (31.1)
8npq k 2πnpq
where p = 1 − q = k/n ∈ (0, 1). Since w(X n ) ∼ Bin(n, p), we have
E[l(ΨE (X n ))] = nh(p) + O(log n).
Therefore the extraction rate approaches the optimum h(p) as n → ∞.
325
Peres’ extractor For each t ∈ N, recursively define an extractor Ψt as follows:
• Set Ψ1 to be von Neumann’s extractor ΨvN , i.e., Ψ1 (x2n+1 ) = Ψ1 (x2n ) = y k .
• Define Ψt by Ψt (x2n ) = Ψt (x2n+1 ) = (Ψ1 (x2n ), Ψt−1 (un ), Ψt−1 (v n−k )).
Example: Input x = 100111010011 of length 2n = 12. Output recursively:
y u v
z }| { z }| { z }| {
(011) (110100) (101)
(1)(010)(10)(0)
(1)(0)
Next we (a) verify Ψt is a valid extractor; (b) evaluate its efficiency (rate). Note that the bits
that enter into the iteration are no longer i.i.d. To compute the rate of Ψt , it is convenient to
introduce the notion of exchangeability. We say X n are exchangeable if the joint distribution is
invariant under permutation, that is, PX1 ,...,Xn = PXπ(1) ,...,Xπ(n) for any permutation π on [n]. In
particular, if Xi ’s are binary, then X n are exchangeable if and only if the joint distribution only
depends on the Hamming weight, i.e., PX n =xn = p(w(xn )). Examples: X n is iid Bern(p); X n is
uniform over the Hamming sphere Tk .
As an example, if X 2n are i.i.d. Bern(p), then conditioned on L = k, V n−k is iid Bern(p2 /(p2 +q 2 )),
since L ∼ Binom(n, 2pq) and
pk+2m q n−k−2m
P[Y k = y, U n = u, V n−k = v|L = k] = n
2 2 n−k (2pq)k
k (p + q )
−1
n p2 m q 2 n−k−m
= 2−k · · 2
k p + q2 p2 + q 2
= P[Y k = y|L = k]P[U n = u|L = k]P[V n−k = v|L = k],
where m = w(v). In general, when X 2n are only exchangeable, we have the following:
Lemma 31.2 (Ψt preserves exchangebility). Let X 2n be exchangeable and L = Ψ1 (X 2n ). Then con-
i.i.d.
ditioned on L = k, Y k , U n and V n−k are independent and exchangeable. Furthermore, Y k ∼ Bern( 12 )
and U n is uniform over Tk .
Proof. If suffices to show that ∀y, y 0 ∈ {0, 1}k , u, u0 ∈ Tk and v, v 0 ∈ {0, 1}n−k such that w(v) = w(v 0 ),
we have
Note that the string X 2n and the triple (Y k , U n , V n−k ) are in one-to-one correspondence of each
other. Indeed, to reconstruct X 2n , simply read the k distinct pairs from Y and fill them according
to the locations of ones in U and fill the remaining equal pairs from V . Finally, note that u, y, v and
u0 , y 0 , v 0 correspond to two input strings x and x0 of identical Hamming weight (w(x) = k + 2w(v))
and hence of identical probability due to the exchangeability of X 2n . [Examples: (y, u, v) =
(01, 1100, 01) ⇒ x = (10010011), (y, u, v) = (11, 1010, 10) ⇒ x0 = (01110100).]
Computing the marginals, we conclude that both Y k and U n are uniform over their respective
support set.
i.i.d.
Lemma 31.3 (Ψt is an extractor). Let X 2n be exchangeable. Then Ψt (X 2n ) ∼ Bern(1/2) condi-
tioned on l(Ψt (X 2n )) = m.
326
Proof. Note that Ψt (X 2n ) ∈ {0, 1}∗ . It is equivalent to show that for all sm ∈ {0, 1}m ,
Proceed by induction on t. The base case of t = 1 follows from Lemma 31.2 (the distribution of the
Y part). Assume Ψt−1 is an extractor. Recall that Ψt (X 2n ) = (Ψ1 (X 2n ), Ψt−1 (U n ), Ψt−1 (V n−k ))
and write the length as L = L1 + L2 + L3 , where L2 ⊥ ⊥ L3 |L1 by Lemma 31.2. Then
P[Ψt (X 2n ) = sm ]
Xm
= P[Ψt (X 2n ) = sm |L1 = k]P[L1 = k]
k=0
Xm m−k
X
Lemma 31.2
= P[L1 = k]P[Y k = sk |L1 = k]P[Ψt−1 (U n ) = sk+r
k+1 |L1 = k]P[Ψt−1 (V
n−k
) = sm
k+r+1 |L1 = k]
k=0 r=0
Xm m−k
X
induction
= P[L1 = k]2−k 2−r P[L2 = r|L1 = k]2−(m−k−r) P[L3 = m − k − r|L1 = k]
k=0 r=0
= 2−m P[L = m].
i.i.d.
Next we compute the rate of Ψt . Let X 2n ∼ Bern(p). Then by SLLN, 2n 1
l(Ψ1 (X 2n )) , L2nn
1 a.s.
converges a.s. to pq. Assume, again by induction, that 2n l(Ψt−1 (X 2n ))−−→rt−1 (p), with r1 (p) = pq.
Then
1 Ln 1 1
l(Ψt (X 2n )) = + l(Ψt−1 (U n )) + l(Ψt−1 (V n−Ln )).
2n 2n 2n 2n
i.i.d. i.i.d. a.s.
Note that U n ∼ Bern(2pq), V n−Ln |Ln ∼ Bern(p2 /(p2 + q 2 )) and Ln −−→∞. Then the induction
a.s. a.s.
hypothesis implies that n1 l(Ψt−1 (U n ))−−→rt−1 (2pq) and 2(n−L 1
n)
l(Ψt−1 (V n−Ln ))−−→rt−1 (p2 /(p2 +
q 2 )). We obtain the recursion:
1 p2 + q 2 p2
rt (p) = pq + rt−1 (2pq) + rt−1 , (T rt−1 )(p), (31.2)
2 2 p2 + q 2
where the operator T maps a continuous function on [0, 1] to another. Furthermore, T is monotone
in the senes that f ≤ g pointwise then T f ≤ T g. Then it can be shown that rt converges
monotonically from below to the fixed-point of T , which turns out to be exactly the binary
entropy function h. Instead of directly verifying T h = h, next we give a simple proof: Consider
i.i.d.
X1 , X2 ∼ Bern(p). Then 2h(p) = H(X1 , X2 ) = H(X1 ⊕ X2 , X1 ) = H(X1 ⊕ X2 ) + H(X1 |X1 ⊕ X2 ) =
2
h(p2 + q 2 ) + 2pqh( 21 ) + (p2 + q 2 )h( p2p+q2 ).
The convergence of rt to h are shown in Fig. 31.1.
327
1.0
0.8
0.6
0.4
0.2
Figure 31.1: Rate function rt for t = 1, 4, 10 versus the binary entropy function.
The above result deals with what f (p) can be simulated in principle. What type of computational
devices are needed for such as task? Note that since r1 (p) is quadratic in p, all rate functions rt
that arise from the iteration (31.2) are rational functions (ratios of polynomials), converging to
the binary entropy function as Fig. 31.1 shows. It turns out that for any rational function f that
satisfies 0 < f < 1 on (0, 1), we can generate independent Bern(f (p)) from Bern(p) using either of
the following schemes with finite memory [MP05]:
1. Finite-state machine (FSM): initial state (red), intermediate states (white) and final states
(blue, output 0 or 1 then reset to initial state).
2. Block simulation: let A0 , A1 be disjoint subsets of {0, 1}k . For each k-bit segment, output 0 if
falling in A0 or 1 if falling in A1 . If neither, discard and move to the next segment. The block
size is at most the degree of the denominator polynomial of f .
328
Goal Block simulation FSM
1
1
0
1 0
0
0 0
1
f (p) = 2pq A0 = 00, 11; A1 = 01, 10 0 1
0
1 1
0 0
0
0
1 1
p3
f (p) = p3 +q 3
A0 = 000; A1 = 111
0 0
1
1
1 1
Exercise: How to generate f (p) = 1/3?
It turns out that the only type of f that can be simulated using either FSM or block simulation
√
is rational function. For f (p) = p, which satisfies Keane-O’Brien’s characterization, it cannot be
simulated by FSM or block simulation, but it can be simulated by pushdown automata (PDA),
which are FSM operating with a stack (infinite memory) [MP05].
What is the optimal Bernoulli factory with the best rate is unclear. Clearly, a converse is the
h(p)
entropy bound h(f (p)) , which can be trivial (bigger than one).
• To generate P = [1/2, 1/4, 1/4] on {a, b, c}, use the finite tree: E[L] = 1.5.
329
0 1
a
1 1
b c
• To generate P = [1/3, 2/3] on {a, b} (note that 2/3 = 0.1010 . . . , 1/3 = 0.0101 . . .), use the
infinite tree: E[L] = 2 (geometric distribution)
0 1
a
0 1
b
0 1
a ..
.
330
§ 32. Entropy method in combinatorics and geometry
A commonly used method in combinatorics for bounding the number of certain objects from above
involves a smart application of Shannon entropy. Usually the method proceeds as follows: in order
to count the cardinality of a given set C, we draw an element uniformly at random from C, whose
entropy is given by log |C|. To bound |C| from above, we describe this random object by a random
vector X = (X1 , . . . , Xn ), e.g., an indicator vector, then proceed to compute or upper-bound the
joint entropy H(X1 , . . . , Xn ) using methods we learned in Part I.
Notably, three methods of increasing precision are as follows:
• Marginal bound:
n
X
H(X1 , . . . , Xn ) ≤ H(Xi )
i=1
Next, we give three applications using the above three methods, respectively, in increasing difficulties:
• Enumerating binary vectors of a given average weights
• Counting triangles and other subgraphs
• Brégman’s theorem
Finally, to demonstrate how entropy method can also be used for questions in Euclidean spaces,
we prove the Loomis-Whitney and Bollobás-Thomason theorems based on analogous properties of
differential entropy.
where w(x) is the Hamming weight (number of 1’s) of x ∈ {0, 1}n . Then |C| ≤ 2nh(p) .
331
Remark 32.1. This result holds even if p > 1/2.
where pi = P [Xi = 1] is the fraction of vertices whose i-th bit is 1. Note that
n
1X
p= pi ,
n
i=1
since we can either first average over vectors in C or first average across different bits. By Jensen’s
inequality and the fact that x 7→ h(x) is concave,
n n
!
X 1X
h(pi ) ≤ nh pi = nh(p).
n
i=1 i=1
Theorem 32.1.
k
X n
≤ 2nh(k/n) , k ≤ n/2.
j
j=0
Proof. We take C = {x ∈ {0, 1}n : w(x) ≤ k} and invoke the previous lemma, which says that
k
X n
= |C| ≤ 2nh(p) ≤ 2nh(k/n) ,
j
j=0
where the last inequality follows from the fact that x 7→ h(x) is increasing for x ≤ 1/2.
Remark 32.2. Alternatively, we can prove the theorem using the large-deviation bound in Part III.
By the Chernoff bound on the binomial tail (see Example after Theorem 14.1),
LHS k 1 RHS
n
= P(Bin(n, 1/2) ≤ k) ≤ 2−nd( n k 2 ) = 2−n(1−h(k/n)) = n .
2 2
N( , ) = 4, N( , ) = 4.
1
To be precise, here N (H, G) is the injective homomorphism number, that is, the number of injective mappings
ϕ : V (H) → V (G) which preserve edges, i.e., (ϕ(v)ϕ(w)) ∈ E(G) whenever (vw) ∈ E(H). Note that if H is a clique,
every homomorphism is automatically injective.
332
If we know G has m edges, what is the maximal number of H that are contained in G? To study
this quantity, let’s define
N (H, m) = max N (H, G).
G:|E(G)|≤m
Theorem 32.2.
∗ (H) ∗ (H)
c0 (H)mρ ≤ N (H, m) ≤ c1 (H)mρ . (32.4)
For example, for triangles we have ρ∗ (K3 ) = 3/2 and Theorem 32.2 is consistent with (32.1).
Proof. Upper bound: Let |V (H)| = n and let w∗ (e) be the solution for ρ∗ (H). For any G with m
edges, draw a subgraph of G, uniformly at random from all those that are isomorphic to H, and
label its vertices by X = (X1 , . . . , Xn ). Now define a random 2-subset S of [n] by sampling an
∗ (e)
edge e from E(H) with probability ρw∗ (H) . By the definition of ρ∗ (H) we have for any i ∈ [n] that
P[i ∈ S] ≥ ρ∗ (H)
1
. We are now ready to apply Shearer’s Theorem 1.4:
1
log(N (H, G)n!) = H(X)
ρ∗ (H)
≤H(XS |S) ≤ log(2m) ,
2
If the “∈ [0, 1]” constraints in (32.3) and (32.5) are replaced by “∈ {0, 1}”, we obtain the covering number ρ(H)
and the independence number α(H) of H, respectively.
333
where the last bound is as before: if S = {v, w} then XS = (Xv , Xw ) takes one of 2m values. Overall,
1 ∗
we get N (H, G) ≥ n! (2m)ρ (H) .
Lower bound: It amounts to construct a graph G with m edges for which N (H, G) ≥
∗
c(H)|e(G)|ρ (H) . Consider the dual LP of (32.3)
X
∗
α (H) = max ψ(v) : ψ(v) + ψ(w) ≤ 1, ∀(vw) ∈ E, ψ(v) ∈ [0, 1] (32.5)
ψ
v∈V (H)
i.e., the fractional packing number. By the duality theorem of LP, we have α∗ (H) = ρ∗ (H). The
graph G is constructed as follows: for each vertex v of H, replicate it for m(v) times. For each edge
e = (vw) of H, replace it by a complete bipartite graph Km(v),m(w) . Then the total number of edges
of G is X
|E(G)| = m(v)m(w).
(vw)∈E(H)
Q
Furthermore, N (G, H) ≥ v∈V (H) m(v). To minimize the exponent log N (G,H)
log |E(G)| , fix a large number
M and let m(v) = M ψ(v) , where ψ is the maximizer in (32.5). Then
X
|E(G)| ≤ 4M ψ(v)+ψ(w) ≤ 4M |E(H)|
(vw)∈E(H)
Y ∗ (H)
N (G, H) ≥ M ψ(v) = M α
v∈V (H)
where Sn denotes the group of all permutations of n letters. For a bipartite graph G with n vertices
on the left and right respectively, the number of perfect matchings in G is given by perm(A), where
A is the adjacency matrix.
Example:
perm = 1, perm
=2
where di is the degree of left vertex i (i.e. sum of the ith row).
334
Example: Consider G = Kn,n . Then perm(G) = n!, and the RHS is [(n!)1/n ]n = n! as well. More
generally, if G consists of n/d copies of Kd,d , then Bregman’s bound is tight and perm = (d!)n/d .
First attempt: We select a perfect matching uniformly at random which matches the ith left
vertex to the Xi th right one. Let X = (X1 , . . . , Xn ). Then
n
X n
X
log perm(A) = H(X) = H(X1 , . . . , Xn ) ≤ H(Xi ) ≤ log(di ).
i=1 i=1
Q
Hence perm(A) ≤ i di . This is much worse than Brégman’s bound since by Stirling’s formula
n n
!
Y 1 Y P
(di !) di ∼ di e− di .
i=1 i=1
P
where di is the total number of edges.
Second attempt: The hope is to the chain rule to expand the joint entropy and bound the
conditional entropy more carefully. Let’s write
n
X n
X
H(X1 , . . . , Xn ) = H(Xi |X1 , . . . , Xi−1 ) ≤ E[log Ni ].
i=1 i=1
where Ni , as a random variable, denotes the number of possible values Xi can take conditioned
on X1 , . . . , Xi−1 , i.e., how many possible matchings for left vertex i given the outcome of where
1, . . . , i − 1 are matched to. However, it is hard to proceed from this point as we only know the
degree information, not the graph itself. In fact, since we do not know the relative positions of the
vertices, why should we order like this? The key idea is to label the vertices randomly, apply chain
rule in this random order and average.
To this end, pick π uniformly at random from Sn and independent of X. Then
where Nk denotes the number of possible matchings for vertex k given the outcome of {Xj : π −1 (j) <
π −1 (k)} are matched to and the expectation is with respect to (X, π). The key observation is:
Lemma 32.2. Nk is uniformly distributed on [dk ].
Example: Consider the graph G on the right. Consider k = 1. Then dk = 2. 1 1
Depending on the random ordering, if π = 1 ∗ ∗, then Nk = 2
w.p. 1/3; if π = ∗ ∗ 1, then Nk = 1 w.p. 1/3; if π = 213, then 2 2
Nk = 2 w.p. 1/3; if π = 312, then Nk = 1 w.p. 1/3. Combining
everything, indeed Nk is equally likely to be 1 or 2. 3 3
335
Thus,
dk
1 X 1
E(X,π) log Nk = log i = log(di !) di
dk
i=1
and hence
n
X n
Y
1 1
log perm(A) ≤ log(di !) di
= log (di !) di .
k=1 i=1
Finally, we prove Lemma 32.2:
Proof. Note that Xi = σ(i) for some random permutation σ. Let T = ∂(k) be the neighbors of k.
Then
Nk = |T \{σ(j) : π −1 (j) < π −1 (k)}|
which is a function of (σ, π). In fact, conditioned on any realization of σ, Nk is uniform over [dk ].
To see this, note that σ −1 (T ) is a fixed subset of [n] of cardinality dk , and k ∈ σ −1 (T ). On the
other hand, S , {j : π −1 (j) < π −1 (k)} is a uniformly random subset of [n]\{k}. Then
Nk = |σ −1 (T )\S| = 1 + |σ −1 (T )\{k} ∩ S|,
| {z }
uniform({0,...,dk −1})
336
Corollary 32.1 (Loomis-Whitney). Let K be a compact subset of Rn and let Kj c denote the
projection of K onto coordinates in [n] \ j. Then
n
Y 1
Leb{K} ≤ Leb{Kj c } n−1 . (32.6)
j=1
The meaning of Loomis-Whitney inequality is best understood by introducing the average width
Leb{K}
of K in direction j: wj , Leb{Kjc }
. Then (32.6) is equivalent to
n
Y
Leb{K} ≥ wj ,
j=1
i.e. that volume of K is greater than volume of the rectangle of average widths.
337
Bibliography
[AFTS01] I.C. Abou-Faycal, M.D. Trott, and S. Shamai. The capacity of discrete-time memoryless
rayleigh-fading channels. IEEE Transaction Information Theory, 47(4):1290 – 1301, 2001.
[Ahl82] Rudolf Ahlswede. An elementary proof of the strong converse theorem for the multiple-
access channel. J. Combinatorics, Information and System Sciences, 7(3), 1982.
[Alo81] Noga Alon. On the number of subgraphs of prescribed type of graphs with a given
number of edges. Israel J. Math., 38(1-2):116–130, 1981.
[AN07] Shun-ichi Amari and Hiroshi Nagaoka. Methods of information geometry, volume 191.
American Mathematical Soc., 2007.
[AS08] Noga Alon and Joel H. Spencer. The Probabilistic Method. John Wiley & Sons, 3rd
edition, 2008.
[Ash65] Robert B. Ash. Information Theory. Dover Publications Inc., New York, NY, 1965.
[BF14] Ahmad Beirami and Faramarz Fekri. Fundamental limits of universal lossless one-to-one
compression of parametric sources. In Information Theory Workshop (ITW), 2014 IEEE,
pages 212–216. IEEE, 2014.
[Bla74] R. E. Blahut. Hypothesis testing and information theory. IEEE Trans. Inf. Theory,
20(4):405–417, 1974.
[BNO03] Dimitri P Bertsekas, Angelia Nedić, and Asuman E. Ozdaglar. Convex analysis and
optimization. Athena Scientific, Belmont, MA, USA, 2003.
[Boh38] H. F. Bohnenblust. Convex regions and projections in Minkowski spaces. Ann. Math.,
39(2):301–308, 1938.
[Bre73] Lev M Bregman. Some properties of nonnegative matrices and their permanents. Soviet
Math. Dokl., 14(4):945–949, 1973.
[CB94] Bertrand S Clarke and Andrew R Barron. Jeffreys’ prior is asymptotically least favorable
under entropy risk. Journal of Statistical planning and Inference, 41(1):37–60, 1994.
[Ç11] Erhan Çinlar. Probability and Stochastics. Springer, New York, 2011.
338
[Cho56] Noam Chomsky. Three models for the description of language. IRE Trans. Inform. Th.,
2(3):113–124, 1956.
[CK81a] I. Csiszár and J. Körner. Graph decomposition: a new key to coding theorems. IEEE
Trans. Inf. Theory, 27(1):5–12, 1981.
[CK81b] I. Csiszár and J. Körner. Information Theory: Coding Theorems for Discrete Memoryless
Systems. Academic, New York, 1981.
[CS83] J. Conway and N. Sloane. A fast encoding method for lattice codes and quantizers. IEEE
Transactions on Information Theory, 29(6):820–824, Nov 1983.
[CT06] Thomas M. Cover and Joy A. Thomas. Elements of information theory, 2nd Ed. Wiley-
Interscience, New York, NY, USA, 2006.
[Eli55] Peter Elias. Coding for noisy channels. IRE Convention Record, 3:37–46, 1955.
[Eli72] P. Elias. The efficient construction of an unbiased random sequence. Annals of Mathe-
matical Statistics, 43(3):865–870, 1972.
[ELZ05] Uri Erez, Simon Litsyn, and Ram Zamir. Lattices which are good for (almost) everything.
IEEE Transactions on Information Theory, 51(10):3401–3416, Oct. 2005.
[EZ04] U. Erez and R. Zamir. Achieving 12 log(1 + SNR) on the AWGN channel with lattice
encoding and decoding. IEEE Trans. Inf. Theory, IT-50:2293–2314, Oct. 2004.
[FJ89] G.D. Forney Jr. Multidimensional constellations. II. Voronoi constellations. IEEE
Journal on Selected Areas in Communications, 7(6):941–958, Aug 1989.
[FK98] Ehud Friedgut and Jeff Kahn. On the number of copies of one hypergraph in another.
Israel J. Math., 105:251–256, 1998.
[FMG92] Meir Feder, Neri Merhav, and Michael Gutman. Universal prediction of individual
sequences. IEEE Trans. Inf. Theory, 38(4):1258–1270, 1992.
[Gil10] Gustavo L Gilardoni. On pinsker’s and vajda’s type inequalities for csiszár’s-divergences.
Information Theory, IEEE Transactions on, 56(11):5377–5386, 2010.
[GL95] Richard D Gill and Boris Y Levit. Applications of the van Trees inequality: a Bayesian
Cramér-Rao bound. Bernoulli, pages 59–79, 1995.
339
[Hoe65] Wassily Hoeffding. Asymptotically optimal tests for multinomial distributions. The
Annals of Mathematical Statistics, pages 369–401, 1965.
[HV11] P. Harremoës and I. Vajda. On pairs of f -divergences and their joint range. IEEE Trans.
Inf. Theory, 57(6):3230–3235, Jun. 2011.
[KO94] M.S. Keane and G.L. O’Brien. A Bernoulli factory. ACM Transactions on Modeling and
Computer Simulation, 4(2):213–219, 1994.
[Kos63] VN Koshelev. Quantization with minimal entropy. Probl. Pered. Inform, 14:151–156,
1963.
[KS14] Oliver Kosut and Lalitha Sankar. Asymptotics and non-asymptotics for universal fixed-
to-variable source coding. arXiv preprint arXiv:1412.4444, 2014.
[KV14] Ioannis Kontoyiannis and Sergio Verdú. Optimal lossless data compression: Non-
asymptotics and asymptotics. IEEE Trans. Inf. Theory, 60(2):777–795, 2014.
[LC86] Lucien Le Cam. Asymptotic methods in statistical decision theory. Springer-Verlag, New
York, NY, 1986.
[LM03] Amos Lapidoth and Stefan M Moser. Capacity bounds via duality with applications to
multiple-antenna systems on flat-fading channels. IEEE Transactions on Information
Theory, 49(10):2426–2467, 2003.
[Loe97] Hans-Andrea Loeliger. Averaging bounds for lattices and linear codes. IEEE Transactions
on Information Theory, 43(6):1767–1773, Nov. 1997.
[Mas74] James Massey. On the fractional weight of distinct binary n-tuples (corresp.). IEEE
Transactions on Information Theory, 20(1):131–131, 1974.
[MF98] Neri Merhav and Meir Feder. Universal prediction. IEEE Trans. Inf. Theory, 44(6):2124–
2147, 1998.
[MP05] Elchanan Mossel and Yuval Peres. New coins from old: computing with unknown bias.
Combinatorica, 25(6):707–724, 2005.
[MT10] Mokshay Madiman and Prasad Tetali. Information inequalities for joint distributions,
with interpretations and applications. IEEE Trans. Inf. Theory, 56(6):2699–2713, 2010.
[OE15] O. Ordentlich and U. Erez. A simple proof for the existence of “good” pairs of nested
lattices. IEEE Transactions on Information Theory, Submitted Aug. 2015.
[OPS48] BM Oliver, JR Pierce, and CE Shannon. The philosophy of pcm. Proceedings of the IRE,
36(11):1324–1331, 1948.
[Per92] Yuval Peres. Iterating von Neumann’s procedure for extracting random bits. Annals of
Statistics, 20(1):590–597, 1992.
[PPV10a] Y. Polyanskiy, H. V. Poor, and S. Verdú. Channel coding rate in the finite blocklength
regime. IEEE Trans. Inf. Theory, 56(5):2307–2359, May 2010.
[PPV10b] Y. Polyanskiy, H. V. Poor, and S. Verdú. Feedback in the non-asymptotic regime. IEEE
Trans. Inf. Theory, April 2010. submitted for publication.
340
[PPV11] Y. Polyanskiy, H. V. Poor, and S. Verdú. Minimum energy to send k bits with and
without feedback. IEEE Trans. Inf. Theory, 57(8):4880–4902, August 2011.
[PW14] Y. Polyanskiy and Y. Wu. Peak-to-average power ratio of good codes for Gaussian
channel. IEEE Trans. Inf. Theory, 60(12):7655–7660, December 2014.
[PW17] Yury Polyanskiy and Yihong Wu. Strong data-processing inequalities for channels and
Bayesian networks. In Eric Carlen, Mokshay Madiman, and Elisabeth M. Werner, editors,
Convexity and Concentration. The IMA Volumes in Mathematics and its Applications,
vol 161, pages 211–249. Springer, New York, NY, 2017.
[Ree65] Alec H Reeves. The past present and future of PCM. IEEE Spectrum, 2(5):58–62, 1965.
[RSU01] Thomas J. Richardson, Mohammad Amin Shokrollahi, and Rüdiger L. Urbanke. Design
of capacity-approaching irregular low-density parity-check codes. IEEE Transactions on
Information Theory, 47(2):619–637, 2001.
[RU96] Bixio Rimoldi and Rüdiger Urbanke. A rate-splitting approach to the gaussian multiple-
access channel. Information Theory, IEEE Transactions on, 42(2):364–375, 1996.
[SF11] Ofer Shayevitz and Meir Feder. Optimal feedback communication via posterior matching.
IEEE Trans. Inf. Theory, 57(3):1186–1222, 2011.
[Sha48] C. E. Shannon. A mathematical theory of communication. Bell Syst. Tech. J., 27:379–423
and 623–656, July/October 1948.
[Sio58] Maurice Sion. On general minimax theorems. Pacific J. Math, 8(1):171–176, 1958.
[Spi96] Daniel A. Spielman. Linear-time encodable and decodable error-correcting codes. IEEE
Transactions on Information Theory, 42(6):1723–1731, 1996.
[SV11] Wojciech Szpankowski and Sergio Verdú. Minimum expected length of fixed-to-variable
lossless compression without prefix constraints. IEEE Trans. Inf. Theory, 57(7):4017–4025,
2011.
341
[TE97] Giorgio Taricco and Michele Elia. Capacity of fading channel with no side information.
Electronics Letters, 33(16):1368–1370, 1997.
[Top00] F. Topsøe. Some inequalities for information divergence and related measures of discrim-
ination. IEEE Transactions on Information Theory, 46(4):1602–1609, 2000.
[TV05] David Tse and Pramod Viswanath. Fundamentals of wireless communication. Cambridge
University Press, 2005.
[UR98] R. Urbanke and B. Rimoldi. Lattice codes can achieve capacity on the AWGN channel.
IEEE Transactions on Information Theory, 44(1):273–278, 1998.
[Ver07] S. Verdú. EE528–Information Theory, Lecture Notes. Princeton Univ., Princeton, NJ,
2007.
[vN51] J. von Neumann. Various techniques used in connection with random digits. Monte
Carlo Method, National Bureau of Standards, Applied Math Series, (12):36–38, 1951.
[Yek04] Sergey Yekhanin. Improved upper bound for the redundancy of fix-free codes. IEEE
Trans. Inf. Theory, 50(11):2815–2818, 2004.
[Yu97] Bin Yu. Assouad, Fano, and Le Cam. Festschrift for Lucien Le Cam: Research Papers
in Probability and Statistics, pages 423–435, 1997.
[Zam14] Ram Zamir. Lattice Coding for Signals and Networks. Cambridge University Press,
Cambridge, 2014.
[ZY98] Zhen Zhang and Raymond W Yeung. On characterization of entropy function via
information inequalities. IEEE Trans. Inf. Theory, 44(4):1440–1452, 1998.
342