0% found this document useful (0 votes)
15 views

Information Theory For Single-User Systems With Arbitrary Statistical Memory

This document provides a 3-paragraph summary of a technical research paper on information theory for single-user systems with arbitrary statistical memory. The paper introduces generalized information measures that can be used to analyze discrete-time single-user stochastic systems that are not necessarily stationary, ergodic or information stable. These measures include the information spectrum, quantiles, and extensions of entropy, mutual information and divergence. The paper will cover topics such as lossless and lossy data compression theorems, measures of randomness and resolvability, channel coding theorems, and multi-terminal information theory using these generalized information measures. The paper is intended as a sequel to the authors' previous textbook on introductory information theory. It

Uploaded by

Tao-Wei Huang
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views

Information Theory For Single-User Systems With Arbitrary Statistical Memory

This document provides a 3-paragraph summary of a technical research paper on information theory for single-user systems with arbitrary statistical memory. The paper introduces generalized information measures that can be used to analyze discrete-time single-user stochastic systems that are not necessarily stationary, ergodic or information stable. These measures include the information spectrum, quantiles, and extensions of entropy, mutual information and divergence. The paper will cover topics such as lossless and lossy data compression theorems, measures of randomness and resolvability, channel coding theorems, and multi-terminal information theory using these generalized information measures. The paper is intended as a sequel to the authors' previous textbook on introductory information theory. It

Uploaded by

Tao-Wei Huang
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 111

Information Theory for Single-User Systems

with Arbitrary Statistical Memory

by

Fady Alajaji† and Po-Ning Chen‡


Department of Mathematics & Statistics,
Queen’s University, Kingston, ON K7L 3N6, Canada
Email: [email protected]


Department of Electrical & Computer Engineering
National Chiao Tung University
1001, Ta Hsueh Road
Hsin Chu, Taiwan 300
Republic of China
Email: [email protected]

February 21, 2019



c Copyright by
Fady Alajaji† and Po-Ning Chen‡
February 21, 2019
Preface

The reliable transmission of information bearing signals over a noisy commu-


nication channel is at the heart of what we call communication. Information
theory—founded by Claude E. Shannon in 1948—provides a mathematical frame-
work for the theory of communication; it describes the fundamental limits to how
efficiently one can encode information and still be able to recover it with negli-
gible loss.
This book is a sequel to our previous introductory textbook on information
theory [2]. It covers advanced topics concerning the information theoretic limits
of discrete-time single-user stochastic systems with arbitrary statistical memory
(i.e., systems that are not necessarily stationary, ergodic or information stable).
The topics are studied using the information spectrum methods, and extensions
thereof, established by T.S. Han in his pioneering textbook [25]. What follows
is a tentative list of topics to be covered.
1. General information measure: Information spectrum and Quantile and
their properties.
2. Advanced topics of losslesss data compression: Fixed-length lossless data
compression theorem for arbitrary channels, variable-length lossless data
compression theorem for arbitrary channels.
3. Measure of randomness and resolvability: Resolvability and source coding,
approximation of output statistics for arbitrary channels.
4. Advanced topics of channel coding: Channel capacity for arbitrary single-
user channel, optimistic Shannon coding theorem, strong capacity, ε-capacity.
5. Advanced topics of lossy data compression
6. Hypothesis testing: Error exponent and divergence, large deviations the-
ory, Berry-Esseen theorem.
7. Channel reliability: Random coding exponent, expurgated exponent, par-
titioning exponent, sphere-packing exponent, the asymptotic largest min-
imum distance of block codes, Elias bound, Varshamov-Gilbert bound,
Bhattacharyya distance.

ii
8. Information theory of networks: Distributed detection, data compression
over distributed source, capacity of multiple access channels, degraded
broadcast channel, Gaussian multiple terminal channels.
The mathematical background on which these topics are based can be found
in Appendices A and B of [2].

Notation
For a discrete random variable X with alphabet X , we use PX to denote its
distribution. For convenience, we will use interchangeably the two expressions
for the probability of the elementary event [X = x]: Pr[X = x] and PX (x).
Similarly, the probability of a set characterized by an inequality, such as f (x) <
a, will be expressed by either
PX {x ∈ X : f (x) < a}
or
Pr [f (X) < a] .
In the second expression, we view f (X) as a new random variable defined through
X and a function f (·).
Obviously, the above expressions can be applied to any legitimate function
f (·) defined over X , including any probability function PX̂ (·) (or log PX̂ (x)) of a
random variable X̂. Therefore, the next two expressions denote the probability
of f (x) = PX̂ (x) < a evaluated under distribution PX :
PX {x ∈ X : f (x) < a} = PX {x ∈ X : PX̂ (x) < a}
and
Pr [f (X) < a] = Pr [PX̂ (X) < a] .
As a result, if we write
 
PX̂,Ŷ (x, y)
PX,Y (x, y) ∈ X × Y : log <a
PX̂ (x)PŶ (y)
 
PX̂,Ŷ (X, Y )
= Pr log <a ,
PX̂ (X)PŶ (Y )
we mean that we have defined a new function
PX̂,Ŷ (x, y)
f (x, y) := log
PX̂ (x)PŶ (y)
in terms of the joint distribution PX̂,Ŷ and its two marginal distributions, and
that we are interested in the probability of f (X, Y ) < a where X and Y have
joint distribution PX,Y .

iii
Notes to readers
In this book, all the assumptions, claims, conjectures, corollaries, definitions, ex-
amples, exercises, lemmas, observations, properties, and theorems are numbered
under the same counter for ease of their searching. For example, the lemma that
immediately follows Theorem 2.1 will be numbered as Lemma 2.2, instead of
Lemma 2.1.

iv
Acknowledgements

Thanks are given to our families for their full support during the period of
our writing the book.

v
Table of Contents

Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii
Notes to readers . . . . . . . . . . . . . . . . . . . . . . . . iv

Chapter Page

List of Tables viii

List of Figures ix

1 Generalized Information Measures for Arbitrary Systems with


Memory 1
1.1 Spectrums and quantiles . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Properties of quantiles . . . . . . . . . . . . . . . . . . . . . . . . 5
1.3 Generalized information measures . . . . . . . . . . . . . . . . . . 8
1.4 Properties of generalized information measures . . . . . . . . . . . 11
1.5 Examples for the computation of general information measures . . 17

2 General Data Compression Theorems 23


2.1 Fixed-length data compression codes for arbitrary sources . . . . . 24
2.2 Generalized AEP theorem . . . . . . . . . . . . . . . . . . . . . . 30

3 Measure of Randomness for Stochastic Processes 34


3.1 Motivation for resolvability : measure of randomness of random
variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.2 Notations and definitions regarding to resolvability . . . . . . . . 35
3.3 Operational meanings of resolvability and mean-resolvability . . . 39
3.4 Resolvability, mean-resolvability and source coding . . . . . . . . 47

4 Channel Coding Theorems and Approximations of Output Sta-


tistics for Arbitrary Channels 57
4.1 General channel models . . . . . . . . . . . . . . . . . . . . . . . 57
4.2 Channel coding and Feinstein’s lemma . . . . . . . . . . . . . . . 59
4.3 Error bounds for multihypothesis testing . . . . . . . . . . . . . . 63

vi
4.4 Capacity formulas for general channels . . . . . . . . . . . . . . . 70
4.5 Examples for the general capacity formulas . . . . . . . . . . . . . 73
4.6 Capacity and resolvability for channels . . . . . . . . . . . . . . . 77

5 Optimistic Shannon Coding Theorems for Arbitrary Single-User


Systems 82
5.1 Motivations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
5.2 Optimistic source coding theorems . . . . . . . . . . . . . . . . . 83
5.3 Optimistic channel coding theorems . . . . . . . . . . . . . . . . . 88
5.4 Examples for the computations of capacity and strong capacity . . 91
5.4.1 Information stable channels . . . . . . . . . . . . . . . . . 91
5.4.2 Information unstable channels . . . . . . . . . . . . . . . . 92

vii
List of Tables

Number Page

1.1 Generalized entropy measures where δ ∈ [0, 1]. . . . . . . . . . . 10


1.2 Generalized mutual information measures where δ ∈ [0, 1]. . . . . 11
1.3 Generalized divergence measures where δ ∈ [0, 1]. . . . . . . . . . 12

viii
List of Figures

Number Page

1.1 The asymptotic CDFs (spectrums) of a sequence of random vari-


ables {An }∞n=1 and their quantiles: ū(·) = sup-spectrum of {An },
u(·) = inf-spectrum of {An }, U δ = quantile of ū(·), Ūδ = quantile
¯ ¯
of u(·), U = limδ↓0 U δ = U 0 , U 1− = limξ↑1 U ξ , Ū = limδ↑1 Ūδ . . . . 6
¯ ¯ ¯ ¯ ¯ ¯
1.2 The spectrum h(θ) for Example 1.8. . . . . . . . . . . . . . . . . . 20
¯
1.3 The spectrum ī(θ) for Example 1.8. . . . . . . . . . . . . . . . . . 20
1.4 The limiting spectrums of (1/n)hZ n (Z n ) for Example 1.9 . . . . . 21
1.5 The possible limiting spectrums of (1/n)iX n ,Y n (X n ; Y n ) for Ex-
ample 1.9. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

2.1 Behavior of the probability of block decoding error as blocklength


n goes to infinity for an arbitrary source X. . . . . . . . . . . . . 30
2.2 Illustration of generalized AEP Theorem. Fn (δ; ε) := Tn [H̄ε (X) +
δ] \ Tn [H̄ε (X) − δ] is the dashed region. . . . . . . . . . . . . . . . 32

3.1 Source generator: {Xt }0<t<1 is an independent random process


with PXt (0) = t and PXt (1) = 1 − t, and is also independent of
the selector Z, where Xt is outputted if Z = t. Source generator
of each time instance is independent temporally. . . . . . . . . . . 54
3.2 The ultimate CDF of −(1/n) log PX n (X n ): Pr{hb (Z) ≤ t}. . . . . 56

4.1 The ultimate CDFs of −(1/n) log PN n (N n ). . . . . . . . . . . . . 75


4.2 The ultimate CDF of the normalized information density for Ex-
ample 4.18-Case B). . . . . . . . . . . . . . . . . . . . . . . . . . 76
4.3 The communication system. . . . . . . . . . . . . . . . . . . . . . 79
4.4 The simulated communication system. . . . . . . . . . . . . . . . 79

ix
Chapter 1

Generalized Information Measures for


Arbitrary Systems with Memory

The entropy of a discrete random variable X [2], defined by



H(X) := − PX (x) log PX (x) = EX [− log PX (X)] nats,
x∈X

is a measure of the average amount of uncertainty in X. An extened notion


of entropy for a sequence of random variables X1 , X2 , . . . , Xn , . . . is the entropy
rate, which is given by
1 1
lim H(X n ) = lim E [− log PX n (X n )] ,
n→∞ n n→∞ n

assuming the limit exists. The above quantities have an operational significance
established via Shannon’s source coding theorems when the stochastic systems
under consideration satisfy certain regularity conditions, such as stationarity and
ergodicity [24, 42]. However, in more complicated situations such as when the
systems have a time-varying nature and are non-stationary, these information
rates are no longer valid and lose their operational significance. This results in
the need to establish new entropy measures which appropriately characterize the
operational limits of arbitrary stochastic systems.
Let us begin with the model of an arbitrary system with memory. In gen-
eral, there are two indices for random variables or observations: a time in-
dex and a space index. When a sequence of random variables is denoted by
X1 , X2 , . . . , Xn , . . ., the subscript i of Xi can be treated as either a time index
or a space index, but not both. Hence, when a sequence of random variables is
a function of both time and space, the notation of X1 , X2 , . . . , Xn , . . ., is by no
means sufficient; and therefore, a new model for a general time-varying source,
such as
(n) (n) (n)
X1 , X2 , . . . , Xt , . . . ,

1
where t is the time index and n is the space or position index (or vice versa),
becomes significant.
When block-wise (fixed-length) compression of such source (with blocklength
n) is considered, the same question as for the compression of an i.i.d. source
arises:
what is the minimum compression rate (say in bits per
source sample) for which the probability of error can be (1.0.1)
made arbitrarily small as the blocklength goes to infinity?
To answer this question, information theorists have to find a sequence of data
compression codes for each blocklength n and investigate if the decompression
error probability goes to zero as n approaches infinity. However, unlike the simple
source models such as discrete memorylessness [2], the source being arbitrary
may exhibit distinct statistics for each blocklength n; e.g., for
(1)
n = 1 : X1
(2) (2)
n = 2 : X1 , X2
(3) (3) (3)
n = 3 : X1 , X2 , X3
(4) (4) (4) (4)
n = 4 : X1 , X2 , X3 , X4 (1.0.2)
..
.
(4) (1) (2) (3)
the statistics of X1 could be different from X1 , X1 and X1 (i.e., the source
statistics are not necessarily consistent). Since the model in question (1.0.1) is
general, and the system statistics can be arbitrarily defined, it is therefore named
an arbitrary system with memory.
The triangular array of random variables in (1.0.2) is denoted by a boldface
letter as
X := {X n }∞
n=1 ,

where X n := (X1 , X2 , . . . , Xn ); for convenience, we also write


(n) (n) (n)

   ∞
X := X n = X1 , X2 , . . . , Xn(n)
(n) (n)
.
n=1

In this chapter, we will first introduce a new concept on defining information


measures for arbitrary systems and then, analyze in detail their algebraic prop-
erties. In the next chapter, we will utilize the new measures to establish general
source coding theorems for arbitrary finite-alphabet sources.

1.1 Spectrums and quantiles

Definition 1.1 (Inf/sup-spectrum) If {An }∞ n=1 is a sequence of random vari-


ables, then its inf-spectrum u(·) and its sup-spectrum ū(·) are respectively defined
¯

2
by
u(θ) := lim inf Pr{An ≤ θ}
¯ n→∞
and
ū(θ) := lim sup Pr{An ≤ θ},
n→∞
respectively, where θ ∈ R.

In other words, u(·) and ū(·) are respectively the liminf and the limsup of
¯
the cumulative distribution function (CDF) of An . Note that by definition, the
CDF of An — Pr{An ≤ θ} — is non-decreasing and right-continuous. However,
for u(·) and ū(·), only the non-decreasing property remains.1
¯
Definition 1.2 (Quantile of inf/sup-spectrum) For any 0 ≤ δ ≤ 1, the
quantile U δ of the sup-spectrum ū(·) and the quantile Ūδ of the inf-spectrum u(·)
¯ ¯
are defined by2
U δ := sup{θ : ū(θ) ≤ δ}
¯
and
Ūδ := sup{θ : u(θ) ≤ δ},
¯
respectively.
It follows from the above definitions that U δ and Ūδ are right-continuous and
¯
non-decreasing in δ. Note that the supremum of an empty set is defined to be
−∞.

Observation 1.3 Generally speaking, one can define “quantile” in four different
ways. For example, for the quantile of the inf-spectrum, we can define:
Ūδ := sup{θ : lim inf Pr[An ≤ θ] ≤ δ}
n→∞
Ūδ− := sup{θ : lim inf Pr[An ≤ θ] < δ}
n→∞
Ūδ+ := sup{θ : lim inf Pr[An < θ] ≤ δ}
n→∞
Ūδ+− := sup{θ : lim inf Pr[An < θ] < δ}.
n→∞

1
It is pertinent to also point out that even if we do not require right-continuity as a funda-
mental property of a CDF, the spectrums u(·) and ū(·) are not necessarily legitimate CDFs of
¯
(conventional real-valued) random variables since there might exist cases where the “probabi-
lity mass escapes to infinity” (cf. [3, pp. 346]). A necessary and sufficient condition for u(·)
¯
and ū(·) to be conventional CDFs (without requiring right-continuity) is that the sequence
of distribution functions of An is tight [3, pp. 346]. Tightness is actually guaranteed if the
alphabet of An is finite.
2
Note that the usual definition of the quantile function φ(δ) of a non-decreasing function
F (·) is slightly different from our definition [3, pp. 190], where φ(δ) := sup{θ : F (θ) < δ}.
Remark that if F (·) is strictly increasing, then the quantile is nothing but the inverse of F (·):
φ(δ) = F −1 (δ).

3
The general relations between these four quantities are as follows:
Ūδ− = Ūδ+− ≤ Ūδ = Ūδ+ .
Obviously, Ūδ− ≤ Ūδ ≤ Ūδ+ and Ūδ− ≤ Ūδ+− by their definitions. It remains to
show that Ūδ+− ≤ Ūδ , that Ūδ+ ≤ Ūδ and that Ūδ+− ≤ Ūδ− .
Suppose Ūδ+− > Ūδ + γ for some γ > 0. Then by definition of Ūδ+− ,
lim inf Pr[An < Ūδ + γ] < δ,
n→∞

which implies that


lim inf Pr[An ≤ Ūδ + γ/2] ≤ lim inf Pr[An < Ūδ + γ] < δ ≤ δ
n→∞ n→∞

and violates the definition of Ūδ . This completes the proof of Ūδ+− ≤ Ūδ . To prove
that Ūδ+ ≤ Ūδ , note that from the definition of Ūδ+ , we have that for any ε > 0,
lim inf n→∞ Pr[An < Ūδ+ − ε] ≤ δ and hence lim inf n→∞ Pr[An ≤ Ūδ+ − 2ε] ≤ δ,
implying that Ūδ+ − 2ε ≤ Ūδ . The latter yields that Ūδ+ ≤ inf ε>0 [Ūδ+ − 2ε] = Ūδ ,
which completes the proof. Proving that Ūδ+− ≤ Ūδ− follows a similar argument.
It is worth noting that Ūδ− = limξ↑δ Ūξ . Their equality can be proved by
first observing that Ūδ− ≥ limξ↑δ Ūξ by their definitions, and then assuming that
γ := Ūδ− − limξ↑δ Ūξ > 0. Then Ūξ < Ūξ + γ/2 ≤ (Ūδ− ) − γ/2 implies that
u((Ūδ− ) − γ/2) > ξ for ξ arbitrarily close to δ from below, which in turn implies
¯
u((Ūδ− ) − γ/2) ≥ δ, contradicting to the definition of Ūδ− . Throughout, we will
¯
interchangeably use Ūδ− and limξ↑δ Ūδ for convenience.
The final note in this observation is that Ūδ+− and Ūδ+ will not be used in
defining our general information measures. They are introduced only for math-
ematical interests. 2

Based on the above definitions, the liminf in probability U of {An }∞n=1 [26],
¯
which is defined as the largest extended real number such that for all ξ > 0,
lim Pr[An ≤ U − ξ] = 0,
n→∞ ¯
satisfies3
U = lim U δ = U 0 .
¯ δ↓0 ¯ ¯
3
It is obvious from their definitions that
lim U δ ≥ U 0 ≥ U .
δ↓0 ¯ ¯ ¯
The equality of limδ↓0 U δ and U can be proved by contradiction by first assuming
¯ ¯
γ := lim U δ − U > 0.
δ↓0 ¯ ¯
Then ū(U + γ/2) ≤ δ for arbitrarily small δ > 0, which immediately implies ū(U + γ/2) = 0,
¯ ¯
contradicting to the definition of U .
¯

4
Also, the limsup in probability Ū (cf. [26]), defined as the smallest extended real
number such that for all ξ > 0,

lim Pr[An ≥ Ū + ξ] = 0,
n→∞

is exactly4
Ū = lim Ūδ = sup{θ : u(θ) < 1}.
δ↑1 ¯
It readily follows from the above definitions that

U ≤ U δ ≤ Ūδ ≤ Ū
¯ ¯
for δ ∈ [0, 1).
Remark that U δ and Ūδ always exist. Furthermore, if U δ = Ūδ for all δ
¯ ¯
in [0, 1], the sequence of random variables An converges in distribution to a
random variable A, provided the sequence of An is tight. Finally, for a better
understanding of the quantities defined above, we depict them in Figure 1.1.

1.2 Properties of quantiles

Lemma 1.4 Consider two random sequences, {An }∞ ∞


n=1 and {Bn }n=1 . Let ū(·)
and u(·) be respectively the sup- and inf-spectrums of {An }∞ n=1 . Similarly, let
¯
v̄(·) and v(·) denote respectively the sup- and inf-spectrums of {Bn }∞ n=1 . Define

U δ and Ūδ be the quantiles of the sup- and inf-spectrums of {An }n=1; also define
¯
Vδ and V̄δ be the quantiles of the sup- and inf-spectrums of {Bn }∞ n=1 .
Now let (u + v)(·) and (u + v)(·) denote the sup- and inf-spectrums of sum
sequence {An + Bn }∞ n=1 , i.e.,

(u + v)(θ) := lim sup Pr{An + Bn ≤ θ},


n→∞

4
Since 1 = limn→∞ Pr{An < Ū + ξ} ≤ limn→∞ Pr{An ≤ Ū + ξ} = u(Ū + ξ), it is straight-
¯
forward that
Ū ≥ sup{θ : u(θ) < 1} = lim Ūδ .
¯ δ↑1

The equality of Ū and limδ↑1 Ūδ can be proved by contraction by first assuming that

γ := Ū − lim Ūδ > 0.


δ↑1

Then 1 ≥ u(Ū −γ/2) > δ for δ arbitrarily close to 1, which implies u(Ū −γ/2) = 1. Accordingly,
¯ ¯
by
1 ≥ lim inf Pr{An < Ū − γ/4} ≥ lim inf Pr{An ≤ Ū − γ/2} = u(Ū − γ/2) = 1,
n→∞ n→∞ ¯
we obtain the desired contradiction.

5
6

ū(·) u(·)
¯
δ

0 -
U Ū0 Uδ Ūδ U 1− Ū
¯ ¯ ¯
Figure 1.1: The asymptotic CDFs (spectrums) of a sequence of random
variables {An }∞n=1 and their quantiles: ū(·) = sup-spectrum of {An },
u(·) = inf-spectrum of {An }, U δ = quantile of ū(·), Ūδ = quantile of
¯ ¯
u(·), U = limδ↓0 U δ = U 0 , U 1− = limξ↑1 U ξ , Ū = limδ↑1 Ūδ .
¯ ¯ ¯ ¯ ¯ ¯

and
(u + v)(θ) := lim inf Pr{An + Bn ≤ θ}.
n→∞

Again, define (U + V )δ and (U + V )δ be the quantiles with respect to (u + v)(·)


and (u + v)(·).
Then the following statements hold.

1. U δ and Ūδ are both non-decreasing and right-continuous functions of δ for


¯
δ ∈ [0, 1].

2. limδ↓0 U δ = U 0 and limδ↓0 Ūδ = Ū0 .


¯ ¯
3. For δ ≥ 0, γ ≥ 0, and δ + γ ≤ 1,

(U + V )δ+γ ≥ U δ + Vγ , (1.2.1)
¯
and
(U + V )δ+γ ≥ U δ + V̄γ . (1.2.2)
¯
4. For δ ≥ 0, γ ≥ 0, and δ + γ ≤ 1,

(U + V )δ ≤ U δ+γ + V̄(1−γ) , (1.2.3)


¯

6
and
(U + V )δ ≤ Ūδ+γ + V̄(1−γ) . (1.2.4)

Proof: The proof of Property 1 follows directly from the definitions of U δ and
¯
Ūδ and the fact that the inf-spectrum and the sup-spectrum are non-decreasing
in δ.
The proof of Property 2 can be proved by contradiction as follows. Suppose
limδ↓0 U δ > U 0 + ε for some ε > 0. Then for any δ > 0,
¯ ¯
ū(U 0 + ε/2) ≤ δ.
¯
Since the above inequality holds for every δ > 0, and ū(·) is a non-negative
function, we obtain ū(U 0 + ε/2) = 0, which contradicts to the definition of U 0 .
¯ ¯
We can prove limδ↓0 Ūδ = Ū0 in a similar fashion.
To show (1.2.1), we observe that for α > 0,

lim sup Pr{An + Bn ≤ U δ + Vγ − 2α}


n→∞ ¯
≤ lim sup (Pr{An ≤ U δ − α} + Pr{Bn ≤ Vγ − α})
n→∞ ¯
≤ lim sup Pr{An ≤ U δ − α} + lim sup Pr{Bn ≤ Vγ − α}
n→∞ ¯ n→∞
≤ δ + γ,

which, by definition of (U + V )δ+γ , yields

(U + V )δ+γ ≥ U δ + Vγ − 2α.
¯
The proof is completed by noting that α can be made arbitrarily small.
Similarly, we note that for α > 0,
lim inf Pr{An + Bn ≤ U δ + V̄γ − 2α}
n→∞ ¯
≤ lim inf Pr{An ≤ U δ − α} + Pr{Bn ≤ V̄γ − α}
n→∞ ¯
≤ lim sup Pr{An ≤ U δ − α} + lim inf Pr{Bn ≤ V̄γ − α}
n→∞ ¯ n→∞
≤ δ + γ,

which, by definition of (U + V )δ+γ and for arbitrarily small α, proves (1.2.2).


To show (1.2.3), we first observe that (1.2.3) trivially holds when γ = 1 (and
δ = 0). It remains to prove its validity under γ < 1. Remark from (1.2.1) that

(U + V )δ + (−V )γ ≤ (U + V − V )δ+γ = U δ+γ .


¯
Hence,
(U + V )δ ≤ U δ+γ − (−V )γ .
¯

7
(Note that the case γ = 1 is not allowed here because it results in U 1 = (−V )1 =
¯
∞, and the subtraction between two infinite terms is undefined. That is why we
need to exclude the case of γ = 1 for the subsequent proof.) The proof is then
completed by showing that
−(−V )γ ≤ V̄(1−γ) . (1.2.5)
By definition,
(−v)(θ) := lim sup Pr {−Bn ≤ θ}
n→∞
= 1 − lim inf Pr {Bn < −θ} .
n→∞
Then
V̄(1−γ) := sup{θ : v(θ) ≤ 1 − γ}
= sup{θ : lim inf Pr[Bn ≤ θ] ≤ 1 − γ}
n→∞
≥ sup{θ : lim inf Pr[Bn < θ] < 1 − γ} (cf. footnote 1)
n→∞
= sup{−θ : lim inf Pr[Bn < −θ] < 1 − γ}
n→∞
= sup{−θ : 1 − lim sup Pr[−Bn ≤ θ] < 1 − γ}
n→∞
= sup{−θ : lim sup Pr[−Bn ≤ θ] > γ}
n→∞
= − inf{θ : lim sup Pr[−Bn ≤ θ] > γ}
n→∞
= − sup{θ : lim sup Pr[−Bn ≤ θ] ≤ γ}
n→∞
= − sup{θ : (−v)(θ) ≤ γ}
= −(−V )γ .
Finally, to show (1.2.4), we again note that it is trivially true for γ = 1. We then
observe from (1.2.2) that (U + V )δ + (−V )γ ≤ (U + V − V )δ+γ = Ūδ+γ . Hence,
(U + V )δ ≤ Ūδ+γ − (−V )γ .
Using (1.2.5), we have the desired result. 2

1.3 Generalized information measures

In Definitions 1.1 and 1.2, if we let the random variable An equal the normalized
entropy density5
1 1
hX n (X n ) := − log PX n (X n )
n n
5
The random variable hX n (X n ) := − log PX n (X n ) is named the entropy density. Hence,
the normalized entropy density is equal to hX n (X n ) divided or normalized by the blocklength
n.

8
of an arbitrary source
   ∞
n (n) (n)
X = X = X1 , X2 , . . . , Xn
(n)
,
n=1

we obtain two generalized (spectral) entropy measures for X:

1
δ-inf-entropy rate Hδ (X) = quantile of the sup-spectrum of hX n (X n )
n
1
δ-sup-entropy rate H̄δ (X) = quantile of the inf-spectrum of hX n (X n ).
n

Note that the inf-entropy-rate H(X) and the sup-entropy-rate H̄(X) introduced
in [25, 26] are special cases of the δ-inf/sup-entropy rate measures:

H(X) = H0 (X), and H̄(X) = lim H̄δ (X).


δ↑1

Conceptually, we may imagine that if the random variable (1/n)h(X n ) ex-


hibits a limiting distribution, then the sup-entropy rate is the right-margin of
the support of that limiting distribution. For example, suppose that the limiting
distribution of (1/n)hX n (X n ) is positive over (−2, 2), and zero, otherwise. Then
H̄(X) = 2. Similarly, the inf-entropy rate is the left margin of the support
of the limiting random variable lim supn→∞ (1/n)hX n (X n ), which is −2 for the
same example.
Analogously, for an arbitrary channel
   ∞
n (n)
W = (Y |X) = W = W1 , . . . , Wn (n)
,
n=1

or more specifically, PW = PY |X = {PY n |X n }∞


n=1 with input X and output Y , if
we replace An in Definitions 1.1 and 1.2 by the channel’s normalized information
density

1 1 1 PX n ,Y n (X n , Y n )
iX n W n (X n ; Y n ) = iX n ,Y n (X n ; Y n ) := log ,
n n n PX n (X n )PY n (Y n )

we get the δ-inf/sup-information rates, denoted respectively by I δ (X; Y ) and


¯
I¯δ (X; Y ), as:

1
I δ (X; Y ) = quantile of the sup-spectrum of iX n W n (X n ; Y n )
¯ n
1
I¯δ (X; Y ) = quantile of the inf-spectrum of iX n W n (X n ; Y n ).
n

9
Entropy Measures

system arbitrary source X


1 1
norm. entropy density hX n (X n ) := − log PX n (X n )
n n 
1 n
entropy sup-spectrum h̄(θ) := lim sup Pr hX n (X ) ≤ θ
n→∞ n
 
1 n
entropy inf-spectrum h(θ) := lim inf Pr hX n (X ) ≤ θ
¯ n→∞ n
δ-inf-entropy rate Hδ (X) := sup{θ : h̄(θ) ≤ δ}

δ-sup-entropy rate H̄δ (X) := sup{θ : h(θ) ≤ δ}


¯
sup-entropy rate H̄(X) := limδ↑1 H̄δ (X)

inf-entropy rate H(X) := H0 (X)

Table 1.1: Generalized entropy measures where δ ∈ [0, 1].

Similarly, for a simple hypothesis testing system with arbitrary observation


statistics for each hypothesis,

H 0 : PX
H1 : PX̂

we can replace An in Definitions 1.1 and 1.2 by the normalized log-likelihood ratio

1 1 PX n (X n )
dX n (X n X̂ n ) := log
n n PX̂ n (X n )
to obtain the δ-inf/sup-divergence rates, denoted respectively by Dδ (XX̂) and
D̄δ (XX̂), as

1
Dδ (XX̂) = quantile of the sup-spectrum of dX n (X n X̂ n )
n
1
D̄δ (XX̂) = quantile of the inf-spectrum of dX n (X n X̂ n ).
n
The above information measure definitions are summarized in Tables 1.1–1.3.

10
Mutual Information Measures
arbitrary channel PW = PY |X with input
system
X and output Y
1
norm. information density iX n W n (X n ; Y n )
n
1 PX n ,Y n (X n , Y n )
:= log
n PX n (X n ) × PY n (Y n )
 
1 n n
information sup-spectrum ī(θ) := lim sup Pr iX n W n (X ; Y ) ≤ θ
n→∞ n
 
1 n n
information inf-spectrum i(θ) := lim inf Pr iX n W n (X ; Y ) ≤ θ
n→∞ n
δ-inf-information rate I δ (X; Y ) := sup{θ : ī(θ) ≤ δ}
¯
δ-sup-information rate I¯δ (X; Y ) := sup{θ : i(θ) ≤ δ}

sup-information rate ¯
I(X; Y ) := limδ↑1 I¯δ (X; Y )

inf-information rate I (X; Y ) := I 0 (X; Y )


¯ ¯
Table 1.2: Generalized mutual information measures where δ ∈ [0, 1].

1.4 Properties of generalized information measures

In this section, we present the properties of the general (spectral) information


measures defined in the previous section. We begin with the generalization
of the simple property between mutual information and entropy: I(X; Y ) =
H(Y ) − H(Y |X).
By taking δ = 0 and letting γ ↓ 0 in (1.2.1) and (1.2.3), we obtain
(U + V ) ≥ U 0 + lim Vγ ≥ U + V
¯ γ↓0 ¯
and
(U + V ) ≤ lim U γ + lim V̄(1−γ) = U + V̄ ,
γ↓0 ¯ γ↓0 ¯
which mean that the liminf in probability of a sequence of random variables
An + Bn is upper bounded by the liminf in probability of An plus the limsup in
probability of Bn , and is lower bounded by the sum of the liminfs in probability
of An and Bn . This fact is used in [42] to show that
I (X; Y ) +H(Y |X) ≤H(Y ) ≤ I (X; Y ) + H̄(Y |X),
¯ ¯

11
Divergence Measures

system arbitrary sources X and X̂


1 1 PX n (X n )
norm. log-likelihood ratio dX n (X n X̂ n ) := log
n n PX̂ n (X n )
 
¯ 1 n n
divergence sup-spectrum d(θ) := lim sup Pr dX n (X X̂ ) ≤ θ
n→∞ n
 
1 n n
divergence inf-spectrum d(θ) := lim inf Pr dX n (X X̂ ) ≤ θ
n→∞ n
δ-inf-divergence rate ¯ ≤ δ}
Dδ (XX̂) := sup{θ : d(θ)

δ-sup-divergence rate D̄δ (XX̂) := sup{θ : d(θ) ≤ δ}

sup-divergence rate D̄(XX̂) := limδ↑1 D̄δ (XX̂)

inf-divergence rate D(XX̂) := D0 (XX̂)

Table 1.3: Generalized divergence measures where δ ∈ [0, 1].

or equivalently,

H(Y ) − H̄(Y |X) ≤ I (X; Y ) ≤H(Y ) −H(Y |X).


¯
Other properties of the generalized information measures are summarized in
the next lemma.

Lemma 1.5 For a source X with finite alphabet X and arbitrary sources Y
and Z, the following properties hold.

1. H̄δ (X) ≥ 0 for δ ∈ [0, 1].


(This property also applies to Hδ (X), I¯δ (X; Y ), I δ (X; Y ), D̄δ (XX̂),
¯
and Dδ (XX̂).)
2. I δ (X; Y ) = I δ (Y ; X) and I¯δ (X; Y ) = I¯δ (Y ; X) for δ ∈ [0, 1].
¯ ¯
3. For 0 ≤ δ < 1, 0 ≤ γ < 1 and δ + γ ≤ 1,

I δ (X; Y ) ≤Hδ+γ (Y ) −Hγ (Y |X), (1.4.1)


¯
I δ (X; Y ) ≤ H̄δ+γ (Y ) − H̄γ (Y |X), (1.4.2)
¯

12
I¯γ (X; Y ) ≤ H̄δ+γ (Y ) −Hδ (Y |X), (1.4.3)
I δ+γ (X; Y ) ≥Hδ (Y ) − H̄(1−γ) (Y |X), (1.4.4)
¯
and
I¯δ+γ (X; Y ) ≥ H̄δ (Y ) − H̄(1−γ) (Y |X). (1.4.5)
(Note that the case of (δ, γ) = (1, 0) holds for (1.4.1) and (1.4.2), and the
case of (δ, γ) = (0, 1) holds for (1.4.3), (1.4.4) and (1.4.5).)
(n)
4. 0 ≤Hδ (X) ≤ H̄δ (X) ≤ log |X | for δ ∈ [0, 1), where each Xi takes values
in X for i = 1, . . . , n and n = 1, 2, . . ..

5. I δ (X, Y ; Z) ≥ I δ (X; Z) for δ ∈ [0, 1].


¯ ¯
Proof: Property 1 holds because
 
1 n
Pr − log PX n (X ) < 0 = 0,
n
   
1 PX n (X n ) n n 1 PX n (xn )
Pr log < −ν = PX n x ∈ X : log < −ν
n PX̂ n (X n ) n PX̂ n (xn )

= PX n (xn )
xn ∈X n : PX n (xn )<PX̂ n (xn )e−nν }

≤ PX̂ n (xn )e−nν
xn ∈X n : PX n (xn )<PX̂ n (xn )enν

≤ e−nν · PX̂ n (xn )
xn ∈X n : PX n (xn )<PX̂ n (xn )enν

≤ e−νn , (1.4.6)

and, by following the same procedure as (1.4.6),


 
1 PX n ,Y n (X n , Y n )
Pr log < −ν ≤ e−νn .
n PX n (X n )PY n (Y n )

Property 2 is an immediate consequence of the definition.


To show the inequalities in Property 3, we first remark that
1 1 1
hY n (Y n ) = iX n ,Y n (X n ; Y n ) + hX n ,Y n (Y n |X n ),
n n n
where
1 1
hX n ,Y n (Y n |X n ) := − log PY n |X n (Y n |X n ).
n n

13
With this fact and for 0 ≤ δ < 1, 0 < γ < 1 and δ + γ ≤ 1, (1.4.1) follows
directly from (1.2.1); (1.4.2) and (1.4.3) follow from (1.2.2); (1.4.4) follows from
(1.2.3); and (1.4.5) follows from (1.2.4). (Note that in (1.4.1), if δ = 0 and γ = 1,
then the right-hand-side would be a difference between two infinite terms, which
is undefined; hence, such case is therefore excluded by the condition that γ < 1.
For the same reason, we also exclude the cases of γ = 0 and δ = 1.) Now when
0 ≤ δ < 1 and γ = 0, we can confirm the validity of (1.4.1), (1.4.2) and (1.4.3),
again, by (1.2.1) and (1.2.2), and also examine the validity of (1.4.4) and (1.4.5)
by directly taking these values inside. The validity of (1.4.1) and (1.4.2) for
(δ, γ) = (1, 0), and the validity of (1.4.3), (1.4.4) and (1.4.5) for (δ, γ) = (0, 1)
can be checked by directly replacing δ and γ with the respective numbers.
Property 4 follows from the facts that H̄δ (·) is non-decreasing in δ, H̄δ (X) ≤
H̄(X), and H̄(X) ≤ log |X |. The last inequality can be proved as follows.
 
1 n
Pr hX n (X ) ≤ log |X | + ν
n
 
n n 1 PX n (X n )
= 1 − PX n x ∈ X : log < −ν
n 1/|X |n
≥ 1 − e−nν ,
where the last step can be obtained by using the same procedure as (1.4.6).
Therefore, h(log |X |+ν) = 1 for any ν > 0, which indicates that H̄(X) ≤ log |X |.
¯
Property 5 can be proved using the fact that
1 1 1
iX n ,Y n ,Z n (X n , Y n ; Z n ) = iX n ,Z n (X n ; Z n ) + iX n ,Y n ,Z n (Y n ; Z n |X n ).
n n n
By applying (1.2.1) with γ = 0, and observing I (Y ; Z|X) ≥ 0, we obtain the
¯
desired result. 2

Lemma 1.6 (Data processing lemma) Fix δ ∈ [0, 1]. Consider arbitrary
sources X 1 , X 2 and X 3 such that for every n, X1n and X3n are conditionally
independent given X2n . Then
I δ (X 1 ; X 3 ) ≤ I δ (X 1 ; X 2 ).
¯ ¯
Proof: By Property 5 of Lemma 1.5, we get
I δ (X 1 ; X 3 ) ≤ I δ (X 1 ; X 2 , X 3 ) = I δ (X 1 ; X 2 ),
¯ ¯ ¯
where the equality holds because
1 PX1n ,X2n ,X3n (xn1 , xn2 , xn3 ) 1 PX1n ,X2n (xn1 , xn2 )
log = log .
n PX1n (xn1 )PX2n ,X3n (xn2 , xn3 ) n PX1n (xn1 )PX2n (xn2 )
2

14
Lemma 1.7 (Optimality of independent inputs) Fix δ ∈ [0, 1). Consider
a channel with finite input and output alphabets, X nand Y, respectively, and
n n n n
with distribution PW (y |x ) = PY |X (y |x ) = i=1 PYi |Xi (yi |xi ) for all n,
n n n

xn ∈ X n and y n ∈ Y n . For any input X and its corresponding output Y ,

I δ (X; Y ) ≤ I δ (X̄; Ȳ ) = I (X̄; Ȳ ),


¯ ¯ ¯
where Ȳ is the output due to X̄, which is annindependent process with the same
n
first-order statistics as X, i.e., PX n (x ) = i=1 PXi (xi ).

Proof: First, we observe that

1 PW n (y n |xn ) 1 PY n (y n ) 1 PW n (y n |xn )
log + log = log .
n PY n (y n ) n PY n (y n ) n PY n (y n )

In other words,

1 PX n W n (xn , y n) 1 PY n (y n ) 1 PX n W n (xn , y n )
log + log = log ,
n PX n (xn )PY n (y n ) n PY n (y n ) n PX n (xn )PY n (y n )

where PX n W n := PX n ,Y n = PY n |X n PX n is the joint input-output distribution


resulting from sending input X n over the channel W n and receiving Y n as output
(a similar definition holds for PX n W n ). By evaluating the above terms under
PX n W n = PX n ,Y n and defining
 
n n n n 1 PX n W n (xn , y n )
z̄(θ) := lim sup PX n W n (x , y ) ∈ X × Y : log ≤θ
n→∞ n PX n (xn )PY n (y n )

and
Zδ (X̄; Ȳ ) := sup{θ : z̄(θ) ≤ δ},
we obtain from (1.2.1) (with γ = 0) that

Zδ (X̄; Ȳ ) ≥ I δ (X; Y ) + D(Y Ȳ ) ≥ I δ (X; Y ),


¯ ¯
since D(Y Ȳ ) ≥ 0 by Property 1 of Lemma 1.5.
Now since X̄ is a source with independent random variables and with iden-
tical marginal (first-order) statistics as X, we have that the induced Ȳ is also

15
an independent process and has the same marginal statistics6 as Y . Hence,
 n 
1    
 PX i Wi (Xi , Yi ) PX i Wi (Xi , Yi ) 
Pr  log − EXi Wi log >γ
n P X (X i )P Y (Y i ) P X (X i )P Y (Y i ) 
i=1 i i i i
 n 
1    
 PXi Wi (Xi , Yi ) PXi Wi (Xi , Yi ) 
= Pr  log − EXi Wi log >γ
n PXi (Xi )PYi (Yi ) PXi (Xi )PYi (Yi ) 
i=1
→ 0,

for any γ > 0, where the convergence to zero follows from Chebyshev’s inequality
and the finiteness of the channel alphabets (or more directly, the finiteness of
individual variances). Consequently, z̄(θ) = 1 for
 
1 1
n n
PXi Wi (Xi , Yi )
θ > lim inf EXi Wi log = lim inf I(Xi ; Yi );
n→∞ n PXi (Xi )PYi (Yi ) n→∞ n
i=1 i=1
n
and z̄(θ) = 0 for θ < lim inf n→∞ (1/n) i=1 I(Xi ; Yi ), which implies

1
n
Z(X̄; Ȳ ) = Zδ (X̄; Ȳ ) = lim inf I(Xi ; Yi )
n→∞ n
i=1

for any δ ∈ [0, 1). Similarly, we can show that

1
n
I (X̄; Ȳ ) = I δ (X̄; Ȳ ) = lim inf I(Xi ; Yi )
¯ ¯ n→∞ n
i=1

6
The claim can be justified as follows. First, for y1 ∈ Y,
 
PY1 (y1 ) = PX n (xn )PW n (y n |xn )
y2n ∈Y n−1 xn ∈X n
  
= PX1 (x1 )PW1 (y1 |x1 ) PX2n |X1 (xn2 |x1 )PW2n (y2n |xn2 )
x1 ∈X y2n ∈Y n−1 xn
2 ∈X
n−1

  
= PX1 (x1 )PW1 (y1 |x1 ) PX2n W2n |X1 (xn2 , y2n |x1 )
x1 ∈X y2n ∈Y n−1 xn
2 ∈X
n−1


= PX1 (x1 )PW1 (y1 |x1 ),
x1 ∈X

where xn2 := (x2 , · · · , xn ), y2n := (y2 , · · · , yn ) and W2n is the channel law between input and
output tuples from time 2 to n. It can be similarly shown that for any time index 1 ≤ i ≤ n,

PYi (yi ) = PXi (xi )PWi (yi |xi )
xi ∈X
n
for xi ∈ X and yi ∈ Y. Hence, for a channel with PW n (y n |xn ) = i=1 PWi (yi |xi ), the output
marginal only depends on the respective input marginal.

16
for any δ ∈ [0, 1). Accordingly,

I (X̄; Ȳ ) = Z(X̄; Ȳ ) ≥ I δ (X; Y ).


¯ ¯
2

1.5 Examples for the computation of general information


measures

Consider a channel with binary input and output alphabets; i.e., X = Y = {0, 1},
and let every channel output be given by
(n) (n) (n)
Yi = Xi ⊕ Zi

where ⊕ represents addition modulo-2, and Z is an arbitrary binary noise pro-


cess, independent of input X. Assume that X is a Bernoulli uniform input,
i.e., an i.i.d. random process with uniform marginal distribution. Then it can
be readily seen that the resultant Y is also Bernoulli uniform, no matter what
distribution Z has.
To compute I ε (X; Y ) for ε ∈ [0, 1) (I 1 (X; Y ) = ∞ is known), we use the
¯ ¯
results of Property 3 in Lemma 1.5:

I ε (X; Y ) ≥H0 (Y ) − H̄(1−ε) (Y |X), (1.5.1)


¯
and
I ε (X; Y ) ≤ H̄ε+γ (Y ) − H̄γ (Y |X). (1.5.2)
¯
where 0 ≤ ε < 1, 0 ≤ γ < 1 and ε + γ ≤ 1. Note that the lower bound in (1.5.1)
and the upper bound in (1.5.2) are respectively equal to −∞ and ∞ for ε = 0
and ε + γ = 1, which become trivial bounds; hence, we further restrict ε > 0
and ε + γ < 1, respectively, for the lower and upper bounds.
Thus for ε ∈ [0, 1),

I ε (X; Y ) ≤ inf H̄ε+γ (Y ) − H̄γ (Y |X) .


¯ 0≤γ<1−ε

Given that the channel noise is additive and independent from the channel input,
H̄γ (Y |X) = H̄γ (Z), which is independent of X. Hence,

I ε (X; Y ) ≤ inf H̄ε+γ (Y ) − H̄γ (Z)


¯ 0≤γ<1−ε

≤ inf log(2) − H̄γ (Z) ,


0≤γ<1−ε

17
where the last step follows from Property 4 of Lemma 1.5. Since log(2) − H̄γ (Z)
is non-increasing in γ,

I ε (X; Y ) ≤ log(2) − lim H̄γ (Z).


¯ γ↑(1−ε)

On the other hand, we can derive the lower bound to I ε (X; Y ) in (1.5.1) by
¯
the fact that Y is Bernoulli uniform. We thus obtain that for ε ∈ (0, 1],

I ε (X; Y ) ≥ log(2) − H̄(1−ε) (Z),


¯
and

I 0 (X; Y ) = lim I ε (X; Y ) ≥ log(2) − lim H̄γ (Z) = log(2) − H̄(Z).


¯ ε↓0 ¯ γ↑1

To summarize,

log(2) − H̄(1−ε) (Z) ≤ I ε (X; Y ) ≤ log(2) − lim H̄γ (Z) for ε ∈ (0, 1)
¯ γ↑(1−ε)

and
I (X; Y ) = I 0 (X; Y ) = log(2) − H̄(Z).
¯ ¯
An alternative method to compute I ε (X; Y ) is to derive its corresponding
¯
sup-spectrum in terms of the inf-spectrum of the noise process. Under the equally
likely Bernoulli input X, we can write
 
1 PY n |X n (Y n |X n )
ī(θ) := lim sup Pr log ≤θ
n→∞ n PY n (Y n )
 
1 n 1 n
= lim sup Pr log PZ n (Z ) − log PY n (Y ) ≤ θ
n→∞ n n
 
1 n
= lim sup Pr log PZ n (Z ) ≤ θ − log(2)
n→∞ n
 
1 n
= lim sup Pr − log PZ n (Z ) ≥ log(2) − θ
n→∞ n
 
1 n
= 1 − lim inf Pr − log PZ n (Z ) < log(2) − θ .
n→∞ n

18
Hence, for ε ∈ (0, 1),

I ε (X; Y ) = sup {θ : ī(θ) ≤ ε}


¯    
1 n
= sup θ : 1 − lim inf Pr − log PZ n (Z ) < log(2) − θ ≤ ε
n→∞ n
   
1 n
= sup θ : lim inf Pr − log PZ n (Z ) < log(2) − θ ≥ 1 − ε
n→∞ n
   
1 n
= sup (log(2) − β) : lim inf Pr − log PZ n (Z ) < β ≥ 1 − ε
n→∞ n
   
1 n
= log(2) + sup −β : lim inf Pr − log PZ n (Z ) < β ≥ 1 − ε
n→∞ n
   
1 n
= log(2) − inf β : lim inf Pr − log PZ n (Z ) < β ≥ 1 − ε
n→∞ n
   
1 n
= log(2) − sup β : lim inf Pr − log PZ n (Z ) < β < 1 − ε
n→∞ n
   
1 n
≤ log(2) − sup β : lim inf Pr − log PZ n (Z ) ≤ β < 1 − ε
n→∞ n
= log(2) − lim H̄δ (Z).
δ↑(1−ε)

Also, for ε ∈ (0, 1),


   
1 PX n ,Y n (X n , Y n )
I ε (X; Y ) ≥ sup θ : lim sup Pr log < θ < ε (1.5.3)
¯ n→∞ n PX n (X n )PY n (Y n )
   
1 n
= log(2) − sup β : lim inf Pr − log PZ n (Z ) ≤ β ≤ 1 − ε
n→∞ n
= log(2) − H̄(1−ε) (Z),

where (1.5.3) follows from the fact described in Footnote 1.3. Therefore,

log(2) − H̄(1−ε) (Z) ≤ I ε (X; Y ) ≤ log(2) − lim H̄γ (Z) for ε ∈ (0, 1).
¯ γ↑(1−ε)

By taking ε ↓ 0, we obtain

I (X; Y ) = I 0 (X; Y ) = log(2) − H̄(Z).


¯ ¯
Based on this result, we can now compute I ε (X; Y ) for some specific exam-
¯
ples.

Example 1.8 Let Z be an all-zero sequence with probability β and Bernoulli(p)


with probability 1 − β, where Bernoulli(p) represents a binary Bernoulli process

19
6

1 t

β t d

0 d -
0 hb (p)

Figure 1.2: The spectrum h(θ) for Example 1.8.


¯
6

1 t

1−β t d

0 d -
1 − hb (p) 1

Figure 1.3: The spectrum ī(θ) for Example 1.8.

with any of its random variables equaling one with probability p. Then as
n → ∞, the sequence of random variables (1/n)hZ n (Z n ) converges to 0 and hb (p)
with respective masses β and 1 − β, where hb (p) := −p log p − (1 − p) log(1 − p) is
the binary entropy function. The resulting h(θ) is depicted in Figure 1.2. From
¯
(1.5.3), we obtain ī(θ) as shown in Figure 1.3.
Therefore,

1 − hb (p), if 0 < ε < 1 − β;
I ε (X; Y ) =
¯ 1, if 1 − β ≤ ε < 1.

Example 1.9 If    ∞
Z = Z n = Z1 , . . . , Zn(n)
(n)
n=1
is a non-stationary binary independent sequence with
 
(n) (n)
Pr Zi = 0 = 1 − Pr Zi = 1 = pi ,

then by the uniform boundedness (in i) of the variance of random variable

20
 
(n)
− log PZ (n) Zi , namely
i

     2 
(n) (n)
Var − log PZ (n) Zi ≤ E log PZ (n) Zi
i i
 
≤ sup pi (log pi )2 + (1 − pi )(log(1 − pi ))2
0<pi <1
< log(2),

we have (by Chebyshev’s inequality) that as n → ∞,


  
 1 n  
 1 
Pr − log PZ n (Z n ) −
(n)
H Zi  > γ → 0,
 n n i=1 

for any γ > 0. Therefore, H̄(1−ε) (Z) is equal to

1   (n)  1
n n
H̄(1−ε) (Z) = H̄(Z) = lim sup H Zi = lim sup hb (pi )
n→∞ n i=1 n→∞ n
i=1

for ε ∈ (0, 1], and infinity for ε = 0, where hb (pi ) = −pi log(pi )−(1−pi ) log(1−pi ).
Consequently,

 n
1 − H̄(Z) = 1 − lim sup 1 hb (pi ), for ε ∈ [0, 1),
I ε (X; Y ) = n→∞ n
¯ 

i=1
∞, for ε = 1.

This result is illustrated in Figures 1.4 and 1.5.

clustering points

···

-
H(Z) H̄(Z)

Figure 1.4: The limiting spectrums of (1/n)hZ n (Z n ) for Example 1.9

21
clustering points

···

-
log(2) − H̄(Z) log(2) −H(Z)

Figure 1.5: The possible limiting spectrums of (1/n)iX n ,Y n (X n ; Y n ) for


Example 1.9.

22
Chapter 2

General Data Compression Theorems

It is known that the entropy rate


1
lim H(X n )
n→∞ n

is the minimum data compression rate (nats per source symbol) for arbitrarily
small data compression error for block coding of the stationary-ergodic source
[2]. For a more complicated situations where the sources become non-stationary,
the quantity limn→∞ (1/n)H(X n ) may not exist, and can no longer be used to
characterize the source compression. This results in the need to establish a
new entropy measure which appropriately characterizes the operational limits of
arbitrary stochastic systems, which was done in the previous chapter.
The role of a source code is to represent the output of a source efficiently.
Specifically, a source code design is to minimize the source description rate of
the code subject to a fidelity criterion constraint. One commonly used fidelity
criterion constraint is to place an upper bound on the probability of decoding
error Pe . If Pe is made arbitrarily small, we obtain a traditional (almost) error-
free source coding system.1 Lossy data compression codes are a larger class
of codes in the sense that the fidelity criterion used in the coding scheme is a
general distortion measure. In this chapter, we only demonstrate the bounded-
error data compression theorems for arbitrary (not necessarily stationary ergodic,
information stable, etc.) sources. The general lossy data compression theorems
will be introduced in subsequent chapters.
1
Recall that only for variable-length codes, a complete error-free data compression is re-
quired. A lossless data compression block codes only dictates that the compression error can
be made arbitrarily small, or asymptotically error-free.

23
2.1 Fixed-length data compression codes for arbitrary
sources

Equipped with the general information measures, we herein demonstrate a gen-


eralized Asymptotic Equipartition Property (AEP) Theorem and establish ex-
pressions for the minimum ε-achievable (fixed-length) coding rate of an arbitrary
source X.
Here, we have made an implicit assumption in the following derivation, which
is the source alphabet X is finite.2

Definition 2.1 (cf. Definition 3.2 and its associated footnote in [2]) An
(n, M) block code for data compression is a set

∼Cn := {c1 , c2 , . . . , cM }

consisting of M sourcewords3 of blocklength n (and a binary-indexing codeword


for each sourceword ci ); each sourceword represents a group of source symbols
of length n.

Definition 2.2 Fix ε ∈ [0, 1]. R is an ε-achievable data compression rate for
a source X if there exists a sequence of block data compression codes {∼Cn =
(n, Mn )}∞
n=1 with
1
lim sup log Mn ≤ R,
n→∞ n

and
lim sup Pe (∼Cn ) ≤ ε,
n→∞
n
where Pe (∼Cn ) := Pr (X ∈
/ ∼Cn ) is the probability of decoding error.
The infimum of all ε-achievable data compression rate for X is denoted by
Tε (X).
2
Actually, the theorems introduced also apply for sources with countable alphabets. We
assume finite alphabets in order to avoid uninteresting cases (such as H̄ε (X) = ∞) that might
arise with countable alphabets.
3
In [2, Def. 3.2], the (n, M ) block data compression code is defined by M codewords,
where each codeword represents a group of sourcewords of length n. However, we can actually
pick up one source symbol from each group, and equivalently define the code using these M
representative sourcewords. Later, it will be shown that this viewpoint facilitates the proving
of the general source coding theorem.

24
Lemma 2.3 (Lemma 1.5 in [25]) Fix a positive integer n. There exists an
(n, Mn ) source block code ∼Cn for PX n such that its error probability satisfies
 
1 n 1
Pe (∼Cn ) ≤ Pr hX n (X ) > log Mn .
n n

Proof: Observe that



1 ≥ PX n (xn )
{xn ∈X n : (1/n)hX n (xn )≤(1/n) log Mn }
 1

Mn
{xn ∈X n : (1/n)hX n (xn )≤(1/n) log Mn }
 
 n 1 1  1
≥ {x ∈ X : hX n (x ) ≤ log Mn }
n n
.
n n Mn

Therefore, |{xn ∈ X n : (1/n)hX n (xn ) ≤ (1/n) log Mn }| ≤ Mn . We can then


choose a code
 
n n 1 n 1
∼Cn ⊃ x ∈ X : hX n (x ) ≤ log Mn
n n

with |∼Cn | = Mn and


 
1 n 1
Pe (∼Cn ) = 1 − PX n {∼Cn } ≤ Pr hX n (X ) > log Mn .
n n
2

Lemma 2.4 (Lemma 1.6 in [25]) Every (n, Mn ) source block code ∼Cn for
PX n satisfies
 
1 n 1
Pe (∼Cn ) ≥ Pr hX n (X ) > log Mn + γ − exp{−nγ},
n n

for every γ > 0.

Proof: It suffices to prove that


 
n 1 n 1
1 − Pe (∼Cn ) = Pr {X ∈ ∼Cn } < Pr hX n (X ) ≤ log Mn + γ + exp{−nγ}.
n n

25
Clearly,
 
n n 1 n 1
Pr {X ∈ ∼Cn } = Pr X ∈ ∼Cn and hX n (X ) ≤ log Mn + γ
n n
 
n 1 n 1
+ Pr X ∈ ∼Cn and hX n (X ) > log Mn + γ
n n
 
1 1
≤ Pr hX n (X n ) ≤ log Mn + γ
n n
 
n 1 n 1
+ Pr X ∈ ∼Cn and hX n (X ) > log Mn + γ
n n
 
1 1
= Pr hX n (X n ) ≤ log Mn + γ
n n
  
n 1 n 1
+ PX n (x ) · 1 hX n (x ) > log Mn + γ
n n
x ∈∼
n Cn
 
1 n 1
= Pr hX n (X ) ≤ log Mn + γ
n n
  
n n 1
+ PX n (x ) · 1 PX n (x ) < exp{−nγ}
Mn
xn ∈ ∼
C
 n 
1 n 1 1
< Pr hX n (X ) ≤ log Mn + γ + |∼Cn | exp{−nγ}
n n Mn
 
1 n 1
= Pr hX n (X ) ≤ log Mn + γ + exp{−nγ}.
n n
2
We now apply Lemmas 2.3 and 2.4 to prove a general source coding theorems
for block codes.

Theorem 2.5 (general source coding theorem) For any source X,



 lim H̄δ (X), for ε ∈ [0, 1);
Tε (X) = δ↑(1−ε)
0, for ε = 1.

Proof: The case of ε = 1 follows directly from its definition; hence, the proof
only focuses on the case of ε ∈ [0, 1).

1. Forward part (achievability): Tε (X) ≤ limδ↑(1−ε) H̄δ (X)


We need to prove the existence of a sequence of block codes {∼Cn =
(n, Mn )}n≥1 such that for every γ > 0,
1
lim sup log Mn ≤ lim H̄δ (X) + γ and lim sup Pe (∼Cn ) ≤ ε.
n→∞ n δ↑(1−ε) n→∞

26
Lemma 2.3 ensures the existence (for any γ > 0) of a source block code
∼Cn = (n, Mn = exp{n(limδ↑(1−ε) H̄δ (X) + γ)} ) with error probability
 
1 n 1
Pe (∼Cn ) ≤ Pr hX n (X ) > log Mn
n n
 
1 n
≤ Pr hX n (X ) > lim H̄δ (X) + γ .
n δ↑(1−ε)

Therefore,
 
1 n
lim sup Pe (∼Cn ) ≤ lim sup Pr hX n (X ) > lim H̄δ (X) + γ
n→∞ n→∞ n δ↑(1−ε)
 
1 n
= 1 − lim inf Pr hX n (X ) ≤ lim H̄δ (X) + γ
n→∞ n δ↑(1−ε)

≤ 1 − (1 − ε) = ε,
where the last inequality follows from
   
1 n
lim H̄δ (X) = sup θ : lim inf Pr hX n (X ) ≤ θ < 1 − ε . (2.1.1)
δ↑(1−ε) n→∞ n

2. Converse part: Tε (X) ≥ limδ↑(1−ε) H̄δ (X)


Assume without loss of generality that limδ↑(1−ε) H̄δ (X) > 0. We will prove
the converse by contradiction. Suppose that Tε (X) < limδ↑(1−ε) H̄δ (X).
Then (∃ γ > 0) Tε (X) < limδ↑(1−ε) H̄δ (X) − 4γ. By definition of Tε (X),
there exists a sequence of codes ∼Cn = (n, Mn ) such that
 
1
lim sup log Mn ≤ lim H̄δ (X) − 4γ + γ
n→∞ n δ↑(1−ε)

< lim H̄δ (X) − 2γ (2.1.2)


δ↑(1−ε)

and
lim sup Pe (∼Cn ) ≤ ε. (2.1.3)
n→∞
(2.1.2) implies that
1
log Mn ≤ lim H̄δ (X) − 2γ
n δ↑(1−ε)

for all sufficiently large n. Hence, for those n satisfying the above inequality
and also by Lemma 2.4,
 
1 1
Pe (∼Cn ) ≥ Pr hX n (X ) > log Mn + γ − e−nγ
n
n n
   
1 n
≥ Pr hX n (X ) > lim H̄δ (X) − 2γ + γ − e−nγ .
n δ↑(1−ε)

27
Therefore,
 
1
lim sup Pe (∼Cn ) ≥ 1 − lim inf Pr hX n (X n ) ≤ lim H̄δ (X) − γ
n→∞ n→∞ n δ↑(1−ε)

> 1 − (1 − ε) = ε,
where the last inequality follows from (2.1.1). Thus, a contradiction to
(2.1.3) is obtained. 2

A few remarks are made based on the previous theorem.

• Note that as ε = 0, limδ↑(1−ε) H̄δ (X) = H̄(X). Hence, the above theorem
generalizes the block source coding theorem in [26], which states that the
minimum achievable fixed-length source coding rate of any finite-alphabet
source is H̄(X).
• Consider the special case where −(1/n) log PX n (X n ) converges in proba-
bility to a constant H, which holds for all information stable sources.4 In
this case, both the inf- and sup-spectrums of X degenerate to a unit step
function: 
1, if θ > H;
u(θ) =
0, if θ < H,
where H is the source entropy rate. Thus, H̄ε (X) = H for all ε ∈ [0, 1).
Hence, the general source coding theorem reduces to the conventional
source coding theorem.
• More generally, if −(1/n) log PX n (X n ) converges in probability to a random
variable Z whose cumulative distribution function (CDF) is FZ (·), then the
minimum achievable data compression rate subject to decoding error being
no greater than ε is
lim H̄δ (X) = sup {R : FZ (R) < 1 − ε} .
δ↑(1−ε)

Therefore, the relationship between the code rate and the ultimate optimal
error probability is also clearly defined. We further explore the case in the
next example.
4 (n) (n)
A source X = {X n = (X1 , . . . , Xn )}∞ n
n=1 is said to be information stable if H(X ) =
n
E [− log PX n (x )] > 0 for all n, and
  
 − log PX n (xn ) 
lim Pr  − 1  > ε = 0,

n→∞ H(X n )
for every ε > 0. By the definition, any stationary-ergodic source with finite n-fold entropy is
information stable; hence, it can be viewed a generalized source model for stationary-ergodic
sources.

28
Example 2.6 Consider a binary source X with each X n is Bernoulli(Θ)
distributed, where Θ is a random variable defined over (0, 1). This is a sta-
tionary but non-ergodic source [1]. We can view the source as a mixture of
Bernoulli(θ) processes where the parameter θ ∈ Θ = (0, 1), and has distri-
bution PΘ [1, Corollary 1]. Therefore, it can be shown by ergodic decom-
position theorem (which states that any stationary source can be viewed
as a mixture of stationary-ergodic sources) that −(1/n) log PX n (X n ) con-
verges in probability to a random variable Z = hb (Θ) [1], where hb (x) :=
−x log2 (x) − (1 − x) log2 (1 − x) is the binary entropy function. Conse-
quently, the CDF of Z is FZ (z) = Pr{hb (Θ) ≤ z}; and the minimum
achievable fixed-length source coding rate with compression error being no
larger than ε is
sup{R : FZ (R) < 1 − ε} = sup{R : Pr{hb (Θ) ≤ R} < 1 − ε}.

• From the above example, or from Theorem 2.5, it shows that the strong
converse theorem (which states that codes with rate below entropy rate
will ultimately have decompression error approaching one, cf. [2, Thm. 3.6])
does not hold in general. However, one can always claim the weak converse
statement for arbitrary sources.

Theorem 2.7 (weak converse theorem) For any block code sequence
of ultimate rate R < H̄(X), the probability of block decoding failure Pe
cannot be made arbitrarily small. In other words, there exists ε > 0 such
that Pe is lower bounded by ε infinitely often in blocklength n.

The possible behavior of the probability of block decompression error of


an arbitrary source is depicted in Figure 2.1. As shown in the figure,
there exist two bounds, denoted by H̄(X) and H̄0 (X), where H̄(X) is
the tight bound for lossless data compression rate. In other words, it is
possible to find a sequence of block codes with compression rate larger than
H̄(X) and the probability of decoding error is asymptotically zero. When
the data compression rate lies between H̄(X) and H̄0 (X), the minimum
probability of decoding error achievable is bounded below by a positive
absolute constant in (0, 1) infinitely often in blocklength n. In the case that
the data compression rate is less than H̄0 (X), the probability of decoding
error of all codes will eventually go to 1 (for n infinitely often). This
fact tells the block code designer that all codes with long blocklength are
bad when data compression rate is smaller than H̄0 (X). From the strong
converse theorem, the two bounds in Figure 2.1 coincide for memoryless
sources. In fact, these two bounds coincide even for stationary-ergodic
sources.

29
n (i.o.) n→∞
Pe −
−−→1 Pe is lower Pe −
−−→0
bounded (i.o. in n) - R
H̄0 (X) H̄(X)

Figure 2.1: Behavior of the probability of block decoding error as block-


length n goes to infinity for an arbitrary source X.

We close this section by remarking that the definition that we adopt for the
ε-achievable data compression rate is slightly different from, but equivalent to,
the one used in [26, Def. 8]. The definition in [26] also brings the same result,
which was separately proved by Steinberg and Verdú as a direct consequence of
Theorem 10(a) (or Corollary 3) in [36]. To be precise, they showed that Tε (X),
denoted by Te (ε, X) in [36], is equal to R̄v (2ε) (cf. Def. 17 in [36]). By a simple
derivation, we obtain:
Te (ε, X) = R̄v (2ε)
   
1 n
= inf θ : lim sup PX n − log PX n (X ) > θ ≤ ε
n→∞ n
   
1 n
= inf θ : lim inf PX n − log PX n (X ) ≤ θ ≥ 1 − ε
n→∞ n
   
1 n
= sup θ : lim inf PX n − log PX n (X ) ≤ θ < 1 − ε
n→∞ n
= lim H̄δ (X).
δ↑(1−ε)

Note that Theorem 10(a) in [36] is a lossless data compression theorem for ar-
bitrary sources, which the authors show as a by-product of their results on
finite-precision resolvability theory. Specifically, they proved T0 (X) = S(X)
[26, Thm. 1] and S(X) = H̄(X) [26, Thm. 3], where S(X) is the resolvability5
of an arbitrary source X. Here, we establish Theorem 2.5 in a different and
more direct way.

2.2 Generalized AEP theorem

For discrete memoryless sources, the data compression theorem is proved by


choosing the codebook ∼Cn to be the weakly δ-typical set and applying the Asymp-
5
The resolvability, which is a measure of randomness for random variables, will be intro-
duced in subsequent chapters.

30
totic Equipartition Property (AEP) which states that (1/n)hX n (X n ) converges
to H(X) with probability one (and hence in probability). The AEP – which
implies that the probability of the typical set is close to one for sufficiently large
n – also holds for stationary-ergodic sources. It is however invalid for more gen-
eral sources – e.g., non-stationary, non-ergodic sources. We herein demonstrate
a generalized AEP theorem.

Theorem 2.8 (generalized asymptotic equipartition property for arbi-


trary sources) Fix ε ∈ [0, 1). Given an arbitrary source X, define
 
n n 1 n
Tn [R] := x ∈ X : − log PX n (x ) ≤ R .
n

Then for any δ > 0, the following statements hold.

1.
lim inf Pr Tn [H̄ε (X) − δ] ≤ ε (2.2.1)
n→∞

2.
lim inf Pr Tn [H̄ε (X) + δ] > ε (2.2.2)
n→∞

3. The number of elements in

Fn (δ; ε) := Tn [H̄ε (X) + δ] \ Tn [H̄ε (X) − δ],

denoted by |Fn (δ; ε)|, satisfies

|Fn (δ; ε)| ≤ exp n(H̄ε (X) + δ) , (2.2.3)

where the operation A \ B between two sets A and B is defined by A \ B :=


A ∩ Bc with Bc denoting the complement set of B.

4. There exists ρ = ρ(δ) > 0 and a subsequence {nj }∞


j=1 such that

|Fn (δ; ε)| > ρ · exp nj (H̄ε (X) − δ) . (2.2.4)

Proof: (2.2.1) and (2.2.2) follow from the definitions. For (2.2.3), we have

1 ≥ PX n (xn )
xn ∈Fn (δ;ε)

≥ exp −n (H̄ε (X) + δ)
xn ∈Fn (δ;ε)

= |Fn (δ; ε)| exp −n (H̄ε (X) + δ) .

31
Fn (δ; ε)

Tn [H̄ε (X) + δ] Tn [H̄ε (X) − δ]

Figure 2.2: Illustration of generalized AEP Theorem. Fn (δ; ε) :=


Tn [H̄ε (X) + δ] \ Tn [H̄ε (X) − δ] is the dashed region.

It remains to show (2.2.4). (2.2.2) implies that there exist ρ = ρ(δ) > 0 and
N1 such that for all n > N1 ,
Pr Tn [H̄ε (X) + δ] > ε + 2ρ.
Furthermore, (2.2.1) implies that for the previously chosen ρ, there exists a
subsequence {nj }∞
j=1 such that

Pr Tnj [H̄ε (X) − δ] < ε + ρ.

Therefore, for all nj > N1 ,



ρ < Pr Tnj [H̄ε (X) + δ] \ Tnj [H̄ε (X) − δ]
 
 
< Tnj [H̄ε (X) + δ] \ Tnj [H̄ε (X) − δ] exp −nj (H̄ε (X) − δ)
 
  
= Fnj (δ; ε) exp −nj (H̄ε (X) − δ) .

The desired subsequence {nj }∞ 


j=1 is then defined as n1 is the first nj > N1 , and

n2 is the second nj > N1 , etc. 2
With the illustration depicted in Figure 2.2, we can clearly deduce that The-
orem 2.8 is indeed a generalized version of the AEP since:

• The set
Fn (δ; ε) := Tn [H̄ε (X) + δ] \ Tn [H̄ε (X) − δ]
   

n  1

= x ∈ X : − log PX n (x ) − H̄ε (X) ≤ δ
n n
n
is nothing but the weakly δ-typical set.

32
• (2.2.1) and (2.2.2) imply that qn := Pr{Fn (δ; ε)} > 0 infinitely often in n.

• (2.2.3) and (2.2.4) imply that the number of sequences in Fn (δ; ε) (the
dashed region) is approximately equal to exp nH̄ε (X) , and the probabi-
lity of each sequence in Fn (δ; ε) can be estimated by qn · exp −nH̄ε (X) .

• In particular, if X is a stationary-ergodic source, then H̄ε (X) is indepen-


dent of ε ∈ [0, 1) and, H̄ε (X) = Hε (X) = H for all ε ∈ [0, 1), where H is
the source entropy rate
1
H = lim E [− log PX n (X n )] .
n→∞ n

In this case, (2.2.1)-(2.2.2) and the fact that H̄ε (X) = Hε (X) for all ε ∈
[0, 1) imply that the probability of the typical set Fn (δ; ε) is close to one
(for n sufficiently large), and (2.2.3) and (2.2.4) imply that there are about
enH typical sequences of length n, each with probability about e−nH . Hence
we obtain the conventional AEP.

• The general source coding theorem can also be proved in terms of the
generalized AEP theorem. For more detail, readers can refer to [11].

33
Chapter 3

Measure of Randomness for Stochastic


Processes

In the previous chapter, it is shown that the sup-entropy rate is indeed the
minimum lossless data compression ratio achievable for block codes. Hence, to
find an optimal block code becomes a well-defined mission since for any source
with well-formulated statistical model, the sup-entropy rate can be computed
and such quantity can be used as a criterion to evaluate the optimality of the
designed block code.
In [26], Verdú and Han found that, other than the minimum lossless data
compression ratio, the sup-entropy rate actually has another operational mean-
ing, which is called resolvability. In this chapter, we will explore the new concept
in details.

3.1 Motivation for resolvability : measure of randomness


of random variables

In simulations of statistical communication systems, generation of random vari-


ables by a computer algorithm is very essential. The computer usually has the
access to a basic random experiment (through pre-defined Application Program-
ing Interface), which generates equally likely random values, such as rand( ) that
generates a real number uniformly distributed over (0, 1). Conceptually, random
variables with complex models are more difficult to generate by computer than
random variables with simple models. Question is how to quantify the “com-
plexity” of generating the random variables by computers. One way to define
such “complexity” measurement is:

Definition 3.1 The complexity of generating a random variable is defined as


the number of random bits that the most efficient algorithm requires in order to

34
generate the random variable by computer that has an access to a equally likely
random experiments.

To understand the above definition quantitatively, a simple example is de-


monstrated below.

Example 3.2 Consider the generation of the random variable with probability
masses PX (−1) = 1/4, PX (0) = 1/2, and PX (1) = 1/4. An algorithm is written
as:

Flip-a-fair-coin; \\ one random bit


If “Head”, then output 0;
else
{
Flip-a-fair-coin; \\ one random bit
If “Head”, then output −1;
else output 1;
}

On the average, the above algorithm requires 1.5 coin flips, and in the worst-
case, 2 coin flips are necessary. Therefore, the complexity measure can take two
fundamental forms: worst-case or average-case over the range of outcomes of
the random variables. Note that we did not show in the above example that
the algorithm is the most efficient one in the sense of using minimum number of
random bits; however, it is indeed an optimal algorithm because it achieves the
lower bound of the minimum number of random bits. Later, we will show that
such bound for average minimum number of random bits required for generat-
ing the random variables is the entropy, which is exactly 1.5 bits in the above
example. As for the worse-case bound, a new terminology, resolution, will be
introduced. As a result, the above algorithm also achieves the lower bound of
the worst-case complexity, which is the resolution of the random variable.

3.2 Notations and definitions regarding to resolvability

Definition 3.3 (M-type) For any positive integer M, a probability distribu-


tion P is said to be M-type if
 
1 2
P (ω) ∈ 0, , , . . . , 1 for all ω ∈ Ω.
M M

35
Definition 3.4 (resolution of a random variable) The resolution1 R(X) of
a random variable X is the minimum log(M) such that PX is M-type. If PX is
not M-type for any integer M, then R(X) = ∞.

As revealed previously, a random source needs to be resolved (meaning, can


be generated by a computer algorithm with access to equal-probable random
experiments). As anticipated, a random variable with finite resolution is resolv-
able by computer algorithms. Yet, it is possible that the resolution of a random
variable is infinity. A quick example is the random variable X with distribution
PX (0) = 1/π and PX (1) = 1 − 1/π. (X does not belong to any M-type for fi-
nite M.) In such case, one can alternatively choose another computer-resolvable
random variable, which resembles the true source within some acceptable range,
to simulate the original one.
One criterion that can be used as a measure of resemblance of two random
variables is the variational distance. As for the same example in the above
paragraph, choose a random variable X  with distribution P  (0) = 1/3 and
X
PX (1) = 2/3. Then X − X  ≈ 0.03, and X  is 3-type, which is computer-
resolvable. A program that generates the 3-type X  is as follows (in C language).

even = False;
while (1)
{Flip-a-fair-coin; \\ one random bit
if (Head)
{if (even==True) { output 0; break;}
else {output 1; break;}
}
else
{if (even==True) even=False;
else even=True;
}
}

Then, by denoting H = Head and T = Tail, the probability to output 1 is equal to


the probability to observe H, TTH, TTTTH, . . ., which equals 12 + 213 + 215 + · · · = 23 .
The average complexity of this algorithm is 1 · 12 + 2 · 14 + 3 · 18 + · · · = 2 bits but
its worse-case complexity is infinity.
We proceed to give the definitions of variational distance and ε-resolution as
well as its extension for the generation of a sequence of random variables.
1
If the base of the logarithmic operation is 2, the resolution is measured in bits; however,
if natural logarithm is taken, nats becomes the basic measurement unit of resolution.

36
Definition 3.5 (variational distance) The variational distance (or 1 dis-
tance) between two distributions P and Q defined on common measurable space
(Ω, F ) is 
P − Q := |P (ω) − Q(ω)|.
ω∈Ω
(Note that an alternative way to formulate the variational distance is:

P − Q = 2 · sup |P (E) − Q(E)| = 2 [P (x) − Q(x)].
E∈F
x∈X : P (x)≥Q(x)

This two definitions are actually equivalent.)

Definition 3.6 (ε-achievable resolution) Fix ε ≥ 0. R is an ε-achievable


 satisfying
resolution for input X if for all γ > 0, there exists X
 < R + γ and X − X
R(X)  < ε.

The ε-achievable resolution reveals the possibility that one can choose another
computer-resolvable random variable whose variational distance to the true sour-
ce is within an acceptable range, ε.
Next we define the ε-achievable resolution rate for a sequence of random
variables, which is an extension of ε-achievable resolution defined for a single
random variable. Such extension is analogous to extending entropy for a single
source to entropy rate for a random source sequence.

Definition 3.7 (ε-achievable resolution rate) Fix ε ≥ 0 and input X. R is


an ε-achievable resolution rate2 for input X if for every γ > 0, there exists X̃
satisfying
1  n ) < R + γ and X n − X  n  < ε,
R(X
n
for all sufficiently large n.

Definition 3.8 (ε-resolvability for X) Fix ε > 0. The ε-resolvability for in-
put X, denoted by Sε (X), is the minimum ε-achievable resolution rate of the
same input, i.e.,

Sε (X) := min R : (∀ γ > 0)(∃X̃ and N)(∀ n > N)

1  ) < R + γ and X − X
n n  <ε .
n
R(X
n
2
Note that our definition of resolution rate is different from its original form (cf. Definition
7 in [26] and the statements following Definition 7 of the same paper for its modified Definition
for specific input X), which involves an arbitrary channel model W . Readers may treat our
definition as a special case of theirs over identity channel.

37
Here, we define Sε (X) using the “minimum” instead of a more general “in-
fimum” operation is simply because Sε (X) indeed belongs to the range of the
minimum operation,3 i.e.,

Sε (X) ∈ Rε (X) := R : (∀ γ > 0)(∃X̃ and N)(∀ n > N)

1  n n  n
R(X ) < R + γ and X − X  < ε .
n
Similar convention will be applied throughout the rest of this chapter.

Definition 3.9 (resolvability for X) The resolvability for input X, denoted


by S(X), is
S(X) := lim Sε (X).
ε↓0

From the definition of ε-resolvability, it is obviously non-increasing in ε.


Hence, the resolvability can also be defined using supremum operation as:

S(X) := sup Sε (X).


ε>0

The resolvability is pertinent to the worse-case complexity measure for ran-


dom variables (cf. Example 3.2, and the discussion following it). With the en-
tropy function, the information theorists also define the ε-mean-resolvability and
mean-resolvability for input X, which characterize the average-case complexity
of random variables.

Definition 3.10 (ε-mean-achievable resolution rate) Fix ε ≥ 0. R is an


ε-mean-achievable resolution rate for input X if for all γ > 0, there exists X̃
satisfying
1  n ) < R + γ and X n − X  n  < ε,
H(X
n
for all sufficiently large n.
3
By its definition, Rε (X) is either (Sε (X), ∞) or [Sε (X), ∞). Suppose Rε (X) =
(Sε (X), ∞). Since Sε (X) ∈ Rε (X), there exists γ0 > 0 satisfying
1  n ) ≥ Sε (X) + γ0 or X n − X
 n  ≥ ε.
(∀X̃ and N )(∃ n > N )  R(X (3.2.1)
n
Let R0 := Sε (X) + 12 γ0 and note that R0 is contained in Rε (X); hence, for γ = 12 γ0 ,

1 0n ) < R0 + γ = Sε (X) + γ0 and X n − X


0n  < ε.
(∃X̃ 0 and N0 )(∀ n > N0 )  R(X
n
The existences of X̃ 0 and N0 then contradict (3.2.1).

38
Definition 3.11 (ε-mean-resolvability for X) Fix ε > 0. The ε-mean-re-
solvability for input X, denoted by S̄ε (X), is the minimum ε-mean achievable
resolution rate for the same input, i.e.,

S̄ε (X) := min R : (∀ γ > 0)(∃X̃ and N)(∀ n > N)

1  n n  n
H(X ) < R + γ and X − X  < ε .
n

Definition 3.12 (mean-resolvability for X) The mean-resolvability for in-


put X, denoted by S̄(X), is

S̄(X) := lim S̄ε (X) = sup S̄ε (X).


ε↓0 ε>0

The only difference between resolvability and mean-resolvability is that the


former employs resolution function, while the latter replaces it by entropy func-
tion. Since entropy is the minimum average codeword length for uniquely decod-
able codes, an explanation for mean-resolvability is that the new random variable
X n can be resolvable through realizing the optimal variable-length code for it.
You can think of the probability mass of each outcome of X  n is approximately
2− where is the codeword length of the optimal lossless variable-length code
for X n (See also [2, Sect. 3.3.3(B)]). Such probability mass can actually be gen-
erated by flipping fair coins times, and the average number of fair coin flipping
for this outcome is indeed × 2− . As you may expect, the mean-resolvability is
shown to be the average complexity of a random variable.

3.3 Operational meanings of resolvability and mean-re-


solvability

The operational meanings for the resolution and entropy (a new operational
meaning for entropy other than the one from source coding theorem) follow the
next theorem.

Theorem 3.13 For a single random variable X,

1. the worse-case complexity is lower-bounded by its resolution R(X) [26];

2. the average-case complexity is lower-bounded by its entropy H(X), and is


upper-bounded by entropy H(X) plus 2 bits [28].

39
Next, we reveal the operational meanings for resolvability and mean-resolva-
bility in source coding. We begin with some lemmas that are useful in charac-
terizing the resolvability.

Lemma 3.14 (bound on variational distance) For every µ > 0,


 
P (x)
P − Q ≤ 2µ + 2 · PX x ∈ X : log >µ .
Q(x)

Proof:

P − Q = 2 [P (x) − Q(x)]
x∈X : P (x)≥Q(x)

= 2 [P (x) − Q(x)]
x∈X : log[P (x)/Q(x)]≥0

 
= 2 [P (x) − Q(x)]
x∈X : log[P (x)/Q(x)]>µ

 
+ [P (x) − Q(x)]
x∈X : µ≥log[P (x)/Q(x)]≥0

 
≤ 2 P (x)
x∈X : log[P (x)/Q(x)]>µ

  
Q(x) 
+ P (x) 1 − 
P (x)
x∈X : µ≥log[P (x)/Q(x)]≥0
  
P (x)
≤ 2 P x ∈ X : log >µ
Q(x)

  
P (x) 
+ P (x) log 
Q(x)
x∈X : µ≥log[P (x)/Q(x)]≥0

(by fundamental inequality)


 
  
 P (x) 
≤ 2 P x ∈ X : log >µ + P (x) · µ
Q(x)
x∈X : µ≥log[P (x)/Q(x)]≥0

40
  
P (x)
= 2 P x ∈ X : log >µ
Q(x)
 
P (x)
+µ · PX x ∈ X : µ ≥ log ≥ 0}
Q(x)
   
P (x)
= 2 P x ∈ X : log >µ +µ .
Q(x)
2

Lemma 3.15
 
n n 1 1  n)
PX n x ∈X : − log PX n (xn ) ≤ R(X = 1,
n n

for every n.

 n ),
Proof: By definition of R(X

 n )}
PX n (xn ) ≥ exp{−R(X

for all xn ∈ X n . Hence, for all xn ∈ X n ,


1 1  n ).
− log PX n (xn ) ≤ R(X
n n
The lemma then holds. 2

Theorem 3.16 The resolvability for input X is equal to its sup-entropy rate,
i.e.,
S(X) = H̄(X).

Proof:

1. S(X) ≥ H̄(X).
It suffices to show that S(X) < H̄(X) contradicts to Lemma 3.15.
Suppose S(X) < H̄(X). Then there exists δ > 0 such that

S(X) + δ < H̄(X).

Let  
n n1 n
D0 := x ∈ X : − log PX n (x ) ≥ S(X) + δ .
n

41
By definition of H̄(X),4
lim sup PX n (D0 ) > 0.
n→∞

Therefore, there exists α > 0 such that


lim sup PX n (D0 ) > α,
n→∞

which immediately implies


PX n (D0 ) > α
infinitely often in n.
Select 0 < ε < min{α2 , 1} and observe that Sε (X) ≤ S(X), we can choose
 n to satisfy
X
1  n ) < S(X) + δ  n < ε
R(X and X n − X (3.3.1)
n 2
for sufficiently large n. Define
D1 := {xn ∈ X n : PX n (xn ) > 0
  √
and PX n (xn ) − PX n (xn ) ≤ ε · PX n (xn ) .
Then
PX n (D1c ) = PX n {xn ∈ X n : PX n (xn ) = 0
  √
or PX n (xn ) − PX n (xn ) > ε · PX n (xn )
≤ PX n {xn ∈ X n : PX n (xn ) = 0}
  √
+PX n xn ∈ X n : PX n (xn ) − PX n (xn ) > ε · PX n (xn )
  √
= PX n xn ∈ X n : PX n (xn ) − PX n (xn ) > ε · PX n (xn )

= PX n (xn )

xn ∈X n : PX n (xn )<(1/ ε)|PX n (xn )−PX
 n (x )|
n

 1
≤ √ |PX n (xn ) − PX n (xn )|
xn ∈X n
ε
ε √
≤ √ = ε.
ε
Consider that
PX n (D1 ∩ D0 ) ≥ PX n (D0 ) − PX n (D1c )

≥ α − ε > 0, (3.3.2)
4
lim inf n→∞ PX (D0c ) ≤ h(S(X)+δ) < 1 because if h(S(X)+δ) ≥ 1, then H̄(X) ≤ S(X)+δ.
¯ ¯

42
which holds infinitely often in n; and every xn0 in D1 ∩ D0 satisfies

PX n (xn0 ) ≤ (1 + ε)PX n (xn0 ) (since xn0 ∈ D1 )
and
1 1 1 1
− log PX n (xn0 ) ≥ − log PX n (xn0 ) + log √
n n n 1+ ε
1 1
≥ (S(X) + δ) + log √ (since xn0 ∈ D0 )
n 1+ ε
δ
≥ S(X) + ,
2

for n > (2/δ) log(1 + ε). Therefore, for those n that (3.3.2) holds,
 
n n 1 n 1  n
PX n x ∈ X : − log PX n (x ) > R(X )
n n
 
n n 1 n δ
≥ PX n x ∈ X : − log PX n (x ) > S(X) + (From (3.3.1))
n 2
 
n n 1 n
≥ PX n x ∈ X : − log PX n (x ) ≥ S(X) + δ
n
$ %& '
=D0
≥ PX n (D1 ∩ D0 )

≥ (1 − ε)PX n (D1 ∩ D0 ) (By definition of D1 )
> 0,
which contradicts to the result of Lemma 3.15.
2. S(X) ≤ H̄(X).
It suffices to show the existence of X̃ for arbitrary γ > 0 such that
 n = 0
lim X n − X
n→∞

 n is an M-type distribution with


and X
( γ
)
M = en(H̄(X)+ 2 ) ,
2
which ensures that for n > γ
log(2),
γ γ
M < en(H̄(X)+ 2 ) + 1 < 2en(H̄(X)+ 2 ) < en(H̄(X)+γ) .

n = X
Let X  n (X n ) be uniformly distributed over a set

G := {Uj ∈ X n : j = 1, . . . , M}

43
which drawn randomly (independently) according to PX n . Define for µ > 0,
 
n n 1 n γ µ
D := x ∈ X : − log PX n (x ) > H̄(X) + + .
n 2 n
For each G chosen, we obtain from Lemma 3.14 that
 n
X n − X
 
n n PX n (xn )
≤ 2µ + 2 · PX n x ∈ X : log >µ
PX n (xn )
 
n 1/M
= 2µ + 2 · PX n x ∈ G : log > µ (since PX n (G c ) = 0)
PX n (xn )
 ( ) µ
n 1 n 1 n(H̄(X)+ γ2 )
= 2µ + 2 · PX n x ∈ G : − log PX n (x ) > log e +
n n n
 
1 γ µ
≤ 2µ + PX n xn ∈ G : − log PX n (xn ) > H̄(X) + +
n 2 n
= 2µ + PX n (G ∩ D)
1
= 2µ + |G ∩ D| .
M
Since G is chosen randomly, we can take the expectation values (with
respect to the random G) of the above inequality to obtain:
 
EG X n − X n  ≤ 2µ + 1 EG [|G ∩ D|] .
M
Observe that each Uj is either in D or not in D. From the i.i.d. assumption
of {Uj }M
j=1 , we can then evaluate EG [|G ∩ D|] by
5

EG [|G ∩ D|] = MPXMn [D] + (M − 1)PXMn−1[D]PX n [D c ]


+ · · · + PX n [D]PXMn−1[D c ]
= MPX n [D].
Hence,
 
n  n
lim sup EG X − X  ≤ 2µ + lim sup PX n [D] = 2µ,
n→∞ n→∞

which implies  
n  n
lim sup EG X − X  = 0 (3.3.3)
n→∞
since µ can be chosen arbitrarily small. (3.3.3) therefore guarantees the
existence of the desired X̃. 2
5
Readers may imagine that there are cases: where |G ∩ D| = M , |G ∩ D| = M − 1, . . .,
M M−1
|G ∩D| = 1 and |G ∩D| = 0, respectively with drawing probability PX n (D), PX n (D)PX n (Dc ),
M−1 c M c
. . ., PX (D)PX n (D ) and PX n (D ).
n

44
The next two lemmas are useful in characterizing mean-resolvability.

Lemma 3.17 With 0 < a, b ≤ 1,


 1
     
 1 1  |a − b| · log |a − b| , |a − b| < 12 ;
a log − b log ≤
 a b   1
(1 − |a − b|) · log , 12 ≤ |a − b| < 1.
(1 − |a − b|)

Proof: Without loss of generality, assume a = t + τ and b = t with 0 < t ≤


t + τ < 1 and τ < 1.
Subject to f (t) := t log( 1t ), we have that for 0 < t ≤ 1 − τ ,

∂[f (t + τ ) − f (t)] t
= log ≤0
∂t t+τ
Hence,
sup [f (t + τ ) − f (t)] = f (τ ) − f (0) = f (τ )
0<t≤1−τ

and
sup [f (t) − f (t + τ )] = f (1 − τ ) − f (0) = f (1 − τ ).
0<t≤1−τ

Thus

|f (a) − f (b)| = |f (t + τ ) − f (t)|


≤ max{f (τ ), f (1 − τ )}
= max{f (|a − b|), f (1 − |a − b|)}
 1

|a − b| · log , |a − b| < 12 ;
|a − b|
= 1

(1 − |a − b|) · log , 1 ≤ |a − b| < 1.
(1 − |a − b|) 2
2

Lemma 3.18 (variational distance and entropy difference [16, p. 33])

 n )| ≤ X n − X
 n  · log |X |n
|H(X n ) − H(X ,
X n − X n

 n ≤ 1 .
provided X n − X 2

45
Proof:
|H(X n ) − H(X n )|
 
   
 n 1 n 1 
=  PX (x ) log
n − PX̃ n (x ) log 
 n n n
PX (x ) xn ∈X n
n PX̃ n (x ) 
n
x ∈X
 
  1 1 
≤ PX n (xn ) log − P (xn
) log 
 PX n (xn ) X̃ n
P n (xn ) 
xn ∈X n X̃
 1
≤ |PX n (xn ) − PX̃ n (xn )| · log (3.3.4)
xn ∈X n
|PX n
n (x ) − PX̃ n (xn )|
 

   xn ∈X n 1
n n
≤ |PX n (x ) − PX̃ n (x )| log   
n n
xn ∈X n |PX n (x ) − PX̃ n (x )|
xn ∈X n
(3.3.5)
n
|X |
= X n − X̃ n  log ,
X n − X̃ n 
 n ≤
where (3.3.4) follows from X n − X 1
and Lemma 3.17, and (3.3.5) uses
2
the log-sum inequality. 2

Theorem 3.19 For any X,


1
S̄(X) = lim sup H(X n ).
n→∞ n

Proof:
1. S̄(X) ≤ lim supn→∞ (1/n)H(X n).
It suffices to prove that S̄ε (X) ≤ lim supn→∞ (1/n)H(X n) for every ε > 0.
This is equivalent to show that for all γ > 0, there exists X̃ such that
1  n ) < lim sup 1 H(X n ) + γ
H(X
n n→∞ n

and
 n < ε
X n − X
for sufficiently large n. This can be trivially achieved by letting X̃ = X,
since for sufficiently many n,
1 1
H(X n ) < lim sup H(X n ) + γ
n n→∞ n

and
X n − X n  = 0.

46
2. S̄(X) ≥ lim supn→∞ (1/n)H(X n).
Observe that S̄(X) ≥ S̄ε (X) for any 0 < ε < e−1 ≈ 0.36788. Then for any
 n such that
γ > 0 and all sufficiently large n, there exists X
1  n ) < S̄(X) + γ
H(X (3.3.6)
n
and
 n  < ε.
X n − X

From Lemma 3.18 that states

 n )| ≤ X n − X
 n  · log |X |n |X |n
|H(X n ) − H(X ≤ ε log ,
X n − X n ε

where the last inequality holds because t log(1/t) is increasing for 0 < t <
e−1 , we obtain
 n ) ≥ H(X n ) − ε log |X |n + ε log ε.
H(X

which, together with (3.3.6), implies that


1
lim sup H(X n ) − ε log |X | < S̄(X) + γ.
n→∞ n

Since ε and γ can be taken arbitrarily small, we have


1
S̄(X) ≥ lim sup H(X n ).
n→∞ n

3.4 Resolvability, mean-resolvability and source coding

In the previous chapter, we have proved that the lossless data compression rate
for block codes is lower bounded by H̄(X). We also show in Section 3.3 that
H̄(X) is also the resolvability for source X. We can therefore conclude that
resolvability is equal to the minimum lossless data compression rate for block
codes. We will justify their equivalence directly in this section.
As explained in AEP theorem for memoryless source, the set Fn (δ) contains
approximately 2nH(X) elements, and the probability for source sequences being
outside Fn (δ) will eventually goes to 0. Therefore, we can binary-index those
source sequences in Fn (δ) by

log2 2nH(X) = nH(X) bits,

47
and encode the source sequences outside Fn (δ) by a unique default binary code-
word, which results in an asymptotically zero probability of decoding error. This
is indeed the main idea for Shannon’s source coding theorem for block codes.
By further exploring the above concept, we found that the key is actually the
existence of a set An = {xn1 , xn2 , . . . , xnM } with M ≈ 2nH(X) and PX n (Acn ) → 0.
Thus, if we can find such typical set, Shannon’s source coding theorem for block
codes can actually be generalized to more general sources, such as non-stationary
sources. Furthermore, extension of the theorems to codes of some specific types
becomes feasible.

Definition 3.20 (minimum ε-source compression rate for fixed-length


codes) R is the ε-source compression rate for fixed-length codes if there exists
a sequence of sets {An }∞ n
n=1 with An ⊂ X such that

1
lim sup log |An | ≤ R and lim sup PX n [Acn ] ≤ ε.
n→∞ n n→∞

Tε (X) is the minimum of all such rates.

Note that the definition of Tε (X) is equivalent to the one in Definition 2.2.

Definition 3.21 (minimum source compression rate for fixed-length


codes) T (X) represents the minimum source compression rate for fixed-length
codes, which is defined as:
T (X) := lim Tε (X).
ε↓0

Definition 3.22 (minimum source compression rate for variable-length


codes) R is an achievable source compression rate for variable-length codes if
there exists a sequence of error-free prefix codes {∼Cn }∞
n=1 such that

1
lim sup n ≤R
n→∞ n
where n is the average codeword length of ∼Cn . T̄ (X) is the minimum of all
such rates.

Recall that for a single source, the measure of its uncertainty is entropy.
Although the entropy can also be used to characterize the overall uncertainty of
a random sequence X, the source coding however concerns more on the “average”
entropy of it. So far, we have seen four expressions of “average” entropy:
1 1 
lim sup H(X n ) := lim sup −PX n (xn ) log PX n (xn );
n→∞ n n→∞ n n
x ∈X n

48
1 1 
lim inf H(X n ) := lim inf −PX n (xn ) log PX n (xn );
n→∞ n n→∞ n
xn ∈X n
   
1 n
H̄(X) := inf β : lim sup PX n − log PX n (X ) > β = 0 ;
β∈ n→∞ n
   
1 n
H(X) := sup α : lim sup PX n − log PX n (X ) < α = 0 .
α∈ n→∞ n
If
1 1 1
lim H(X n ) = lim sup H(X n ) = lim inf H(X n ),
n→∞ n n→∞ n n→∞ n

then limn→∞ (1/n)H(X n ) is named the entropy rate of the source. H̄(X) and
H(X) are called the sup-entropy rate and inf-entropy rate, which were introduced
in Section 1.3.
Next we will prove that T (X) = S(X) = H̄(X) and T̄ (X) = S̄(X) =
lim supn→∞ (1/n)H(X n ) for a source X. The operational characterization of
lim inf n→∞ (1/n)H(X n ) and H(X) will be introduced in Chapter 5.

Theorem 3.23 (equality of resolvability and minimum source coding


rate for fixed-length codes)

T (X) = S(X) = H̄(X).

Proof: Equality of S(X) and H̄(X) is already given in Theorem 3.16. Also,
T (X) = H̄(X) can be obtained from Theorem 2.5 by letting ε = 0. Here, we
provide an alternative proof for T (X) = S(X).

1. T (X) ≤ S(X).
If we can show that, for any ε fixed, Tε (X) ≤ S2ε (X), then the proof is
completed. This claim is proved as follows.
• By definition of S2ε (X), we know that for any γ > 0, there exists X̃
and N such that for n > N,
1  n ) < S2ε (X) + γ  n  < 2ε.
R(X and X n − X
n
 n ) < S2ε (X) + γ,
• Let An := xn : PX n (xn ) > 0 . Since (1/n)R(X

 n )} < exp{n(S2ε (X) + γ)}.


|An | ≤ exp{R(X

Therefore,
1
lim sup log |An | ≤ S2ε (X) + γ.
n→∞ n

49
• Also,
 n  = 2 sup |PX n (E) − P  n (E)|
2ε > X n − X X
E⊂X n
≥ 2|PX n (Acn ) − PX n (Acn )|
= 2PX n (Acn ), (sincePX n (Acn ) = 0).

Hence, lim supn→∞ PX n (Acn ) ≤ ε.


• Since S2ε (X) + γ is just one of the rates that satisfy the condition of
the minimum ε-source compression rate, and Tε (X) is the smallest
one of such rates,

Tε (X) ≤ S2ε (X) + γ for any γ > 0.

2. T (X) ≥ S(X).
Similarly, if we can show that, for any ε fixed, Tε (X) ≥ S3ε (X), then the
proof is completed. This claim can be proved as follows.

• Fix α > 0. By definition of Tε (X), we know that for any γ > 0, there
exists N and a sequence of sets {An }∞
n=1 such that for n > N,

1
log |An | < Tε (X) + γ and PX n (Acn ) < ε + α.
n
• Choose Mn to satisfy

exp{n(Tε (X) + 2γ)} ≤ Mn ≤ exp{n(Tε (X) + 3γ)}. (3.4.1)

Also select one element xn0 from Acn . Define a new random variable
 n as follows:
X

 0, if xn ∈ {xn0 } ∪ An ;
n n
PX n (x ) = k(x )
 , if xn ∈ {xn0 } ∪ An ,
Mn
where  n
 Mn PX
n (x ), if xn ∈ An ;
k(xn ) := k(xn ), if xn = xn0 .
 Mn −
xn ∈An

 n satisfies the next four properties:


It can then be easily verified that X
(a) X n is Mn -type;
(b) PX n (xn0 ) ≤ PX n (Acn ) < ε + α, since xn0 ∈ Acn ;

50
(c) for all xn ∈ An ,
  n
P  n (xn ) − PX n (xn ) = PX n (xn ) − Mn PX n (x ) ≤ 1 .
X
Mn Mn
(d) PX n (An ) + PX n (xn0 ) = 1.
• Consequently,
1  n ) ≤ Tε (X) + 3γ, (by (3.4.1))
R(X
n
and
    
 n =
X n − X P  n (xn ) − PX n (xn ) + P  n (xn ) − PX n (xn )
X X 0 0
xn ∈An
  
+ P  n (xn ) − PX n (xn )
X
xn ∈Acn \{xn
0}
  
≤ P  n (xn ) − PX n (xn ) + P  n (xn ) + PX n (xn )
X X 0 0
xn ∈An
  
+ P  n (xn ) − PX n (xn )
X
xn ∈Acn \{xn
0}
 1
≤ + PX n (xn0 ) + PX n (xn0 )
xn ∈An
M n

+ PX n (xn )
xn ∈Acn \{xn
0}

|An | 
= + PX n (xn0 ) + PX n (xn )
Mn xn ∈Ac n

exp{n(Tε (X) + γ)}


≤ + (ε + α) + PX n (Acn )
exp{n(Tε (X) + 2γ)}
≤ e−nγ + (ε + α) + (ε + α)
≤ 3(ε + α), for n ≥ − log(ε + α)/γ.

• Since Tε (X) is just one of the rates that satisfy the condition of 3(ε +
α)-resolvability, and S3(ε+α) (X) is the smallest one of such quantities,
S3(ε+α) (X) ≤ Tε (X).
The proof is completed by noting that α can be made arbitrarily
small. 2

This theorem tells us that the minimum source compression ratio for fixed-
length code is the resolvability, which in turn is equal to the sup-entropy rate.

51
Theorem 3.24 (equality of mean-resolvability and minimum source
coding rate for variable-length codes)
1
T̄ (X) = S̄(X) = lim sup H(X n ).
n→∞ n

Proof: Equality of S̄(X) and lim supn→∞ (1/n)H(X n ) is already given in The-
orem 3.19.

1. S̄(X) ≤ T̄ (X).
Definition 3.22 states that there exists, for all γ > 0 and all sufficiently
large n, an error-free variable-length code whose average codeword length
n satisfies
1
n < T̄ (X) + γ.
n
Moreover, the fundamental source coding lower bound for a uniquely de-
codable code (cf. [2, Thm. 3.22]) is

H(X n ) ≤ n.

 n  = 0 and
Thus, by letting X̃ = X, we obtain X n − X
1  n ) = 1 H(X n ) ≤ 1
H(X n < T̄ (X) + γ,
n n n
which concludes that T̄ (X) is an ε-achievable mean-resolution rate of X
for any ε > 0, i.e.,
S̄(X) = lim S̄ε (X) ≤ T̄ (X).
ε→0

2. T̄ (X) ≤ S̄(X).
Observe that S̄ε (X) ≤ S̄(X) for 0 < ε < e−1 ≈ 0.36788. Hence, by taking
γ satisfying ε log |X | < γ < 2ε log |X | and for all sufficiently large n, there
exists X n such that
1  n ) < S̄(X) + γ
H(X
n
and
X n − X  n  < ε. (3.4.2)
On the other hand, Theorem 3.27 in [2] proves the existence of an error-free
prefix code for X n with average codeword length n satisfies

n ≤ H(X n ) + log(2) (nats).

52
From Lemma 3.18 that states

 n )| ≤ X n − X
 n  · log |X |n |X |n
|H(X n ) − H(X ≤ ε log ,
X n − X n ε

where the last inequality holds because t log(1/t) is increasing for 0 < t <
e−1 , we obtain
1 1 1
n ≤ H(X n ) + log(2)
n n n
1  n ) + ε log |X | − 1 ε log(ε) + 1 log(2)
≤ H(X
n n n
1 1
≤ S̄(X) + γ + ε log |X | − ε log(ε) + log(2)
n n
≤ S̄(X) + 2γ,
if n > (log(2) − ε log(ε))/(γ − ε log |X |). Since γ can be made arbitrarily
small, S̄(X) is an achievable source compression rate for variable-length
codes; and hence,
T̄ (X) ≤ S̄(X).
2

Again, the above theorem tells us that the minimum source compression ratio
for variable-length code is the mean-resolvability, and the mean-resolvability is
exactly lim supn→∞ (1/n)H(X n ).
Note that lim supn→∞ (1/n)H(X n ) ≤ H̄(X), which follows straightforwardly
by the fact that the mean of the random variable −(1/n) log PX n (X n ) is no
greater than its right margin of the support. Also note that for stationary-
ergodic sources, all these quantities are equal, i.e.,
1
T (X) = S(X) = H̄(X) = T̄ (X) = S̄(X) = lim sup H(X n ).
n→∞ n

We end this chapter by computing these quantities for a specific example.

Example 3.25 Consider a binary random source X1 , X2 , . . . where {Xi }∞


i=1 are
independent random variables with individual distribution
PXi (0) = Zi and PXi (1) = 1 − Zi ,
where {Zi }∞
i=1 are pair-wise independent with common uniform marginal distri-
bution over (0, 1).
You may imagine that the source is formed by selecting from infinitely many
binary number generators as shown in Figure 3.1. The selecting process {Zi }∞
i=1
is independent for each time instance.

53
Source Generator
..
.

source Xt1

.. Z
.
?
source Xt2 - . . . , X2 , X1
- Selector -
-

..
.

source Xt
t∈I

..
.

Figure 3.1: Source generator: {Xt }0<t<1 is an independent random


process with PXt (0) = t and PXt (1) = 1 − t, and is also independent
of the selector Z, where Xt is outputted if Z = t. Source generator of
each time instance is independent temporally.

It can be shown that such source is not stationary. Nevertheless, by means


of similar argument as AEP theorem, we can show that:

log PX (X1 ) + log PX (X2 ) + · · · + log PX (Xn )


− → hb (Z) in probability,
n
where hb (a) := −a log2 (a) − (1 − a) log2 (1 − a) is the binary entropy function.
To compute the ultimate average entropy rate in terms of the random variable
hb (Z), it requires that

log PX (X1 ) + log PX (X2 ) + · · · + log PX (Xn )


− → hb (Z) in mean,
n
which is a stronger result than convergence in probability. With the fundamen-
tal properties for convergence, convergence-in-probability implies convergence-in-
mean provided the sequence of random variables is uniformly integrable, which

54
n
is true for −(1/n) log PX (Xi ) since
i=1
* n +
1  
 
sup E  log PX (Xi )
n>0 n 
i=1

1 n
≤ sup E [|log PX (Xi )|]
n>0 n i=1
= sup E [|log PX (X)|] , because of i.i.d. of {Xi }ni=1
n>0
= E [|log PX (X)|]
   

= E E |log PX (X)|  Z
, 1   

= 
E |log PX (X)|  Z = z dz
0
, 1
= z · | log(z)| + (1 − z) · | log(1 − z)| dz
0
, 1
≤ log(2)dz = log(2).
0

We therefore have:
     
1   
 H(X n ) − E[hb (Z)] = E − 1 log PX n (X n ) − E[hb (Z)]
n   n 
 
 1 
≤ E − log PX n (X n ) − hb (Z) → 0 as n → ∞.
n

Consequently,
1
lim sup H(X n ) = E[hb (Z)]
n→∞ n
, 1
= [z log(z) + (1 − z) log(1 − z)]dz
0
, 1
= 2z log(z)dz
0
 1
2 1 2 
= z log(z) + z 
2 0
1 1
= nats or ≈ 0.72135 bits.
2 2 log(2)

However, it can be shown that the ultimate cumulative distribution function of


−(1/n) log PX n (X n ) is Pr[hb (Z) ≤ t] for t ∈ [0, log(2)] (cf. Figure 3.2).

55
1

0
0 log(2) nats
Figure 3.2: The ultimate CDF of −(1/n) log PX n (X n ): Pr{hb (Z) ≤ t}.

The sup-entropy rate of X is log(2) nats or 1 bit (which is the right-margin of


the ultimate CDF of −(1/n) log PX n (X n )). Hence, for this unstationary source,
the minimum average codeword length for fixed-length codes and variable-length
codes are different, which are 0.72135 bits and 1 bit, respectively.

56
Chapter 4

Channel Coding Theorems and


Approximations of Output Statistics for
Arbitrary Channels

Shannon’s channel capacity [2] is usually derived under the assumption that the
channel is memoryless. With moderate modification of the proof, this result
can be extended to stationary-ergodic channels for which the capacity formula
becomes the maximization of the mutual information rate:
1
lim sup I(X n ; Y n ).
n→∞ X n n

Yet, for more general channels, such as non-stationary or non-ergodic channels,


a more general expression for channel capacity needs to be derived. This general
formula will be introduced and established in this chapter.

4.1 General channel models

The channel transition probability in its most general form is denoted by {W n =


PY n |X n }∞
n=1 , which is abbreviated by W for convenience. Similarly, the input
and output random processes are respectively denoted by X and Y . Throughout
the text, we denote for convenience PX n ,Y n = PX n W n , where Y n is the output of
channel W n = PY n |X n under input X n . Please refer also to Section 1.3 for the
description of general channels.
Now, similar to the definitions of sup- and inf-entropy rates for sequence of

57
sources, the sup- and inf-(mutual-)information rates are respectively defined by1
¯
I(X; Y ) := sup{θ : i(θ) < 1}
and
I (X; Y ) := sup{θ : ī(θ) ≤ 0},
¯
where  
1 n n
i(θ) := lim inf Pr iX n W n (X ; Y ) ≤ θ
n→∞ n
is the inf-spectrum of the normalized information density,
 
1 n n
ī(θ) := lim sup Pr iX n W n (X ; Y ) ≤ θ
n→∞ n
is the sup-spectrum of the normalized information density, and

n n PY n |X n (y n |xn )
iXnW n (x ; y ) := log
PY n (y n )
is the information density.
In 1994, Verdú and Han [42] showed that the channel capacity in its most
general form is
C := sup I (X; Y ).
X ¯

In their proof, they showed the achievability part via Feinstein’s lemma for the
channel coding average error probability. More importantly, they provided a new
converse based on an error lower bound for multihypothesis testing, which can
be seen as a natural counterpart to the error upper bound afforded by Feinstein’s
lemma. In this chapter, we do not present the original proof of Verdú and Han in
the converse theorem. Instead, we will first derive and illustrate in Section 4.3 a
general lower bound on the minimum error probability of multihypothesis testing
[14].2 We then use a special case of the bound, which results the so-called Poor-
Verdú bound [34], to complete the proof of the converse theorem.
1
In the paper of Verdú and Han [42], these two quantities are defined by:
   
1
¯
I(X; Y ) := inf β : (∀ γ > 0) lim sup PX n W n iX n W n (X n ; Y n ) > β + γ = 0
β∈ n→∞ n
and
   
1 n n
I (X; Y ) := sup α : (∀ γ > 0) lim sup PX n W n iX n W n (X ; Y ) < α + γ = 0 .
¯ α∈ n→∞ n
The above definitions are in fact equivalent to ours.
2
We refer the reader to [40] for other tight characterizations of the error probability of
multihypothesis testing.

58
4.2 Channel coding and Feinstein’s lemma

Definition 4.1 (fixed-length data transmission code) An (n, M) fixed-


length data transmission code for channel input alphabet X n and output alpha-
bet Y n consists of:

1. M messages intended for transmission;


2. an encoding function
f : {1, 2, . . . , M} → X n ;

3. a decoding function
g : Y n → {1, 2, . . . , M},
which is (usually) a deterministic rule that assigns a guess to each possible
received vector.

The channel inputs in {xn ∈ X n : xn = f (m) for some 1 ≤ m ≤ M} are the


codewords of the data transmission code.

Definition 4.2 (average probability of error) The average probability of


error for a ∼Cn = (n, M) code with encoder f (·) and decoder g(·) transmitted
over channel W n = PY n |X n is defined as

1 
M
Pe (∼Cn ) = λi ,
M i=1

where 
λi := PY n |X n (y n |f (i)).
y n ∈Y n : g(y n )=i

We assume that the message set (of size M) is governed by a uniform distri-
bution. Thus, under the average probability of error criterion, all codewords are
treated equally (having a uniform prior distribution).

Definition 4.3 (channel capacity C) The channel capacity C is the supre-


mum of all the rates R for which there exists a sequence of ∼Cn = (n, Mn )
channel block codes such that
1
lim inf log Mn ≥ R,
n→∞ n

and
lim sup Pe (∼Cn ) = 0.
n→∞

59
Lemma 4.4 (Feinstein’s Lemma) Fix a positive n. For every γ > 0 and
input distribution PX n on X n , there exists an (n, M) block code for the transition
probability PW n = PY n |X n that its average error probability Pe (∼Cn ) satisfies
 
1 1
Pe (∼Cn ) ≤ Pr iX n W n (X ; Y ) < log M + γ + e−nγ .
n n
n n

Proof:

Step 1: Notations. Define


 
n n n n 1 n n 1
G := (x , y ) ∈ X × Y : iX n W n (x ; y ) ≥ log M + γ .
n n

Let ν := e−nγ + PX n W n (G c ). Feinstein’s Lemma obviously holds if ν ≥ 1,


because then
 
1 1
Pe (∼Cn ) ≤ 1 ≤ ν := Pr iX n W n (X ; Y ) < log M + γ + e−nγ .
n n
n n

So we assume ν < 1, which immediately results in

PX n W n (G c ) < ν < 1,

or equivalently,
PX n W n (G) > 1 − ν > 0. (4.2.1)
Therefore, denoting

A := {xn ∈ X n : PY n |X n (Gxn |xn ) > 1 − ν}

with Gxn := {y n ∈ Y n : (xn , y n ) ∈ G}, we have

PX n (A) > 0,

because if PX n (A) = 0,

(∀ xn with PX n (xn ) > 0) PY n |X n (Gxn |xn ) ≤ 1 − ν



⇒ PX n (xn )PY n |X n (Gxn |xn ) = PX n W n (G) ≤ 1 − ν,
xn ∈X n

and a contradiction to (4.2.1) is obtained.

60
Step 2: Encoder. Choose an xn1 in A (Recall that PX n (A) > 0.) Define Γ1 =
Gxn1 . (Then PY n |X n (Γ1 |xn1 ) > 1 − ν.)
Next choose, if possible, a point xn2 ∈ X n without replacement (i.e., xn2 can
be identical to xn1 ) for which

PY n |X n Gxn2 − Γ1  xn2 > 1 − ν,

and define Γ2 := Gxn2 − Γ1 .


Continue in the following way as for codeword i: choose xni to satisfy
-  /
. 
i−1
PY n |X n Gxni − Γj  xni > 1 − ν,

j=1

0i−1
and define Γi := Gxni − j=1 Γj .
Repeat the above codeword selecting procedure until either M codewords
are selected or all the points in A are exhausted.

Step 3: Decoder. Define the decoding rule as



n i, if y n ∈ Γi
φ(y ) =
arbitrary, otherwise.

Step 4: Probability of error. For all selected codewords, the error probabi-
lity given codeword i is transmitted, λe|i , satisfies

λe|i ≤ PY n |X n (Γci |xni ) < ν.

(Note that (∀ i) PY n |X n (Γi |xni ) ≥ 1−ν by Step 2.) Therefore, if we can show
that the above codeword selecting procedures will not terminate before M,
then
1 
M
Pe (∼Cn ) = λe|i < ν.
M i=1

Step 5: Claim. The codeword selecting procedure in Step 2 will not terminate
before M.
Proof: We will prove it by contradiction.
Suppose the above procedure terminates before M, say at N < M. Define
the set
.
N
F := Γi ∈ Y n .
i=1

61
Consider the probability
PX n W n (G) = PX n W n [G ∩ (X n × F )] + PX n W n [G ∩ (X n × F c )]. (4.2.2)

Since for any y n ∈ Gxni ,


PY n |X n (y n |xni )
PY n (y n ) ≤ ,
M · enγ
we have
PY n (Γi ) ≤ PY n (Gxni )
1 −nγ
≤ e PY n |X n (Gxni |xni )
M
1 −nγ
≤ e .
M
So the first term of the right hand side in (4.2.2) can be upper bounded by
PX n W n [G ∩ (X n × F )] ≤ PX n W n (X n × F )
= PY n (F )
N
= PY n (Γi )
i=1
1 −nγ N −nγ
≤ N× e = e .
M M
As for the second term of the right hand side in (4.2.2), we can upper
bound it by

PX n W n [G ∩ (X n × F c )] = PX n (xn )PY n |X n (Gxn ∩ F c |xn )
xn ∈X n
-  /
 .
N 

= PX n (xn )PY n |X n Gxn − Γi  xn

xn ∈X n i=1

≤ PX n (xn ) · (1 − ν) ≤ 1 − ν,
xn ∈X n

where the last step follows since for all xn ∈ X n ,


-  /
.N 

PY n |X n Gxn − Γi  xn ≤ 1 − ν.

i=1

(Because otherwise we could find the (N + 1)-th codeword.)


Consequently, PX n W n (G) ≤ (N/M)e−nγ + 1 − ν. By definition of G, we
obtain
N −nγ
PX n W n (G) = 1 − ν + e−nγ ≤ e + 1 − ν,
M
which implies N ≥ M, resulting in a contradiction. 2

62
4.3 Error bounds for multihypothesis testing

We next introduce the generalized Poor-Verdú bound parameterized by θ ≥ 1.


Note that when θ = 1, this bound reduces to the original Poor-Verdú bound in
[34].

Lemma 4.5 (generalized Poor-Verdú bound [14]) Suppose X and Y are


random variables, where X takes values on a discrete (i.e., finite or coutably
infinite) alphabet X = {x1 , x2 , x3 , . . .} and Y takes on values in an arbitrary
alphabet Y. The minimum probability of error Pe in estimating X from Y
satisfies

(θ)
Pe ≥ (1 − α) · PX,Y (x, y) ∈ X × Y : PX|Y (x|y) ≤ α (4.3.1)

for each α ∈ [0, 1] and θ ≥ 1, where for each y ∈ Y,


(PX|Y (x|y))θ
PX|Y (x|y) := 
(θ)
, x ∈ X, (4.3.2)
x ∈X (PX|Y (x |y))
 θ

is the tilted distribution of PX|Y (·|y) with parameter θ.

Proof: Fix θ ≥ 1. We only provide the proof for 0 < α < 1 since the lower
bound trivially holds when α = 0 and α = 1.
It is known that the estimate e(Y ) of X from observing Y that minimizes
the error probability is the maximum a posteriori (MAP) estimate given by3
e(Y ) = arg max PX|Y (x|Y ). (4.3.3)
x∈X

Therefore, the error probability incurred in testing among the values of X is


given by
1 − Pe = Pr{X = e(Y )}
,   
= PX|Y (x|y) dPY (y)
Y {x : x=e(y)}
,  
= max PX|Y (x|y) dPY (y)
Y x∈X
,  
= max fx (y) dPY (y)
Y x∈X
 
= E max fx (Y ) ,
x∈X

3
Since randomization among those x’s that achieve maxx∈X PX|Y (x|y) results in the same
optimal error probability, we assume without loss of generality that e(y) is a deterministic
mapping that selects the maximizing x ∈ X of lowest index.

63
where fx (y) := PX|Y (x|y). Note that fx (y) satisfies
 
fx (y) = PX|Y (x|y) = 1.
x∈X x∈X

For a fixed y ∈ Y, let hj (y) be the j-th element in the set


{fx1 (y), fx2 (y), fx3 (y), . . .}
such that its elements are listed in non-increasing order; i.e.,
h1 (y) ≥ h2 (y) ≥ h3 (y) ≥ · · ·
and
{h1 (y), h2(y), h3(y), . . .} = {fx1 (y), fx2 (y), fx3 (y), . . .}.
Then
1 − Pe = E[h1 (Y )]. (4.3.4)
(θ) (θ)
Furthermore, for each hj (y) above, define hj (y) such that hj (y) is the respec-
tive element for hj (y), satisfying
(θ) (θ)
hj (y) = fxj (y) = PX|Y (xj |y) ⇔ hj (y) = PX|Y (xj |y).

Since h1 (y) is the largest among {hj (y)}j≥1, we note that


hθ1 (y) 1
h1 (y) = 
(θ)
θ
= 
j≥1 hj (y) 1 + j≥2[hj (y)/h1(y)]θ
is non-decreasing in θ for each y; this implies that
(θ)
h1 (y) ≥ h1 (y) for θ ≥ 1 and y ∈ Y. (4.3.5)

For any α ∈ (0, 1), we can write



(θ)
PX,Y (x, y) ∈ X × Y : PX|Y (x|y) > α
, 
(θ)
= PX|Y x ∈ X : PX|Y (x|y) > α dPY (y)
Y
, - ∞  
/
(θ)
= hj (y) · 1 hj (y) > α dPY (y)
Y j=1
,  
(θ)
≥ h1 (y) · 1 h1 (y) > α dPY (y)
,Y
≥ h1 (y) · 1(h1 (y) > α)dPY (y)
Y
= E[h1 (Y ) · 1(h1 (Y ) > α)], (4.3.6)

64
where 1(·) is the indicator function4 and the second inequality follows from
(4.3.5).
To complete the proof, we next relate E[h1 (Y )·1(h1 (Y ) > α)] with E[h1 (Y )],
which is exactly 1 − Pe . For any α ∈ (0, 1) and any random variable U with
Pr{0 ≤ U ≤ 1} = 1, the following inequality holds with probability one:

U ≤ α + (1 − α) · U · 1(U > α).

This can be easily proved by upper-bounding U in terms of α when 0 ≤ U ≤ α,


and α + (1 − α)U, otherwise. Thus

E[U] ≤ α + (1 − α)E[U · 1(U > α)].

Applying the above inequality to (4.3.6) by setting U = h1 (Y ), we obtain



(θ)
(1 − α)PX,Y (x, y) ∈ X × Y : PX|Y (x|y) > α
≥ E[h1 (Y )] − α
= (1 − Pe ) − α
= (1 − α) − Pe ,

where the first equality follows from (4.3.4). This completes the proof. 2
We next show that if the MAP estimate e(Y ) of X from Y is almost surely
unique in (4.3.3), then the bound of Lemma 4.5, without the (1 − α) factor, is
tight in the limit of θ going to infinity.

Lemma 4.6 Consider two random variables X and Y , where X has a finite or
countably infinite alphabet X = {x1 , x2 , x3 , . . .} and Y has an arbitrary alphabet
Y. Assume that
PX|Y (e(y)|y) > max PX|Y (x|y) (4.3.7)
x∈X :x=e(y)

holds almost surely in PY , where e(y) is the MAP estimate from y as defined in
(4.3.3); in other words, the MAP estimate is almost surely unique in PY . Then,
the error probability in the MAP estimation of X from Y satisfies

(θ)
Pe = lim PX,Y (x, y) ∈ X × Y : PX|Y (x|y) ≤ α (4.3.8)
θ→∞

(θ)
for each α ∈ (0, 1), where the tilted distribution PX|Y (·|y) is given in (4.3.2) for
y ∈ Y.
4 (θ)
I.e., if h1 (y) > α is true, 1(·) = 1; else, it is zero.

65
(θ)
Proof: It can be easily verified from the definitions of hj (·) and hj (·) that the
following two limits hold for each y ∈ Y:
1 1
(θ)
lim h1 (y) = lim  θ
= ,
θ→∞ θ→∞ 1+ j≥2 [hj (y)/h1 (y)] (y)
where
(y) := max{j ∈ N : hj (y) = h1 (y)} (4.3.9)
and N := {1, 2, 3, . . .} is the set of positive integers, and
 
(θ)
lim hj (y) · 1 hj (y) > α
θ→∞
  
hj (y) · 1 (y)
1
>α for j = 1, · · · , (y);
= (4.3.10)
0 for j > (y).

As a result, we obtain that for any α ∈ (0, 1),



(θ)
lim PX,Y (x, y) ∈ X × Y : PX|Y (x|y) > α
θ→∞
, - ∞  
/
(θ)
= lim hj (y) · 1 hj (y) > α dPY (y)
θ→∞ Y j=1
, - /

∞  
(θ)
= lim hj (y) · 1 hj (y) > α dPY (y) (4.3.11)
Y θ→∞ j=1
 
, (y)  
 1
= h1 (y) · 1 > α  dPY (y), (4.3.12)
Y j=1
(y)
,  
1
= h1 (y) · (y) · 1 > α dPY (y),
Y (y)
where (4.3.11) holds by the dominated convergence theorem since
 
∞    ∞
 (θ) 
 hj (y) · 1 hj (y) > α  ≤ hj (y) = 1,
 
j=1 j=1

and (4.3.12) holds since the limit (in θ) of


 
(θ)
aθ,j := hj (y) · 1 hj (y) > α

exists for every j = 1, 2, · · · by (4.3.10), hence implying that



∞ 

lim aθ,j = lim aθ,j .
θ→∞ θ→∞
j=1 j=1

66
Now the condition in (4.3.7) is equivalent to
Pr[ (Y ) = 1] := PY {y ∈ Y : (y) = 1} = 1; (4.3.13)
thus,

(θ)
lim PX,Y (x, y) ∈ X × Y : PX|Y (x|y) > α
θ→∞
,
= h1 (y) · 1 (1 > α) dPY (y)
Y
= E[h1 (Y )] = 1 − Pe , (4.3.14)
where (4.3.14) follows from (4.3.4). This immediately yields that for 0 < α < 1,

(θ)
Pe = 1 − lim PX,Y (x, y) ∈ X × Y : PX|Y (x|y) > α
θ→∞

(θ)
= lim PX,Y (x, y) ∈ X × Y : PX|Y (x|y) ≤ α .
θ→∞
2
The following two examples illustrate that the condition specified in (4.3.7)
holds for certain situations and hence the generalized Poor-Verdú bound can be
made arbitrarily close to the minimum probability of error Pe by adjusting θ and
α.
Example 4.7 (binary erasure channel) Suppose that X and Y are respec-
tively the channel input and channel output of a binary erasure channel (BEC)
with erasure probability ε, where X = {0, 1} and Y = {0, 1, E}, and

1 − p, y = x ∈ {0, 1};
PY |X (y|x) =
p, y = E.
Let PX (0) = 1 − p = 1 − PX (1) with 0 < p < 1/2. Then, the MAP estimate of
X from Y is given by 
y if y ∈ {0, 1}
e(y) =
0 if y = E
and the resulting minimum probability of error is Pe = εp.
Calculating bound (4.3.1) of Lemma 4.5 yields

(θ)
(1 − α)PX,Y (x, y) ∈ X × Y : PX|Y (x|y) ≤ α


 pθ

 0 if 0 ≤ α <

 pθ + (1 − p)θ
 θ
p (1 − p)θ
= εp(1 − α) if θ ≤ α < (4.3.15)

 p + (1 − p)θ pθ + (1 − p)θ

 θ

 (1 − p)
ε(1 − α) if θ ≤ α < 1.
p + (1 − p)θ

67
Thus, taking θ ↑ ∞ and then α ↓ 0 in (4.3.15) results in the exact error proba-
bility εp. Note that in this example, the original Poor-Verdú bound (i.e., with
θ = 1) also achieves the minimum probability of error εp by choosing α = 1 − p;
however this maximizing choice of α = 1 − p for the original bound is a function
of system’s statistics (here, the input distribution p) which may be undesirable.
On the other hand, the generalized bound (4.3.1) can herein achieve its peak by
systematically taking θ ↑ ∞ and then letting α ↓ 0.
Furthermore, since (y) = 1 for every y ∈ {0, 1, E}, we have that (4.3.7)
holds; hence, by Lemma 4.6, (4.3.8) yields that for 0 < α < 1,

(θ)
Pe = lim PX,Y (x, y) ∈ X × Y : PX|Y (x|y) ≤ α = εp,
θ→∞

where the last equality follows directly from (4.3.15) without the (1 − α) factor.

Example 4.8 (binary-input additive Gaussian noise channel) We herein


consider an example with a continuous observation alphabet Y = R, where R is
the set of real numbers. Specifically, let the observation be given by Y = X + N,
where X is uniformly distributed over X = {−1, +1} and N is a zero-mean
Gaussian random variable with variance σ 2 . Assuming that X and N are inde-
pendent from each other, then for x ∈ {−1, +1} and y ∈ R
1
PX|Y (x|y) = , (4.3.16)
1 + exp{− 2xy
σ2
}

which directly yields a MAP estimate of X from Y given by




+1, y > 0;
e(y) = −1, y < 0;


arbitrary, y = 0

with a resulting error probability of Pe = Φ(−1/σ), where


, z
1 t2
Φ(z) := √ e− 2 dt
−∞ 2π

is the cdf of the standard (zero-mean unit-variance) Gaussian distribution.


Furthermore, since x ∈ {−1, +1}, we have

(θ) 1
PX|Y (x|y) = ,
2 /θ }
1 + exp{− σ2xy

68
and the generalized Poor-Verdú bound (4.3.1) yields

(θ)
Pe ≥ (1 − α)PX,Y (x, y) ∈ X × Y : PX|Y (x|y) ≤ α
, − σ2 log( 1 −1)−1  
2θ α 1 t2
= (1 − α) √ exp − 2 dt
−∞ 2πσ 2 2σ
   
σ 1 1
= (1 − α)Φ − log −1 − . (4.3.17)
2θ α σ

Now taking the limits θ ↑ ∞ followed by α ↓ 0 for the right-hand side


term in (4.3.17) yields exactly Φ (−1/σ) = Pe ; hence the generalized Poor-Verdú
bound (4.3.1) is asymptotically tight as a consequence of the validity of condition
(4.3.7).

We close this section by applying the Poor-Verdú bound to the decoding of


an (n, M) block code used over the channel W n .

Corollary 4.9 Every ∼Cn = (n, M) code satisfies


 
−nγ 1 n n 1
Pe (∼Cn ) ≥ 1 − e Pr iX n W n (X ; Y ) ≤ log M − γ
n n

for every γ > 0, where X n places probability mass 1/M on each codeword, and
Pe (∼Cn ) denotes the error probability of the code.

Proof: Taking α = e−nγ and θ = 1 in Lemma 4.5, and replacing X and Y in


Lemma 4.5 by its n-fold counterparts, i.e., X n and Y n , we obtain
 
Pe (∼Cn ) ≥ 1 − e−nγ PX n W n (xn , y n) ∈ X n × Y n : PX n |Y n (xn |y n) ≤ e−nγ
 
−nγ n n n n PX n |Y n (xn |y n) e−nγ
= 1−e PX n W n (x , y ) ∈ X × Y : ≤
1/M 1/M
 n n 
−nγ n n n n PX n |Y n (x |y ) e−nγ
= 1−e PX n W n (x , y ) ∈ X × Y : ≤
PX n (xn ) 1/M
−nγ n n n n
= 1−e PX n W n [(x , y ) ∈ X × Y :

1 PX n |Y n (xn |y n ) 1
log ≤ log M − γ
n PX n (xn ) n
 
−nγ 1 n n 1
= 1−e Pr iX n W n (X ; Y ) ≤ log M − γ .
n n
2

69
4.4 Capacity formulas for general channels

In this section, the general formulas for several operational notions of channel
capacity are derived. The units of these capacity quantities are in nats/channel
use (assuming that the natural logarithm is used).

Definition 4.10 (ε-achievable rate) Fix ε ∈ [0, 1]. R ≥ 0 is an ε-achievable


rate if there exists a sequence of ∼Cn = (n, Mn ) channel block codes such that
1
lim inf log Mn ≥ R
n→∞ n
and
lim sup Pe (∼Cn ) ≤ ε.
n→∞

Definition 4.11 (ε-capacity Cε ) Fix ε ∈ [0, 1]. The supremum of ε-achievable


rates is called the ε-capacity, Cε .

It is straightforward from the definition that Cε is non-decreasing in ε, and


C1 = log |X |.

Observation 4.12 (capacity C) Note that channel capacity C is equal to the


supremum of the rates that are ε-achievable for all ε ∈ [0, 1]:5

C = inf Cε = lim Cε = C0 .
0≤ε≤1 ε↓0

Definition 4.13 (strong capacity CSC ) Define the strong converse capacity
(or strong capacity) CSC as the infimum of the rates R such that for all ∼Cn =
(n, Mn ) channel block codes with
1
lim inf log Mn ≥ R,
n→∞ n
we have
lim inf Pe (∼Cn ) = 1.
n→∞

5
The claim of C0 = limε↓0 Cε can be proved by contradiction as follows.
Suppose C0 + 2γ < limε↓0 Cε for some γ > 0. For any positive integer j, and by definition
of C1/j , there exist Nj and a sequence of block codes ∼Cn = (n, Mn ) such that for n > Nj ,
(1/n) log Mn > C1/j − γ ≥ limε↓0 Cε − γ > C0 + γ and Pe ( ∼Cn ) < 2/j. Construct a sequence
of block codes ∼ Cn = (n, M̃n ) as: ∼
Cn = ∼Cn , if max1≤i≤j−1 Ni ≤ n < max1≤i≤j Ni . Then
lim supn→∞ (1/n) log M̃n ≥ C0 + γ and lim inf n→∞ Pe ( ∼ Cn ) = 0, which contradicts to the
definition of C0 .

70
Based on these definitions, general formulas for the above capacity notions
are established as follows.

Theorem 4.14 (ε-capacity) For 0 < ε < 1, the ε-capacity Cε for arbitrary
channels satisfies
Cε = sup I ε (X; Y ).
X ¯

Proof:
1. Cε ≥ supX I ε (X; Y ).
¯
Fix input X. It suffices to show the existence of ∼Cn = (n, Mn ) data
transmission code with rate
1 γ
I ε (X; Y ) − γ < log Mn < I ε (X; Y ) −
¯ n ¯ 2
and probability of decoding error satisfying
lim sup Pe (∼Cn ) ≤ ε
n→∞

for every γ > 0. (Because if such code exists, then lim inf n→∞ (1/n) log Mn ≥
I ε (X; Y ) − γ, which implies Cε ≥ I ε (X; Y ) − γ for arbitrarily small γ.)
¯ ¯
From Lemma 4.4, there exists an ∼Cn = (n, Mn ) code whose error probabi-
lity satisfies
 
1 n n 1 γ
Pe (∼Cn ) < Pr iX n W n (X ; Y ) < log Mn + + e−nγ/4
n n 4
  
1 n n γ γ
≤ Pr iX n W n (X ; Y ) < I ε (X; Y ) − + + e−nγ/4
n ¯ 2 4
 
1 n n γ
≤ Pr iX n W n (X ; Y ) < I ε (X; Y ) − + e−nγ/4 .
n ¯ 4
Since
   
1 n n
I ε (X; Y ) := sup R : lim sup Pr iW n W n (X ; Y ) ≤ R ≤ ε ,
¯ n→∞ n
we obtain
 
1 n n γ
lim sup Pr iX n W n (X ; Y ) < lim I ε (X; Y ) − ≤ ε.
n→∞ n δ↑ε ¯ 4
Hence, the proof of the direct part is completed by noting that
 
1 n n γ
lim sup Pe (∼Cn ) ≤ lim sup Pr iX n W n (X ; Y ) < I ε (X; Y ) −
n→∞ n→∞ n ¯ 4
+ lim sup e−nγ/4 .
n→∞
≤ ε.

71
2. Cε ≤ supX I ε (X; Y ).
¯
Suppose that there exists a sequence of ∼Cn = (n, Mn ) codes with rate
strictly larger than supX I ε (X; Y ) and lim supn→∞ Pe (∼Cn ) ≤ ε. Let the
¯
ultimate code rate for this code be supX I ε (X; Y ) + 3ρ for some ρ > 0.
¯
Then for sufficiently large n,
1
log Mn > sup I ε (X; Y ) + 2ρ.
n X ¯

Since the above inequality holds for every X, it certainly holds if taking
input X̂ n which places probability mass 1/Mn on each codeword, i.e.,
1
log Mn > I ε (X̂; Ŷ ) + 2ρ, (4.4.1)
n ¯
where Ŷ is the channel output due to channel input X̂. Then from Corol-
lary 4.9, the error probability of the code satisfies
 
−nρ 1 n n 1
Pe (∼Cn ) ≥ 1 − e Pr iX̂ n W n (X̂ ; Ŷ ) ≤ log Mn − ρ
n n
 
−nρ 1 n n
≥ 1−e Pr iX̂ n W n (X̂ ; Ŷ ) ≤ I ε (X̂; Ŷ ) + ρ ,
n ¯

where the last inequality follows from (4.4.1). Taking the limsup of both
sides, we have

ε ≥ lim sup Pe (∼Cn )


n→∞
 
1 n n
≥ lim sup Pr iX̂ n W n (X̂ ; Ŷ ) ≤ I ε (X̂; Ŷ ) + ρ > ε,
n→∞ n ¯
and a desired contradiction is obtained. 2

Theorem 4.15 (general channel capacity) The channel capacity C for ar-
bitrary channel satisfies
C = sup I (X; Y ).
X ¯

Proof: Observe that

C = C0 = lim Cε = lim sup I ε (X; Y ).


ε↓0 ε↓0 X ¯

Hence, from Theorem 4.14, we note that

C = lim sup I ε (X; Y ) ≥ lim sup I (X; Y ) = sup I (X; Y ).


ε↓0 X ¯ ε↓0 X ¯ X ¯

72
It remains to show that C ≤ supX I (X; Y ).
¯
Suppose that there exists a sequence of ∼Cn = (n, Mn ) codes with rate strictly
larger than supX I (X; Y ) and error probability tends to 0 as n → ∞. Let the
¯
ultimate code rate for this code be supX I (X; Y ) + 3ρ for some ρ > 0. Then for
¯
sufficiently large n,
1
log Mn > sup I (X; Y ) + 2ρ.
n X ¯
Since the above inequality holds for every X, it certainly holds if taking input
X̂ n which places probability mass 1/Mn on each codeword, i.e.,
1
log Mn > I (X̂; Ŷ ) + 2ρ, (4.4.2)
n ¯
where Ŷ is the channel output due to channel input X̂. Then from Corollary
4.9, the error probability of the code satisfies
 
−nρ 1 n n 1
Pe (∼Cn ) ≥ 1 − e Pr iX̂ n W n (X̂ ; Ŷ ) ≤ log Mn − ρ
n n
 
−nρ 1 n n
≥ 1−e Pr iX̂ n W n (X̂ ; Ŷ ) ≤ I (X̂; Ŷ ) + ρ , (4.4.3)
n ¯
where the last inequality follows from (4.4.2). Since, by assumption, Pe (∼Cn )
vanishes as n → ∞ but (4.4.3) cannot vanish by definition of I (X̂; Ŷ ), we
¯
obtain the desired contradiction. 2
We close this section by providing the general formula for strong capacity for
which the proof follows similar steps as the previous two theorems and hence we
omit it. Note that in the general formula for strong capacity, the sup-information
rate is used as opposed to the inf-information rate formula for channel capacity.

Theorem 4.16 (general strong capacity)


¯
CSC := sup I(X; Y ).
X

4.5 Examples for the general capacity formulas

With the general capacity formulas shown in the previous section, we can now
compute them for some non-stationary or non-ergodic channels, and analyze
their properties.

Example 4.17 (capacity) Let the input and output alphabets be {0, 1}, and
let every output Yi be given by:
Yi = Xi ⊕ Ni ,

73
where “⊕” represents modulo-2 addition operation. Assume the input process
X and the noise process N are independent.
A general relation between the inf-information rate and inf/sup-entropy rates
can be derived from (1.4.2) and (1.4.4) as follows:
H(Y ) − H̄(Y |X) ≤ I (X; Y ) ≤ H̄(Y ) − H̄(Y |X).
¯
Since N is completely determined from Y n under the knowledge of X n ,
n

H̄(Y |X) = H̄(N ).


Indeed, this channel is symmetric [2]; thus a uniform input yields a uniform
output (i.e., a Bernoulli process with parameter (1/2)), and H(Y ) = H̄(Y ) =
log(2) nats. We thus have
C = log(2) − H̄(N ).
We next compute the channel capacity for the following two noise cases.

Case A) If N is a non-stationary binary independent sequence with


Pr{Ni = 1} = pi ,
then by the uniform boundedness (in i) of the variance of random variable
− log PNi (Ni ), namely,
Var[− log PNi (Ni )] ≤ E[(log PNi (Ni ))2 ]
 
≤ sup pi (log pi )2 + (1 − pi )(log(1 − pi ))2
0<pi <1
≤ log(2),
we have by Chebyshev’s inequality that as n → ∞,
  
 1 n 
 1 
Pr − log PN n (N n ) − H(Ni ) > γ → 0,
 n n i=1 

for any γ > 0. Therefore,


1
n
H̄(N ) = lim sup hb (pi ),
n→∞ n i=1

where hb (p) := −p log(p) − (1 − p) log(1 − p) is the binary entropy function.


Consequently,
1
n
C = log(2) − lim sup hb (pi ).
n→∞ n
i=1
This result is illustrated in Figures 4.1.

74
···

-
H(N ) cluster points H̄(N )

Figure 4.1: The ultimate CDFs of −(1/n) log PN n (N n ).

Case B) If N has the same distribution as the source process in Example 3.25,
then H̄(N ) = log(2) nats, which yields a zero channel capacity.

Example 4.18 (strong capacity) Considering the same additive noise chan-
nel as in Example 4.17, we have that under a uniform input (in this case,
PX n (xn ) = PY n (y n ) = 2−n ),
 
1 PY n |X n (Y n |X n )
Pr log ≤θ
n PY n (Y n )
 
1 n 1 n
= Pr log PN n (N ) − log PY n (Y ) ≤ θ
n n
 
1 n
= Pr log PN n (N ) ≤ θ − log(2)
n
 
1 n
= Pr − log PN n (N ) ≥ log(2) − θ
n
 
1 n
= 1 − Pr − log PN n (N ) < log(2) − θ . (4.5.1)
n
We again consider the two cases in Example 4.17.

Case A) From (4.5.1), we directly have that

1
n
CSC = 1 − lim inf hb (pi ).
n→∞ n
i=1

Case B) From (4.5.1) and also from Figure 4.2, we obtain that
CSC = log(2).

In Figure 4.2, we plot the ultimate CDF of the channel’s normalized infor-
mation density for Case B. Recall that this limiting CDF is the spectrum of the
normalized information density.

75
1

-
0 log(2)

Figure 4.2: The ultimate CDF of the normalized information density


for Example 4.18-Case B).

Figure 4.2 indicates that the channel capacity is 0, and that the strong ca-
pacity is log(2). Hence, the operational meaning of these two extremum values
are determined. One can then naturally ask “what is the operational meaning of
the function value between 0 and log(2)? ” The answer to this question actually
comes from Theorem 4.14, which is ε-capacity. Indeed in practice, it may not be
easy to design a block code which transmits information with (asymptotically)
no error through a very noisy channel with rate equals to channel capacity. How-
ever, if we allow some errors during transmission such that the error probability
is bounded above by 0.001, we have a better chance to construct a practical
block code.

Example 4.19 (ε-capacity) Consider the channel in Case B of Example 4.17.


Let the spectrum of the normalized information density be i(θ). Then the ε-
capacity of this channel is actually the inverse function of i(θ), i.e.
Cε = i−1 (θ).
Note that the capacity can be written as:
C = lim Cε .
ε↓0

In general, the strong capacity satisfies


CSC ≥ lim Cε
ε↑1

76
since strong capacity dictates a strong condition of lim inf n→∞ Pe (∼Cn ) = 1 as
contrary to lim supn→∞ Pe (∼Cn ) ≤ ε required for ε-capacity. However, equality of
the above inequality holds in this example.

4.6 Capacity and resolvability for channels

The channel capacity for discrete memoryless channel is shown to be:

C := max I(PX , QY |X ).
PX

Let PX be the optimizer of the above maximization operation. Then

C := max I(PX , QY |X ) = I(PX , QY |X ).


PX

Here, the performance of the code is assumed to be the average error probability,
namely
1 
M
Pe (∼Cn ) = Pe (∼Cn |xni ),
M i=1
if the codebook is ∼Cn := {xn1 , xn2 , . . . , xnM }. Due to the random coding argument,
a deterministic good code with arbitrarily small error probability and with rate
less than channel capacity must exist. One can ask: What is the relationship
between a good code and the optimizer PX ? It is widely believed that if the
code is good (with rate close to capacity and low error probability), then the
output statistics PY n – due to the equally-likely code – must approximate the
output distribution, denoted by PY n , due to the input distribution achieving the
channel capacity.
This fact is actually reflected in the next theorem.

Theorem 4.20 ([26]) For any channel W n = (Y n |X n ) with finite input al-
phabet and capacity C that satisfies the strong converse (i.e., C = CSC ), the
following statement holds.
Fix γ > 0 and a sequence of {∼Cn = (n, Mn )}∞
n=1 block codes with

1
log Mn ≥ C − γ/2,
n
and vanishing error probability (i.e., error probability approaches zero as block-
length n tends to infinity.) Then,
1 n n
Y − Y  ≤ γ for all sufficiently large n,
n

77
n n
where Y n is the output due to the block code and Y is the output due the X
that satisfies
n n
I(X ; Y ) = max n
I(X n ; Y n ).
X

To be specific,
  1
PY n (y n ) = PX n (xn )PW n (y n |xn ) = PW n (y n |xn )
Mn
xn ∈ ∼
Cn xn ∈ ∼
Cn
and 
PY n (y n ) = PX n (xn )PW n (y n |xn ).
xn ∈X n

Note that the above theorem holds for arbitrary channels, not restricted to
only discrete memoryless channels.
One can wonder whether a result in the spirit of the above theorem can be
proved for the input statistics rather than the output statistics. The answer
is negative. Hence, the statement that the statistics of any good code must
approximate those that maximize the mutual information is erroneously taken
for granted. (However, we do not rule out the possibility of the existence of
good codes that approximate those that maximize the mutual information.) To
n
 n (which
see this, simply consider the normalized entropy of X versus that of X
is uniformly distributed over the codewords) for discrete memoryless channels:
 
1 n 1  ) =
n 1 n n 1 n n 1
H(X ) − H(X H(X |Y ) + I(X ; Y ) − log(Mn )
n n n n n
  1
= H(X|Y ) + I(X; Y ) − log(Mn )
n
  1
= H(X|Y ) + C − log(Mn ).
n
A good code with vanishing error probability exists for (1/n) log(Mn ) arbitrarily
close to C; hence, we can find a good code sequence to satisfy
 
1 n 1  n
lim H(X ) − H(X ) = H(X|Y ).
n→∞ n n

Since the term H(X|Y ) is in general positive, where a quick example is the BSC
with crossover probability p that yields X and Y are both uniform and

H(X|Y ) = H(X) − I(X; Y )


= H(X) − H(Y ) + H(Y |X)
= H(Y |X) = −p log(p) − (1 − p) log(1 − p),

78
. . . , X3 , X2 , X1 . . . , Y3 , Y2 , Y1
- PY n |X n -
true source true channel true output

Figure 4.3: The communication system.

3 , X
...,X 2 , X
1 . . . , Y3 , Y2 , Y1
- PY n |X n -
computer-generated true channel corresponding
source output
Figure 4.4: The simulated communication system.

the two input distributions do not necessarily resemble to each other.


The previous discussion motivates the necessity to find an equally distributed
(over a subset of input alphabet) input distribution that generates the output
statistics, which is close to the output due to the input that maximizes the mutual
information. Since such approximations are usually performed by computers, it
may be natural to connect approximations of the input and output statistics
with the concept of resolvability.
In a data transmission system as shown in Figure 4.3, suppose that the source,
channel and output are respectively denoted by

X n := (X1 , . . . , Xn ),

W n := (W1 , . . . , Wn ),
and
Y n := (Y1 , . . . , Yn ),
where Wi has distribution PYi |Xi . In order to simulate the behavior of the chan-
nel, a computer-generated input may be necessary as shown in Figure 4.4. As
stated in Chapter 3, such computer-generated input is based on an algorithm
formed by a few basic uniform random experiments, which has finite resolu-
tion. Our goal is to find a good computer-generated input X  n such that the
corresponding output Y n is very close to the true output Y n .

Definition 4.21 (ε-resolvability for input X and channel W ) Fix ε > 0,


and suppose that the (true) input random variable and (true) channel statistics
are X and W = (Y |X), respectively. Then the ε-resolvability Sε (X, W ) for

79
input X and channel W is defined by:

Sε (X, W ) := min R : (∀ γ > 0)(∃ X̃ and N)(∀ n > N)

1  n n  n
R(X ) < R + γ and Y − Y  < ε ,
n
where PY n = PX n PW n . (The definitions of resolution R(·) and variational dis-
tance (·) − (·) are given by Definitions 3.4 and 3.5.)

Note that if we take the channel W n to be an identity channel for all n,


namely X n = Y n and PY n |X n (y n |xn ) is either 1 or 0, then the ε-resolvability for
input X and channel W is reduced to source ε-resolvability for X:

Sε (X, W Identity ) = Sε (X).

Similar reductions can be applied to all the following definitions.

Definition 4.22 (ε-mean-resolvability for input X and channel W ) Fix


ε > 0, and suppose that the (true) input random variable and (true) channel
statistics are respectively X and W . Then the ε-mean-resolvability S̄ε (X, W )
for input X and channel W is defined by:

S̄ε (X, W ) := min R : (∀ γ > 0)(∃ X̃ and N)(∀ n > N)

1  ) < R + γ and Y − Y  < ε ,
n n n
H(X
n
where PY n = PX n PW n and PY n = PX n PW n .

Definition 4.23 (resolvability and mean resolvability for input X and


channel W ) The resolvability and mean-resolvability for input X and W are
defined respectively as:

S(X, W ) := sup Sε (X, W ) and S̄(X, W ) := sup S̄ε (X, W ).


ε>0 ε>0

Definition 4.24 (resolvability and mean resolvability for channel W )


The resolvability and mean-resolvability for channel W are defined respectively
as:
S(W ) := sup S(X, W ),
X

and
S̄(W ) := sup S̄(X, W ).
X

80
As an extension to Chapter 3, the above definitions lead to the following
theorem.

Theorem 4.25 ([26])


¯
S(W ) = CSC = sup I(X; Y)
X

and
S̄(W ) = C = sup I (X; Y ).
X ¯

It is somewhat a reasonable inference that if no computer algorithms can pro-


duce a desired good output statistics under the number of random nats specified,
then all codes should be bad codes for this rate.

81
Chapter 5

Optimistic Shannon Coding Theorems


for Arbitrary Single-User Systems

As seen in Chapters 2 and 4, the conventional definitions of the source coding


rate and channel capacity require the existence of reliable codes for all suffi-
ciently large blocklengths. Alternatively, if it is required that good codes exist
for infinitely many blocklengths, then optimistic definitions of source coding rate
and channel capacity are obtained.
In this chapter, formulas for the optimistic minimum achievable fixed-length
source coding rate and the minimum ε-achievable source coding rate for arbi-
trary finite-alphabet sources are established. The expressions for the optimistic
capacity and the optimistic ε-capacity of arbitrary single-user channels are also
provided. The expressions of the optimistic source coding rate and capacity are
examined for the class of information stable sources and channels, respectively.
Finally, examples for the computation of optimistic capacity are presented.

5.1 Motivations

The conventional definition of the minimum achievable fixed-length source cod-


ing rate T (X) (or T0 (X)) for a source X (cf. Definition 2.2) requires the exis-
tence of reliable source codes for all sufficiently large blocklengths. Alternatively,
if it is required that reliable codes exist for infinitely many blocklengths, a new,
more optimistic definition of source coding rate (denoted by T (X)) is obtained
[41]. Similarly, the optimistic capacity C̄ is defined by requiring the existence of
reliable channel codes for infinitely many blocklengths, as opposed to the defini-
tion of the conventional channel capacity C (see Definition 4.3 or [42, Definition
1]).
This concept of optimistic source coding rate and capacity were investigated
by Verdú et.al for arbitrary (not necessarily stationary, ergodic, information sta-

82
ble, etc.) sources and single-user channels in [41, 42]. More specifically, they
establish an additional operational characterization for the optimistic minimum
achievable source coding rate (T (X) for source X) by demonstrating that for a
given channel, the classical statement of the source-channel separation theorem1
holds for every channel if T (X) = T (X) [41]. In a dual fashion, they also show
that for channels with C̄ = C, the classical separation theorem holds for every
source. They also conjecture that T (X) and C̄ do not seem to admit a simple
expression.
In this chapter, we demonstrate that T (X) and C̄ do indeed have a general
formula. The key to these results is the application of the generalized sup-
information rate introduced in Chapter 1 to the existing proofs by Verdú and
Han [42] of the direct and converse parts of the conventional coding theorems.
We also provide a general expression for the optimistic minimum ε-achievable
source coding rate and the optimistic ε-capacity.

5.2 Optimistic source coding theorems

In this section, we provide the optimistic source coding theorems. They are
shown based on two new bounds due to Han [25] on the error probability of a
source code as a function of its size. Interestingly, these bounds constitute the
natural counterparts of the upper bound provided by Feinstein’s Lemma (see
Lemma 4.4) and the Verdú-Han lower bound [42] to the error probability of a
channel code. Furthermore, we show that for information stable sources, the
formula for T (X) reduces to
1
T (X) = lim inf H(X n ).
n→∞ n
This is in contrast to the expression for T (X), which is known to be
1
T (X) = lim sup H(X n ).
n→∞ n

The above result leads us to observe that for sources that are both stationary and
information stable, the classical separation theorem is valid for every channel.
In [41], Vembu et.al characterize the sources for which the classical separation
theorem holds for every channel. They demonstrate that for a given source X,
1
By the “classical statement of the source-channel separation theorem,” we mean the fol-
lowing. Given a source X with (conventional) source coding rate T (X) and channel W
with capacity C, then X can be reliably transmitted over W if T (X) < C. Conversely, if
T (X) > C, then X cannot be reliably transmitted over W . By reliable transmissibility of
the source over the channel, we mean that there exits a sequence of joint source-channel codes
such that the decoding error probability vanishes as the blocklength n → ∞ (cf. [41]).

83
the separation theorem holds for every channel if its optimistic minimum achiev-
able source coding rate T (X) coincides with its conventional (or pessimistic)
minimum achievable source coding rate T (X); i.e., if T (X) = T (X).
We herein establish a general formula for T (X). We prove that for any source
X,
T (X) = limHδ (X) = H1− (X).
δ↑1

We also provide the general expression for the optimistic minimum ε-achievable
source coding rate. We show these results based on two new bounds due to Han
(one upper bound and one lower bound) on the error probability of a source code
[25, Chapter 1]. The upper bound (i.e., Lemma 2.3) consists of the counterpart
of Feinstein’s Lemma for channel codes, while the lower bound (i.e., Lemma 2.4)
consists of the counterpart of the Verdú-Han lower bound on the error probability
of a channel code ([42, Theorem 4]). As in the case of the channel coding bounds,
both source coding bounds (Lemmas 2.3 and 2.4) hold for arbitrary sources and
for arbitrary fixed blocklength.

Definition 5.1 An (n, M) fixed-length source code for X n is a collection of M


n-tuples ∼Cn = {cn1 , . . . , cnM }. The error probability of the code is
Pe (∼Cn ) := Pr [X n ∈ ∼Cn ] .

Definition 5.2 (optimistic ε-achievable source coding rate) Fix 0 < ε <
1. R ≥ 0 is an optimistic ε-achievable rate if, for every γ > 0, there exists a
sequence of (n, Mn ) fixed-length source codes ∼Cn such that
1
lim sup log Mn ≤ R
n→∞ n
and
lim inf Pe (∼Cn ) ≤ ε.
n→∞

The infimum of all ε-achievable source coding rates for source X is denoted
by T ε (X). Also define T (X) := sup0<ε<1 T ε (X) = limε↓0 T ε (X) = T 0 (X) as
the optimistic source coding rate.

We can then use Lemmas 2.3 and 2.4 (in a similar fashion to the general
source coding theorem in Theorem 2.5) to prove the general optimistic (fixed-
length) source coding theorems.

Theorem 5.3 (optimistic minimum ε-achievable source coding rate for-


mula) For any source X,

 lim Hδ (X), for ε ∈ [0, 1);
T ε (X) = δ↑(1−ε)
0, for ε = 1.

84
Proof: The case of ε = 1 follows directly from its definition; hence, the proof
only focuses on the case of ε ∈ [0, 1).

1. Forward part (achievability): T ε (X) ≤ limδ↑(1−ε) Hδ (X)


We need to prove the existence of a sequence of block codes {∼Cn =
(n, Mn )}n≥1 such that for every γ > 0,
1
lim sup log Mn ≤ lim Hδ (X) + γ and lim inf Pe (∼Cn ) ≤ ε.
n→∞ n δ↑(1−ε) n→∞

Lemma 2.3 ensures the existence (for any γ > 0) of a source block code
∼Cn = (n, Mn = exp{n(limδ↑(1−ε) Hδ (X) + γ)} ) with error probability
 
1 n 1
Pe (∼Cn ) ≤ Pr hX n (X ) > log Mn
n n
 
1 n
≤ Pr hX n (X ) > lim Hδ (X) + γ .
n δ↑(1−ε)

Therefore,
 
1 n
lim inf Pe (∼Cn ) ≤ lim inf Pr hX n (X ) > lim Hδ (X) + γ
n→∞ n→∞ n δ↑(1−ε)
 
1 n
= 1 − lim sup Pr hX n (X ) ≤ lim Hδ (X) + γ
n→∞ n δ↑(1−ε)

≤ 1 − (1 − ε) = ε,

where the last inequality follows from


   
1 n
lim Hδ (X) = sup θ : lim sup Pr hX n (X ) ≤ θ < 1 − ε . (5.2.1)
δ↑(1−ε) n→∞ n

2. Converse part: T ε (X) ≥ limδ↑(1−ε) Hδ (X)


Assume without loss of generality that limδ↑(1−ε) Hδ (X) > 0. We will prove
the converse by contradiction. Suppose that T ε (X) < limδ↑(1−ε) Hδ (X).
Then (∃ γ > 0) T ε (X) < limδ↑(1−ε) Hδ (X) − 4γ. By definition of T ε (X),
there exists a sequence of codes ∼Cn = (n, Mn ) such that
 
1
lim sup log Mn ≤ lim Hδ (X) − 4γ + γ
n→∞ n δ↑(1−ε)

< lim Hδ (X) − 2γ (5.2.2)


δ↑(1−ε)

and
lim inf Pe (∼Cn ) ≤ ε. (5.2.3)
n→∞

85
(5.2.2) implies that
1
log Mn ≤ lim Hδ (X) − 2γ
n δ↑(1−ε)

for all sufficiently large n. Hence, for those n satisfying the above inequality
and also by Lemma 2.4,
 
1 1
Pe (∼Cn ) ≥ Pr hX n (X ) > log Mn + γ − e−nγ
n
n n
   
1 n
≥ Pr hX n (X ) > lim Hδ (X) − 2γ + γ − e−nγ .
n δ↑(1−ε)

Therefore,
 
1
lim inf Pe (∼Cn ) ≥ 1 − lim sup Pr hX n (X n ) ≤ lim Hδ (X) − γ
n→∞ n→∞ n δ↑(1−ε)

> 1 − (1 − ε) = ε,

where the last inequality follows from (5.2.1). Thus, a contradiction to


(5.2.3) is obtained. 2

We conclude this section by examining the expression of T (X) for infor-


mation stable sources. It is already known (cf. for example [41]) that for an
information stable source X,
1
T (X) = lim sup H(X n ).
n→∞ n

We herein prove a parallel expression for T (X).

Definition 5.4 (information stable sources [41]) A source X is said to be


information stable if H(X n ) > 0 for n sufficiently large, and hX n (X n )/H(X n )
converges in probability to one as n → ∞, i.e.,
  
 hX n (X n ) 
lim sup P r  
− 1 > γ = 0 ∀γ > 0,
n→∞ H(X n )

where H(X n ) = E[hX n (X n )] is the entropy of X n .

Lemma 5.5 Every information source X satisfies


1
T (X) = lim inf H(X n ).
n→∞ n

86
Proof:

1. T (X) ≥ lim inf n→∞ (1/n)H(X n )


Fix ε > 0 arbitrarily small. Using the fact that hX n (X n ) is a non-negative
bounded random variable for finite alphabet, we can write the normalized
block entropy as
 
1 n 1 n
H(X ) = E hX n (X )
n n
  
1 n 1 n
= E hX n (X ) 1 0 ≤ hX n (X ) ≤ limHδ (X) + ε
n n δ↑1
  
1 n 1 n
+E hX n (X ) 1 hX n (X ) > limHδ (X) + ε .
n n δ↑1

(5.2.4)
From the definition of limδ↑1 Hδ (X), it directly follows that the first term
in the right hand side of (5.2.4) is upper bounded by limδ↑1 Hδ (X) + ε, and
that the liminf of the second term is zero. Thus
1
T (X) = limHδ (X) ≥ lim inf H(X n ).
δ↑1 n→∞ n

2. T (X) ≤ lim inf n→∞ (1/n)H(X n )


Fix ε > 0. For infinitely many n,
 
hX n (X n )
Pr −1 >ε
H(X n )
  
1 n 1 n
= Pr hX n (X ) > (1 + ε) H(X )
n n
  
1 n 1 n
≥ Pr hX n (X ) > (1 + ε) lim inf H(X ) + ε .
n n→∞ n

Since X is information stable, we obtain that


  
1 n 1 n
lim inf Pr hX n (X ) > (1 + ε) lim inf H(X ) + ε
n→∞ n n→∞ n
 n

hX n (X )
≤ lim Pr − 1 > ε = 0.
n→∞ H(X n )
By definition of limδ↑1 Hδ (X), the above implies that
 
1 n
T (X) = limHδ (X) ≤ (1 + ε) lim inf H(X ) + ε .
δ↑1 n→∞ n

The proof is completed by noting that ε can be made arbitrarily small. 2

87
It is worth pointing out that if the source X is both information stable and
stationary, the above Lemma yields
1
T (X) = T (X) = lim H(X n ).
n→∞ n
This implies that given a stationary and information stable source X, the clas-
sical separation theorem holds for every channel.

5.3 Optimistic channel coding theorems

In this section, we state without proving the general expressions for the opti-
mistic ε-capacity2 (C̄ε ) and for the optimistic capacity (C̄) of arbitrary single-
user channels. The proofs of these expressions are straightforward once the right
definition (of I¯ε (X; Y )) is made. They employ Feinstein’s Lemma and the Poor-
Verdú bound, and follow the same arguments used in Theorems 4.14 and 4.15
to show the general expressions of the conventional ε-capacity

Cε = sup I ε (X; Y ),
X ¯

and the conventional channel capacity

C = sup I 0 (X; Y ) = sup I (X; Y ).


X ¯ X ¯

We close this section by proving the formula of C̄ for information stable channels.

Definition 5.6 (optimistic ε-achievable rate) Fix 0 < ε < 1. R ≥ 0 is an


optimistic ε-achievable rate if there exists a sequence of ∼Cn = (n, Mn ) channel
block codes such that
1
lim inf log Mn ≥ R
n→∞ n

and
lim inf Pe (∼Cn ) ≤ ε.
n→∞

Definition 5.7 (optimistic ε-capacity C̄ε ) Fix 0 < ε < 1. The supremum of
optimistic ε-achievable rates is called the optimistic ε-capacity, C̄ε .

It is straightforward from the definition that C̄ε is non-decreasing in ε, and


C̄1 = log |X |.

2
Note that the expression of C̄ε was also separately obtained in [37, Theorem 7].

88
Definition 5.8 (optimistic capacity C̄) The optimistic channel capacity C̄
is defined as the supremum of the rates that are ε-achievable for all ε ∈ [0, 1]. It
follows immediately from the definition that C̄ = inf 0≤ε≤1 C̄ε = limε↓0 C̄ε = C̄0
and that C̄ is the supremum of all the rates R for which there exists a sequence
of ∼Cn = (n, Mn ) channel block codes such that
1
lim inf log Mn ≥ R,
n→∞ n
and
lim inf Pe (∼Cn ) = 0.
n→∞

Theorem 5.9 (optimistic ε-capacity formula) Fix 0 < ε < 1. The opti-
mistic ε-capacity C̄ε satisfies

C̄ε = sup I¯ε (X; Y ). (5.3.1)


X

Theorem 5.10 (optimistic capacity formula) The optimistic capacity C̄


satisfies
C̄ = sup I¯0 (X; Y ).
X

We next investigate the expression of C̄ for information stable channels.


The expression for the capacity of information stable channels is already known
(cf. for example [41])
1
C = lim inf sup I(X n ; Y n ).
n→∞ X n n

We prove a dual formula for C̄.

Definition 5.11 (Information stable channels [18, 32]) A channel W is


said to be information stable if there exists an input process X such that 0 <
Cn := supX n (1/n)I(X n ; Y n ) < ∞ for n sufficiently large, and
  
 iX n W n (X n ; Y n ) 
lim sup Pr  − 1 > γ = 0
n→∞ nCn

for every γ > 0.

Lemma 5.12 Every information stable channel W satisfies


1
C̄ = lim sup sup I(X n ; Y n ).
n→∞ X n n

89
Proof:
1. C̄ ≤ lim supn→∞ supX n (1/n)I(X n ; Y n )
By using a similar argument as in the proof of [42, Theorem 8, property
h)], we have
1
I¯0 (X; Y ) ≤ lim sup sup I(X n ; Y n ).
n→∞ X n n
Hence,
1
C̄ = sup I¯0 (X; Y ) ≤ lim sup sup I(X n ; Y n ).
X n→∞ X n n

2. C̄ ≥ lim supn→∞ supX n (1/n)I(X n ; Y n )


Suppose X̃ is the input process that makes the channel information stable.
Fix ε > 0. Then for infinitely many n,
 
1 n n
PX̃ n W n i n n (X̃ ; Y ) ≤ (1 − ε)(lim sup Cn − ε)
n X̃ W n→∞
* +
iX̃ n W n (X̃ n ; Y n )
≤ PX̃ n W n < (1 − ε)Cn
n
* +
iX̃ n W n (X̃ n ; Y n )
= PX̃ n W n − 1 < −ε .
nCn
Since the channel is information stable, we get that
 
1 n n
lim inf PX̃ n W n i n n (X̃ ; Y ) ≤ (1 − ε)(lim sup Cn − ε) = 0.
n→∞ n X̃ W n→∞

By the definition of C̄, the above immediately implies that


C̄ = sup I¯0 (X; Y ) ≥ I¯0 (X̃; Y ) ≥ (1 − ε)(lim sup Cn − ε).
X n→∞

The proof is completed by noting that ε can be made arbitrarily small. 2

Observations:
• It is known that for discrete memoryless channels, the optimistic capacity
C̄ is equal to the (conventional) capacity C [42, 16]. The same result
holds for modulo-q additive noise channels with stationary ergodic noise.
However, in general, C̄ ≥ C since I¯0 (X; Y ) ≥ I (X; Y ) [10, 11].
¯
• Remark that Theorem 11 in [41] holds if, and only if,
sup I (X; Y ) = sup I¯0 (X; Y ).
X ¯ X

Furthermore, note that, if C̄ = C and there exists an input distribution


PX̂ that achieves C, then PX̂ also achieves C̄.

90
5.4 Examples for the computations of capacity and str-
ong capacity

We provide four examples to illustrate the computation of C and C̄. The first two
examples present information stable channels for which C̄ > C. The third exam-
ple shows an information unstable channel for which C̄ = C. These examples in-
dicate that information stability is neither necessary nor sufficient to ensure that
C̄ = C or thereby the validity of the classical source-channel separation theorem.
The last example illustrates the situation where 0 < C < C̄ < CSC < log2 |Y|,
where CSC We assume in this section that all logarithms are in base 2 so that C
and C̄ are measured in bits.

5.4.1 Information stable channels


Example 5.13 Consider a nonstationary channel W such that at odd time
instances n = 1, 3, · · · , W n is the product of the transition distribution of a
binary symmetric channel with crossover probability 1/8 (BSC( 18 )), and at even
time instances n = 2, 4, 6, · · · , W n is the product of the distribution of a BSC( 14 ).
It can be easily verified that this channel is information stable. Since the channel
is symmetric, a Bernoulli(1/2) input achieves Cn = supX n (1/n)I(X n ; Y n ); thus

1 − hb (1/8), for n odd;
Cn =
1 − hb (1/4), for n even,

where hb (a) := −a log2 a − (1 − a) log2 (1 − a) is the binary entropy function.


Therefore, C = lim inf n→∞ Cn = 1 − hb (1/4) and C̄ = lim supn→∞ Cn = 1 −
hb (1/8) > C.

Example 5.14 Here we use the information stable channel provided in [41,
Section III] to show that C̄ > C. Let N be the set of all positive integers.
Define the set J as
J := {n ∈ N : 22i+1 ≤ n < 22i+2 , i = 0, 1, 2, . . .}
= {2, 3, 8, 9, 10, 11, 12, 13, 14, 15, 32, 33, · · · , 63, 128, 129, · · · , 255, · · · }.
Consider the following nonstationary symmetric channel W . At times n ∈ J ,
Wn is a BSC(0), whereas at times n ∈ J , Wn is a BSC(1/2). Put W n =
W1 × W2 × · · · × Wn . Here again Cn is achieved by a Bernoulli(1/2) input X̂ n .
Since the set J is deterministic and is known to both transmitter and receiver,
we then obtain
1
n
1 J(n)
Cn = I(X̂i ; Yi ) = [J(n) · 1 + (n − J(n)) · 0] = ,
n i=1 n n

91
where J(n) := |J ∩ {1, 2, · · · , n}|. It can be shown that


 2 2 log2 n 1
J(n) 1 − × + , for log2 n odd;
= 3 n 3n
log2 n
n 
 2×2 2
− , for log2 n even.
3 n 3n
Consequently, C = lim inf n→∞ Cn = 1/3 and C̄ = lim supn→∞ Cn = 2/3.

5.4.2 Information unstable channels


Example 5.15 (Polya-contagion channel) Consider a discrete additive cha-
nnel with binary input and output alphabet {0, 1} described by
Y i = Xi ⊕ Z i , i = 1, 2, · · · ,
where Xi , Yi and Zi are respectively the i-th input, i-th output and i-th noise,
and ⊕ represents modulo-2 addition. Suppose that the input process is indepen-
dent of the noise process. Also assume that the noise sequence {Zn }n≥1 is drawn
according to the Polya contagion urn scheme [1, 33] as follows: an urn originally
contains R red balls and B black balls with R < B; the noise just makes succes-
sive draws from the urn; after each draw, it returns to the urn 1 + ∆ balls of the
same color as was just drawn (∆ > 0). The noise sequence {Zi } corresponds to
the outcomes of the draws from the Polya urn: Zi = 1 if i-th ball drawn is red
and Zi = 0, otherwise. Let ρ := R/(R + B) and δ := ∆/(R + B). It is shown in
[1] that the noise process {Zi } is stationary and nonergodic; thus the channel is
information unstable. We then obtain3
Cε = 1 − lim H̄δ (X),
δ↑(1−ε)

3
Some work adopts slightly different definitions of Cε and C̄ε from ours. For example, the ε-
achievable rate in [42] and the optimistic ε-achievable rate in [13] are defined via arbitrary γ > 0
and a sequence of ∼Cn = (n, Mn ) block codes such that i) (1/n) log Mn > R − γ for sufficiently
large n and ii) Pe ( ∼Cn ) ≤ ε for sufficiently large n under ε-achievability and Pe ( ∼Cn ) ≤ ε for
infinitely many n under optimistic ε-achievability. We, however, define the two quantities
without the auxiliary (arbitrarily small) γ but dictate the existence of a sequence of ∼Cn =
(n, Mn ) block codes such that i) lim inf n→∞ (1/n) log Mn ≥ R and ii) lim supn→∞ Pe ( ∼Cn ) ≤ ε
for ε-achievability and lim inf n→∞ Pe ( ∼Cn ) ≤ ε for optimistic ε-achievability (cf. Definitions 4.10
and 5.6). Notably, by adopting the definitions in [42] and [13], one can only obtain (cf. [11,
Part I])
1 − H̄1−ε (X) ≤ Cε ≤ 1 − lim H̄δ (X),
δ↑(1−ε)

and
1 −H1−ε (X) ≤ C̄ε ≤ 1 − lim Hδ (X).
δ↑(1−ε)

Our definitions accordingly provide simpler equality formulas for ε-capacity and optimistic
ε-capacity.

92
and
C̄ε = 1 − lim Hδ (X).
δ↑(1−ε)

It is shown [1] that −(1/n) log PX n (X n ) converges in distribution to the contin-


uous random variable V := hb (U), where U is beta-distributed with parameters
(ρ/δ, (1 − ρ)/δ), and hb (·) is the binary entropy function. Thus

H̄1−ε (X) = lim H̄δ (X) = H1−ε (X) = lim Hδ (X) = FV−1 (1 − ε),
δ↑(1−ε) δ↑(1−ε)

where FV (a) := Pr{V ≤ a} is the cumulative distribution function of V , and


FV−1 (·) is its inverse [1]. Consequently,

Cε = C̄ε = 1 − FV−1 (1 − ε),

and  
C = C̄ = lim 1 − FV−1 (1 − ε) = 0.
ε↓0

Example 5.16 Let W̃1 , W̃2 , . . . consist of the channel in Example 5.14, and let
Ŵ1 , Ŵ2 , . . . consist of the channel in Example 5.15. Define a new channel W as
follows:
W2i = W̃i and W2i−1 = Ŵi for i = 1, 2, · · · .
As in the previous examples, the channel is symmetric, and a Bernoulli(1/2)
input maximizes the inf/sup-information rates. Therefore for a Bernoulli(1/2)
input X, we have
 
1 PW n (Y n |X n )
Pr log ≤θ
n PY n (Y n )
    
 1 PW̃ i (Y i |X i ) PŴ i (Y i |X i )

 Pr log + log ≤θ ,

 2i PY i (Y i ) PY i (Y i )



 if n = 2i;
=   i i i+1 i+1
 
 1 P i (Y |X ) P i+1 (Y |X )

 W̃ Ŵ
≤θ ,
 Pr 2i + 1 log P i (Y i ) + log
 PY i+1 (Y i+1 )

 Y

 if n = 2i + 1;
  


1 i
1 − Pr − log PZ i (Z ) < 1 − 2θ + J(i) ,
1



 i i


 if n = 2i;
=    

 1 i+1 1 1
 1 − Pr −
 log PZ i+1 (Z ) < 1 − 2 − θ+ J(i) ,

 i+1 i+1 i+1


 if n = 2i + 1.

93
The fact that −(1/i) log[PZ i (Z i )] converges in distribution to the continuous
random variable V := hb (U), where U is beta-distributed with parameters
(ρ/δ, (1 − ρ)/δ), and the fact that
1 1 1 2
lim inf J(n) = and lim sup J(n) =
n→∞ n 3 n→∞ n 3
imply that
   
1 PW n (Y n |X n ) 5
i(θ) := lim inf Pr log ≤θ = 1 − FV − 2θ ,
n→∞ n PY n (Y n ) 3

and
   
1 PW n (Y n |X n ) 4
ī(θ) := lim sup Pr log ≤θ = 1 − FV − 2θ .
n→∞ n PY n (Y n ) 3

Consequently,
5 1 −1 2 1 −1
C̄ε = − F (1 − ε) and Cε = − F (1 − ε).
6 2 V 3 2 V
Thus
1 1 5
0<C= < C̄ = < CSC = < log2 |Y| = 1.
6 3 6

94
Bibliography

[1] F. Alajaji and T. Fuja, “A communication channel modeled on contagion,”


IEEE Trans. Inf. Theory, vol. 40, no. 6, pp. 2035-2041, Nov. 1994.

[2] F. Alajaji and P.-N. Chen. An Introduction to Single-User Information The-


ory, Springer, 2018.

[3] P. Billingsley. Probability and Measure, 2nd edition, Wiley, New York, 1995.

[4] R. E. Blahut. Principles and Practice of Information Theory, Addison Wes-


ley, Massachusetts, 1988.

[5] V. Blinovsky. Asymptotic Combinatorial Coding Theory. Kluwer Academic,


1997.

[6] J. A. Bucklew. Large Deviation Techniques in Decision, Simulation, and


Estimation, Wiley, New York, 1990.

[7] P.-N. Chen, “General formulas for the Neyman-Pearson type-II error ex-
ponent subject to fixed and exponential type-I error bound,” IEEE Trans.
Inf. Theory, vol. 42, no. 1, pp. 316-323, Jan 1996.

[8] P.-N. Chen, “Generalization of Gártner-Ellis theorem,” IEEE Trans. Inf.


Theory, vol. 46, no. 7, pp. 2752-2760, Nov. 2000.

[9] P.-N. Chen and F. Alajaji, “Strong converse, feedback capacity and hy-
pothesis testing,” Proc. Conf. Inf. Sciences Systems, John Hopkins Univ.,
Baltimore, Mar. 1995.

[10] P.-N. Chen and F. Alajaji, “Generalization of information measures,” Proc.


Int. Symp. Inf. Theory & Applications, Victoria, Canada, Sep. 1996.

[11] P.-N. Chen and F. Alajaji, “Generalized source coding theorems and hy-
pothesis testing (Parts I and II),” J. Chin. Inst. Eng., vol. 21, no. 3, pp. 283-
303, May 1998.

95
[12] P.-N. Chen and F. Alajaji, “On the optimistic capacity of arbitrary chan-
nels,” in Proc. IEEE Int. Symp. Inf. Theory, Cambridge, MA, Aug. 1998.

[13] P.-N. Chen and F. Alajaji, “Optimistic Shannon coding theorems for arbi-
trary single-user systems,” IEEE Trans. Inf. Theory, vol. 45, no. 7, pp. 2623-
2629, Nov. 1999.

[14] P.-N. Chen and F. Alajaji, “A generalized Poor-Verdú error bound for mul-
tihypothesis testings,” IEEE Trans. Inf. Theory, vol. 58, no. 1, pp. 311-316,
Jan. 2012.

[15] P.-N. Chen and A. Papamarcou, “New asymptotic results in parallel dis-
tributed detection,” IEEE Trans. Inf. Theory, vol. 39, no. 6, pp. 1847-1863,
Nov. 1993.

[16] I. Csiszár and J. Körner, Information Theory: Coding Theorems for Discrete
Memoryless Systems, Academic Press, New York, 1981.

[17] J.-D. Deuschel and D. W. Stroock, Large Deviations, Academic Press, San
Diego, 1989.

[18] R. L. Dobrushin, “General formulation of Shannon’s basic theorems of infor-


mation theory,” AMS Translations, vol. 33, pp. 323-438, AMS, Providence,
RI, 1963.

[19] T. Ericson and V. A. Zinoviev, “An improvement of the Gilbert bound for
constant weight codes,” IEEE Trans. Inf. Theory, vol. 33, no. 5, pp. 721-723,
Sep. 1987.

[20] W. Feller, An Introduction to Probability Theory and its Applications, 2nd


edition, Wiley, New York, 1970.

[21] J. A. Fill and M. J. Wichura, “The convergence rate for the strong law
of large numbers: General lattice distributions,” Probab. Th. Rel. Fields,
vol. 81, pp. 189-212, 1989.

[22] R. G. Gallager. Information Theory and Reliable Communications, Wiley,


1968.

[23] G. van der Geer and J. H. van Lint. Introduction to Coding Theory and
Algebraic Geometry. Birkhauser, Basel, 1988.

[24] R. M. Gray, Entropy and Information Theory, Springer-Verlag, New York,


1990.

96
[25] T. S. Han, Information-Spectrum Methods in Information Theory, Springer,
2003.
[26] T. S. Han and S. Verdú, “Approximation theory of output statistics,” IEEE
Trans. Inf. Theory, vol. 39, no. 3, pp. 752-772, May 1993.
[27] A. N. Kolmogorov and S. V. Fomin. Introductory Real Analysis. Dover Pub-
lications, New York, 1970.
[28] D. E. Knuth and A. C. Yao, “The complexity of random number gener-
ation,” in Proc. Symp. Algorithms and Complexity: New Directions and
Recent Results, Academic Press, New York, 1976.
[29] J. H. van Lint, Introduction to Coding Theory. 2nd edition, Springer-Verlag,
New York, 1992.
[30] S. N. Litsyn and M. A. Tsfasman, “A note on lower bounds,” IEEE Trans.
Inf. Theory, vol. 32, no. 5, pp. 705-706, Sep. 1986.
[31] J. K. Omura, “On general Gilbert bounds,” IEEE Trans. Inf. Theory,
vol. 19, no. 5, pp. 661-666, Sep. 1973.
[32] M. S. Pinsker, Information and Information Stability of Random Variables
and Processes, Holden-Day, 1964.
[33] G. Polya, “Sur quelques points de la théorie des probabilités,” Ann. Inst.
H. Poincarré, vol. 1, pp. 117-161, 1931.
[34] H. V. Poor and S. Verdú, “A lower bound on the probability of error in
multihypothesis testing,” IEEE Trans. Inf. Theory, vol. 41, no. 6, pp. 1992-
1994, Nov. 1995.
[35] H. L. Royden. Real Analysis, 3rd edition, Macmillan, New York, 1988.
[36] Y. Steinberg and S. Verdú, “Simulation of random processes and rate-
distortion theory,” IEEE Trans. Inf. Theory, vol. 42, no. 1, pp. 63-86,
Jan. 1996.
[37] Y. Steinberg, “New converses in the theory of identification via channels,”
IEEE Trans. Inf. Theory, vol. 44, no. 3, pp. 984-998, May 1998.
[38] M. A. Tsfasman and S. G. Vladut. Algebraic-Geometric Codes, Kluwer Aca-
demic, Netherlands, 1991.
[39] J. N. Tsitsiklis, “Decentralized detection by a large number of sensors,”
Mathematics of Control, Signals and Systems, vol. 1, no. 2, pp. 167-182,
1988.

97
[40] G. Vazquez-Vilar, A. Tauste Campo, A. Guillen i Fabregas, and A. Mar-
tinez, “Bayesian M-ary hypothesis testing: The meta-converse and Verdú-
Han bounds are tight,” IEEE Trans. Inf. Theory, vol. 62, no. 5, pp. 2324-
2333, May 2016.

[41] S. Vembu, S. Verdú and Y. Steinberg, “The source-channel separation theo-


rem revisited,” IEEE Trans. Inf. Theory, vol. 41, no. 1, pp. 44-54, Jan. 1995.

[42] S. Verdú and T. S. Han, “A general formula for channel capacity,” IEEE
Trans. Inf. Theory, vol. 40, no. 4, pp. 1147-1157, Jul. 1994.

[43] S. G. Vladut, “An exhaustion bound for algebraic-geometric modular


codes,” Probl. Inf. Transm., vol. 23, pp. 22-34, 1987.

[44] V. A. Zinoviev and S. N. Litsyn, “Codes that exceed the Gilbert bound,”
Probl. Inf. Transm., vol. 21, no. 1, pp. 105-108, 1985.

98
Index

M-type, 35 capacity, 59
Algorithm for 3-type, 36 ε-capacity, 70
Algorithm for 4-type, 35 definition, 70
δ-inf-divergence rate, 10 Example, 73
δ-inf-entropy rate, 9 for arbitrary channels, 58
δ-inf-information rate, 9 general formula, 72
δ-sup-divergence rate, 10 optimistic, 89
δ-sup-entropy rate, 9 strong, 70, 73
δ-sup-information rate, 9 channel capacity, 59
ε-achievable data compression rate, 24 channels
ε-achievable rate, 70 general models, 57
optimistic, 88
ε-achievable resolution, 37 data processing lemma, 14
ε-achievable resolution rate, 37 data transmission code
ε-achievable source coding rate fixed length, 59
optimistic, 84 entropy, 1
ε-capacity, 70 entropy rate, 1
Example, 76 entropy rate, 1
optimistic, 88, 89 inf-entropy rate, 9
theorem, 71 sup-entropy rate, 9
ε-mean-achievable resolution rate, 38
ε-mean-resolvability, 39 Feinstein’s lemma, 60
for input and channel, 80 fixed-length
ε-resolvability, 37 data transmission code, 59
for input and channel, 79
ε-source compression rate general lossless data compression theo-
fixed-length codes, 48 rem, 23
general optimistic source coding theo-
AEP, 24 rem, 84
generalized, 30 general source coding theorem, 26
arbitrary statistics system, 1 optimistic, 84
arbitrary system with memory, 2 generalized AEP theorem, 31
average probability of error, 59 generalized divergence measure, 12
average-case complexity for generating generalized entropy measure, 10
a random variable, 35 generalized information measure, 1, 8

99
δ-inf-divergence rate, 10 optimistic capacity, 89
δ-inf-entropy rate, 9 information stable channel, 89
δ-inf-information rate, 9 optimistic source coding rate, 84
δ-sup-divergence rate, 10 information stable source, 86
δ-sup-entropy rate, 9
δ-sup-information rate, 9 Polya-contagion channel, 92
examples, 17 Poor-Verdú lemma
normalized entropy density, 8 example, 67, 68
normalized information density, 9 generalized, 63
normalized log-likelihood ratio, 10 probability of an event evaluated under
properties, 11 a distribution, iii
generalized mutual information measure, quantile, 2, 3, 5
11 properties, 5
inf-entropy rate, 9 resolution, 35
inf-information rate, 58 ε-achievable resolution rate, 37
inf-spectrum, 58 ε-mean-achievable resolution rate,
information density, 58 38
information stable channels, 89 resolvability, 30, 34, 38
information stable sources, 28, 86 ε-mean-resolvability, 39
information unstable channels, 92 ε-resolvability, 37
liminf in probability, 4 equality to sup-entropy rate, 41
limsup in probability, 5 equivalence to strong capacity, 81
for channel, 80
mean-resolvability, 39 for input and channel, 80
equivalence to capacity, 81 mean-resolvability, 39
for channel, 80 minimum source coding rate for fixed-
for input and channel, 80 length codes, 49
operational meanings, 39 operational meanings, 39
variable-length codes source coding, 47
minimum source coding rate, 51
source compression rate
normalized entropy density, 8 ε-source compression rate, 48
normalized information density, 9 fixed-length codes, 48
normalized log-likelihood ratio, 10 variable-length codes, 48
source-channel separation theorem, 83
optimality of independent inputs for δ- spectrum, 2
inf-information rate, 14 inf-spectrum, 2, 58
optimistic ε-achievable rate, 88 sup-spectrum, 2, 58
optimistic ε-achievable source coding rate,stationary-ergodic source, 29, 33
84 strong capacity, 70, 73
optimistic ε-capacity, 88, 89 Example, 75

100
strong converse theorem, 29
sup-entropy rate, 9
sup-information rate, 58
sup-spectrum, 58

variational distance, 36
bound for entropy difference, 45
bound for log-likelihood ratio spec-
trum, 40

weak converse theorem, 29


weakly δ-typical set, 30, 32
worst-case complexity for generating a
random variable, 35

101

You might also like