0% found this document useful (0 votes)

15 views

Information Theory For Single-User Systems With Arbitrary Statistical Memory

This document provides a 3-paragraph summary of a technical research paper on information theory for single-user systems with arbitrary statistical memory. The paper introduces generalized information measures that can be used to analyze discrete-time single-user stochastic systems that are not necessarily stationary, ergodic or information stable. These measures include the information spectrum, quantiles, and extensions of entropy, mutual information and divergence. The paper will cover topics such as lossless and lossy data compression theorems, measures of randomness and resolvability, channel coding theorems, and multi-terminal information theory using these generalized information measures. The paper is intended as a sequel to the authors' previous textbook on introductory information theory. It

Uploaded by

Tao-Wei Huang

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

15 views

Information Theory For Single-User Systems With Arbitrary Statistical Memory

Uploaded by

Tao-Wei Huang

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 111

Information Theory for Single-User Systems

with Arbitrary Statistical Memory

Fady Alajaji† and Po-Ning Chen‡

†
Department of Mathematics & Statistics,
Queen’s University, Kingston, ON K7L 3N6, Canada
Email: [email protected]

‡
Department of Electrical & Computer Engineering
National Chiao Tung University
1001, Ta Hsueh Road
Hsin Chu, Taiwan 300
Republic of China
Email: [email protected]

February 21, 2019

c Copyright by
Fady Alajaji† and Po-Ning Chen‡
February 21, 2019
Preface

The reliable transmission of information bearing signals over a noisy commu-

nication channel is at the heart of what we call communication. Information
theory—founded by Claude E. Shannon in 1948—provides a mathematical frame-
work for the theory of communication; it describes the fundamental limits to how
eﬃciently one can encode information and still be able to recover it with negli-
gible loss.
This book is a sequel to our previous introductory textbook on information
theory [2]. It covers advanced topics concerning the information theoretic limits
of discrete-time single-user stochastic systems with arbitrary statistical memory
(i.e., systems that are not necessarily stationary, ergodic or information stable).
The topics are studied using the information spectrum methods, and extensions
thereof, established by T.S. Han in his pioneering textbook [25]. What follows
is a tentative list of topics to be covered.
1. General information measure: Information spectrum and Quantile and
their properties.
2. Advanced topics of losslesss data compression: Fixed-length lossless data
compression theorem for arbitrary channels, variable-length lossless data
compression theorem for arbitrary channels.
3. Measure of randomness and resolvability: Resolvability and source coding,
approximation of output statistics for arbitrary channels.
4. Advanced topics of channel coding: Channel capacity for arbitrary single-
user channel, optimistic Shannon coding theorem, strong capacity, ε-capacity.
5. Advanced topics of lossy data compression
6. Hypothesis testing: Error exponent and divergence, large deviations the-
ory, Berry-Esseen theorem.
7. Channel reliability: Random coding exponent, expurgated exponent, par-
titioning exponent, sphere-packing exponent, the asymptotic largest min-
imum distance of block codes, Elias bound, Varshamov-Gilbert bound,
Bhattacharyya distance.

ii
8. Information theory of networks: Distributed detection, data compression
over distributed source, capacity of multiple access channels, degraded
broadcast channel, Gaussian multiple terminal channels.
The mathematical background on which these topics are based can be found
in Appendices A and B of [2].

Notation
For a discrete random variable X with alphabet X , we use PX to denote its
distribution. For convenience, we will use interchangeably the two expressions
for the probability of the elementary event [X = x]: Pr[X = x] and PX (x).
Similarly, the probability of a set characterized by an inequality, such as f (x) <
a, will be expressed by either
PX {x ∈ X : f (x) < a}
or
Pr [f (X) < a] .
In the second expression, we view f (X) as a new random variable defined through
X and a function f (·).
Obviously, the above expressions can be applied to any legitimate function
f (·) defined over X , including any probability function PX̂ (·) (or log PX̂ (x)) of a
random variable X̂. Therefore, the next two expressions denote the probability
of f (x) = PX̂ (x) < a evaluated under distribution PX :
PX {x ∈ X : f (x) < a} = PX {x ∈ X : PX̂ (x) < a}
and
Pr [f (X) < a] = Pr [PX̂ (X) < a] .
As a result, if we write

PX̂,Ŷ (x, y)
PX,Y (x, y) ∈ X × Y : log <a
PX̂ (x)PŶ (y)

PX̂,Ŷ (X, Y )
= Pr log <a ,
PX̂ (X)PŶ (Y )
we mean that we have defined a new function
PX̂,Ŷ (x, y)
f (x, y) := log
PX̂ (x)PŶ (y)
in terms of the joint distribution PX̂,Ŷ and its two marginal distributions, and
that we are interested in the probability of f (X, Y ) < a where X and Y have
joint distribution PX,Y .

iii
Notes to readers
In this book, all the assumptions, claims, conjectures, corollaries, deﬁnitions, ex-
amples, exercises, lemmas, observations, properties, and theorems are numbered
under the same counter for ease of their searching. For example, the lemma that
immediately follows Theorem 2.1 will be numbered as Lemma 2.2, instead of
Lemma 2.1.

iv
Acknowledgements

Thanks are given to our families for their full support during the period of
our writing the book.

v
Table of Contents

Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii
Notes to readers . . . . . . . . . . . . . . . . . . . . . . . . iv

Chapter Page

List of Tables viii

List of Figures ix

1 Generalized Information Measures for Arbitrary Systems with

Memory 1
1.1 Spectrums and quantiles . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Properties of quantiles . . . . . . . . . . . . . . . . . . . . . . . . 5
1.3 Generalized information measures . . . . . . . . . . . . . . . . . . 8
1.4 Properties of generalized information measures . . . . . . . . . . . 11
1.5 Examples for the computation of general information measures . . 17

2 General Data Compression Theorems 23

2.1 Fixed-length data compression codes for arbitrary sources . . . . . 24
2.2 Generalized AEP theorem . . . . . . . . . . . . . . . . . . . . . . 30

3 Measure of Randomness for Stochastic Processes 34

3.1 Motivation for resolvability : measure of randomness of random
variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.2 Notations and deﬁnitions regarding to resolvability . . . . . . . . 35
3.3 Operational meanings of resolvability and mean-resolvability . . . 39
3.4 Resolvability, mean-resolvability and source coding . . . . . . . . 47

4 Channel Coding Theorems and Approximations of Output Sta-

tistics for Arbitrary Channels 57
4.1 General channel models . . . . . . . . . . . . . . . . . . . . . . . 57
4.2 Channel coding and Feinstein’s lemma . . . . . . . . . . . . . . . 59
4.3 Error bounds for multihypothesis testing . . . . . . . . . . . . . . 63

vi
4.4 Capacity formulas for general channels . . . . . . . . . . . . . . . 70
4.5 Examples for the general capacity formulas . . . . . . . . . . . . . 73
4.6 Capacity and resolvability for channels . . . . . . . . . . . . . . . 77

5 Optimistic Shannon Coding Theorems for Arbitrary Single-User

Systems 82
5.1 Motivations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
5.2 Optimistic source coding theorems . . . . . . . . . . . . . . . . . 83
5.3 Optimistic channel coding theorems . . . . . . . . . . . . . . . . . 88
5.4 Examples for the computations of capacity and strong capacity . . 91
5.4.1 Information stable channels . . . . . . . . . . . . . . . . . 91
5.4.2 Information unstable channels . . . . . . . . . . . . . . . . 92

vii
List of Tables

Number Page

1.1 Generalized entropy measures where δ ∈ [0, 1]. . . . . . . . . . . 10

1.2 Generalized mutual information measures where δ ∈ [0, 1]. . . . . 11
1.3 Generalized divergence measures where δ ∈ [0, 1]. . . . . . . . . . 12

viii
List of Figures

Number Page

1.1 The asymptotic CDFs (spectrums) of a sequence of random vari-

ables {An }∞n=1 and their quantiles: ū(·) = sup-spectrum of {An },
u(·) = inf-spectrum of {An }, U δ = quantile of ū(·), Ūδ = quantile
¯ ¯
of u(·), U = limδ↓0 U δ = U 0 , U 1− = limξ↑1 U ξ , Ū = limδ↑1 Ūδ . . . . 6
¯ ¯ ¯ ¯ ¯ ¯
1.2 The spectrum h(θ) for Example 1.8. . . . . . . . . . . . . . . . . . 20
¯
1.3 The spectrum ī(θ) for Example 1.8. . . . . . . . . . . . . . . . . . 20
1.4 The limiting spectrums of (1/n)hZ n (Z n ) for Example 1.9 . . . . . 21
1.5 The possible limiting spectrums of (1/n)iX n ,Y n (X n ; Y n ) for Ex-
ample 1.9. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

2.1 Behavior of the probability of block decoding error as blocklength

n goes to inﬁnity for an arbitrary source X. . . . . . . . . . . . . 30
2.2 Illustration of generalized AEP Theorem. Fn (δ; ε) := Tn [H̄ε (X) +
δ] \ Tn [H̄ε (X) − δ] is the dashed region. . . . . . . . . . . . . . . . 32

3.1 Source generator: {Xt }0<t<1 is an independent random process

with PXt (0) = t and PXt (1) = 1 − t, and is also independent of
the selector Z, where Xt is outputted if Z = t. Source generator
of each time instance is independent temporally. . . . . . . . . . . 54
3.2 The ultimate CDF of −(1/n) log PX n (X n ): Pr{hb (Z) ≤ t}. . . . . 56

4.1 The ultimate CDFs of −(1/n) log PN n (N n ). . . . . . . . . . . . . 75

4.2 The ultimate CDF of the normalized information density for Ex-
ample 4.18-Case B). . . . . . . . . . . . . . . . . . . . . . . . . . 76
4.3 The communication system. . . . . . . . . . . . . . . . . . . . . . 79
4.4 The simulated communication system. . . . . . . . . . . . . . . . 79

ix
Chapter 1

Generalized Information Measures for

Arbitrary Systems with Memory

The entropy of a discrete random variable X [2], deﬁned by

H(X) := − PX (x) log PX (x) = EX [− log PX (X)] nats,
x∈X

is a measure of the average amount of uncertainty in X. An extened notion

of entropy for a sequence of random variables X1 , X2 , . . . , Xn , . . . is the entropy
rate, which is given by
1 1
lim H(X n ) = lim E [− log PX n (X n )] ,
n→∞ n n→∞ n

assuming the limit exists. The above quantities have an operational significance
established via Shannon’s source coding theorems when the stochastic systems
under consideration satisfy certain regularity conditions, such as stationarity and
ergodicity [24, 42]. However, in more complicated situations such as when the
systems have a time-varying nature and are non-stationary, these information
rates are no longer valid and lose their operational significance. This results in
the need to establish new entropy measures which appropriately characterize the
operational limits of arbitrary stochastic systems.
Let us begin with the model of an arbitrary system with memory. In gen-
eral, there are two indices for random variables or observations: a time in-
dex and a space index. When a sequence of random variables is denoted by
X1 , X2 , . . . , Xn , . . ., the subscript i of Xi can be treated as either a time index
or a space index, but not both. Hence, when a sequence of random variables is
a function of both time and space, the notation of X1 , X2 , . . . , Xn , . . ., is by no
means sufficient; and therefore, a new model for a general time-varying source,
such as
(n) (n) (n)
X1 , X2 , . . . , Xt , . . . ,

1
where t is the time index and n is the space or position index (or vice versa),
becomes significant.
When block-wise (fixed-length) compression of such source (with blocklength
n) is considered, the same question as for the compression of an i.i.d. source
arises:
what is the minimum compression rate (say in bits per
source sample) for which the probability of error can be (1.0.1)
made arbitrarily small as the blocklength goes to infinity?
To answer this question, information theorists have to find a sequence of data
compression codes for each blocklength n and investigate if the decompression
error probability goes to zero as n approaches infinity. However, unlike the simple
source models such as discrete memorylessness [2], the source being arbitrary
may exhibit distinct statistics for each blocklength n; e.g., for
(1)
n = 1 : X1
(2) (2)
n = 2 : X1 , X2
(3) (3) (3)
n = 3 : X1 , X2 , X3
(4) (4) (4) (4)
n = 4 : X1 , X2 , X3 , X4 (1.0.2)
..
.
(4) (1) (2) (3)
the statistics of X1 could be different from X1 , X1 and X1 (i.e., the source
statistics are not necessarily consistent). Since the model in question (1.0.1) is
general, and the system statistics can be arbitrarily defined, it is therefore named
an arbitrary system with memory.
The triangular array of random variables in (1.0.2) is denoted by a boldface
letter as
X := {X n }∞
n=1 ,

where X n := (X1 , X2 , . . . , Xn ); for convenience, we also write

(n) (n) (n)

∞
X := X n = X1 , X2 , . . . , Xn(n)
(n) (n)
.
n=1

In this chapter, we will ﬁrst introduce a new concept on deﬁning information

measures for arbitrary systems and then, analyze in detail their algebraic prop-
erties. In the next chapter, we will utilize the new measures to establish general
source coding theorems for arbitrary ﬁnite-alphabet sources.

1.1 Spectrums and quantiles

Deﬁnition 1.1 (Inf/sup-spectrum) If {An }∞ n=1 is a sequence of random vari-

ables, then its inf-spectrum u(·) and its sup-spectrum ū(·) are respectively deﬁned
¯

2
by
u(θ) := lim inf Pr{An ≤ θ}
¯ n→∞
and
ū(θ) := lim sup Pr{An ≤ θ},
n→∞
respectively, where θ ∈ R.

In other words, u(·) and ū(·) are respectively the liminf and the limsup of
¯
the cumulative distribution function (CDF) of An . Note that by definition, the
CDF of An — Pr{An ≤ θ} — is non-decreasing and right-continuous. However,
for u(·) and ū(·), only the non-decreasing property remains.1
¯
Definition 1.2 (Quantile of inf/sup-spectrum) For any 0 ≤ δ ≤ 1, the
quantile U δ of the sup-spectrum ū(·) and the quantile Ūδ of the inf-spectrum u(·)
¯ ¯
are defined by2
U δ := sup{θ : ū(θ) ≤ δ}
¯
and
Ūδ := sup{θ : u(θ) ≤ δ},
¯
respectively.
It follows from the above definitions that U δ and Ūδ are right-continuous and
¯
non-decreasing in δ. Note that the supremum of an empty set is defined to be
−∞.

Observation 1.3 Generally speaking, one can define “quantile” in four different
ways. For example, for the quantile of the inf-spectrum, we can define:
Ūδ := sup{θ : lim inf Pr[An ≤ θ] ≤ δ}
n→∞
Ūδ− := sup{θ : lim inf Pr[An ≤ θ] < δ}
n→∞
Ūδ+ := sup{θ : lim inf Pr[An < θ] ≤ δ}
n→∞
Ūδ+− := sup{θ : lim inf Pr[An < θ] < δ}.
n→∞

1
It is pertinent to also point out that even if we do not require right-continuity as a funda-
mental property of a CDF, the spectrums u(·) and ū(·) are not necessarily legitimate CDFs of
¯
(conventional real-valued) random variables since there might exist cases where the “probabi-
lity mass escapes to infinity” (cf. [3, pp. 346]). A necessary and sufficient condition for u(·)
¯
and ū(·) to be conventional CDFs (without requiring right-continuity) is that the sequence
of distribution functions of An is tight [3, pp. 346]. Tightness is actually guaranteed if the
alphabet of An is finite.
2
Note that the usual definition of the quantile function φ(δ) of a non-decreasing function
F (·) is slightly different from our definition [3, pp. 190], where φ(δ) := sup{θ : F (θ) < δ}.
Remark that if F (·) is strictly increasing, then the quantile is nothing but the inverse of F (·):
φ(δ) = F −1 (δ).

3
The general relations between these four quantities are as follows:
Ūδ− = Ūδ+− ≤ Ūδ = Ūδ+ .
Obviously, Ūδ− ≤ Ūδ ≤ Ūδ+ and Ūδ− ≤ Ūδ+− by their deﬁnitions. It remains to
show that Ūδ+− ≤ Ūδ , that Ūδ+ ≤ Ūδ and that Ūδ+− ≤ Ūδ− .
Suppose Ūδ+− > Ūδ + γ for some γ > 0. Then by deﬁnition of Ūδ+− ,
lim inf Pr[An < Ūδ + γ] < δ,
n→∞

which implies that

lim inf Pr[An ≤ Ūδ + γ/2] ≤ lim inf Pr[An < Ūδ + γ] < δ ≤ δ
n→∞ n→∞

and violates the definition of Ūδ . This completes the proof of Ūδ+− ≤ Ūδ . To prove
that Ūδ+ ≤ Ūδ , note that from the definition of Ūδ+ , we have that for any ε > 0,
lim inf n→∞ Pr[An < Ūδ+ − ε] ≤ δ and hence lim inf n→∞ Pr[An ≤ Ūδ+ − 2ε] ≤ δ,
implying that Ūδ+ − 2ε ≤ Ūδ . The latter yields that Ūδ+ ≤ inf ε>0 [Ūδ+ − 2ε] = Ūδ ,
which completes the proof. Proving that Ūδ+− ≤ Ūδ− follows a similar argument.
It is worth noting that Ūδ− = limξ↑δ Ūξ . Their equality can be proved by
first observing that Ūδ− ≥ limξ↑δ Ūξ by their definitions, and then assuming that
γ := Ūδ− − limξ↑δ Ūξ > 0. Then Ūξ < Ūξ + γ/2 ≤ (Ūδ− ) − γ/2 implies that
u((Ūδ− ) − γ/2) > ξ for ξ arbitrarily close to δ from below, which in turn implies
¯
u((Ūδ− ) − γ/2) ≥ δ, contradicting to the definition of Ūδ− . Throughout, we will
¯
interchangeably use Ūδ− and limξ↑δ Ūδ for convenience.
The final note in this observation is that Ūδ+− and Ūδ+ will not be used in
defining our general information measures. They are introduced only for math-
ematical interests. 2

Based on the above definitions, the liminf in probability U of {An }∞n=1 [26],
¯
which is defined as the largest extended real number such that for all ξ > 0,
lim Pr[An ≤ U − ξ] = 0,
n→∞ ¯
satisfies3
U = lim U δ = U 0 .
¯ δ↓0 ¯ ¯
3
It is obvious from their definitions that
lim U δ ≥ U 0 ≥ U .
δ↓0 ¯ ¯ ¯
The equality of limδ↓0 U δ and U can be proved by contradiction by first assuming
¯ ¯
γ := lim U δ − U > 0.
δ↓0 ¯ ¯
Then ū(U + γ/2) ≤ δ for arbitrarily small δ > 0, which immediately implies ū(U + γ/2) = 0,
¯ ¯
contradicting to the definition of U .
¯

4
Also, the limsup in probability Ū (cf. [26]), deﬁned as the smallest extended real
number such that for all ξ > 0,

lim Pr[An ≥ Ū + ξ] = 0,
n→∞

is exactly4
Ū = lim Ūδ = sup{θ : u(θ) < 1}.
δ↑1 ¯
It readily follows from the above deﬁnitions that

U ≤ U δ ≤ Ūδ ≤ Ū
¯ ¯
for δ ∈ [0, 1).
Remark that U δ and Ūδ always exist. Furthermore, if U δ = Ūδ for all δ
¯ ¯
in [0, 1], the sequence of random variables An converges in distribution to a
random variable A, provided the sequence of An is tight. Finally, for a better
understanding of the quantities deﬁned above, we depict them in Figure 1.1.

1.2 Properties of quantiles

Lemma 1.4 Consider two random sequences, {An }∞ ∞

n=1 and {Bn }n=1 . Let ū(·)
and u(·) be respectively the sup- and inf-spectrums of {An }∞ n=1 . Similarly, let
¯
v̄(·) and v(·) denote respectively the sup- and inf-spectrums of {Bn }∞ n=1 . Deﬁne
∞
U δ and Ūδ be the quantiles of the sup- and inf-spectrums of {An }n=1; also deﬁne
¯
Vδ and V̄δ be the quantiles of the sup- and inf-spectrums of {Bn }∞ n=1 .
Now let (u + v)(·) and (u + v)(·) denote the sup- and inf-spectrums of sum
sequence {An + Bn }∞ n=1 , i.e.,

(u + v)(θ) := lim sup Pr{An + Bn ≤ θ},

n→∞

4
Since 1 = limn→∞ Pr{An < Ū + ξ} ≤ limn→∞ Pr{An ≤ Ū + ξ} = u(Ū + ξ), it is straight-
¯
forward that
Ū ≥ sup{θ : u(θ) < 1} = lim Ūδ .
¯ δ↑1

The equality of Ū and limδ↑1 Ūδ can be proved by contraction by ﬁrst assuming that

γ := Ū − lim Ūδ > 0.

δ↑1

Then 1 ≥ u(Ū −γ/2) > δ for δ arbitrarily close to 1, which implies u(Ū −γ/2) = 1. Accordingly,
¯ ¯
by
1 ≥ lim inf Pr{An < Ū − γ/4} ≥ lim inf Pr{An ≤ Ū − γ/2} = u(Ū − γ/2) = 1,
n→∞ n→∞ ¯
we obtain the desired contradiction.

5
6

ū(·) u(·)
¯
δ

0 -
U Ū0 Uδ Ūδ U 1− Ū
¯ ¯ ¯
Figure 1.1: The asymptotic CDFs (spectrums) of a sequence of random
variables {An }∞n=1 and their quantiles: ū(·) = sup-spectrum of {An },
u(·) = inf-spectrum of {An }, U δ = quantile of ū(·), Ūδ = quantile of
¯ ¯
u(·), U = limδ↓0 U δ = U 0 , U 1− = limξ↑1 U ξ , Ū = limδ↑1 Ūδ .
¯ ¯ ¯ ¯ ¯ ¯

and
(u + v)(θ) := lim inf Pr{An + Bn ≤ θ}.
n→∞

Again, deﬁne (U + V )δ and (U + V )δ be the quantiles with respect to (u + v)(·)

and (u + v)(·).
Then the following statements hold.

1. U δ and Ūδ are both non-decreasing and right-continuous functions of δ for

¯
δ ∈ [0, 1].

2. limδ↓0 U δ = U 0 and limδ↓0 Ūδ = Ū0 .

¯ ¯
3. For δ ≥ 0, γ ≥ 0, and δ + γ ≤ 1,

(U + V )δ+γ ≥ U δ + Vγ , (1.2.1)
¯
and
(U + V )δ+γ ≥ U δ + V̄γ . (1.2.2)
¯
4. For δ ≥ 0, γ ≥ 0, and δ + γ ≤ 1,

(U + V )δ ≤ U δ+γ + V̄(1−γ) , (1.2.3)

6
and
(U + V )δ ≤ Ūδ+γ + V̄(1−γ) . (1.2.4)

Proof: The proof of Property 1 follows directly from the deﬁnitions of U δ and
¯
Ūδ and the fact that the inf-spectrum and the sup-spectrum are non-decreasing
in δ.
The proof of Property 2 can be proved by contradiction as follows. Suppose
limδ↓0 U δ > U 0 + ε for some ε > 0. Then for any δ > 0,
¯ ¯
ū(U 0 + ε/2) ≤ δ.
¯
Since the above inequality holds for every δ > 0, and ū(·) is a non-negative
function, we obtain ū(U 0 + ε/2) = 0, which contradicts to the deﬁnition of U 0 .
¯ ¯
We can prove limδ↓0 Ūδ = Ū0 in a similar fashion.
To show (1.2.1), we observe that for α > 0,

lim sup Pr{An + Bn ≤ U δ + Vγ − 2α}

n→∞ ¯
≤ lim sup (Pr{An ≤ U δ − α} + Pr{Bn ≤ Vγ − α})
n→∞ ¯
≤ lim sup Pr{An ≤ U δ − α} + lim sup Pr{Bn ≤ Vγ − α}
n→∞ ¯ n→∞
≤ δ + γ,

which, by deﬁnition of (U + V )δ+γ , yields

(U + V )δ+γ ≥ U δ + Vγ − 2α.
¯
The proof is completed by noting that α can be made arbitrarily small.
Similarly, we note that for α > 0,
lim inf Pr{An + Bn ≤ U δ + V̄γ − 2α}
n→∞ ¯
≤ lim inf Pr{An ≤ U δ − α} + Pr{Bn ≤ V̄γ − α}
n→∞ ¯
≤ lim sup Pr{An ≤ U δ − α} + lim inf Pr{Bn ≤ V̄γ − α}
n→∞ ¯ n→∞
≤ δ + γ,

which, by deﬁnition of (U + V )δ+γ and for arbitrarily small α, proves (1.2.2).

To show (1.2.3), we ﬁrst observe that (1.2.3) trivially holds when γ = 1 (and
δ = 0). It remains to prove its validity under γ < 1. Remark from (1.2.1) that

(U + V )δ + (−V )γ ≤ (U + V − V )δ+γ = U δ+γ .

¯
Hence,
(U + V )δ ≤ U δ+γ − (−V )γ .
¯

7
(Note that the case γ = 1 is not allowed here because it results in U 1 = (−V )1 =
¯
∞, and the subtraction between two infinite terms is undefined. That is why we
need to exclude the case of γ = 1 for the subsequent proof.) The proof is then
completed by showing that
−(−V )γ ≤ V̄(1−γ) . (1.2.5)
By definition,
(−v)(θ) := lim sup Pr {−Bn ≤ θ}
n→∞
= 1 − lim inf Pr {Bn < −θ} .
n→∞
Then
V̄(1−γ) := sup{θ : v(θ) ≤ 1 − γ}
= sup{θ : lim inf Pr[Bn ≤ θ] ≤ 1 − γ}
n→∞
≥ sup{θ : lim inf Pr[Bn < θ] < 1 − γ} (cf. footnote 1)
n→∞
= sup{−θ : lim inf Pr[Bn < −θ] < 1 − γ}
n→∞
= sup{−θ : 1 − lim sup Pr[−Bn ≤ θ] < 1 − γ}
n→∞
= sup{−θ : lim sup Pr[−Bn ≤ θ] > γ}
n→∞
= − inf{θ : lim sup Pr[−Bn ≤ θ] > γ}
n→∞
= − sup{θ : lim sup Pr[−Bn ≤ θ] ≤ γ}
n→∞
= − sup{θ : (−v)(θ) ≤ γ}
= −(−V )γ .
Finally, to show (1.2.4), we again note that it is trivially true for γ = 1. We then
observe from (1.2.2) that (U + V )δ + (−V )γ ≤ (U + V − V )δ+γ = Ūδ+γ . Hence,
(U + V )δ ≤ Ūδ+γ − (−V )γ .
Using (1.2.5), we have the desired result. 2

1.3 Generalized information measures

In Deﬁnitions 1.1 and 1.2, if we let the random variable An equal the normalized
entropy density5
1 1
hX n (X n ) := − log PX n (X n )
n n
5
The random variable hX n (X n ) := − log PX n (X n ) is named the entropy density. Hence,
the normalized entropy density is equal to hX n (X n ) divided or normalized by the blocklength
n.

8
of an arbitrary source
∞
n (n) (n)
X = X = X1 , X2 , . . . , Xn
(n)
,
n=1

we obtain two generalized (spectral) entropy measures for X:

1
δ-inf-entropy rate Hδ (X) = quantile of the sup-spectrum of hX n (X n )
n
1
δ-sup-entropy rate H̄δ (X) = quantile of the inf-spectrum of hX n (X n ).
n

Note that the inf-entropy-rate H(X) and the sup-entropy-rate H̄(X) introduced
in [25, 26] are special cases of the δ-inf/sup-entropy rate measures:

H(X) = H0 (X), and H̄(X) = lim H̄δ (X).

δ↑1

Conceptually, we may imagine that if the random variable (1/n)h(X n ) ex-

hibits a limiting distribution, then the sup-entropy rate is the right-margin of
the support of that limiting distribution. For example, suppose that the limiting
distribution of (1/n)hX n (X n ) is positive over (−2, 2), and zero, otherwise. Then
H̄(X) = 2. Similarly, the inf-entropy rate is the left margin of the support
of the limiting random variable lim supn→∞ (1/n)hX n (X n ), which is −2 for the
same example.
Analogously, for an arbitrary channel
∞
n (n)
W = (Y |X) = W = W1 , . . . , Wn (n)
,
n=1

or more speciﬁcally, PW = PY |X = {PY n |X n }∞

n=1 with input X and output Y , if
we replace An in Deﬁnitions 1.1 and 1.2 by the channel’s normalized information
density

1 1 1 PX n ,Y n (X n , Y n )
iX n W n (X n ; Y n ) = iX n ,Y n (X n ; Y n ) := log ,
n n n PX n (X n )PY n (Y n )

we get the δ-inf/sup-information rates, denoted respectively by I δ (X; Y ) and

¯
I¯δ (X; Y ), as:

1
I δ (X; Y ) = quantile of the sup-spectrum of iX n W n (X n ; Y n )
¯ n
1
I¯δ (X; Y ) = quantile of the inf-spectrum of iX n W n (X n ; Y n ).
n

9
Entropy Measures

system arbitrary source X

1 1
norm. entropy density hX n (X n ) := − log PX n (X n )
n n
1 n
entropy sup-spectrum h̄(θ) := lim sup Pr hX n (X ) ≤ θ
n→∞ n

1 n
entropy inf-spectrum h(θ) := lim inf Pr hX n (X ) ≤ θ
¯ n→∞ n
δ-inf-entropy rate Hδ (X) := sup{θ : h̄(θ) ≤ δ}

δ-sup-entropy rate H̄δ (X) := sup{θ : h(θ) ≤ δ}

¯
sup-entropy rate H̄(X) := limδ↑1 H̄δ (X)

inf-entropy rate H(X) := H0 (X)

Table 1.1: Generalized entropy measures where δ ∈ [0, 1].

Similarly, for a simple hypothesis testing system with arbitrary observation

statistics for each hypothesis,

H 0 : PX
H1 : PX̂

we can replace An in Deﬁnitions 1.1 and 1.2 by the normalized log-likelihood ratio

1 1 PX n (X n )
dX n (X n X̂ n ) := log
n n PX̂ n (X n )
to obtain the δ-inf/sup-divergence rates, denoted respectively by Dδ (XX̂) and
D̄δ (XX̂), as

1
Dδ (XX̂) = quantile of the sup-spectrum of dX n (X n X̂ n )
n
1
D̄δ (XX̂) = quantile of the inf-spectrum of dX n (X n X̂ n ).
n
The above information measure deﬁnitions are summarized in Tables 1.1–1.3.

10
Mutual Information Measures
arbitrary channel PW = PY |X with input
system
X and output Y
1
norm. information density iX n W n (X n ; Y n )
n
1 PX n ,Y n (X n , Y n )
:= log
n PX n (X n ) × PY n (Y n )

1 n n
information sup-spectrum ī(θ) := lim sup Pr iX n W n (X ; Y ) ≤ θ
n→∞ n

1 n n
information inf-spectrum i(θ) := lim inf Pr iX n W n (X ; Y ) ≤ θ
n→∞ n
δ-inf-information rate I δ (X; Y ) := sup{θ : ī(θ) ≤ δ}
¯
δ-sup-information rate I¯δ (X; Y ) := sup{θ : i(θ) ≤ δ}

sup-information rate ¯
I(X; Y ) := limδ↑1 I¯δ (X; Y )

inf-information rate I (X; Y ) := I 0 (X; Y )

¯ ¯
Table 1.2: Generalized mutual information measures where δ ∈ [0, 1].

1.4 Properties of generalized information measures

In this section, we present the properties of the general (spectral) information

measures deﬁned in the previous section. We begin with the generalization
of the simple property between mutual information and entropy: I(X; Y ) =
H(Y ) − H(Y |X).
By taking δ = 0 and letting γ ↓ 0 in (1.2.1) and (1.2.3), we obtain
(U + V ) ≥ U 0 + lim Vγ ≥ U + V
¯ γ↓0 ¯
and
(U + V ) ≤ lim U γ + lim V̄(1−γ) = U + V̄ ,
γ↓0 ¯ γ↓0 ¯
which mean that the liminf in probability of a sequence of random variables
An + Bn is upper bounded by the liminf in probability of An plus the limsup in
probability of Bn , and is lower bounded by the sum of the liminfs in probability
of An and Bn . This fact is used in [42] to show that
I (X; Y ) +H(Y |X) ≤H(Y ) ≤ I (X; Y ) + H̄(Y |X),
¯ ¯

11
Divergence Measures

system arbitrary sources X and X̂

1 1 PX n (X n )
norm. log-likelihood ratio dX n (X n X̂ n ) := log
n n PX̂ n (X n )

¯ 1 n n
divergence sup-spectrum d(θ) := lim sup Pr dX n (X X̂ ) ≤ θ
n→∞ n

1 n n
divergence inf-spectrum d(θ) := lim inf Pr dX n (X X̂ ) ≤ θ
n→∞ n
δ-inf-divergence rate ¯ ≤ δ}
Dδ (XX̂) := sup{θ : d(θ)

δ-sup-divergence rate D̄δ (XX̂) := sup{θ : d(θ) ≤ δ}

sup-divergence rate D̄(XX̂) := limδ↑1 D̄δ (XX̂)

inf-divergence rate D(XX̂) := D0 (XX̂)

Table 1.3: Generalized divergence measures where δ ∈ [0, 1].

or equivalently,

H(Y ) − H̄(Y |X) ≤ I (X; Y ) ≤H(Y ) −H(Y |X).

¯
Other properties of the generalized information measures are summarized in
the next lemma.

Lemma 1.5 For a source X with ﬁnite alphabet X and arbitrary sources Y
and Z, the following properties hold.

1. H̄δ (X) ≥ 0 for δ ∈ [0, 1].

(This property also applies to Hδ (X), I¯δ (X; Y ), I δ (X; Y ), D̄δ (XX̂),
¯
and Dδ (XX̂).)
2. I δ (X; Y ) = I δ (Y ; X) and I¯δ (X; Y ) = I¯δ (Y ; X) for δ ∈ [0, 1].
¯ ¯
3. For 0 ≤ δ < 1, 0 ≤ γ < 1 and δ + γ ≤ 1,

I δ (X; Y ) ≤Hδ+γ (Y ) −Hγ (Y |X), (1.4.1)

¯
I δ (X; Y ) ≤ H̄δ+γ (Y ) − H̄γ (Y |X), (1.4.2)
¯

12
I¯γ (X; Y ) ≤ H̄δ+γ (Y ) −Hδ (Y |X), (1.4.3)
I δ+γ (X; Y ) ≥Hδ (Y ) − H̄(1−γ) (Y |X), (1.4.4)
¯
and
I¯δ+γ (X; Y ) ≥ H̄δ (Y ) − H̄(1−γ) (Y |X). (1.4.5)
(Note that the case of (δ, γ) = (1, 0) holds for (1.4.1) and (1.4.2), and the
case of (δ, γ) = (0, 1) holds for (1.4.3), (1.4.4) and (1.4.5).)
(n)
4. 0 ≤Hδ (X) ≤ H̄δ (X) ≤ log |X | for δ ∈ [0, 1), where each Xi takes values
in X for i = 1, . . . , n and n = 1, 2, . . ..

5. I δ (X, Y ; Z) ≥ I δ (X; Z) for δ ∈ [0, 1].

¯ ¯
Proof: Property 1 holds because

1 n
Pr − log PX n (X ) < 0 = 0,
n

1 PX n (X n ) n n 1 PX n (xn )
Pr log < −ν = PX n x ∈ X : log < −ν
n PX̂ n (X n ) n PX̂ n (xn )

= PX n (xn )
xn ∈X n : PX n (xn )<PX̂ n (xn )e−nν }

≤ PX̂ n (xn )e−nν
xn ∈X n : PX n (xn )<PX̂ n (xn )enν

≤ e−nν · PX̂ n (xn )
xn ∈X n : PX n (xn )<PX̂ n (xn )enν

≤ e−νn , (1.4.6)

and, by following the same procedure as (1.4.6),

1 PX n ,Y n (X n , Y n )
Pr log < −ν ≤ e−νn .
n PX n (X n )PY n (Y n )

Property 2 is an immediate consequence of the deﬁnition.

To show the inequalities in Property 3, we ﬁrst remark that
1 1 1
hY n (Y n ) = iX n ,Y n (X n ; Y n ) + hX n ,Y n (Y n |X n ),
n n n
where
1 1
hX n ,Y n (Y n |X n ) := − log PY n |X n (Y n |X n ).
n n

13
With this fact and for 0 ≤ δ < 1, 0 < γ < 1 and δ + γ ≤ 1, (1.4.1) follows
directly from (1.2.1); (1.4.2) and (1.4.3) follow from (1.2.2); (1.4.4) follows from
(1.2.3); and (1.4.5) follows from (1.2.4). (Note that in (1.4.1), if δ = 0 and γ = 1,
then the right-hand-side would be a difference between two infinite terms, which
is undefined; hence, such case is therefore excluded by the condition that γ < 1.
For the same reason, we also exclude the cases of γ = 0 and δ = 1.) Now when
0 ≤ δ < 1 and γ = 0, we can confirm the validity of (1.4.1), (1.4.2) and (1.4.3),
again, by (1.2.1) and (1.2.2), and also examine the validity of (1.4.4) and (1.4.5)
by directly taking these values inside. The validity of (1.4.1) and (1.4.2) for
(δ, γ) = (1, 0), and the validity of (1.4.3), (1.4.4) and (1.4.5) for (δ, γ) = (0, 1)
can be checked by directly replacing δ and γ with the respective numbers.
Property 4 follows from the facts that H̄δ (·) is non-decreasing in δ, H̄δ (X) ≤
H̄(X), and H̄(X) ≤ log |X |. The last inequality can be proved as follows.

1 n
Pr hX n (X ) ≤ log |X | + ν
n

n n 1 PX n (X n )
= 1 − PX n x ∈ X : log < −ν
n 1/|X |n
≥ 1 − e−nν ,
where the last step can be obtained by using the same procedure as (1.4.6).
Therefore, h(log |X |+ν) = 1 for any ν > 0, which indicates that H̄(X) ≤ log |X |.
¯
Property 5 can be proved using the fact that
1 1 1
iX n ,Y n ,Z n (X n , Y n ; Z n ) = iX n ,Z n (X n ; Z n ) + iX n ,Y n ,Z n (Y n ; Z n |X n ).
n n n
By applying (1.2.1) with γ = 0, and observing I (Y ; Z|X) ≥ 0, we obtain the
¯
desired result. 2

Lemma 1.6 (Data processing lemma) Fix δ ∈ [0, 1]. Consider arbitrary
sources X 1 , X 2 and X 3 such that for every n, X1n and X3n are conditionally
independent given X2n . Then
I δ (X 1 ; X 3 ) ≤ I δ (X 1 ; X 2 ).
¯ ¯
Proof: By Property 5 of Lemma 1.5, we get
I δ (X 1 ; X 3 ) ≤ I δ (X 1 ; X 2 , X 3 ) = I δ (X 1 ; X 2 ),
¯ ¯ ¯
where the equality holds because
1 PX1n ,X2n ,X3n (xn1 , xn2 , xn3 ) 1 PX1n ,X2n (xn1 , xn2 )
log = log .
n PX1n (xn1 )PX2n ,X3n (xn2 , xn3 ) n PX1n (xn1 )PX2n (xn2 )
2

xn ∈ X n and y n ∈ Y n . For any input X and its corresponding output Y ,

I δ (X; Y ) ≤ I δ (X̄; Ȳ ) = I (X̄; Ȳ ),

¯ ¯ ¯
where Ȳ is the output due to X̄, which is annindependent process with the same
n
ﬁrst-order statistics as X, i.e., PX n (x ) = i=1 PXi (xi ).

Proof: First, we observe that

1 PW n (y n |xn ) 1 PY n (y n ) 1 PW n (y n |xn )
log + log = log .
n PY n (y n ) n PY n (y n ) n PY n (y n )

In other words,

1 PX n W n (xn , y n) 1 PY n (y n ) 1 PX n W n (xn , y n )
log + log = log ,
n PX n (xn )PY n (y n ) n PY n (y n ) n PX n (xn )PY n (y n )

where PX n W n := PX n ,Y n = PY n |X n PX n is the joint input-output distribution

resulting from sending input X n over the channel W n and receiving Y n as output
(a similar deﬁnition holds for PX n W n ). By evaluating the above terms under
PX n W n = PX n ,Y n and deﬁning

n n n n 1 PX n W n (xn , y n )
z̄(θ) := lim sup PX n W n (x , y ) ∈ X × Y : log ≤θ
n→∞ n PX n (xn )PY n (y n )

and
Zδ (X̄; Ȳ ) := sup{θ : z̄(θ) ≤ δ},
we obtain from (1.2.1) (with γ = 0) that

Zδ (X̄; Ȳ ) ≥ I δ (X; Y ) + D(Y Ȳ ) ≥ I δ (X; Y ),

¯ ¯
since D(Y Ȳ ) ≥ 0 by Property 1 of Lemma 1.5.
Now since X̄ is a source with independent random variables and with iden-
tical marginal (ﬁrst-order) statistics as X, we have that the induced Ȳ is also

15
an independent process and has the same marginal statistics6 as Y . Hence,
n
1
PX i Wi (Xi , Yi ) PX i Wi (Xi , Yi )
Pr log − EXi Wi log >γ
n P X (X i )P Y (Y i ) P X (X i )P Y (Y i )
i=1 i i i i
n
1
PXi Wi (Xi , Yi ) PXi Wi (Xi , Yi )
= Pr log − EXi Wi log >γ
n PXi (Xi )PYi (Yi ) PXi (Xi )PYi (Yi )
i=1
→ 0,

for any γ > 0, where the convergence to zero follows from Chebyshev’s inequality
and the ﬁniteness of the channel alphabets (or more directly, the ﬁniteness of
individual variances). Consequently, z̄(θ) = 1 for

1 1
n n
PXi Wi (Xi , Yi )
θ > lim inf EXi Wi log = lim inf I(Xi ; Yi );
n→∞ n PXi (Xi )PYi (Yi ) n→∞ n
i=1 i=1
n
and z̄(θ) = 0 for θ < lim inf n→∞ (1/n) i=1 I(Xi ; Yi ), which implies

1
n
Z(X̄; Ȳ ) = Zδ (X̄; Ȳ ) = lim inf I(Xi ; Yi )
n→∞ n
i=1

for any δ ∈ [0, 1). Similarly, we can show that

1
n
I (X̄; Ȳ ) = I δ (X̄; Ȳ ) = lim inf I(Xi ; Yi )
¯ ¯ n→∞ n
i=1

= PX1 (x1 )PW1 (y1 |x1 ) PX2n W2n |X1 (xn2 , y2n |x1 )
x1 ∈X y2n ∈Y n−1 xn
2 ∈X
n−1

= PX1 (x1 )PW1 (y1 |x1 ),
x1 ∈X

where xn2 := (x2 , · · · , xn ), y2n := (y2 , · · · , yn ) and W2n is the channel law between input and
output tuples from time 2 to n. It can be similarly shown that for any time index 1 ≤ i ≤ n,

PYi (yi ) = PXi (xi )PWi (yi |xi )
xi ∈X
n
for xi ∈ X and yi ∈ Y. Hence, for a channel with PW n (y n |xn ) = i=1 PWi (yi |xi ), the output
marginal only depends on the respective input marginal.

16
for any δ ∈ [0, 1). Accordingly,

I (X̄; Ȳ ) = Z(X̄; Ȳ ) ≥ I δ (X; Y ).

¯ ¯
2

1.5 Examples for the computation of general information

measures

Consider a channel with binary input and output alphabets; i.e., X = Y = {0, 1},
and let every channel output be given by
(n) (n) (n)
Yi = Xi ⊕ Zi

where ⊕ represents addition modulo-2, and Z is an arbitrary binary noise pro-

cess, independent of input X. Assume that X is a Bernoulli uniform input,
i.e., an i.i.d. random process with uniform marginal distribution. Then it can
be readily seen that the resultant Y is also Bernoulli uniform, no matter what
distribution Z has.
To compute I ε (X; Y ) for ε ∈ [0, 1) (I 1 (X; Y ) = ∞ is known), we use the
¯ ¯
results of Property 3 in Lemma 1.5:

I ε (X; Y ) ≥H0 (Y ) − H̄(1−ε) (Y |X), (1.5.1)

¯
and
I ε (X; Y ) ≤ H̄ε+γ (Y ) − H̄γ (Y |X). (1.5.2)
¯
where 0 ≤ ε < 1, 0 ≤ γ < 1 and ε + γ ≤ 1. Note that the lower bound in (1.5.1)
and the upper bound in (1.5.2) are respectively equal to −∞ and ∞ for ε = 0
and ε + γ = 1, which become trivial bounds; hence, we further restrict ε > 0
and ε + γ < 1, respectively, for the lower and upper bounds.
Thus for ε ∈ [0, 1),

I ε (X; Y ) ≤ inf H̄ε+γ (Y ) − H̄γ (Y |X) .

¯ 0≤γ<1−ε

Given that the channel noise is additive and independent from the channel input,
H̄γ (Y |X) = H̄γ (Z), which is independent of X. Hence,

I ε (X; Y ) ≤ inf H̄ε+γ (Y ) − H̄γ (Z)

¯ 0≤γ<1−ε

≤ inf log(2) − H̄γ (Z) ,

0≤γ<1−ε

17
where the last step follows from Property 4 of Lemma 1.5. Since log(2) − H̄γ (Z)
is non-increasing in γ,

I ε (X; Y ) ≤ log(2) − lim H̄γ (Z).

¯ γ↑(1−ε)

On the other hand, we can derive the lower bound to I ε (X; Y ) in (1.5.1) by
¯
the fact that Y is Bernoulli uniform. We thus obtain that for ε ∈ (0, 1],

I ε (X; Y ) ≥ log(2) − H̄(1−ε) (Z),

¯
and

I 0 (X; Y ) = lim I ε (X; Y ) ≥ log(2) − lim H̄γ (Z) = log(2) − H̄(Z).

¯ ε↓0 ¯ γ↑1

To summarize,

log(2) − H̄(1−ε) (Z) ≤ I ε (X; Y ) ≤ log(2) − lim H̄γ (Z) for ε ∈ (0, 1)
¯ γ↑(1−ε)

and
I (X; Y ) = I 0 (X; Y ) = log(2) − H̄(Z).
¯ ¯
An alternative method to compute I ε (X; Y ) is to derive its corresponding
¯
sup-spectrum in terms of the inf-spectrum of the noise process. Under the equally
likely Bernoulli input X, we can write

1 PY n |X n (Y n |X n )
ī(θ) := lim sup Pr log ≤θ
n→∞ n PY n (Y n )

1 n 1 n
= lim sup Pr log PZ n (Z ) − log PY n (Y ) ≤ θ
n→∞ n n

1 n
= lim sup Pr log PZ n (Z ) ≤ θ − log(2)
n→∞ n

1 n
= lim sup Pr − log PZ n (Z ) ≥ log(2) − θ
n→∞ n

1 n
= 1 − lim inf Pr − log PZ n (Z ) < log(2) − θ .
n→∞ n

18
Hence, for ε ∈ (0, 1),

I ε (X; Y ) = sup {θ : ī(θ) ≤ ε}

¯
1 n
= sup θ : 1 − lim inf Pr − log PZ n (Z ) < log(2) − θ ≤ ε
n→∞ n

1 n
= sup θ : lim inf Pr − log PZ n (Z ) < log(2) − θ ≥ 1 − ε
n→∞ n

1 n
= sup (log(2) − β) : lim inf Pr − log PZ n (Z ) < β ≥ 1 − ε
n→∞ n

1 n
= log(2) + sup −β : lim inf Pr − log PZ n (Z ) < β ≥ 1 − ε
n→∞ n

1 n
= log(2) − inf β : lim inf Pr − log PZ n (Z ) < β ≥ 1 − ε
n→∞ n

1 n
= log(2) − sup β : lim inf Pr − log PZ n (Z ) < β < 1 − ε
n→∞ n

1 n
≤ log(2) − sup β : lim inf Pr − log PZ n (Z ) ≤ β < 1 − ε
n→∞ n
= log(2) − lim H̄δ (Z).
δ↑(1−ε)

Also, for ε ∈ (0, 1),

1 PX n ,Y n (X n , Y n )
I ε (X; Y ) ≥ sup θ : lim sup Pr log < θ < ε (1.5.3)
¯ n→∞ n PX n (X n )PY n (Y n )

1 n
= log(2) − sup β : lim inf Pr − log PZ n (Z ) ≤ β ≤ 1 − ε
n→∞ n
= log(2) − H̄(1−ε) (Z),

where (1.5.3) follows from the fact described in Footnote 1.3. Therefore,

log(2) − H̄(1−ε) (Z) ≤ I ε (X; Y ) ≤ log(2) − lim H̄γ (Z) for ε ∈ (0, 1).
¯ γ↑(1−ε)

By taking ε ↓ 0, we obtain

I (X; Y ) = I 0 (X; Y ) = log(2) − H̄(Z).

¯ ¯
Based on this result, we can now compute I ε (X; Y ) for some speciﬁc exam-
¯
ples.

Example 1.8 Let Z be an all-zero sequence with probability β and Bernoulli(p)

with probability 1 − β, where Bernoulli(p) represents a binary Bernoulli process

19
6

1 t

β t d

0 d -
0 hb (p)

Figure 1.2: The spectrum h(θ) for Example 1.8.

¯
6

1 t

1−β t d

0 d -
1 − hb (p) 1

Figure 1.3: The spectrum ī(θ) for Example 1.8.

with any of its random variables equaling one with probability p. Then as
n → ∞, the sequence of random variables (1/n)hZ n (Z n ) converges to 0 and hb (p)
with respective masses β and 1 − β, where hb (p) := −p log p − (1 − p) log(1 − p) is
the binary entropy function. The resulting h(θ) is depicted in Figure 1.2. From
¯
(1.5.3), we obtain ī(θ) as shown in Figure 1.3.
Therefore,

1 − hb (p), if 0 < ε < 1 − β;
I ε (X; Y ) =
¯ 1, if 1 − β ≤ ε < 1.

Example 1.9 If ∞
Z = Z n = Z1 , . . . , Zn(n)
(n)
n=1
is a non-stationary binary independent sequence with

(n) (n)
Pr Zi = 0 = 1 − Pr Zi = 1 = pi ,

then by the uniform boundedness (in i) of the variance of random variable

20

(n)
− log PZ (n) Zi , namely
i

2
(n) (n)
Var − log PZ (n) Zi ≤ E log PZ (n) Zi
i i

≤ sup pi (log pi )2 + (1 − pi )(log(1 − pi ))2
0<pi <1
< log(2),

we have (by Chebyshev’s inequality) that as n → ∞,

1 n
1
Pr − log PZ n (Z n ) −
(n)
H Zi > γ → 0,
n n i=1

for any γ > 0. Therefore, H̄(1−ε) (Z) is equal to

1 (n) 1
n n
H̄(1−ε) (Z) = H̄(Z) = lim sup H Zi = lim sup hb (pi )
n→∞ n i=1 n→∞ n
i=1

for ε ∈ (0, 1], and inﬁnity for ε = 0, where hb (pi ) = −pi log(pi )−(1−pi ) log(1−pi ).
Consequently,

 n
1 − H̄(Z) = 1 − lim sup 1 hb (pi ), for ε ∈ [0, 1),
I ε (X; Y ) = n→∞ n
¯ 

i=1
∞, for ε = 1.

This result is illustrated in Figures 1.4 and 1.5.

clustering points

···

-
H(Z) H̄(Z)

Figure 1.4: The limiting spectrums of (1/n)hZ n (Z n ) for Example 1.9

21
clustering points

···

-
log(2) − H̄(Z) log(2) −H(Z)

Figure 1.5: The possible limiting spectrums of (1/n)iX n ,Y n (X n ; Y n ) for

Example 1.9.

22
Chapter 2

General Data Compression Theorems

It is known that the entropy rate

1
lim H(X n )
n→∞ n

is the minimum data compression rate (nats per source symbol) for arbitrarily
small data compression error for block coding of the stationary-ergodic source
[2]. For a more complicated situations where the sources become non-stationary,
the quantity limn→∞ (1/n)H(X n ) may not exist, and can no longer be used to
characterize the source compression. This results in the need to establish a
new entropy measure which appropriately characterizes the operational limits of
arbitrary stochastic systems, which was done in the previous chapter.
The role of a source code is to represent the output of a source efficiently.
Specifically, a source code design is to minimize the source description rate of
the code subject to a fidelity criterion constraint. One commonly used fidelity
criterion constraint is to place an upper bound on the probability of decoding
error Pe . If Pe is made arbitrarily small, we obtain a traditional (almost) error-
free source coding system.1 Lossy data compression codes are a larger class
of codes in the sense that the fidelity criterion used in the coding scheme is a
general distortion measure. In this chapter, we only demonstrate the bounded-
error data compression theorems for arbitrary (not necessarily stationary ergodic,
information stable, etc.) sources. The general lossy data compression theorems
will be introduced in subsequent chapters.
1
Recall that only for variable-length codes, a complete error-free data compression is re-
quired. A lossless data compression block codes only dictates that the compression error can
be made arbitrarily small, or asymptotically error-free.

23
2.1 Fixed-length data compression codes for arbitrary
sources

Equipped with the general information measures, we herein demonstrate a gen-

eralized Asymptotic Equipartition Property (AEP) Theorem and establish ex-
pressions for the minimum ε-achievable (ﬁxed-length) coding rate of an arbitrary
source X.
Here, we have made an implicit assumption in the following derivation, which
is the source alphabet X is ﬁnite.2

Deﬁnition 2.1 (cf. Deﬁnition 3.2 and its associated footnote in [2]) An
(n, M) block code for data compression is a set

∼Cn := {c1 , c2 , . . . , cM }

consisting of M sourcewords3 of blocklength n (and a binary-indexing codeword

for each sourceword ci ); each sourceword represents a group of source symbols
of length n.

Deﬁnition 2.2 Fix ε ∈ [0, 1]. R is an ε-achievable data compression rate for
a source X if there exists a sequence of block data compression codes {∼Cn =
(n, Mn )}∞
n=1 with
1
lim sup log Mn ≤ R,
n→∞ n

and
lim sup Pe (∼Cn ) ≤ ε,
n→∞
n
where Pe (∼Cn ) := Pr (X ∈
/ ∼Cn ) is the probability of decoding error.
The infimum of all ε-achievable data compression rate for X is denoted by
Tε (X).
2
Actually, the theorems introduced also apply for sources with countable alphabets. We
assume finite alphabets in order to avoid uninteresting cases (such as H̄ε (X) = ∞) that might
arise with countable alphabets.
3
In [2, Def. 3.2], the (n, M ) block data compression code is defined by M codewords,
where each codeword represents a group of sourcewords of length n. However, we can actually
pick up one source symbol from each group, and equivalently define the code using these M
representative sourcewords. Later, it will be shown that this viewpoint facilitates the proving
of the general source coding theorem.

24
Lemma 2.3 (Lemma 1.5 in [25]) Fix a positive integer n. There exists an
(n, Mn ) source block code ∼Cn for PX n such that its error probability satisﬁes

1 n 1
Pe (∼Cn ) ≤ Pr hX n (X ) > log Mn .
n n

Proof: Observe that

1 ≥ PX n (xn )
{xn ∈X n : (1/n)hX n (xn )≤(1/n) log Mn }
1
≥
Mn
{xn ∈X n : (1/n)hX n (xn )≤(1/n) log Mn }

n 1 1 1
≥ {x ∈ X : hX n (x ) ≤ log Mn }
n n
.
n n Mn

Therefore, |{xn ∈ X n : (1/n)hX n (xn ) ≤ (1/n) log Mn }| ≤ Mn . We can then

choose a code

n n 1 n 1
∼Cn ⊃ x ∈ X : hX n (x ) ≤ log Mn
n n

with |∼Cn | = Mn and

1 n 1
Pe (∼Cn ) = 1 − PX n {∼Cn } ≤ Pr hX n (X ) > log Mn .
n n
2

Lemma 2.4 (Lemma 1.6 in [25]) Every (n, Mn ) source block code ∼Cn for
PX n satisﬁes

1 n 1
Pe (∼Cn ) ≥ Pr hX n (X ) > log Mn + γ − exp{−nγ},
n n

for every γ > 0.

Proof: It suﬃces to prove that

n 1 n 1
1 − Pe (∼Cn ) = Pr {X ∈ ∼Cn } < Pr hX n (X ) ≤ log Mn + γ + exp{−nγ}.
n n

25
Clearly,

n n 1 n 1
Pr {X ∈ ∼Cn } = Pr X ∈ ∼Cn and hX n (X ) ≤ log Mn + γ
n n

n 1 n 1
+ Pr X ∈ ∼Cn and hX n (X ) > log Mn + γ
n n

1 1
≤ Pr hX n (X n ) ≤ log Mn + γ
n n

n 1 n 1
+ Pr X ∈ ∼Cn and hX n (X ) > log Mn + γ
n n

1 1
= Pr hX n (X n ) ≤ log Mn + γ
n n

n 1 n 1
+ PX n (x ) · 1 hX n (x ) > log Mn + γ
n n
x ∈∼
n Cn

1 n 1
= Pr hX n (X ) ≤ log Mn + γ
n n

n n 1
+ PX n (x ) · 1 PX n (x ) < exp{−nγ}
Mn
xn ∈ ∼
C
n
1 n 1 1
< Pr hX n (X ) ≤ log Mn + γ + |∼Cn | exp{−nγ}
n n Mn

1 n 1
= Pr hX n (X ) ≤ log Mn + γ + exp{−nγ}.
n n
2
We now apply Lemmas 2.3 and 2.4 to prove a general source coding theorems
for block codes.

Theorem 2.5 (general source coding theorem) For any source X,


 lim H̄δ (X), for ε ∈ [0, 1);
Tε (X) = δ↑(1−ε)
0, for ε = 1.

Proof: The case of ε = 1 follows directly from its deﬁnition; hence, the proof
only focuses on the case of ε ∈ [0, 1).

1. Forward part (achievability): Tε (X) ≤ limδ↑(1−ε) H̄δ (X)

We need to prove the existence of a sequence of block codes {∼Cn =
(n, Mn )}n≥1 such that for every γ > 0,
1
lim sup log Mn ≤ lim H̄δ (X) + γ and lim sup Pe (∼Cn ) ≤ ε.
n→∞ n δ↑(1−ε) n→∞

26
Lemma 2.3 ensures the existence (for any γ > 0) of a source block code
∼Cn = (n, Mn = exp{n(limδ↑(1−ε) H̄δ (X) + γ)} ) with error probability

1 n 1
Pe (∼Cn ) ≤ Pr hX n (X ) > log Mn
n n

1 n
≤ Pr hX n (X ) > lim H̄δ (X) + γ .
n δ↑(1−ε)

Therefore,

1 n
lim sup Pe (∼Cn ) ≤ lim sup Pr hX n (X ) > lim H̄δ (X) + γ
n→∞ n→∞ n δ↑(1−ε)

1 n
= 1 − lim inf Pr hX n (X ) ≤ lim H̄δ (X) + γ
n→∞ n δ↑(1−ε)

≤ 1 − (1 − ε) = ε,
where the last inequality follows from

1 n
lim H̄δ (X) = sup θ : lim inf Pr hX n (X ) ≤ θ < 1 − ε . (2.1.1)
δ↑(1−ε) n→∞ n

2. Converse part: Tε (X) ≥ limδ↑(1−ε) H̄δ (X)

Assume without loss of generality that limδ↑(1−ε) H̄δ (X) > 0. We will prove
the converse by contradiction. Suppose that Tε (X) < limδ↑(1−ε) H̄δ (X).
Then (∃ γ > 0) Tε (X) < limδ↑(1−ε) H̄δ (X) − 4γ. By deﬁnition of Tε (X),
there exists a sequence of codes ∼Cn = (n, Mn ) such that

1
lim sup log Mn ≤ lim H̄δ (X) − 4γ + γ
n→∞ n δ↑(1−ε)

< lim H̄δ (X) − 2γ (2.1.2)

δ↑(1−ε)

and
lim sup Pe (∼Cn ) ≤ ε. (2.1.3)
n→∞
(2.1.2) implies that
1
log Mn ≤ lim H̄δ (X) − 2γ
n δ↑(1−ε)

for all suﬃciently large n. Hence, for those n satisfying the above inequality
and also by Lemma 2.4,

1 1
Pe (∼Cn ) ≥ Pr hX n (X ) > log Mn + γ − e−nγ
n
n n

1 n
≥ Pr hX n (X ) > lim H̄δ (X) − 2γ + γ − e−nγ .
n δ↑(1−ε)

27
Therefore,

1
lim sup Pe (∼Cn ) ≥ 1 − lim inf Pr hX n (X n ) ≤ lim H̄δ (X) − γ
n→∞ n→∞ n δ↑(1−ε)

> 1 − (1 − ε) = ε,
where the last inequality follows from (2.1.1). Thus, a contradiction to
(2.1.3) is obtained. 2

A few remarks are made based on the previous theorem.

• Note that as ε = 0, limδ↑(1−ε) H̄δ (X) = H̄(X). Hence, the above theorem
generalizes the block source coding theorem in [26], which states that the
minimum achievable ﬁxed-length source coding rate of any ﬁnite-alphabet
source is H̄(X).
• Consider the special case where −(1/n) log PX n (X n ) converges in proba-
bility to a constant H, which holds for all information stable sources.4 In
this case, both the inf- and sup-spectrums of X degenerate to a unit step
function:
1, if θ > H;
u(θ) =
0, if θ < H,
where H is the source entropy rate. Thus, H̄ε (X) = H for all ε ∈ [0, 1).
Hence, the general source coding theorem reduces to the conventional
source coding theorem.
• More generally, if −(1/n) log PX n (X n ) converges in probability to a random
variable Z whose cumulative distribution function (CDF) is FZ (·), then the
minimum achievable data compression rate subject to decoding error being
no greater than ε is
lim H̄δ (X) = sup {R : FZ (R) < 1 − ε} .
δ↑(1−ε)

Therefore, the relationship between the code rate and the ultimate optimal
error probability is also clearly defined. We further explore the case in the
next example.
4 (n) (n)
A source X = {X n = (X1 , . . . , Xn )}∞ n
n=1 is said to be information stable if H(X ) =
n
E [− log PX n (x )] > 0 for all n, and

− log PX n (xn )
lim Pr − 1 > ε = 0,

n→∞ H(X n )
for every ε > 0. By the definition, any stationary-ergodic source with finite n-fold entropy is
information stable; hence, it can be viewed a generalized source model for stationary-ergodic
sources.

28
Example 2.6 Consider a binary source X with each X n is Bernoulli(Θ)
distributed, where Θ is a random variable deﬁned over (0, 1). This is a sta-
tionary but non-ergodic source [1]. We can view the source as a mixture of
Bernoulli(θ) processes where the parameter θ ∈ Θ = (0, 1), and has distri-
bution PΘ [1, Corollary 1]. Therefore, it can be shown by ergodic decom-
position theorem (which states that any stationary source can be viewed
as a mixture of stationary-ergodic sources) that −(1/n) log PX n (X n ) con-
verges in probability to a random variable Z = hb (Θ) [1], where hb (x) :=
−x log2 (x) − (1 − x) log2 (1 − x) is the binary entropy function. Conse-
quently, the CDF of Z is FZ (z) = Pr{hb (Θ) ≤ z}; and the minimum
achievable ﬁxed-length source coding rate with compression error being no
larger than ε is
sup{R : FZ (R) < 1 − ε} = sup{R : Pr{hb (Θ) ≤ R} < 1 − ε}.

• From the above example, or from Theorem 2.5, it shows that the strong
converse theorem (which states that codes with rate below entropy rate
will ultimately have decompression error approaching one, cf. [2, Thm. 3.6])
does not hold in general. However, one can always claim the weak converse
statement for arbitrary sources.

Theorem 2.7 (weak converse theorem) For any block code sequence
of ultimate rate R < H̄(X), the probability of block decoding failure Pe
cannot be made arbitrarily small. In other words, there exists ε > 0 such
that Pe is lower bounded by ε inﬁnitely often in blocklength n.

The possible behavior of the probability of block decompression error of

an arbitrary source is depicted in Figure 2.1. As shown in the figure,
there exist two bounds, denoted by H̄(X) and H̄0 (X), where H̄(X) is
the tight bound for lossless data compression rate. In other words, it is
possible to find a sequence of block codes with compression rate larger than
H̄(X) and the probability of decoding error is asymptotically zero. When
the data compression rate lies between H̄(X) and H̄0 (X), the minimum
probability of decoding error achievable is bounded below by a positive
absolute constant in (0, 1) infinitely often in blocklength n. In the case that
the data compression rate is less than H̄0 (X), the probability of decoding
error of all codes will eventually go to 1 (for n infinitely often). This
fact tells the block code designer that all codes with long blocklength are
bad when data compression rate is smaller than H̄0 (X). From the strong
converse theorem, the two bounds in Figure 2.1 coincide for memoryless
sources. In fact, these two bounds coincide even for stationary-ergodic
sources.

29
n (i.o.) n→∞
Pe −
−−→1 Pe is lower Pe −
−−→0
bounded (i.o. in n) - R
H̄0 (X) H̄(X)

Figure 2.1: Behavior of the probability of block decoding error as block-

length n goes to inﬁnity for an arbitrary source X.

We close this section by remarking that the definition that we adopt for the
ε-achievable data compression rate is slightly different from, but equivalent to,
the one used in [26, Def. 8]. The definition in [26] also brings the same result,
which was separately proved by Steinberg and Verdú as a direct consequence of
Theorem 10(a) (or Corollary 3) in [36]. To be precise, they showed that Tε (X),
denoted by Te (ε, X) in [36], is equal to R̄v (2ε) (cf. Def. 17 in [36]). By a simple
derivation, we obtain:
Te (ε, X) = R̄v (2ε)

1 n
= inf θ : lim sup PX n − log PX n (X ) > θ ≤ ε
n→∞ n

1 n
= inf θ : lim inf PX n − log PX n (X ) ≤ θ ≥ 1 − ε
n→∞ n

1 n
= sup θ : lim inf PX n − log PX n (X ) ≤ θ < 1 − ε
n→∞ n
= lim H̄δ (X).
δ↑(1−ε)

Note that Theorem 10(a) in [36] is a lossless data compression theorem for ar-
bitrary sources, which the authors show as a by-product of their results on
finite-precision resolvability theory. Specifically, they proved T0 (X) = S(X)
[26, Thm. 1] and S(X) = H̄(X) [26, Thm. 3], where S(X) is the resolvability5
of an arbitrary source X. Here, we establish Theorem 2.5 in a different and
more direct way.

2.2 Generalized AEP theorem

For discrete memoryless sources, the data compression theorem is proved by

choosing the codebook ∼Cn to be the weakly δ-typical set and applying the Asymp-
5
The resolvability, which is a measure of randomness for random variables, will be intro-
duced in subsequent chapters.

30
totic Equipartition Property (AEP) which states that (1/n)hX n (X n ) converges
to H(X) with probability one (and hence in probability). The AEP – which
implies that the probability of the typical set is close to one for suﬃciently large
n – also holds for stationary-ergodic sources. It is however invalid for more gen-
eral sources – e.g., non-stationary, non-ergodic sources. We herein demonstrate
a generalized AEP theorem.

Theorem 2.8 (generalized asymptotic equipartition property for arbi-

trary sources) Fix ε ∈ [0, 1). Given an arbitrary source X, deﬁne

n n 1 n
Tn [R] := x ∈ X : − log PX n (x ) ≤ R .
n

Then for any δ > 0, the following statements hold.

1.
lim inf Pr Tn [H̄ε (X) − δ] ≤ ε (2.2.1)
n→∞

2.
lim inf Pr Tn [H̄ε (X) + δ] > ε (2.2.2)
n→∞

3. The number of elements in

Fn (δ; ε) := Tn [H̄ε (X) + δ] \ Tn [H̄ε (X) − δ],

denoted by |Fn (δ; ε)|, satisﬁes

|Fn (δ; ε)| ≤ exp n(H̄ε (X) + δ) , (2.2.3)

where the operation A \ B between two sets A and B is deﬁned by A \ B :=

A ∩ Bc with Bc denoting the complement set of B.

4. There exists ρ = ρ(δ) > 0 and a subsequence {nj }∞

j=1 such that

|Fn (δ; ε)| > ρ · exp nj (H̄ε (X) − δ) . (2.2.4)

Proof: (2.2.1) and (2.2.2) follow from the deﬁnitions. For (2.2.3), we have

1 ≥ PX n (xn )
xn ∈Fn (δ;ε)

≥ exp −n (H̄ε (X) + δ)
xn ∈Fn (δ;ε)

= |Fn (δ; ε)| exp −n (H̄ε (X) + δ) .

31
Fn (δ; ε)

Tn [H̄ε (X) + δ] Tn [H̄ε (X) − δ]

Figure 2.2: Illustration of generalized AEP Theorem. Fn (δ; ε) :=

Tn [H̄ε (X) + δ] \ Tn [H̄ε (X) − δ] is the dashed region.

It remains to show (2.2.4). (2.2.2) implies that there exist ρ = ρ(δ) > 0 and
N1 such that for all n > N1 ,
Pr Tn [H̄ε (X) + δ] > ε + 2ρ.
Furthermore, (2.2.1) implies that for the previously chosen ρ, there exists a
subsequence {nj }∞
j=1 such that

Pr Tnj [H̄ε (X) − δ] < ε + ρ.

Therefore, for all nj > N1 ,

ρ < Pr Tnj [H̄ε (X) + δ] \ Tnj [H̄ε (X) − δ]

< Tnj [H̄ε (X) + δ] \ Tnj [H̄ε (X) − δ] exp −nj (H̄ε (X) − δ)

= Fnj (δ; ε) exp −nj (H̄ε (X) − δ) .

The desired subsequence {nj }∞

j=1 is then deﬁned as n1 is the ﬁrst nj > N1 , and

n2 is the second nj > N1 , etc. 2
With the illustration depicted in Figure 2.2, we can clearly deduce that The-
orem 2.8 is indeed a generalized version of the AEP since:

• The set
Fn (δ; ε) := Tn [H̄ε (X) + δ] \ Tn [H̄ε (X) − δ]

n 1

= x ∈ X : − log PX n (x ) − H̄ε (X) ≤ δ
n n
n
is nothing but the weakly δ-typical set.

32
• (2.2.1) and (2.2.2) imply that qn := Pr{Fn (δ; ε)} > 0 inﬁnitely often in n.

• (2.2.3) and (2.2.4) imply that the number of sequences in Fn (δ; ε) (the
dashed region) is approximately equal to exp nH̄ε (X) , and the probabi-
lity of each sequence in Fn (δ; ε) can be estimated by qn · exp −nH̄ε (X) .

• In particular, if X is a stationary-ergodic source, then H̄ε (X) is indepen-

dent of ε ∈ [0, 1) and, H̄ε (X) = Hε (X) = H for all ε ∈ [0, 1), where H is
the source entropy rate
1
H = lim E [− log PX n (X n )] .
n→∞ n

In this case, (2.2.1)-(2.2.2) and the fact that H̄ε (X) = Hε (X) for all ε ∈
[0, 1) imply that the probability of the typical set Fn (δ; ε) is close to one
(for n suﬃciently large), and (2.2.3) and (2.2.4) imply that there are about
enH typical sequences of length n, each with probability about e−nH . Hence
we obtain the conventional AEP.

• The general source coding theorem can also be proved in terms of the
generalized AEP theorem. For more detail, readers can refer to [11].

33
Chapter 3

Measure of Randomness for Stochastic

Processes

In the previous chapter, it is shown that the sup-entropy rate is indeed the
minimum lossless data compression ratio achievable for block codes. Hence, to
ﬁnd an optimal block code becomes a well-deﬁned mission since for any source
with well-formulated statistical model, the sup-entropy rate can be computed
and such quantity can be used as a criterion to evaluate the optimality of the
designed block code.
In [26], Verdú and Han found that, other than the minimum lossless data
compression ratio, the sup-entropy rate actually has another operational mean-
ing, which is called resolvability. In this chapter, we will explore the new concept
in details.

3.1 Motivation for resolvability : measure of randomness

of random variables

In simulations of statistical communication systems, generation of random vari-

ables by a computer algorithm is very essential. The computer usually has the
access to a basic random experiment (through pre-defined Application Program-
ing Interface), which generates equally likely random values, such as rand( ) that
generates a real number uniformly distributed over (0, 1). Conceptually, random
variables with complex models are more difficult to generate by computer than
random variables with simple models. Question is how to quantify the “com-
plexity” of generating the random variables by computers. One way to define
such “complexity” measurement is:

Deﬁnition 3.1 The complexity of generating a random variable is deﬁned as

the number of random bits that the most eﬃcient algorithm requires in order to

34
generate the random variable by computer that has an access to a equally likely
random experiments.

To understand the above deﬁnition quantitatively, a simple example is de-

monstrated below.

Example 3.2 Consider the generation of the random variable with probability
masses PX (−1) = 1/4, PX (0) = 1/2, and PX (1) = 1/4. An algorithm is written
as:

Flip-a-fair-coin; \\ one random bit

If “Head”, then output 0;
else
{
Flip-a-fair-coin; \\ one random bit
If “Head”, then output −1;
else output 1;
}

On the average, the above algorithm requires 1.5 coin flips, and in the worst-
case, 2 coin flips are necessary. Therefore, the complexity measure can take two
fundamental forms: worst-case or average-case over the range of outcomes of
the random variables. Note that we did not show in the above example that
the algorithm is the most efficient one in the sense of using minimum number of
random bits; however, it is indeed an optimal algorithm because it achieves the
lower bound of the minimum number of random bits. Later, we will show that
such bound for average minimum number of random bits required for generat-
ing the random variables is the entropy, which is exactly 1.5 bits in the above
example. As for the worse-case bound, a new terminology, resolution, will be
introduced. As a result, the above algorithm also achieves the lower bound of
the worst-case complexity, which is the resolution of the random variable.

3.2 Notations and deﬁnitions regarding to resolvability

Deﬁnition 3.3 (M-type) For any positive integer M, a probability distribu-

tion P is said to be M-type if

1 2
P (ω) ∈ 0, , , . . . , 1 for all ω ∈ Ω.
M M

35
Deﬁnition 3.4 (resolution of a random variable) The resolution1 R(X) of
a random variable X is the minimum log(M) such that PX is M-type. If PX is
not M-type for any integer M, then R(X) = ∞.

As revealed previously, a random source needs to be resolved (meaning, can

be generated by a computer algorithm with access to equal-probable random
experiments). As anticipated, a random variable with finite resolution is resolv-
able by computer algorithms. Yet, it is possible that the resolution of a random
variable is infinity. A quick example is the random variable X with distribution
PX (0) = 1/π and PX (1) = 1 − 1/π. (X does not belong to any M-type for fi-
nite M.) In such case, one can alternatively choose another computer-resolvable
random variable, which resembles the true source within some acceptable range,
to simulate the original one.
One criterion that can be used as a measure of resemblance of two random
variables is the variational distance. As for the same example in the above
paragraph, choose a random variable X with distribution P (0) = 1/3 and
X
PX (1) = 2/3. Then X − X ≈ 0.03, and X is 3-type, which is computer-
resolvable. A program that generates the 3-type X is as follows (in C language).

even = False;
while (1)
{Flip-a-fair-coin; \\ one random bit
if (Head)
{if (even==True) { output 0; break;}
else {output 1; break;}
}
else
{if (even==True) even=False;
else even=True;
}
}

Then, by denoting H = Head and T = Tail, the probability to output 1 is equal to

the probability to observe H, TTH, TTTTH, . . ., which equals 12 + 213 + 215 + · · · = 23 .
The average complexity of this algorithm is 1 · 12 + 2 · 14 + 3 · 18 + · · · = 2 bits but
its worse-case complexity is inﬁnity.
We proceed to give the deﬁnitions of variational distance and ε-resolution as
well as its extension for the generation of a sequence of random variables.
1
If the base of the logarithmic operation is 2, the resolution is measured in bits; however,
if natural logarithm is taken, nats becomes the basic measurement unit of resolution.

36
Deﬁnition 3.5 (variational distance) The variational distance (or 1 dis-
tance) between two distributions P and Q deﬁned on common measurable space
(Ω, F ) is
P − Q := |P (ω) − Q(ω)|.
ω∈Ω
(Note that an alternative way to formulate the variational distance is:

P − Q = 2 · sup |P (E) − Q(E)| = 2 [P (x) − Q(x)].
E∈F
x∈X : P (x)≥Q(x)

This two deﬁnitions are actually equivalent.)

Deﬁnition 3.6 (ε-achievable resolution) Fix ε ≥ 0. R is an ε-achievable

satisfying
resolution for input X if for all γ > 0, there exists X
< R + γ and X − X
R(X) < ε.

The ε-achievable resolution reveals the possibility that one can choose another
computer-resolvable random variable whose variational distance to the true sour-
ce is within an acceptable range, ε.
Next we deﬁne the ε-achievable resolution rate for a sequence of random
variables, which is an extension of ε-achievable resolution deﬁned for a single
random variable. Such extension is analogous to extending entropy for a single
source to entropy rate for a random source sequence.

Deﬁnition 3.7 (ε-achievable resolution rate) Fix ε ≥ 0 and input X. R is

an ε-achievable resolution rate2 for input X if for every γ > 0, there exists X̃
satisfying
1 n ) < R + γ and X n − X n < ε,
R(X
n
for all suﬃciently large n.

Definition 3.8 (ε-resolvability for X) Fix ε > 0. The ε-resolvability for in-
put X, denoted by Sε (X), is the minimum ε-achievable resolution rate of the
same input, i.e.,

Sε (X) := min R : (∀ γ > 0)(∃X̃ and N)(∀ n > N)

1 ) < R + γ and X − X
n n <ε .
n
R(X
n
2
Note that our definition of resolution rate is different from its original form (cf. Definition
7 in [26] and the statements following Definition 7 of the same paper for its modified Definition
for specific input X), which involves an arbitrary channel model W . Readers may treat our
definition as a special case of theirs over identity channel.

37
Here, we deﬁne Sε (X) using the “minimum” instead of a more general “in-
ﬁmum” operation is simply because Sε (X) indeed belongs to the range of the
minimum operation,3 i.e.,

Sε (X) ∈ Rε (X) := R : (∀ γ > 0)(∃X̃ and N)(∀ n > N)

1 n n n
R(X ) < R + γ and X − X < ε .
n
Similar convention will be applied throughout the rest of this chapter.

Deﬁnition 3.9 (resolvability for X) The resolvability for input X, denoted

by S(X), is
S(X) := lim Sε (X).
ε↓0

From the deﬁnition of ε-resolvability, it is obviously non-increasing in ε.

Hence, the resolvability can also be deﬁned using supremum operation as:

S(X) := sup Sε (X).

ε>0

The resolvability is pertinent to the worse-case complexity measure for ran-

dom variables (cf. Example 3.2, and the discussion following it). With the en-
tropy function, the information theorists also deﬁne the ε-mean-resolvability and
mean-resolvability for input X, which characterize the average-case complexity
of random variables.

Deﬁnition 3.10 (ε-mean-achievable resolution rate) Fix ε ≥ 0. R is an

ε-mean-achievable resolution rate for input X if for all γ > 0, there exists X̃
satisfying
1 n ) < R + γ and X n − X n < ε,
H(X
n
for all suﬃciently large n.
3
By its deﬁnition, Rε (X) is either (Sε (X), ∞) or [Sε (X), ∞). Suppose Rε (X) =
(Sε (X), ∞). Since Sε (X) ∈ Rε (X), there exists γ0 > 0 satisfying
1 n ) ≥ Sε (X) + γ0 or X n − X
n ≥ ε.
(∀X̃ and N )(∃ n > N ) R(X (3.2.1)
n
Let R0 := Sε (X) + 12 γ0 and note that R0 is contained in Rε (X); hence, for γ = 12 γ0 ,

1 0n ) < R0 + γ = Sε (X) + γ0 and X n − X

0n < ε.
(∃X̃ 0 and N0 )(∀ n > N0 ) R(X
n
The existences of X̃ 0 and N0 then contradict (3.2.1).

38
Deﬁnition 3.11 (ε-mean-resolvability for X) Fix ε > 0. The ε-mean-re-
solvability for input X, denoted by S̄ε (X), is the minimum ε-mean achievable
resolution rate for the same input, i.e.,

S̄ε (X) := min R : (∀ γ > 0)(∃X̃ and N)(∀ n > N)

1 n n n
H(X ) < R + γ and X − X < ε .
n

Deﬁnition 3.12 (mean-resolvability for X) The mean-resolvability for in-

put X, denoted by S̄(X), is

S̄(X) := lim S̄ε (X) = sup S̄ε (X).

ε↓0 ε>0

The only diﬀerence between resolvability and mean-resolvability is that the

former employs resolution function, while the latter replaces it by entropy func-
tion. Since entropy is the minimum average codeword length for uniquely decod-
able codes, an explanation for mean-resolvability is that the new random variable
X n can be resolvable through realizing the optimal variable-length code for it.
You can think of the probability mass of each outcome of X n is approximately
2− where is the codeword length of the optimal lossless variable-length code
for X n (See also [2, Sect. 3.3.3(B)]). Such probability mass can actually be gen-
erated by ﬂipping fair coins times, and the average number of fair coin ﬂipping
for this outcome is indeed × 2− . As you may expect, the mean-resolvability is
shown to be the average complexity of a random variable.

3.3 Operational meanings of resolvability and mean-re-

solvability

The operational meanings for the resolution and entropy (a new operational
meaning for entropy other than the one from source coding theorem) follow the
next theorem.

Theorem 3.13 For a single random variable X,

1. the worse-case complexity is lower-bounded by its resolution R(X) [26];

2. the average-case complexity is lower-bounded by its entropy H(X), and is

upper-bounded by entropy H(X) plus 2 bits [28].

39
Next, we reveal the operational meanings for resolvability and mean-resolva-
bility in source coding. We begin with some lemmas that are useful in charac-
terizing the resolvability.

Lemma 3.14 (bound on variational distance) For every µ > 0,

P (x)
P − Q ≤ 2µ + 2 · PX x ∈ X : log >µ .
Q(x)

Proof:

P − Q = 2 [P (x) − Q(x)]
x∈X : P (x)≥Q(x)

= 2 [P (x) − Q(x)]
x∈X : log[P (x)/Q(x)]≥0


= 2 [P (x) − Q(x)]
x∈X : log[P (x)/Q(x)]>µ


+ [P (x) − Q(x)]
x∈X : µ≥log[P (x)/Q(x)]≥0


≤ 2 P (x)
x∈X : log[P (x)/Q(x)]>µ


Q(x) 
+ P (x) 1 − 
P (x)
x∈X : µ≥log[P (x)/Q(x)]≥0

P (x)
≤ 2 P x ∈ X : log >µ
Q(x)


P (x) 
+ P (x) log 
Q(x)
x∈X : µ≥log[P (x)/Q(x)]≥0

(by fundamental inequality)

 

 P (x) 
≤ 2 P x ∈ X : log >µ + P (x) · µ
Q(x)
x∈X : µ≥log[P (x)/Q(x)]≥0

40

P (x)
= 2 P x ∈ X : log >µ
Q(x)

P (x)
+µ · PX x ∈ X : µ ≥ log ≥ 0}
Q(x)

P (x)
= 2 P x ∈ X : log >µ +µ .
Q(x)
2

Lemma 3.15

n n 1 1 n)
PX n x ∈X : − log PX n (xn ) ≤ R(X = 1,
n n

for every n.

n ),
Proof: By deﬁnition of R(X

n )}
PX n (xn ) ≥ exp{−R(X

for all xn ∈ X n . Hence, for all xn ∈ X n ,

1 1 n ).
− log PX n (xn ) ≤ R(X
n n
The lemma then holds. 2

Theorem 3.16 The resolvability for input X is equal to its sup-entropy rate,
i.e.,
S(X) = H̄(X).

Proof:

1. S(X) ≥ H̄(X).
It suﬃces to show that S(X) < H̄(X) contradicts to Lemma 3.15.
Suppose S(X) < H̄(X). Then there exists δ > 0 such that

S(X) + δ < H̄(X).

Let
n n1 n
D0 := x ∈ X : − log PX n (x ) ≥ S(X) + δ .
n

41
By deﬁnition of H̄(X),4
lim sup PX n (D0 ) > 0.
n→∞

Therefore, there exists α > 0 such that

lim sup PX n (D0 ) > α,
n→∞

which immediately implies

PX n (D0 ) > α
infinitely often in n.
Select 0 < ε < min{α2 , 1} and observe that Sε (X) ≤ S(X), we can choose
n to satisfy
X
1 n ) < S(X) + δ n < ε
R(X and X n − X (3.3.1)
n 2
for sufficiently large n. Define
D1 := {xn ∈ X n : PX n (xn ) > 0
√
and PX n (xn ) − PX n (xn ) ≤ ε · PX n (xn ) .
Then
PX n (D1c ) = PX n {xn ∈ X n : PX n (xn ) = 0
√
or PX n (xn ) − PX n (xn ) > ε · PX n (xn )
≤ PX n {xn ∈ X n : PX n (xn ) = 0}
√
+PX n xn ∈ X n : PX n (xn ) − PX n (xn ) > ε · PX n (xn )
√
= PX n xn ∈ X n : PX n (xn ) − PX n (xn ) > ε · PX n (xn )

= PX n (xn )
√
xn ∈X n : PX n (xn )<(1/ ε)|PX n (xn )−PX
n (x )|
n

1
≤ √ |PX n (xn ) − PX n (xn )|
xn ∈X n
ε
ε √
≤ √ = ε.
ε
Consider that
PX n (D1 ∩ D0 ) ≥ PX n (D0 ) − PX n (D1c )
√
≥ α − ε > 0, (3.3.2)
4
lim inf n→∞ PX (D0c ) ≤ h(S(X)+δ) < 1 because if h(S(X)+δ) ≥ 1, then H̄(X) ≤ S(X)+δ.
¯ ¯

42
which holds infinitely often in n; and every xn0 in D1 ∩ D0 satisfies
√
PX n (xn0 ) ≤ (1 + ε)PX n (xn0 ) (since xn0 ∈ D1 )
and
1 1 1 1
− log PX n (xn0 ) ≥ − log PX n (xn0 ) + log √
n n n 1+ ε
1 1
≥ (S(X) + δ) + log √ (since xn0 ∈ D0 )
n 1+ ε
δ
≥ S(X) + ,
2
√
for n > (2/δ) log(1 + ε). Therefore, for those n that (3.3.2) holds,

n n 1 n 1 n
PX n x ∈ X : − log PX n (x ) > R(X )
n n

n n 1 n δ
≥ PX n x ∈ X : − log PX n (x ) > S(X) + (From (3.3.1))
n 2

n n 1 n
≥ PX n x ∈ X : − log PX n (x ) ≥ S(X) + δ
n
$ %& '
=D0
≥ PX n (D1 ∩ D0 )
√
≥ (1 − ε)PX n (D1 ∩ D0 ) (By definition of D1 )
> 0,
which contradicts to the result of Lemma 3.15.
2. S(X) ≤ H̄(X).
It suffices to show the existence of X̃ for arbitrary γ > 0 such that
n = 0
lim X n − X
n→∞

n is an M-type distribution with

and X
( γ
)
M = en(H̄(X)+ 2 ) ,
2
which ensures that for n > γ
log(2),
γ γ
M < en(H̄(X)+ 2 ) + 1 < 2en(H̄(X)+ 2 ) < en(H̄(X)+γ) .

n = X
Let X n (X n ) be uniformly distributed over a set

G := {Uj ∈ X n : j = 1, . . . , M}

43
which drawn randomly (independently) according to PX n . Deﬁne for µ > 0,

n n 1 n γ µ
D := x ∈ X : − log PX n (x ) > H̄(X) + + .
n 2 n
For each G chosen, we obtain from Lemma 3.14 that
n
X n − X

n n PX n (xn )
≤ 2µ + 2 · PX n x ∈ X : log >µ
PX n (xn )

n 1/M
= 2µ + 2 · PX n x ∈ G : log > µ (since PX n (G c ) = 0)
PX n (xn )
( ) µ
n 1 n 1 n(H̄(X)+ γ2 )
= 2µ + 2 · PX n x ∈ G : − log PX n (x ) > log e +
n n n

1 γ µ
≤ 2µ + PX n xn ∈ G : − log PX n (xn ) > H̄(X) + +
n 2 n
= 2µ + PX n (G ∩ D)
1
= 2µ + |G ∩ D| .
M
Since G is chosen randomly, we can take the expectation values (with
respect to the random G) of the above inequality to obtain:

EG X n − X n ≤ 2µ + 1 EG [|G ∩ D|] .
M
Observe that each Uj is either in D or not in D. From the i.i.d. assumption
of {Uj }M
j=1 , we can then evaluate EG [|G ∩ D|] by
5

EG [|G ∩ D|] = MPXMn [D] + (M − 1)PXMn−1[D]PX n [D c ]

+ · · · + PX n [D]PXMn−1[D c ]
= MPX n [D].
Hence,

n n
lim sup EG X − X ≤ 2µ + lim sup PX n [D] = 2µ,
n→∞ n→∞

which implies
n n
lim sup EG X − X = 0 (3.3.3)
n→∞
since µ can be chosen arbitrarily small. (3.3.3) therefore guarantees the
existence of the desired X̃. 2
5
Readers may imagine that there are cases: where |G ∩ D| = M , |G ∩ D| = M − 1, . . .,
M M−1
|G ∩D| = 1 and |G ∩D| = 0, respectively with drawing probability PX n (D), PX n (D)PX n (Dc ),
M−1 c M c
. . ., PX (D)PX n (D ) and PX n (D ).
n

44
The next two lemmas are useful in characterizing mean-resolvability.

Lemma 3.17 With 0 < a, b ≤ 1,

 1

1 1 |a − b| · log |a − b| , |a − b| < 12 ;
a log − b log ≤
a b  1
(1 − |a − b|) · log , 12 ≤ |a − b| < 1.
(1 − |a − b|)

Proof: Without loss of generality, assume a = t + τ and b = t with 0 < t ≤

t + τ < 1 and τ < 1.
Subject to f (t) := t log( 1t ), we have that for 0 < t ≤ 1 − τ ,

∂[f (t + τ ) − f (t)] t
= log ≤0
∂t t+τ
Hence,
sup [f (t + τ ) − f (t)] = f (τ ) − f (0) = f (τ )
0<t≤1−τ

and
sup [f (t) − f (t + τ )] = f (1 − τ ) − f (0) = f (1 − τ ).
0<t≤1−τ

Thus

|f (a) − f (b)| = |f (t + τ ) − f (t)|

≤ max{f (τ ), f (1 − τ )}
= max{f (|a − b|), f (1 − |a − b|)}
 1

|a − b| · log , |a − b| < 12 ;
|a − b|
= 1

(1 − |a − b|) · log , 1 ≤ |a − b| < 1.
(1 − |a − b|) 2
2

Lemma 3.18 (variational distance and entropy diﬀerence [16, p. 33])

n )| ≤ X n − X
n · log |X |n
|H(X n ) − H(X ,
X n − X n

n ≤ 1 .
provided X n − X 2

Theorem 3.19 For any X,

1
S̄(X) = lim sup H(X n ).
n→∞ n

Proof:
1. S̄(X) ≤ lim supn→∞ (1/n)H(X n).
It suﬃces to prove that S̄ε (X) ≤ lim supn→∞ (1/n)H(X n) for every ε > 0.
This is equivalent to show that for all γ > 0, there exists X̃ such that
1 n ) < lim sup 1 H(X n ) + γ
H(X
n n→∞ n

and
n < ε
X n − X
for suﬃciently large n. This can be trivially achieved by letting X̃ = X,
since for suﬃciently many n,
1 1
H(X n ) < lim sup H(X n ) + γ
n n→∞ n

and
X n − X n = 0.

46
2. S̄(X) ≥ lim supn→∞ (1/n)H(X n).
Observe that S̄(X) ≥ S̄ε (X) for any 0 < ε < e−1 ≈ 0.36788. Then for any
n such that
γ > 0 and all suﬃciently large n, there exists X
1 n ) < S̄(X) + γ
H(X (3.3.6)
n
and
n < ε.
X n − X

From Lemma 3.18 that states

n )| ≤ X n − X
n · log |X |n |X |n
|H(X n ) − H(X ≤ ε log ,
X n − X n ε

where the last inequality holds because t log(1/t) is increasing for 0 < t <
e−1 , we obtain
n ) ≥ H(X n ) − ε log |X |n + ε log ε.
H(X

which, together with (3.3.6), implies that

1
lim sup H(X n ) − ε log |X | < S̄(X) + γ.
n→∞ n

Since ε and γ can be taken arbitrarily small, we have

1
S̄(X) ≥ lim sup H(X n ).
n→∞ n

3.4 Resolvability, mean-resolvability and source coding

In the previous chapter, we have proved that the lossless data compression rate
for block codes is lower bounded by H̄(X). We also show in Section 3.3 that
H̄(X) is also the resolvability for source X. We can therefore conclude that
resolvability is equal to the minimum lossless data compression rate for block
codes. We will justify their equivalence directly in this section.
As explained in AEP theorem for memoryless source, the set Fn (δ) contains
approximately 2nH(X) elements, and the probability for source sequences being
outside Fn (δ) will eventually goes to 0. Therefore, we can binary-index those
source sequences in Fn (δ) by

log2 2nH(X) = nH(X) bits,

47
and encode the source sequences outside Fn (δ) by a unique default binary code-
word, which results in an asymptotically zero probability of decoding error. This
is indeed the main idea for Shannon’s source coding theorem for block codes.
By further exploring the above concept, we found that the key is actually the
existence of a set An = {xn1 , xn2 , . . . , xnM } with M ≈ 2nH(X) and PX n (Acn ) → 0.
Thus, if we can ﬁnd such typical set, Shannon’s source coding theorem for block
codes can actually be generalized to more general sources, such as non-stationary
sources. Furthermore, extension of the theorems to codes of some speciﬁc types
becomes feasible.

Deﬁnition 3.20 (minimum ε-source compression rate for ﬁxed-length

codes) R is the ε-source compression rate for ﬁxed-length codes if there exists
a sequence of sets {An }∞ n
n=1 with An ⊂ X such that

1
lim sup log |An | ≤ R and lim sup PX n [Acn ] ≤ ε.
n→∞ n n→∞

Tε (X) is the minimum of all such rates.

Note that the deﬁnition of Tε (X) is equivalent to the one in Deﬁnition 2.2.

Deﬁnition 3.21 (minimum source compression rate for ﬁxed-length

codes) T (X) represents the minimum source compression rate for ﬁxed-length
codes, which is deﬁned as:
T (X) := lim Tε (X).
ε↓0

Deﬁnition 3.22 (minimum source compression rate for variable-length

codes) R is an achievable source compression rate for variable-length codes if
there exists a sequence of error-free preﬁx codes {∼Cn }∞
n=1 such that

1
lim sup n ≤R
n→∞ n
where n is the average codeword length of ∼Cn . T̄ (X) is the minimum of all
such rates.

Recall that for a single source, the measure of its uncertainty is entropy.
Although the entropy can also be used to characterize the overall uncertainty of
a random sequence X, the source coding however concerns more on the “average”
entropy of it. So far, we have seen four expressions of “average” entropy:
1 1
lim sup H(X n ) := lim sup −PX n (xn ) log PX n (xn );
n→∞ n n→∞ n n
x ∈X n

48
1 1
lim inf H(X n ) := lim inf −PX n (xn ) log PX n (xn );
n→∞ n n→∞ n
xn ∈X n

1 n
H̄(X) := inf β : lim sup PX n − log PX n (X ) > β = 0 ;
β∈ n→∞ n

1 n
H(X) := sup α : lim sup PX n − log PX n (X ) < α = 0 .
α∈ n→∞ n
If
1 1 1
lim H(X n ) = lim sup H(X n ) = lim inf H(X n ),
n→∞ n n→∞ n n→∞ n

then limn→∞ (1/n)H(X n ) is named the entropy rate of the source. H̄(X) and
H(X) are called the sup-entropy rate and inf-entropy rate, which were introduced
in Section 1.3.
Next we will prove that T (X) = S(X) = H̄(X) and T̄ (X) = S̄(X) =
lim supn→∞ (1/n)H(X n ) for a source X. The operational characterization of
lim inf n→∞ (1/n)H(X n ) and H(X) will be introduced in Chapter 5.

Theorem 3.23 (equality of resolvability and minimum source coding

rate for ﬁxed-length codes)

T (X) = S(X) = H̄(X).

Proof: Equality of S(X) and H̄(X) is already given in Theorem 3.16. Also,
T (X) = H̄(X) can be obtained from Theorem 2.5 by letting ε = 0. Here, we
provide an alternative proof for T (X) = S(X).

1. T (X) ≤ S(X).
If we can show that, for any ε ﬁxed, Tε (X) ≤ S2ε (X), then the proof is
completed. This claim is proved as follows.
• By deﬁnition of S2ε (X), we know that for any γ > 0, there exists X̃
and N such that for n > N,
1 n ) < S2ε (X) + γ n < 2ε.
R(X and X n − X
n
n ) < S2ε (X) + γ,
• Let An := xn : PX n (xn ) > 0 . Since (1/n)R(X

n )} < exp{n(S2ε (X) + γ)}.

|An | ≤ exp{R(X

Therefore,
1
lim sup log |An | ≤ S2ε (X) + γ.
n→∞ n

49
• Also,
n = 2 sup |PX n (E) − P n (E)|
2ε > X n − X X
E⊂X n
≥ 2|PX n (Acn ) − PX n (Acn )|
= 2PX n (Acn ), (sincePX n (Acn ) = 0).

Hence, lim supn→∞ PX n (Acn ) ≤ ε.

• Since S2ε (X) + γ is just one of the rates that satisfy the condition of
the minimum ε-source compression rate, and Tε (X) is the smallest
one of such rates,

Tε (X) ≤ S2ε (X) + γ for any γ > 0.

2. T (X) ≥ S(X).
Similarly, if we can show that, for any ε ﬁxed, Tε (X) ≥ S3ε (X), then the
proof is completed. This claim can be proved as follows.

• Fix α > 0. By deﬁnition of Tε (X), we know that for any γ > 0, there
exists N and a sequence of sets {An }∞
n=1 such that for n > N,

1
log |An | < Tε (X) + γ and PX n (Acn ) < ε + α.
n
• Choose Mn to satisfy

exp{n(Tε (X) + 2γ)} ≤ Mn ≤ exp{n(Tε (X) + 3γ)}. (3.4.1)

Also select one element xn0 from Acn . Deﬁne a new random variable
n as follows:
X

 0, if xn ∈ {xn0 } ∪ An ;
n n
PX n (x ) = k(x )
 , if xn ∈ {xn0 } ∪ An ,
Mn
where  n
 Mn PX
n (x ), if xn ∈ An ;
k(xn ) := k(xn ), if xn = xn0 .
 Mn −
xn ∈An

n satisﬁes the next four properties:

It can then be easily veriﬁed that X
(a) X n is Mn -type;
(b) PX n (xn0 ) ≤ PX n (Acn ) < ε + α, since xn0 ∈ Acn ;

50
(c) for all xn ∈ An ,
n
P n (xn ) − PX n (xn ) = PX n (xn ) − Mn PX n (x ) ≤ 1 .
X
Mn Mn
(d) PX n (An ) + PX n (xn0 ) = 1.
• Consequently,
1 n ) ≤ Tε (X) + 3γ, (by (3.4.1))
R(X
n
and

n =
X n − X P n (xn ) − PX n (xn ) + P n (xn ) − PX n (xn )
X X 0 0
xn ∈An

+ P n (xn ) − PX n (xn )
X
xn ∈Acn \{xn
0}

≤ P n (xn ) − PX n (xn ) + P n (xn ) + PX n (xn )
X X 0 0
xn ∈An

+ P n (xn ) − PX n (xn )
X
xn ∈Acn \{xn
0}
1
≤ + PX n (xn0 ) + PX n (xn0 )
xn ∈An
M n

+ PX n (xn )
xn ∈Acn \{xn
0}

|An |
= + PX n (xn0 ) + PX n (xn )
Mn xn ∈Ac n

exp{n(Tε (X) + γ)}

≤ + (ε + α) + PX n (Acn )
exp{n(Tε (X) + 2γ)}
≤ e−nγ + (ε + α) + (ε + α)
≤ 3(ε + α), for n ≥ − log(ε + α)/γ.

• Since Tε (X) is just one of the rates that satisfy the condition of 3(ε +
α)-resolvability, and S3(ε+α) (X) is the smallest one of such quantities,
S3(ε+α) (X) ≤ Tε (X).
The proof is completed by noting that α can be made arbitrarily
small. 2

This theorem tells us that the minimum source compression ratio for ﬁxed-
length code is the resolvability, which in turn is equal to the sup-entropy rate.

51
Theorem 3.24 (equality of mean-resolvability and minimum source
coding rate for variable-length codes)
1
T̄ (X) = S̄(X) = lim sup H(X n ).
n→∞ n

Proof: Equality of S̄(X) and lim supn→∞ (1/n)H(X n ) is already given in The-
orem 3.19.

1. S̄(X) ≤ T̄ (X).
Definition 3.22 states that there exists, for all γ > 0 and all sufficiently
large n, an error-free variable-length code whose average codeword length
n satisfies
1
n < T̄ (X) + γ.
n
Moreover, the fundamental source coding lower bound for a uniquely de-
codable code (cf. [2, Thm. 3.22]) is

H(X n ) ≤ n.

n = 0 and
Thus, by letting X̃ = X, we obtain X n − X
1 n ) = 1 H(X n ) ≤ 1
H(X n < T̄ (X) + γ,
n n n
which concludes that T̄ (X) is an ε-achievable mean-resolution rate of X
for any ε > 0, i.e.,
S̄(X) = lim S̄ε (X) ≤ T̄ (X).
ε→0

2. T̄ (X) ≤ S̄(X).
Observe that S̄ε (X) ≤ S̄(X) for 0 < ε < e−1 ≈ 0.36788. Hence, by taking
γ satisfying ε log |X | < γ < 2ε log |X | and for all sufficiently large n, there
exists X n such that
1 n ) < S̄(X) + γ
H(X
n
and
X n − X n < ε. (3.4.2)
On the other hand, Theorem 3.27 in [2] proves the existence of an error-free
prefix code for X n with average codeword length n satisfies

n ≤ H(X n ) + log(2) (nats).

52
From Lemma 3.18 that states

n )| ≤ X n − X
n · log |X |n |X |n
|H(X n ) − H(X ≤ ε log ,
X n − X n ε

where the last inequality holds because t log(1/t) is increasing for 0 < t <
e−1 , we obtain
1 1 1
n ≤ H(X n ) + log(2)
n n n
1 n ) + ε log |X | − 1 ε log(ε) + 1 log(2)
≤ H(X
n n n
1 1
≤ S̄(X) + γ + ε log |X | − ε log(ε) + log(2)
n n
≤ S̄(X) + 2γ,
if n > (log(2) − ε log(ε))/(γ − ε log |X |). Since γ can be made arbitrarily
small, S̄(X) is an achievable source compression rate for variable-length
codes; and hence,
T̄ (X) ≤ S̄(X).
2

Again, the above theorem tells us that the minimum source compression ratio
for variable-length code is the mean-resolvability, and the mean-resolvability is
exactly lim supn→∞ (1/n)H(X n ).
Note that lim supn→∞ (1/n)H(X n ) ≤ H̄(X), which follows straightforwardly
by the fact that the mean of the random variable −(1/n) log PX n (X n ) is no
greater than its right margin of the support. Also note that for stationary-
ergodic sources, all these quantities are equal, i.e.,
1
T (X) = S(X) = H̄(X) = T̄ (X) = S̄(X) = lim sup H(X n ).
n→∞ n

We end this chapter by computing these quantities for a speciﬁc example.

Example 3.25 Consider a binary random source X1 , X2 , . . . where {Xi }∞

i=1 are
independent random variables with individual distribution
PXi (0) = Zi and PXi (1) = 1 − Zi ,
where {Zi }∞
i=1 are pair-wise independent with common uniform marginal distri-
bution over (0, 1).
You may imagine that the source is formed by selecting from inﬁnitely many
binary number generators as shown in Figure 3.1. The selecting process {Zi }∞
i=1
is independent for each time instance.

53
Source Generator
..
.

source Xt1

.. Z
.
?
source Xt2 - . . . , X2 , X1
- Selector -
-

..
.

source Xt
t∈I

..
.

Figure 3.1: Source generator: {Xt }0<t<1 is an independent random

process with PXt (0) = t and PXt (1) = 1 − t, and is also independent
of the selector Z, where Xt is outputted if Z = t. Source generator of
each time instance is independent temporally.

It can be shown that such source is not stationary. Nevertheless, by means

of similar argument as AEP theorem, we can show that:

log PX (X1 ) + log PX (X2 ) + · · · + log PX (Xn )

− → hb (Z) in probability,
n
where hb (a) := −a log2 (a) − (1 − a) log2 (1 − a) is the binary entropy function.
To compute the ultimate average entropy rate in terms of the random variable
hb (Z), it requires that

log PX (X1 ) + log PX (X2 ) + · · · + log PX (Xn )

− → hb (Z) in mean,
n
which is a stronger result than convergence in probability. With the fundamen-
tal properties for convergence, convergence-in-probability implies convergence-in-
mean provided the sequence of random variables is uniformly integrable, which

54
n
is true for −(1/n) log PX (Xi ) since
i=1
* n +
1

sup E log PX (Xi )
n>0 n
i=1

1 n
≤ sup E [|log PX (Xi )|]
n>0 n i=1
= sup E [|log PX (X)|] , because of i.i.d. of {Xi }ni=1
n>0
= E [|log PX (X)|]

= E E |log PX (X)| Z
, 1

=
E |log PX (X)| Z = z dz
0
, 1
= z · | log(z)| + (1 − z) · | log(1 − z)| dz
0
, 1
≤ log(2)dz = log(2).
0

We therefore have:

1
H(X n ) − E[hb (Z)] = E − 1 log PX n (X n ) − E[hb (Z)]
n n

1
≤ E − log PX n (X n ) − hb (Z) → 0 as n → ∞.
n

Consequently,
1
lim sup H(X n ) = E[hb (Z)]
n→∞ n
, 1
= [z log(z) + (1 − z) log(1 − z)]dz
0
, 1
= 2z log(z)dz
0
1
2 1 2
= z log(z) + z
2 0
1 1
= nats or ≈ 0.72135 bits.
2 2 log(2)

However, it can be shown that the ultimate cumulative distribution function of

−(1/n) log PX n (X n ) is Pr[hb (Z) ≤ t] for t ∈ [0, log(2)] (cf. Figure 3.2).

55
1

0
0 log(2) nats
Figure 3.2: The ultimate CDF of −(1/n) log PX n (X n ): Pr{hb (Z) ≤ t}.

The sup-entropy rate of X is log(2) nats or 1 bit (which is the right-margin of

the ultimate CDF of −(1/n) log PX n (X n )). Hence, for this unstationary source,
the minimum average codeword length for ﬁxed-length codes and variable-length
codes are diﬀerent, which are 0.72135 bits and 1 bit, respectively.

56
Chapter 4

Channel Coding Theorems and

Approximations of Output Statistics for
Arbitrary Channels

Shannon’s channel capacity [2] is usually derived under the assumption that the
channel is memoryless. With moderate modiﬁcation of the proof, this result
can be extended to stationary-ergodic channels for which the capacity formula
becomes the maximization of the mutual information rate:
1
lim sup I(X n ; Y n ).
n→∞ X n n

Yet, for more general channels, such as non-stationary or non-ergodic channels,

a more general expression for channel capacity needs to be derived. This general
formula will be introduced and established in this chapter.

4.1 General channel models

The channel transition probability in its most general form is denoted by {W n =

PY n |X n }∞
n=1 , which is abbreviated by W for convenience. Similarly, the input
and output random processes are respectively denoted by X and Y . Throughout
the text, we denote for convenience PX n ,Y n = PX n W n , where Y n is the output of
channel W n = PY n |X n under input X n . Please refer also to Section 1.3 for the
description of general channels.
Now, similar to the deﬁnitions of sup- and inf-entropy rates for sequence of

57
sources, the sup- and inf-(mutual-)information rates are respectively deﬁned by1
¯
I(X; Y ) := sup{θ : i(θ) < 1}
and
I (X; Y ) := sup{θ : ī(θ) ≤ 0},
¯
where
1 n n
i(θ) := lim inf Pr iX n W n (X ; Y ) ≤ θ
n→∞ n
is the inf-spectrum of the normalized information density,

1 n n
ī(θ) := lim sup Pr iX n W n (X ; Y ) ≤ θ
n→∞ n
is the sup-spectrum of the normalized information density, and

n n PY n |X n (y n |xn )
iXnW n (x ; y ) := log
PY n (y n )
is the information density.
In 1994, Verdú and Han [42] showed that the channel capacity in its most
general form is
C := sup I (X; Y ).
X ¯

In their proof, they showed the achievability part via Feinstein’s lemma for the
channel coding average error probability. More importantly, they provided a new
converse based on an error lower bound for multihypothesis testing, which can
be seen as a natural counterpart to the error upper bound afforded by Feinstein’s
lemma. In this chapter, we do not present the original proof of Verdú and Han in
the converse theorem. Instead, we will first derive and illustrate in Section 4.3 a
general lower bound on the minimum error probability of multihypothesis testing
[14].2 We then use a special case of the bound, which results the so-called Poor-
Verdú bound [34], to complete the proof of the converse theorem.
1
In the paper of Verdú and Han [42], these two quantities are defined by:

1
¯
I(X; Y ) := inf β : (∀ γ > 0) lim sup PX n W n iX n W n (X n ; Y n ) > β + γ = 0
β∈ n→∞ n
and

1 n n
I (X; Y ) := sup α : (∀ γ > 0) lim sup PX n W n iX n W n (X ; Y ) < α + γ = 0 .
¯ α∈ n→∞ n
The above definitions are in fact equivalent to ours.
2
We refer the reader to [40] for other tight characterizations of the error probability of
multihypothesis testing.

58
4.2 Channel coding and Feinstein’s lemma

Definition 4.1 (fixed-length data transmission code) An (n, M) fixed-

length data transmission code for channel input alphabet X n and output alpha-
bet Y n consists of:

1. M messages intended for transmission;

2. an encoding function
f : {1, 2, . . . , M} → X n ;

3. a decoding function
g : Y n → {1, 2, . . . , M},
which is (usually) a deterministic rule that assigns a guess to each possible
received vector.

The channel inputs in {xn ∈ X n : xn = f (m) for some 1 ≤ m ≤ M} are the

codewords of the data transmission code.

Deﬁnition 4.2 (average probability of error) The average probability of

error for a ∼Cn = (n, M) code with encoder f (·) and decoder g(·) transmitted
over channel W n = PY n |X n is deﬁned as

1
M
Pe (∼Cn ) = λi ,
M i=1

where
λi := PY n |X n (y n |f (i)).
y n ∈Y n : g(y n )=i

We assume that the message set (of size M) is governed by a uniform distri-
bution. Thus, under the average probability of error criterion, all codewords are
treated equally (having a uniform prior distribution).

Deﬁnition 4.3 (channel capacity C) The channel capacity C is the supre-

mum of all the rates R for which there exists a sequence of ∼Cn = (n, Mn )
channel block codes such that
1
lim inf log Mn ≥ R,
n→∞ n

and
lim sup Pe (∼Cn ) = 0.
n→∞

59
Lemma 4.4 (Feinstein’s Lemma) Fix a positive n. For every γ > 0 and
input distribution PX n on X n , there exists an (n, M) block code for the transition
probability PW n = PY n |X n that its average error probability Pe (∼Cn ) satisﬁes

1 1
Pe (∼Cn ) ≤ Pr iX n W n (X ; Y ) < log M + γ + e−nγ .
n n
n n

Proof:

Step 1: Notations. Deﬁne

n n n n 1 n n 1
G := (x , y ) ∈ X × Y : iX n W n (x ; y ) ≥ log M + γ .
n n

Let ν := e−nγ + PX n W n (G c ). Feinstein’s Lemma obviously holds if ν ≥ 1,

because then

1 1
Pe (∼Cn ) ≤ 1 ≤ ν := Pr iX n W n (X ; Y ) < log M + γ + e−nγ .
n n
n n

So we assume ν < 1, which immediately results in

PX n W n (G c ) < ν < 1,

or equivalently,
PX n W n (G) > 1 − ν > 0. (4.2.1)
Therefore, denoting

A := {xn ∈ X n : PY n |X n (Gxn |xn ) > 1 − ν}

with Gxn := {y n ∈ Y n : (xn , y n ) ∈ G}, we have

PX n (A) > 0,

because if PX n (A) = 0,

(∀ xn with PX n (xn ) > 0) PY n |X n (Gxn |xn ) ≤ 1 − ν

⇒ PX n (xn )PY n |X n (Gxn |xn ) = PX n W n (G) ≤ 1 − ν,
xn ∈X n

and a contradiction to (4.2.1) is obtained.

60
Step 2: Encoder. Choose an xn1 in A (Recall that PX n (A) > 0.) Deﬁne Γ1 =
Gxn1 . (Then PY n |X n (Γ1 |xn1 ) > 1 − ν.)
Next choose, if possible, a point xn2 ∈ X n without replacement (i.e., xn2 can
be identical to xn1 ) for which

PY n |X n Gxn2 − Γ1 xn2 > 1 − ν,

and deﬁne Γ2 := Gxn2 − Γ1 .

Continue in the following way as for codeword i: choose xni to satisfy
- /
.
i−1
PY n |X n Gxni − Γj xni > 1 − ν,

j=1

0i−1
and deﬁne Γi := Gxni − j=1 Γj .
Repeat the above codeword selecting procedure until either M codewords
are selected or all the points in A are exhausted.

Step 3: Decoder. Deﬁne the decoding rule as

n i, if y n ∈ Γi
φ(y ) =
arbitrary, otherwise.

Step 4: Probability of error. For all selected codewords, the error probabi-
lity given codeword i is transmitted, λe|i , satisﬁes

λe|i ≤ PY n |X n (Γci |xni ) < ν.

(Note that (∀ i) PY n |X n (Γi |xni ) ≥ 1−ν by Step 2.) Therefore, if we can show
that the above codeword selecting procedures will not terminate before M,
then
1
M
Pe (∼Cn ) = λe|i < ν.
M i=1

Step 5: Claim. The codeword selecting procedure in Step 2 will not terminate
before M.
Proof: We will prove it by contradiction.
Suppose the above procedure terminates before M, say at N < M. Deﬁne
the set
.
N
F := Γi ∈ Y n .
i=1

61
Consider the probability
PX n W n (G) = PX n W n [G ∩ (X n × F )] + PX n W n [G ∩ (X n × F c )]. (4.2.2)

Since for any y n ∈ Gxni ,

PY n |X n (y n |xni )
PY n (y n ) ≤ ,
M · enγ
we have
PY n (Γi ) ≤ PY n (Gxni )
1 −nγ
≤ e PY n |X n (Gxni |xni )
M
1 −nγ
≤ e .
M
So the ﬁrst term of the right hand side in (4.2.2) can be upper bounded by
PX n W n [G ∩ (X n × F )] ≤ PX n W n (X n × F )
= PY n (F )
N
= PY n (Γi )
i=1
1 −nγ N −nγ
≤ N× e = e .
M M
As for the second term of the right hand side in (4.2.2), we can upper
bound it by

PX n W n [G ∩ (X n × F c )] = PX n (xn )PY n |X n (Gxn ∩ F c |xn )
xn ∈X n
- /
.
N

= PX n (xn )PY n |X n Gxn − Γi xn

xn ∈X n i=1

≤ PX n (xn ) · (1 − ν) ≤ 1 − ν,
xn ∈X n

where the last step follows since for all xn ∈ X n ,

- /
.N

PY n |X n Gxn − Γi xn ≤ 1 − ν.

i=1

(Because otherwise we could ﬁnd the (N + 1)-th codeword.)

Consequently, PX n W n (G) ≤ (N/M)e−nγ + 1 − ν. By deﬁnition of G, we
obtain
N −nγ
PX n W n (G) = 1 − ν + e−nγ ≤ e + 1 − ν,
M
which implies N ≥ M, resulting in a contradiction. 2

62
4.3 Error bounds for multihypothesis testing

We next introduce the generalized Poor-Verdú bound parameterized by θ ≥ 1.

Note that when θ = 1, this bound reduces to the original Poor-Verdú bound in
[34].

Lemma 4.5 (generalized Poor-Verdú bound [14]) Suppose X and Y are

random variables, where X takes values on a discrete (i.e., finite or coutably
infinite) alphabet X = {x1 , x2 , x3 , . . .} and Y takes on values in an arbitrary
alphabet Y. The minimum probability of error Pe in estimating X from Y
satisfies

(θ)
Pe ≥ (1 − α) · PX,Y (x, y) ∈ X × Y : PX|Y (x|y) ≤ α (4.3.1)

for each α ∈ [0, 1] and θ ≥ 1, where for each y ∈ Y,

(PX|Y (x|y))θ
PX|Y (x|y) :=
(θ)
, x ∈ X, (4.3.2)
x ∈X (PX|Y (x |y))
θ

is the tilted distribution of PX|Y (·|y) with parameter θ.

Proof: Fix θ ≥ 1. We only provide the proof for 0 < α < 1 since the lower
bound trivially holds when α = 0 and α = 1.
It is known that the estimate e(Y ) of X from observing Y that minimizes
the error probability is the maximum a posteriori (MAP) estimate given by3
e(Y ) = arg max PX|Y (x|Y ). (4.3.3)
x∈X

Therefore, the error probability incurred in testing among the values of X is

given by
1 − Pe = Pr{X = e(Y )}
,
= PX|Y (x|y) dPY (y)
Y {x : x=e(y)}
,
= max PX|Y (x|y) dPY (y)
Y x∈X
,
= max fx (y) dPY (y)
Y x∈X

= E max fx (Y ) ,
x∈X

3
Since randomization among those x’s that achieve maxx∈X PX|Y (x|y) results in the same
optimal error probability, we assume without loss of generality that e(y) is a deterministic
mapping that selects the maximizing x ∈ X of lowest index.

63
where fx (y) := PX|Y (x|y). Note that fx (y) satisﬁes

fx (y) = PX|Y (x|y) = 1.
x∈X x∈X

For a ﬁxed y ∈ Y, let hj (y) be the j-th element in the set

{fx1 (y), fx2 (y), fx3 (y), . . .}
such that its elements are listed in non-increasing order; i.e.,
h1 (y) ≥ h2 (y) ≥ h3 (y) ≥ · · ·
and
{h1 (y), h2(y), h3(y), . . .} = {fx1 (y), fx2 (y), fx3 (y), . . .}.
Then
1 − Pe = E[h1 (Y )]. (4.3.4)
(θ) (θ)
Furthermore, for each hj (y) above, deﬁne hj (y) such that hj (y) is the respec-
tive element for hj (y), satisfying
(θ) (θ)
hj (y) = fxj (y) = PX|Y (xj |y) ⇔ hj (y) = PX|Y (xj |y).

Since h1 (y) is the largest among {hj (y)}j≥1, we note that

hθ1 (y) 1
h1 (y) =
(θ)
θ
=
j≥1 hj (y) 1 + j≥2[hj (y)/h1(y)]θ
is non-decreasing in θ for each y; this implies that
(θ)
h1 (y) ≥ h1 (y) for θ ≥ 1 and y ∈ Y. (4.3.5)

For any α ∈ (0, 1), we can write

64
where 1(·) is the indicator function4 and the second inequality follows from
(4.3.5).
To complete the proof, we next relate E[h1 (Y )·1(h1 (Y ) > α)] with E[h1 (Y )],
which is exactly 1 − Pe . For any α ∈ (0, 1) and any random variable U with
Pr{0 ≤ U ≤ 1} = 1, the following inequality holds with probability one:

U ≤ α + (1 − α) · U · 1(U > α).

This can be easily proved by upper-bounding U in terms of α when 0 ≤ U ≤ α,

and α + (1 − α)U, otherwise. Thus

E[U] ≤ α + (1 − α)E[U · 1(U > α)].

Applying the above inequality to (4.3.6) by setting U = h1 (Y ), we obtain

(θ)
(1 − α)PX,Y (x, y) ∈ X × Y : PX|Y (x|y) > α
≥ E[h1 (Y )] − α
= (1 − Pe ) − α
= (1 − α) − Pe ,

where the ﬁrst equality follows from (4.3.4). This completes the proof. 2
We next show that if the MAP estimate e(Y ) of X from Y is almost surely
unique in (4.3.3), then the bound of Lemma 4.5, without the (1 − α) factor, is
tight in the limit of θ going to inﬁnity.

Lemma 4.6 Consider two random variables X and Y , where X has a ﬁnite or
countably inﬁnite alphabet X = {x1 , x2 , x3 , . . .} and Y has an arbitrary alphabet
Y. Assume that
PX|Y (e(y)|y) > max PX|Y (x|y) (4.3.7)
x∈X :x=e(y)

holds almost surely in PY , where e(y) is the MAP estimate from y as deﬁned in
(4.3.3); in other words, the MAP estimate is almost surely unique in PY . Then,
the error probability in the MAP estimation of X from Y satisﬁes

(θ)
Pe = lim PX,Y (x, y) ∈ X × Y : PX|Y (x|y) ≤ α (4.3.8)
θ→∞

(θ)
for each α ∈ (0, 1), where the tilted distribution PX|Y (·|y) is given in (4.3.2) for
y ∈ Y.
4 (θ)
I.e., if h1 (y) > α is true, 1(·) = 1; else, it is zero.

65
(θ)
Proof: It can be easily veriﬁed from the deﬁnitions of hj (·) and hj (·) that the
following two limits hold for each y ∈ Y:
1 1
(θ)
lim h1 (y) = lim θ
= ,
θ→∞ θ→∞ 1+ j≥2 [hj (y)/h1 (y)] (y)
where
(y) := max{j ∈ N : hj (y) = h1 (y)} (4.3.9)
and N := {1, 2, 3, . . .} is the set of positive integers, and

(θ)
lim hj (y) · 1 hj (y) > α
θ→∞

hj (y) · 1 (y)
1
>α for j = 1, · · · , (y);
= (4.3.10)
0 for j > (y).

As a result, we obtain that for any α ∈ (0, 1),

(θ)
lim PX,Y (x, y) ∈ X × Y : PX|Y (x|y) > α
θ→∞
, - ∞
/
(θ)
= lim hj (y) · 1 hj (y) > α dPY (y)
θ→∞ Y j=1
, - /

∞
(θ)
= lim hj (y) · 1 hj (y) > α dPY (y) (4.3.11)
Y θ→∞ j=1
 
, (y)
 1
= h1 (y) · 1 > α  dPY (y), (4.3.12)
Y j=1
(y)
,
1
= h1 (y) · (y) · 1 > α dPY (y),
Y (y)
where (4.3.11) holds by the dominated convergence theorem since

∞ ∞
(θ)
hj (y) · 1 hj (y) > α ≤ hj (y) = 1,

j=1 j=1

and (4.3.12) holds since the limit (in θ) of

(θ)
aθ,j := hj (y) · 1 hj (y) > α

exists for every j = 1, 2, · · · by (4.3.10), hence implying that

∞
∞
lim aθ,j = lim aθ,j .
θ→∞ θ→∞
j=1 j=1

66
Now the condition in (4.3.7) is equivalent to
Pr[ (Y ) = 1] := PY {y ∈ Y : (y) = 1} = 1; (4.3.13)
thus,

(θ)
lim PX,Y (x, y) ∈ X × Y : PX|Y (x|y) > α
θ→∞
,
= h1 (y) · 1 (1 > α) dPY (y)
Y
= E[h1 (Y )] = 1 − Pe , (4.3.14)
where (4.3.14) follows from (4.3.4). This immediately yields that for 0 < α < 1,

(θ)
Pe = 1 − lim PX,Y (x, y) ∈ X × Y : PX|Y (x|y) > α
θ→∞

(θ)
= lim PX,Y (x, y) ∈ X × Y : PX|Y (x|y) ≤ α .
θ→∞
2
The following two examples illustrate that the condition speciﬁed in (4.3.7)
holds for certain situations and hence the generalized Poor-Verdú bound can be
made arbitrarily close to the minimum probability of error Pe by adjusting θ and
α.
Example 4.7 (binary erasure channel) Suppose that X and Y are respec-
tively the channel input and channel output of a binary erasure channel (BEC)
with erasure probability ε, where X = {0, 1} and Y = {0, 1, E}, and

1 − p, y = x ∈ {0, 1};
PY |X (y|x) =
p, y = E.
Let PX (0) = 1 − p = 1 − PX (1) with 0 < p < 1/2. Then, the MAP estimate of
X from Y is given by
y if y ∈ {0, 1}
e(y) =
0 if y = E
and the resulting minimum probability of error is Pe = εp.
Calculating bound (4.3.1) of Lemma 4.5 yields

(θ)
(1 − α)PX,Y (x, y) ∈ X × Y : PX|Y (x|y) ≤ α


 pθ

 0 if 0 ≤ α <

 pθ + (1 − p)θ
 θ
p (1 − p)θ
= εp(1 − α) if θ ≤ α < (4.3.15)

 p + (1 − p)θ pθ + (1 − p)θ

 θ

 (1 − p)
ε(1 − α) if θ ≤ α < 1.
p + (1 − p)θ

67
Thus, taking θ ↑ ∞ and then α ↓ 0 in (4.3.15) results in the exact error proba-
bility εp. Note that in this example, the original Poor-Verdú bound (i.e., with
θ = 1) also achieves the minimum probability of error εp by choosing α = 1 − p;
however this maximizing choice of α = 1 − p for the original bound is a function
of system’s statistics (here, the input distribution p) which may be undesirable.
On the other hand, the generalized bound (4.3.1) can herein achieve its peak by
systematically taking θ ↑ ∞ and then letting α ↓ 0.
Furthermore, since (y) = 1 for every y ∈ {0, 1, E}, we have that (4.3.7)
holds; hence, by Lemma 4.6, (4.3.8) yields that for 0 < α < 1,

(θ)
Pe = lim PX,Y (x, y) ∈ X × Y : PX|Y (x|y) ≤ α = εp,
θ→∞

where the last equality follows directly from (4.3.15) without the (1 − α) factor.

Example 4.8 (binary-input additive Gaussian noise channel) We herein

consider an example with a continuous observation alphabet Y = R, where R is
the set of real numbers. Speciﬁcally, let the observation be given by Y = X + N,
where X is uniformly distributed over X = {−1, +1} and N is a zero-mean
Gaussian random variable with variance σ 2 . Assuming that X and N are inde-
pendent from each other, then for x ∈ {−1, +1} and y ∈ R
1
PX|Y (x|y) = , (4.3.16)
1 + exp{− 2xy
σ2
}

which directly yields a MAP estimate of X from Y given by



+1, y > 0;
e(y) = −1, y < 0;


arbitrary, y = 0

with a resulting error probability of Pe = Φ(−1/σ), where

, z
1 t2
Φ(z) := √ e− 2 dt
−∞ 2π

is the cdf of the standard (zero-mean unit-variance) Gaussian distribution.

Furthermore, since x ∈ {−1, +1}, we have

(θ) 1
PX|Y (x|y) = ,
2 /θ }
1 + exp{− σ2xy

68
and the generalized Poor-Verdú bound (4.3.1) yields

(θ)
Pe ≥ (1 − α)PX,Y (x, y) ∈ X × Y : PX|Y (x|y) ≤ α
, − σ2 log( 1 −1)−1
2θ α 1 t2
= (1 − α) √ exp − 2 dt
−∞ 2πσ 2 2σ

σ 1 1
= (1 − α)Φ − log −1 − . (4.3.17)
2θ α σ

Now taking the limits θ ↑ ∞ followed by α ↓ 0 for the right-hand side

term in (4.3.17) yields exactly Φ (−1/σ) = Pe ; hence the generalized Poor-Verdú
bound (4.3.1) is asymptotically tight as a consequence of the validity of condition
(4.3.7).

We close this section by applying the Poor-Verdú bound to the decoding of

an (n, M) block code used over the channel W n .

Corollary 4.9 Every ∼Cn = (n, M) code satisﬁes

−nγ 1 n n 1
Pe (∼Cn ) ≥ 1 − e Pr iX n W n (X ; Y ) ≤ log M − γ
n n

for every γ > 0, where X n places probability mass 1/M on each codeword, and
Pe (∼Cn ) denotes the error probability of the code.

Proof: Taking α = e−nγ and θ = 1 in Lemma 4.5, and replacing X and Y in

Lemma 4.5 by its n-fold counterparts, i.e., X n and Y n , we obtain

Pe (∼Cn ) ≥ 1 − e−nγ PX n W n (xn , y n) ∈ X n × Y n : PX n |Y n (xn |y n) ≤ e−nγ

−nγ n n n n PX n |Y n (xn |y n) e−nγ
= 1−e PX n W n (x , y ) ∈ X × Y : ≤
1/M 1/M
n n
−nγ n n n n PX n |Y n (x |y ) e−nγ
= 1−e PX n W n (x , y ) ∈ X × Y : ≤
PX n (xn ) 1/M
−nγ n n n n
= 1−e PX n W n [(x , y ) ∈ X × Y :

1 PX n |Y n (xn |y n ) 1
log ≤ log M − γ
n PX n (xn ) n

−nγ 1 n n 1
= 1−e Pr iX n W n (X ; Y ) ≤ log M − γ .
n n
2

69
4.4 Capacity formulas for general channels

In this section, the general formulas for several operational notions of channel
capacity are derived. The units of these capacity quantities are in nats/channel
use (assuming that the natural logarithm is used).

Deﬁnition 4.10 (ε-achievable rate) Fix ε ∈ [0, 1]. R ≥ 0 is an ε-achievable

rate if there exists a sequence of ∼Cn = (n, Mn ) channel block codes such that
1
lim inf log Mn ≥ R
n→∞ n
and
lim sup Pe (∼Cn ) ≤ ε.
n→∞

Deﬁnition 4.11 (ε-capacity Cε ) Fix ε ∈ [0, 1]. The supremum of ε-achievable

rates is called the ε-capacity, Cε .

It is straightforward from the deﬁnition that Cε is non-decreasing in ε, and

C1 = log |X |.

Observation 4.12 (capacity C) Note that channel capacity C is equal to the

supremum of the rates that are ε-achievable for all ε ∈ [0, 1]:5

C = inf Cε = lim Cε = C0 .
0≤ε≤1 ε↓0

Definition 4.13 (strong capacity CSC ) Define the strong converse capacity
(or strong capacity) CSC as the infimum of the rates R such that for all ∼Cn =
(n, Mn ) channel block codes with
1
lim inf log Mn ≥ R,
n→∞ n
we have
lim inf Pe (∼Cn ) = 1.
n→∞

5
The claim of C0 = limε↓0 Cε can be proved by contradiction as follows.
Suppose C0 + 2γ < limε↓0 Cε for some γ > 0. For any positive integer j, and by deﬁnition
of C1/j , there exist Nj and a sequence of block codes ∼Cn = (n, Mn ) such that for n > Nj ,
(1/n) log Mn > C1/j − γ ≥ limε↓0 Cε − γ > C0 + γ and Pe ( ∼Cn ) < 2/j. Construct a sequence
of block codes ∼ Cn = (n, M̃n ) as: ∼
Cn = ∼Cn , if max1≤i≤j−1 Ni ≤ n < max1≤i≤j Ni . Then
lim supn→∞ (1/n) log M̃n ≥ C0 + γ and lim inf n→∞ Pe ( ∼ Cn ) = 0, which contradicts to the
deﬁnition of C0 .

70
Based on these deﬁnitions, general formulas for the above capacity notions
are established as follows.

Theorem 4.14 (ε-capacity) For 0 < ε < 1, the ε-capacity Cε for arbitrary
channels satisﬁes
Cε = sup I ε (X; Y ).
X ¯

Proof:
1. Cε ≥ supX I ε (X; Y ).
¯
Fix input X. It suﬃces to show the existence of ∼Cn = (n, Mn ) data
transmission code with rate
1 γ
I ε (X; Y ) − γ < log Mn < I ε (X; Y ) −
¯ n ¯ 2
and probability of decoding error satisfying
lim sup Pe (∼Cn ) ≤ ε
n→∞

for every γ > 0. (Because if such code exists, then lim inf n→∞ (1/n) log Mn ≥
I ε (X; Y ) − γ, which implies Cε ≥ I ε (X; Y ) − γ for arbitrarily small γ.)
¯ ¯
From Lemma 4.4, there exists an ∼Cn = (n, Mn ) code whose error probabi-
lity satisﬁes

1 n n 1 γ
Pe (∼Cn ) < Pr iX n W n (X ; Y ) < log Mn + + e−nγ/4
n n 4

1 n n γ γ
≤ Pr iX n W n (X ; Y ) < I ε (X; Y ) − + + e−nγ/4
n ¯ 2 4

1 n n γ
≤ Pr iX n W n (X ; Y ) < I ε (X; Y ) − + e−nγ/4 .
n ¯ 4
Since

1 n n
I ε (X; Y ) := sup R : lim sup Pr iW n W n (X ; Y ) ≤ R ≤ ε ,
¯ n→∞ n
we obtain

1 n n γ
lim sup Pr iX n W n (X ; Y ) < lim I ε (X; Y ) − ≤ ε.
n→∞ n δ↑ε ¯ 4
Hence, the proof of the direct part is completed by noting that

1 n n γ
lim sup Pe (∼Cn ) ≤ lim sup Pr iX n W n (X ; Y ) < I ε (X; Y ) −
n→∞ n→∞ n ¯ 4
+ lim sup e−nγ/4 .
n→∞
≤ ε.

71
2. Cε ≤ supX I ε (X; Y ).
¯
Suppose that there exists a sequence of ∼Cn = (n, Mn ) codes with rate
strictly larger than supX I ε (X; Y ) and lim supn→∞ Pe (∼Cn ) ≤ ε. Let the
¯
ultimate code rate for this code be supX I ε (X; Y ) + 3ρ for some ρ > 0.
¯
Then for suﬃciently large n,
1
log Mn > sup I ε (X; Y ) + 2ρ.
n X ¯

Since the above inequality holds for every X, it certainly holds if taking
input X̂ n which places probability mass 1/Mn on each codeword, i.e.,
1
log Mn > I ε (X̂; Ŷ ) + 2ρ, (4.4.1)
n ¯
where Ŷ is the channel output due to channel input X̂. Then from Corol-
lary 4.9, the error probability of the code satisﬁes

−nρ 1 n n 1
Pe (∼Cn ) ≥ 1 − e Pr iX̂ n W n (X̂ ; Ŷ ) ≤ log Mn − ρ
n n

−nρ 1 n n
≥ 1−e Pr iX̂ n W n (X̂ ; Ŷ ) ≤ I ε (X̂; Ŷ ) + ρ ,
n ¯

where the last inequality follows from (4.4.1). Taking the limsup of both
sides, we have

ε ≥ lim sup Pe (∼Cn )

n→∞

1 n n
≥ lim sup Pr iX̂ n W n (X̂ ; Ŷ ) ≤ I ε (X̂; Ŷ ) + ρ > ε,
n→∞ n ¯
and a desired contradiction is obtained. 2

Theorem 4.15 (general channel capacity) The channel capacity C for ar-
bitrary channel satisﬁes
C = sup I (X; Y ).
X ¯

Proof: Observe that

C = C0 = lim Cε = lim sup I ε (X; Y ).

ε↓0 ε↓0 X ¯

Hence, from Theorem 4.14, we note that

C = lim sup I ε (X; Y ) ≥ lim sup I (X; Y ) = sup I (X; Y ).

ε↓0 X ¯ ε↓0 X ¯ X ¯

72
It remains to show that C ≤ supX I (X; Y ).
¯
Suppose that there exists a sequence of ∼Cn = (n, Mn ) codes with rate strictly
larger than supX I (X; Y ) and error probability tends to 0 as n → ∞. Let the
¯
ultimate code rate for this code be supX I (X; Y ) + 3ρ for some ρ > 0. Then for
¯
sufficiently large n,
1
log Mn > sup I (X; Y ) + 2ρ.
n X ¯
Since the above inequality holds for every X, it certainly holds if taking input
X̂ n which places probability mass 1/Mn on each codeword, i.e.,
1
log Mn > I (X̂; Ŷ ) + 2ρ, (4.4.2)
n ¯
where Ŷ is the channel output due to channel input X̂. Then from Corollary
4.9, the error probability of the code satisfies

−nρ 1 n n 1
Pe (∼Cn ) ≥ 1 − e Pr iX̂ n W n (X̂ ; Ŷ ) ≤ log Mn − ρ
n n

−nρ 1 n n
≥ 1−e Pr iX̂ n W n (X̂ ; Ŷ ) ≤ I (X̂; Ŷ ) + ρ , (4.4.3)
n ¯
where the last inequality follows from (4.4.2). Since, by assumption, Pe (∼Cn )
vanishes as n → ∞ but (4.4.3) cannot vanish by definition of I (X̂; Ŷ ), we
¯
obtain the desired contradiction. 2
We close this section by providing the general formula for strong capacity for
which the proof follows similar steps as the previous two theorems and hence we
omit it. Note that in the general formula for strong capacity, the sup-information
rate is used as opposed to the inf-information rate formula for channel capacity.

Theorem 4.16 (general strong capacity)

¯
CSC := sup I(X; Y ).
X

4.5 Examples for the general capacity formulas

With the general capacity formulas shown in the previous section, we can now
compute them for some non-stationary or non-ergodic channels, and analyze
their properties.

Example 4.17 (capacity) Let the input and output alphabets be {0, 1}, and
let every output Yi be given by:
Yi = Xi ⊕ Ni ,

73
where “⊕” represents modulo-2 addition operation. Assume the input process
X and the noise process N are independent.
A general relation between the inf-information rate and inf/sup-entropy rates
can be derived from (1.4.2) and (1.4.4) as follows:
H(Y ) − H̄(Y |X) ≤ I (X; Y ) ≤ H̄(Y ) − H̄(Y |X).
¯
Since N is completely determined from Y n under the knowledge of X n ,
n

H̄(Y |X) = H̄(N ).

Indeed, this channel is symmetric [2]; thus a uniform input yields a uniform
output (i.e., a Bernoulli process with parameter (1/2)), and H(Y ) = H̄(Y ) =
log(2) nats. We thus have
C = log(2) − H̄(N ).
We next compute the channel capacity for the following two noise cases.

Case A) If N is a non-stationary binary independent sequence with

Pr{Ni = 1} = pi ,
then by the uniform boundedness (in i) of the variance of random variable
− log PNi (Ni ), namely,
Var[− log PNi (Ni )] ≤ E[(log PNi (Ni ))2 ]

≤ sup pi (log pi )2 + (1 − pi )(log(1 − pi ))2
0<pi <1
≤ log(2),
we have by Chebyshev’s inequality that as n → ∞,

1 n
1
Pr − log PN n (N n ) − H(Ni ) > γ → 0,
n n i=1

for any γ > 0. Therefore,

1
n
H̄(N ) = lim sup hb (pi ),
n→∞ n i=1

where hb (p) := −p log(p) − (1 − p) log(1 − p) is the binary entropy function.

Consequently,
1
n
C = log(2) − lim sup hb (pi ).
n→∞ n
i=1
This result is illustrated in Figures 4.1.

74
···

-
H(N ) cluster points H̄(N )

Figure 4.1: The ultimate CDFs of −(1/n) log PN n (N n ).

Case B) If N has the same distribution as the source process in Example 3.25,
then H̄(N ) = log(2) nats, which yields a zero channel capacity.

Example 4.18 (strong capacity) Considering the same additive noise chan-
nel as in Example 4.17, we have that under a uniform input (in this case,
PX n (xn ) = PY n (y n ) = 2−n ),

1 PY n |X n (Y n |X n )
Pr log ≤θ
n PY n (Y n )

1 n 1 n
= Pr log PN n (N ) − log PY n (Y ) ≤ θ
n n

1 n
= Pr log PN n (N ) ≤ θ − log(2)
n

1 n
= Pr − log PN n (N ) ≥ log(2) − θ
n

1 n
= 1 − Pr − log PN n (N ) < log(2) − θ . (4.5.1)
n
We again consider the two cases in Example 4.17.

Case A) From (4.5.1), we directly have that

1
n
CSC = 1 − lim inf hb (pi ).
n→∞ n
i=1

Case B) From (4.5.1) and also from Figure 4.2, we obtain that
CSC = log(2).

In Figure 4.2, we plot the ultimate CDF of the channel’s normalized infor-
mation density for Case B. Recall that this limiting CDF is the spectrum of the
normalized information density.

75
1

-
0 log(2)

Figure 4.2: The ultimate CDF of the normalized information density

for Example 4.18-Case B).

Figure 4.2 indicates that the channel capacity is 0, and that the strong ca-
pacity is log(2). Hence, the operational meaning of these two extremum values
are determined. One can then naturally ask “what is the operational meaning of
the function value between 0 and log(2)? ” The answer to this question actually
comes from Theorem 4.14, which is ε-capacity. Indeed in practice, it may not be
easy to design a block code which transmits information with (asymptotically)
no error through a very noisy channel with rate equals to channel capacity. How-
ever, if we allow some errors during transmission such that the error probability
is bounded above by 0.001, we have a better chance to construct a practical
block code.

Example 4.19 (ε-capacity) Consider the channel in Case B of Example 4.17.

Let the spectrum of the normalized information density be i(θ). Then the ε-
capacity of this channel is actually the inverse function of i(θ), i.e.
Cε = i−1 (θ).
Note that the capacity can be written as:
C = lim Cε .
ε↓0

In general, the strong capacity satisﬁes

CSC ≥ lim Cε
ε↑1

76
since strong capacity dictates a strong condition of lim inf n→∞ Pe (∼Cn ) = 1 as
contrary to lim supn→∞ Pe (∼Cn ) ≤ ε required for ε-capacity. However, equality of
the above inequality holds in this example.

4.6 Capacity and resolvability for channels

The channel capacity for discrete memoryless channel is shown to be:

C := max I(PX , QY |X ).
PX

Let PX be the optimizer of the above maximization operation. Then

C := max I(PX , QY |X ) = I(PX , QY |X ).

Here, the performance of the code is assumed to be the average error probability,
namely
1
M
Pe (∼Cn ) = Pe (∼Cn |xni ),
M i=1
if the codebook is ∼Cn := {xn1 , xn2 , . . . , xnM }. Due to the random coding argument,
a deterministic good code with arbitrarily small error probability and with rate
less than channel capacity must exist. One can ask: What is the relationship
between a good code and the optimizer PX ? It is widely believed that if the
code is good (with rate close to capacity and low error probability), then the
output statistics PY n – due to the equally-likely code – must approximate the
output distribution, denoted by PY n , due to the input distribution achieving the
channel capacity.
This fact is actually reﬂected in the next theorem.

Theorem 4.20 ([26]) For any channel W n = (Y n |X n ) with ﬁnite input al-
phabet and capacity C that satisﬁes the strong converse (i.e., C = CSC ), the
following statement holds.
Fix γ > 0 and a sequence of {∼Cn = (n, Mn )}∞
n=1 block codes with

1
log Mn ≥ C − γ/2,
n
and vanishing error probability (i.e., error probability approaches zero as block-
length n tends to inﬁnity.) Then,
1 n n
Y − Y ≤ γ for all suﬃciently large n,
n

77
n n
where Y n is the output due to the block code and Y is the output due the X
that satisﬁes
n n
I(X ; Y ) = max n
I(X n ; Y n ).
X

To be speciﬁc,
1
PY n (y n ) = PX n (xn )PW n (y n |xn ) = PW n (y n |xn )
Mn
xn ∈ ∼
Cn xn ∈ ∼
Cn
and
PY n (y n ) = PX n (xn )PW n (y n |xn ).
xn ∈X n

Note that the above theorem holds for arbitrary channels, not restricted to
only discrete memoryless channels.
One can wonder whether a result in the spirit of the above theorem can be
proved for the input statistics rather than the output statistics. The answer
is negative. Hence, the statement that the statistics of any good code must
approximate those that maximize the mutual information is erroneously taken
for granted. (However, we do not rule out the possibility of the existence of
good codes that approximate those that maximize the mutual information.) To
n
n (which
see this, simply consider the normalized entropy of X versus that of X
is uniformly distributed over the codewords) for discrete memoryless channels:

1 n 1 ) =
n 1 n n 1 n n 1
H(X ) − H(X H(X |Y ) + I(X ; Y ) − log(Mn )
n n n n n
1
= H(X|Y ) + I(X; Y ) − log(Mn )
n
1
= H(X|Y ) + C − log(Mn ).
n
A good code with vanishing error probability exists for (1/n) log(Mn ) arbitrarily
close to C; hence, we can ﬁnd a good code sequence to satisfy

1 n 1 n
lim H(X ) − H(X ) = H(X|Y ).
n→∞ n n

Since the term H(X|Y ) is in general positive, where a quick example is the BSC
with crossover probability p that yields X and Y are both uniform and

H(X|Y ) = H(X) − I(X; Y )

= H(X) − H(Y ) + H(Y |X)
= H(Y |X) = −p log(p) − (1 − p) log(1 − p),

78
. . . , X3 , X2 , X1 . . . , Y3 , Y2 , Y1
- PY n |X n -
true source true channel true output

Figure 4.3: The communication system.

3 , X
...,X 2 , X
1 . . . , Y3 , Y2 , Y1
- PY n |X n -
computer-generated true channel corresponding
source output
Figure 4.4: The simulated communication system.

the two input distributions do not necessarily resemble to each other.

The previous discussion motivates the necessity to ﬁnd an equally distributed
(over a subset of input alphabet) input distribution that generates the output
statistics, which is close to the output due to the input that maximizes the mutual
information. Since such approximations are usually performed by computers, it
may be natural to connect approximations of the input and output statistics
with the concept of resolvability.
In a data transmission system as shown in Figure 4.3, suppose that the source,
channel and output are respectively denoted by

X n := (X1 , . . . , Xn ),

W n := (W1 , . . . , Wn ),
and
Y n := (Y1 , . . . , Yn ),
where Wi has distribution PYi |Xi . In order to simulate the behavior of the chan-
nel, a computer-generated input may be necessary as shown in Figure 4.4. As
stated in Chapter 3, such computer-generated input is based on an algorithm
formed by a few basic uniform random experiments, which has ﬁnite resolu-
tion. Our goal is to ﬁnd a good computer-generated input X n such that the
corresponding output Y n is very close to the true output Y n .

Deﬁnition 4.21 (ε-resolvability for input X and channel W ) Fix ε > 0,

and suppose that the (true) input random variable and (true) channel statistics
are X and W = (Y |X), respectively. Then the ε-resolvability Sε (X, W ) for

79
input X and channel W is defined by:

Sε (X, W ) := min R : (∀ γ > 0)(∃ X̃ and N)(∀ n > N)

1 n n n
R(X ) < R + γ and Y − Y < ε ,
n
where PY n = PX n PW n . (The definitions of resolution R(·) and variational dis-
tance (·) − (·) are given by Definitions 3.4 and 3.5.)

Note that if we take the channel W n to be an identity channel for all n,

namely X n = Y n and PY n |X n (y n |xn ) is either 1 or 0, then the ε-resolvability for
input X and channel W is reduced to source ε-resolvability for X:

Sε (X, W Identity ) = Sε (X).

Similar reductions can be applied to all the following deﬁnitions.

Deﬁnition 4.22 (ε-mean-resolvability for input X and channel W ) Fix

ε > 0, and suppose that the (true) input random variable and (true) channel
statistics are respectively X and W . Then the ε-mean-resolvability S̄ε (X, W )
for input X and channel W is deﬁned by:

S̄ε (X, W ) := min R : (∀ γ > 0)(∃ X̃ and N)(∀ n > N)

1 ) < R + γ and Y − Y < ε ,
n n n
H(X
n
where PY n = PX n PW n and PY n = PX n PW n .

Deﬁnition 4.23 (resolvability and mean resolvability for input X and

channel W ) The resolvability and mean-resolvability for input X and W are
deﬁned respectively as:

S(X, W ) := sup Sε (X, W ) and S̄(X, W ) := sup S̄ε (X, W ).

ε>0 ε>0

Deﬁnition 4.24 (resolvability and mean resolvability for channel W )

The resolvability and mean-resolvability for channel W are deﬁned respectively
as:
S(W ) := sup S(X, W ),
X

and
S̄(W ) := sup S̄(X, W ).
X

80
As an extension to Chapter 3, the above deﬁnitions lead to the following
theorem.

Theorem 4.25 ([26])

¯
S(W ) = CSC = sup I(X; Y)
X

and
S̄(W ) = C = sup I (X; Y ).
X ¯

It is somewhat a reasonable inference that if no computer algorithms can pro-

duce a desired good output statistics under the number of random nats speciﬁed,
then all codes should be bad codes for this rate.

81
Chapter 5

Optimistic Shannon Coding Theorems

for Arbitrary Single-User Systems

As seen in Chapters 2 and 4, the conventional deﬁnitions of the source coding

rate and channel capacity require the existence of reliable codes for all suffi-
ciently large blocklengths. Alternatively, if it is required that good codes exist
for infinitely many blocklengths, then optimistic definitions of source coding rate
and channel capacity are obtained.
In this chapter, formulas for the optimistic minimum achievable fixed-length
source coding rate and the minimum ε-achievable source coding rate for arbi-
trary finite-alphabet sources are established. The expressions for the optimistic
capacity and the optimistic ε-capacity of arbitrary single-user channels are also
provided. The expressions of the optimistic source coding rate and capacity are
examined for the class of information stable sources and channels, respectively.
Finally, examples for the computation of optimistic capacity are presented.

5.1 Motivations

The conventional deﬁnition of the minimum achievable ﬁxed-length source cod-

ing rate T (X) (or T0 (X)) for a source X (cf. Definition 2.2) requires the exis-
tence of reliable source codes for all sufficiently large blocklengths. Alternatively,
if it is required that reliable codes exist for infinitely many blocklengths, a new,
more optimistic definition of source coding rate (denoted by T (X)) is obtained
[41]. Similarly, the optimistic capacity C̄ is defined by requiring the existence of
reliable channel codes for infinitely many blocklengths, as opposed to the defini-
tion of the conventional channel capacity C (see Definition 4.3 or [42, Definition
1]).
This concept of optimistic source coding rate and capacity were investigated
by Verdú et.al for arbitrary (not necessarily stationary, ergodic, information sta-

82
ble, etc.) sources and single-user channels in [41, 42]. More speciﬁcally, they
establish an additional operational characterization for the optimistic minimum
achievable source coding rate (T (X) for source X) by demonstrating that for a
given channel, the classical statement of the source-channel separation theorem1
holds for every channel if T (X) = T (X) [41]. In a dual fashion, they also show
that for channels with C̄ = C, the classical separation theorem holds for every
source. They also conjecture that T (X) and C̄ do not seem to admit a simple
expression.
In this chapter, we demonstrate that T (X) and C̄ do indeed have a general
formula. The key to these results is the application of the generalized sup-
information rate introduced in Chapter 1 to the existing proofs by Verdú and
Han [42] of the direct and converse parts of the conventional coding theorems.
We also provide a general expression for the optimistic minimum ε-achievable
source coding rate and the optimistic ε-capacity.

5.2 Optimistic source coding theorems

In this section, we provide the optimistic source coding theorems. They are
shown based on two new bounds due to Han [25] on the error probability of a
source code as a function of its size. Interestingly, these bounds constitute the
natural counterparts of the upper bound provided by Feinstein’s Lemma (see
Lemma 4.4) and the Verdú-Han lower bound [42] to the error probability of a
channel code. Furthermore, we show that for information stable sources, the
formula for T (X) reduces to
1
T (X) = lim inf H(X n ).
n→∞ n
This is in contrast to the expression for T (X), which is known to be
1
T (X) = lim sup H(X n ).
n→∞ n

The above result leads us to observe that for sources that are both stationary and
information stable, the classical separation theorem is valid for every channel.
In [41], Vembu et.al characterize the sources for which the classical separation
theorem holds for every channel. They demonstrate that for a given source X,
1
By the “classical statement of the source-channel separation theorem,” we mean the fol-
lowing. Given a source X with (conventional) source coding rate T (X) and channel W
with capacity C, then X can be reliably transmitted over W if T (X) < C. Conversely, if
T (X) > C, then X cannot be reliably transmitted over W . By reliable transmissibility of
the source over the channel, we mean that there exits a sequence of joint source-channel codes
such that the decoding error probability vanishes as the blocklength n → ∞ (cf. [41]).

83
the separation theorem holds for every channel if its optimistic minimum achiev-
able source coding rate T (X) coincides with its conventional (or pessimistic)
minimum achievable source coding rate T (X); i.e., if T (X) = T (X).
We herein establish a general formula for T (X). We prove that for any source
X,
T (X) = limHδ (X) = H1− (X).
δ↑1

We also provide the general expression for the optimistic minimum ε-achievable
source coding rate. We show these results based on two new bounds due to Han
(one upper bound and one lower bound) on the error probability of a source code
[25, Chapter 1]. The upper bound (i.e., Lemma 2.3) consists of the counterpart
of Feinstein’s Lemma for channel codes, while the lower bound (i.e., Lemma 2.4)
consists of the counterpart of the Verdú-Han lower bound on the error probability
of a channel code ([42, Theorem 4]). As in the case of the channel coding bounds,
both source coding bounds (Lemmas 2.3 and 2.4) hold for arbitrary sources and
for arbitrary ﬁxed blocklength.

Deﬁnition 5.1 An (n, M) ﬁxed-length source code for X n is a collection of M

n-tuples ∼Cn = {cn1 , . . . , cnM }. The error probability of the code is
Pe (∼Cn ) := Pr [X n ∈ ∼Cn ] .

Deﬁnition 5.2 (optimistic ε-achievable source coding rate) Fix 0 < ε <
1. R ≥ 0 is an optimistic ε-achievable rate if, for every γ > 0, there exists a
sequence of (n, Mn ) ﬁxed-length source codes ∼Cn such that
1
lim sup log Mn ≤ R
n→∞ n
and
lim inf Pe (∼Cn ) ≤ ε.
n→∞

The inﬁmum of all ε-achievable source coding rates for source X is denoted
by T ε (X). Also deﬁne T (X) := sup0<ε<1 T ε (X) = limε↓0 T ε (X) = T 0 (X) as
the optimistic source coding rate.

We can then use Lemmas 2.3 and 2.4 (in a similar fashion to the general
source coding theorem in Theorem 2.5) to prove the general optimistic (ﬁxed-
length) source coding theorems.

Theorem 5.3 (optimistic minimum ε-achievable source coding rate for-

mula) For any source X,

 lim Hδ (X), for ε ∈ [0, 1);
T ε (X) = δ↑(1−ε)
0, for ε = 1.

84
Proof: The case of ε = 1 follows directly from its deﬁnition; hence, the proof
only focuses on the case of ε ∈ [0, 1).

1. Forward part (achievability): T ε (X) ≤ limδ↑(1−ε) Hδ (X)

We need to prove the existence of a sequence of block codes {∼Cn =
(n, Mn )}n≥1 such that for every γ > 0,
1
lim sup log Mn ≤ lim Hδ (X) + γ and lim inf Pe (∼Cn ) ≤ ε.
n→∞ n δ↑(1−ε) n→∞

Lemma 2.3 ensures the existence (for any γ > 0) of a source block code
∼Cn = (n, Mn = exp{n(limδ↑(1−ε) Hδ (X) + γ)} ) with error probability

1 n 1
Pe (∼Cn ) ≤ Pr hX n (X ) > log Mn
n n

1 n
≤ Pr hX n (X ) > lim Hδ (X) + γ .
n δ↑(1−ε)

Therefore,

1 n
lim inf Pe (∼Cn ) ≤ lim inf Pr hX n (X ) > lim Hδ (X) + γ
n→∞ n→∞ n δ↑(1−ε)

1 n
= 1 − lim sup Pr hX n (X ) ≤ lim Hδ (X) + γ
n→∞ n δ↑(1−ε)

≤ 1 − (1 − ε) = ε,

where the last inequality follows from

1 n
lim Hδ (X) = sup θ : lim sup Pr hX n (X ) ≤ θ < 1 − ε . (5.2.1)
δ↑(1−ε) n→∞ n

2. Converse part: T ε (X) ≥ limδ↑(1−ε) Hδ (X)

Assume without loss of generality that limδ↑(1−ε) Hδ (X) > 0. We will prove
the converse by contradiction. Suppose that T ε (X) < limδ↑(1−ε) Hδ (X).
Then (∃ γ > 0) T ε (X) < limδ↑(1−ε) Hδ (X) − 4γ. By deﬁnition of T ε (X),
there exists a sequence of codes ∼Cn = (n, Mn ) such that

1
lim sup log Mn ≤ lim Hδ (X) − 4γ + γ
n→∞ n δ↑(1−ε)

< lim Hδ (X) − 2γ (5.2.2)

δ↑(1−ε)

and
lim inf Pe (∼Cn ) ≤ ε. (5.2.3)
n→∞

85
(5.2.2) implies that
1
log Mn ≤ lim Hδ (X) − 2γ
n δ↑(1−ε)

for all suﬃciently large n. Hence, for those n satisfying the above inequality
and also by Lemma 2.4,

1 1
Pe (∼Cn ) ≥ Pr hX n (X ) > log Mn + γ − e−nγ
n
n n

1 n
≥ Pr hX n (X ) > lim Hδ (X) − 2γ + γ − e−nγ .
n δ↑(1−ε)

Therefore,

1
lim inf Pe (∼Cn ) ≥ 1 − lim sup Pr hX n (X n ) ≤ lim Hδ (X) − γ
n→∞ n→∞ n δ↑(1−ε)

> 1 − (1 − ε) = ε,

where the last inequality follows from (5.2.1). Thus, a contradiction to

(5.2.3) is obtained. 2

We conclude this section by examining the expression of T (X) for infor-

mation stable sources. It is already known (cf. for example [41]) that for an
information stable source X,
1
T (X) = lim sup H(X n ).
n→∞ n

We herein prove a parallel expression for T (X).

Deﬁnition 5.4 (information stable sources [41]) A source X is said to be

information stable if H(X n ) > 0 for n suﬃciently large, and hX n (X n )/H(X n )
converges in probability to one as n → ∞, i.e.,

hX n (X n )
lim sup P r
− 1 > γ = 0 ∀γ > 0,
n→∞ H(X n )

where H(X n ) = E[hX n (X n )] is the entropy of X n .

Lemma 5.5 Every information source X satisﬁes

1
T (X) = lim inf H(X n ).
n→∞ n

86
Proof:

1. T (X) ≥ lim inf n→∞ (1/n)H(X n )

Fix ε > 0 arbitrarily small. Using the fact that hX n (X n ) is a non-negative
bounded random variable for ﬁnite alphabet, we can write the normalized
block entropy as

1 n 1 n
H(X ) = E hX n (X )
n n

1 n 1 n
= E hX n (X ) 1 0 ≤ hX n (X ) ≤ limHδ (X) + ε
n n δ↑1

1 n 1 n
+E hX n (X ) 1 hX n (X ) > limHδ (X) + ε .
n n δ↑1

(5.2.4)
From the deﬁnition of limδ↑1 Hδ (X), it directly follows that the ﬁrst term
in the right hand side of (5.2.4) is upper bounded by limδ↑1 Hδ (X) + ε, and
that the liminf of the second term is zero. Thus
1
T (X) = limHδ (X) ≥ lim inf H(X n ).
δ↑1 n→∞ n

2. T (X) ≤ lim inf n→∞ (1/n)H(X n )

Fix ε > 0. For inﬁnitely many n,

hX n (X n )
Pr −1 >ε
H(X n )

1 n 1 n
= Pr hX n (X ) > (1 + ε) H(X )
n n

1 n 1 n
≥ Pr hX n (X ) > (1 + ε) lim inf H(X ) + ε .
n n→∞ n

Since X is information stable, we obtain that

1 n 1 n
lim inf Pr hX n (X ) > (1 + ε) lim inf H(X ) + ε
n→∞ n n→∞ n
n

hX n (X )
≤ lim Pr − 1 > ε = 0.
n→∞ H(X n )
By deﬁnition of limδ↑1 Hδ (X), the above implies that

1 n
T (X) = limHδ (X) ≤ (1 + ε) lim inf H(X ) + ε .
δ↑1 n→∞ n

The proof is completed by noting that ε can be made arbitrarily small. 2

87
It is worth pointing out that if the source X is both information stable and
stationary, the above Lemma yields
1
T (X) = T (X) = lim H(X n ).
n→∞ n
This implies that given a stationary and information stable source X, the clas-
sical separation theorem holds for every channel.

5.3 Optimistic channel coding theorems

In this section, we state without proving the general expressions for the opti-
mistic ε-capacity2 (C̄ε ) and for the optimistic capacity (C̄) of arbitrary single-
user channels. The proofs of these expressions are straightforward once the right
deﬁnition (of I¯ε (X; Y )) is made. They employ Feinstein’s Lemma and the Poor-
Verdú bound, and follow the same arguments used in Theorems 4.14 and 4.15
to show the general expressions of the conventional ε-capacity

Cε = sup I ε (X; Y ),
X ¯

and the conventional channel capacity

C = sup I 0 (X; Y ) = sup I (X; Y ).

X ¯ X ¯

We close this section by proving the formula of C̄ for information stable channels.

Deﬁnition 5.6 (optimistic ε-achievable rate) Fix 0 < ε < 1. R ≥ 0 is an

optimistic ε-achievable rate if there exists a sequence of ∼Cn = (n, Mn ) channel
block codes such that
1
lim inf log Mn ≥ R
n→∞ n

and
lim inf Pe (∼Cn ) ≤ ε.
n→∞

Deﬁnition 5.7 (optimistic ε-capacity C̄ε ) Fix 0 < ε < 1. The supremum of
optimistic ε-achievable rates is called the optimistic ε-capacity, C̄ε .

It is straightforward from the deﬁnition that C̄ε is non-decreasing in ε, and

C̄1 = log |X |.

2
Note that the expression of C̄ε was also separately obtained in [37, Theorem 7].

88
Definition 5.8 (optimistic capacity C̄) The optimistic channel capacity C̄
is defined as the supremum of the rates that are ε-achievable for all ε ∈ [0, 1]. It
follows immediately from the definition that C̄ = inf 0≤ε≤1 C̄ε = limε↓0 C̄ε = C̄0
and that C̄ is the supremum of all the rates R for which there exists a sequence
of ∼Cn = (n, Mn ) channel block codes such that
1
lim inf log Mn ≥ R,
n→∞ n
and
lim inf Pe (∼Cn ) = 0.
n→∞

Theorem 5.9 (optimistic ε-capacity formula) Fix 0 < ε < 1. The opti-
mistic ε-capacity C̄ε satisﬁes

C̄ε = sup I¯ε (X; Y ). (5.3.1)

Theorem 5.10 (optimistic capacity formula) The optimistic capacity C̄

satisﬁes
C̄ = sup I¯0 (X; Y ).
X

We next investigate the expression of C̄ for information stable channels.

The expression for the capacity of information stable channels is already known
(cf. for example [41])
1
C = lim inf sup I(X n ; Y n ).
n→∞ X n n

We prove a dual formula for C̄.

Deﬁnition 5.11 (Information stable channels [18, 32]) A channel W is

said to be information stable if there exists an input process X such that 0 <
Cn := supX n (1/n)I(X n ; Y n ) < ∞ for n suﬃciently large, and

iX n W n (X n ; Y n )
lim sup Pr − 1 > γ = 0
n→∞ nCn

for every γ > 0.

Lemma 5.12 Every information stable channel W satisﬁes

1
C̄ = lim sup sup I(X n ; Y n ).
n→∞ X n n

89
Proof:
1. C̄ ≤ lim supn→∞ supX n (1/n)I(X n ; Y n )
By using a similar argument as in the proof of [42, Theorem 8, property
h)], we have
1
I¯0 (X; Y ) ≤ lim sup sup I(X n ; Y n ).
n→∞ X n n
Hence,
1
C̄ = sup I¯0 (X; Y ) ≤ lim sup sup I(X n ; Y n ).
X n→∞ X n n

2. C̄ ≥ lim supn→∞ supX n (1/n)I(X n ; Y n )

Suppose X̃ is the input process that makes the channel information stable.
Fix ε > 0. Then for inﬁnitely many n,

1 n n
PX̃ n W n i n n (X̃ ; Y ) ≤ (1 − ε)(lim sup Cn − ε)
n X̃ W n→∞
* +
iX̃ n W n (X̃ n ; Y n )
≤ PX̃ n W n < (1 − ε)Cn
n
* +
iX̃ n W n (X̃ n ; Y n )
= PX̃ n W n − 1 < −ε .
nCn
Since the channel is information stable, we get that

1 n n
lim inf PX̃ n W n i n n (X̃ ; Y ) ≤ (1 − ε)(lim sup Cn − ε) = 0.
n→∞ n X̃ W n→∞

By the deﬁnition of C̄, the above immediately implies that

C̄ = sup I¯0 (X; Y ) ≥ I¯0 (X̃; Y ) ≥ (1 − ε)(lim sup Cn − ε).
X n→∞

The proof is completed by noting that ε can be made arbitrarily small. 2

Observations:
• It is known that for discrete memoryless channels, the optimistic capacity
C̄ is equal to the (conventional) capacity C [42, 16]. The same result
holds for modulo-q additive noise channels with stationary ergodic noise.
However, in general, C̄ ≥ C since I¯0 (X; Y ) ≥ I (X; Y ) [10, 11].
¯
• Remark that Theorem 11 in [41] holds if, and only if,
sup I (X; Y ) = sup I¯0 (X; Y ).
X ¯ X

Furthermore, note that, if C̄ = C and there exists an input distribution

PX̂ that achieves C, then PX̂ also achieves C̄.

90
5.4 Examples for the computations of capacity and str-
ong capacity

We provide four examples to illustrate the computation of C and C̄. The ﬁrst two
examples present information stable channels for which C̄ > C. The third exam-
ple shows an information unstable channel for which C̄ = C. These examples in-
dicate that information stability is neither necessary nor suﬃcient to ensure that
C̄ = C or thereby the validity of the classical source-channel separation theorem.
The last example illustrates the situation where 0 < C < C̄ < CSC < log2 |Y|,
where CSC We assume in this section that all logarithms are in base 2 so that C
and C̄ are measured in bits.

5.4.1 Information stable channels

Example 5.13 Consider a nonstationary channel W such that at odd time
instances n = 1, 3, · · · , W n is the product of the transition distribution of a
binary symmetric channel with crossover probability 1/8 (BSC( 18 )), and at even
time instances n = 2, 4, 6, · · · , W n is the product of the distribution of a BSC( 14 ).
It can be easily veriﬁed that this channel is information stable. Since the channel
is symmetric, a Bernoulli(1/2) input achieves Cn = supX n (1/n)I(X n ; Y n ); thus

1 − hb (1/8), for n odd;
Cn =
1 − hb (1/4), for n even,

where hb (a) := −a log2 a − (1 − a) log2 (1 − a) is the binary entropy function.

Therefore, C = lim inf n→∞ Cn = 1 − hb (1/4) and C̄ = lim supn→∞ Cn = 1 −
hb (1/8) > C.

Example 5.14 Here we use the information stable channel provided in [41,
Section III] to show that C̄ > C. Let N be the set of all positive integers.
Deﬁne the set J as
J := {n ∈ N : 22i+1 ≤ n < 22i+2 , i = 0, 1, 2, . . .}
= {2, 3, 8, 9, 10, 11, 12, 13, 14, 15, 32, 33, · · · , 63, 128, 129, · · · , 255, · · · }.
Consider the following nonstationary symmetric channel W . At times n ∈ J ,
Wn is a BSC(0), whereas at times n ∈ J , Wn is a BSC(1/2). Put W n =
W1 × W2 × · · · × Wn . Here again Cn is achieved by a Bernoulli(1/2) input X̂ n .
Since the set J is deterministic and is known to both transmitter and receiver,
we then obtain
1
n
1 J(n)
Cn = I(X̂i ; Yi ) = [J(n) · 1 + (n − J(n)) · 0] = ,
n i=1 n n

91
where J(n) := |J ∩ {1, 2, · · · , n}|. It can be shown that


 2 2 log2 n 1
J(n) 1 − × + , for log2 n odd;
= 3 n 3n
log2 n
n 
 2×2 2
− , for log2 n even.
3 n 3n
Consequently, C = lim inf n→∞ Cn = 1/3 and C̄ = lim supn→∞ Cn = 2/3.

5.4.2 Information unstable channels

Example 5.15 (Polya-contagion channel) Consider a discrete additive cha-
nnel with binary input and output alphabet {0, 1} described by
Y i = Xi ⊕ Z i , i = 1, 2, · · · ,
where Xi , Yi and Zi are respectively the i-th input, i-th output and i-th noise,
and ⊕ represents modulo-2 addition. Suppose that the input process is indepen-
dent of the noise process. Also assume that the noise sequence {Zn }n≥1 is drawn
according to the Polya contagion urn scheme [1, 33] as follows: an urn originally
contains R red balls and B black balls with R < B; the noise just makes succes-
sive draws from the urn; after each draw, it returns to the urn 1 + ∆ balls of the
same color as was just drawn (∆ > 0). The noise sequence {Zi } corresponds to
the outcomes of the draws from the Polya urn: Zi = 1 if i-th ball drawn is red
and Zi = 0, otherwise. Let ρ := R/(R + B) and δ := ∆/(R + B). It is shown in
[1] that the noise process {Zi } is stationary and nonergodic; thus the channel is
information unstable. We then obtain3
Cε = 1 − lim H̄δ (X),
δ↑(1−ε)

3
Some work adopts slightly different definitions of Cε and C̄ε from ours. For example, the ε-
achievable rate in [42] and the optimistic ε-achievable rate in [13] are defined via arbitrary γ > 0
and a sequence of ∼Cn = (n, Mn ) block codes such that i) (1/n) log Mn > R − γ for sufficiently
large n and ii) Pe ( ∼Cn ) ≤ ε for sufficiently large n under ε-achievability and Pe ( ∼Cn ) ≤ ε for
infinitely many n under optimistic ε-achievability. We, however, define the two quantities
without the auxiliary (arbitrarily small) γ but dictate the existence of a sequence of ∼Cn =
(n, Mn ) block codes such that i) lim inf n→∞ (1/n) log Mn ≥ R and ii) lim supn→∞ Pe ( ∼Cn ) ≤ ε
for ε-achievability and lim inf n→∞ Pe ( ∼Cn ) ≤ ε for optimistic ε-achievability (cf. Definitions 4.10
and 5.6). Notably, by adopting the definitions in [42] and [13], one can only obtain (cf. [11,
Part I])
1 − H̄1−ε (X) ≤ Cε ≤ 1 − lim H̄δ (X),
δ↑(1−ε)

and
1 −H1−ε (X) ≤ C̄ε ≤ 1 − lim Hδ (X).
δ↑(1−ε)

Our deﬁnitions accordingly provide simpler equality formulas for ε-capacity and optimistic
ε-capacity.

92
and
C̄ε = 1 − lim Hδ (X).
δ↑(1−ε)

It is shown [1] that −(1/n) log PX n (X n ) converges in distribution to the contin-

uous random variable V := hb (U), where U is beta-distributed with parameters
(ρ/δ, (1 − ρ)/δ), and hb (·) is the binary entropy function. Thus

H̄1−ε (X) = lim H̄δ (X) = H1−ε (X) = lim Hδ (X) = FV−1 (1 − ε),
δ↑(1−ε) δ↑(1−ε)

where FV (a) := Pr{V ≤ a} is the cumulative distribution function of V , and

FV−1 (·) is its inverse [1]. Consequently,

Cε = C̄ε = 1 − FV−1 (1 − ε),

and
C = C̄ = lim 1 − FV−1 (1 − ε) = 0.
ε↓0

Example 5.16 Let W̃1 , W̃2 , . . . consist of the channel in Example 5.14, and let
Ŵ1 , Ŵ2 , . . . consist of the channel in Example 5.15. Deﬁne a new channel W as
follows:
W2i = W̃i and W2i−1 = Ŵi for i = 1, 2, · · · .
As in the previous examples, the channel is symmetric, and a Bernoulli(1/2)
input maximizes the inf/sup-information rates. Therefore for a Bernoulli(1/2)
input X, we have

1 PW n (Y n |X n )
Pr log ≤θ
n PY n (Y n )

 1 PW̃ i (Y i |X i ) PŴ i (Y i |X i )

 Pr log + log ≤θ ,

 2i PY i (Y i ) PY i (Y i )



 if n = 2i;
= i i i+1 i+1

 1 P i (Y |X ) P i+1 (Y |X )

 W̃ Ŵ
≤θ ,
 Pr 2i + 1 log P i (Y i ) + log
 PY i+1 (Y i+1 )

 Y

 if n = 2i + 1;



1 i
1 − Pr − log PZ i (Z ) < 1 − 2θ + J(i) ,
1



 i i


 if n = 2i;
=

 1 i+1 1 1
 1 − Pr −
 log PZ i+1 (Z ) < 1 − 2 − θ+ J(i) ,

 i+1 i+1 i+1


 if n = 2i + 1.

93
The fact that −(1/i) log[PZ i (Z i )] converges in distribution to the continuous
random variable V := hb (U), where U is beta-distributed with parameters
(ρ/δ, (1 − ρ)/δ), and the fact that
1 1 1 2
lim inf J(n) = and lim sup J(n) =
n→∞ n 3 n→∞ n 3
imply that

1 PW n (Y n |X n ) 5
i(θ) := lim inf Pr log ≤θ = 1 − FV − 2θ ,
n→∞ n PY n (Y n ) 3

and

1 PW n (Y n |X n ) 4
ī(θ) := lim sup Pr log ≤θ = 1 − FV − 2θ .
n→∞ n PY n (Y n ) 3

Consequently,
5 1 −1 2 1 −1
C̄ε = − F (1 − ε) and Cε = − F (1 − ε).
6 2 V 3 2 V
Thus
1 1 5
0<C= < C̄ = < CSC = < log2 |Y| = 1.
6 3 6

94
Bibliography

[1] F. Alajaji and T. Fuja, “A communication channel modeled on contagion,”

IEEE Trans. Inf. Theory, vol. 40, no. 6, pp. 2035-2041, Nov. 1994.

[2] F. Alajaji and P.-N. Chen. An Introduction to Single-User Information The-

ory, Springer, 2018.

[3] P. Billingsley. Probability and Measure, 2nd edition, Wiley, New York, 1995.

[4] R. E. Blahut. Principles and Practice of Information Theory, Addison Wes-

ley, Massachusetts, 1988.

[5] V. Blinovsky. Asymptotic Combinatorial Coding Theory. Kluwer Academic,

1997.

[6] J. A. Bucklew. Large Deviation Techniques in Decision, Simulation, and

Estimation, Wiley, New York, 1990.

[7] P.-N. Chen, “General formulas for the Neyman-Pearson type-II error ex-
ponent subject to ﬁxed and exponential type-I error bound,” IEEE Trans.
Inf. Theory, vol. 42, no. 1, pp. 316-323, Jan 1996.

[8] P.-N. Chen, “Generalization of Gártner-Ellis theorem,” IEEE Trans. Inf.

Theory, vol. 46, no. 7, pp. 2752-2760, Nov. 2000.

[9] P.-N. Chen and F. Alajaji, “Strong converse, feedback capacity and hy-
pothesis testing,” Proc. Conf. Inf. Sciences Systems, John Hopkins Univ.,
Baltimore, Mar. 1995.

[10] P.-N. Chen and F. Alajaji, “Generalization of information measures,” Proc.

Int. Symp. Inf. Theory & Applications, Victoria, Canada, Sep. 1996.

[11] P.-N. Chen and F. Alajaji, “Generalized source coding theorems and hy-
pothesis testing (Parts I and II),” J. Chin. Inst. Eng., vol. 21, no. 3, pp. 283-
303, May 1998.

95
[12] P.-N. Chen and F. Alajaji, “On the optimistic capacity of arbitrary chan-
nels,” in Proc. IEEE Int. Symp. Inf. Theory, Cambridge, MA, Aug. 1998.

[13] P.-N. Chen and F. Alajaji, “Optimistic Shannon coding theorems for arbi-
trary single-user systems,” IEEE Trans. Inf. Theory, vol. 45, no. 7, pp. 2623-
2629, Nov. 1999.

[14] P.-N. Chen and F. Alajaji, “A generalized Poor-Verdú error bound for mul-
tihypothesis testings,” IEEE Trans. Inf. Theory, vol. 58, no. 1, pp. 311-316,
Jan. 2012.

[15] P.-N. Chen and A. Papamarcou, “New asymptotic results in parallel dis-
tributed detection,” IEEE Trans. Inf. Theory, vol. 39, no. 6, pp. 1847-1863,
Nov. 1993.

[16] I. Csiszár and J. Körner, Information Theory: Coding Theorems for Discrete
Memoryless Systems, Academic Press, New York, 1981.

[17] J.-D. Deuschel and D. W. Stroock, Large Deviations, Academic Press, San
Diego, 1989.

[18] R. L. Dobrushin, “General formulation of Shannon’s basic theorems of infor-

mation theory,” AMS Translations, vol. 33, pp. 323-438, AMS, Providence,
RI, 1963.

[19] T. Ericson and V. A. Zinoviev, “An improvement of the Gilbert bound for
constant weight codes,” IEEE Trans. Inf. Theory, vol. 33, no. 5, pp. 721-723,
Sep. 1987.

[20] W. Feller, An Introduction to Probability Theory and its Applications, 2nd

edition, Wiley, New York, 1970.

[21] J. A. Fill and M. J. Wichura, “The convergence rate for the strong law
of large numbers: General lattice distributions,” Probab. Th. Rel. Fields,
vol. 81, pp. 189-212, 1989.

[22] R. G. Gallager. Information Theory and Reliable Communications, Wiley,

1968.

[23] G. van der Geer and J. H. van Lint. Introduction to Coding Theory and
Algebraic Geometry. Birkhauser, Basel, 1988.

[24] R. M. Gray, Entropy and Information Theory, Springer-Verlag, New York,

1990.

96
[25] T. S. Han, Information-Spectrum Methods in Information Theory, Springer,
2003.
[26] T. S. Han and S. Verdú, “Approximation theory of output statistics,” IEEE
Trans. Inf. Theory, vol. 39, no. 3, pp. 752-772, May 1993.
[27] A. N. Kolmogorov and S. V. Fomin. Introductory Real Analysis. Dover Pub-
lications, New York, 1970.
[28] D. E. Knuth and A. C. Yao, “The complexity of random number gener-
ation,” in Proc. Symp. Algorithms and Complexity: New Directions and
Recent Results, Academic Press, New York, 1976.
[29] J. H. van Lint, Introduction to Coding Theory. 2nd edition, Springer-Verlag,
New York, 1992.
[30] S. N. Litsyn and M. A. Tsfasman, “A note on lower bounds,” IEEE Trans.
Inf. Theory, vol. 32, no. 5, pp. 705-706, Sep. 1986.
[31] J. K. Omura, “On general Gilbert bounds,” IEEE Trans. Inf. Theory,
vol. 19, no. 5, pp. 661-666, Sep. 1973.
[32] M. S. Pinsker, Information and Information Stability of Random Variables
and Processes, Holden-Day, 1964.
[33] G. Polya, “Sur quelques points de la théorie des probabilités,” Ann. Inst.
H. Poincarré, vol. 1, pp. 117-161, 1931.
[34] H. V. Poor and S. Verdú, “A lower bound on the probability of error in
multihypothesis testing,” IEEE Trans. Inf. Theory, vol. 41, no. 6, pp. 1992-
1994, Nov. 1995.
[35] H. L. Royden. Real Analysis, 3rd edition, Macmillan, New York, 1988.
[36] Y. Steinberg and S. Verdú, “Simulation of random processes and rate-
distortion theory,” IEEE Trans. Inf. Theory, vol. 42, no. 1, pp. 63-86,
Jan. 1996.
[37] Y. Steinberg, “New converses in the theory of identiﬁcation via channels,”
IEEE Trans. Inf. Theory, vol. 44, no. 3, pp. 984-998, May 1998.
[38] M. A. Tsfasman and S. G. Vladut. Algebraic-Geometric Codes, Kluwer Aca-
demic, Netherlands, 1991.
[39] J. N. Tsitsiklis, “Decentralized detection by a large number of sensors,”
Mathematics of Control, Signals and Systems, vol. 1, no. 2, pp. 167-182,
1988.

97
[40] G. Vazquez-Vilar, A. Tauste Campo, A. Guillen i Fabregas, and A. Mar-
tinez, “Bayesian M-ary hypothesis testing: The meta-converse and Verdú-
Han bounds are tight,” IEEE Trans. Inf. Theory, vol. 62, no. 5, pp. 2324-
2333, May 2016.

[41] S. Vembu, S. Verdú and Y. Steinberg, “The source-channel separation theo-

rem revisited,” IEEE Trans. Inf. Theory, vol. 41, no. 1, pp. 44-54, Jan. 1995.

[42] S. Verdú and T. S. Han, “A general formula for channel capacity,” IEEE
Trans. Inf. Theory, vol. 40, no. 4, pp. 1147-1157, Jul. 1994.

[43] S. G. Vladut, “An exhaustion bound for algebraic-geometric modular

codes,” Probl. Inf. Transm., vol. 23, pp. 22-34, 1987.

[44] V. A. Zinoviev and S. N. Litsyn, “Codes that exceed the Gilbert bound,”
Probl. Inf. Transm., vol. 21, no. 1, pp. 105-108, 1985.

98
Index

M-type, 35 capacity, 59
Algorithm for 3-type, 36 ε-capacity, 70
Algorithm for 4-type, 35 definition, 70
δ-inf-divergence rate, 10 Example, 73
δ-inf-entropy rate, 9 for arbitrary channels, 58
δ-inf-information rate, 9 general formula, 72
δ-sup-divergence rate, 10 optimistic, 89
δ-sup-entropy rate, 9 strong, 70, 73
δ-sup-information rate, 9 channel capacity, 59
ε-achievable data compression rate, 24 channels
ε-achievable rate, 70 general models, 57
optimistic, 88
ε-achievable resolution, 37 data processing lemma, 14
ε-achievable resolution rate, 37 data transmission code
ε-achievable source coding rate fixed length, 59
optimistic, 84 entropy, 1
ε-capacity, 70 entropy rate, 1
Example, 76 entropy rate, 1
optimistic, 88, 89 inf-entropy rate, 9
theorem, 71 sup-entropy rate, 9
ε-mean-achievable resolution rate, 38
ε-mean-resolvability, 39 Feinstein’s lemma, 60
for input and channel, 80 fixed-length
ε-resolvability, 37 data transmission code, 59
for input and channel, 79
ε-source compression rate general lossless data compression theo-
fixed-length codes, 48 rem, 23
general optimistic source coding theo-
AEP, 24 rem, 84
generalized, 30 general source coding theorem, 26
arbitrary statistics system, 1 optimistic, 84
arbitrary system with memory, 2 generalized AEP theorem, 31
average probability of error, 59 generalized divergence measure, 12
average-case complexity for generating generalized entropy measure, 10
a random variable, 35 generalized information measure, 1, 8

99
δ-inf-divergence rate, 10 optimistic capacity, 89
δ-inf-entropy rate, 9 information stable channel, 89
δ-inf-information rate, 9 optimistic source coding rate, 84
δ-sup-divergence rate, 10 information stable source, 86
δ-sup-entropy rate, 9
δ-sup-information rate, 9 Polya-contagion channel, 92
examples, 17 Poor-Verdú lemma
normalized entropy density, 8 example, 67, 68
normalized information density, 9 generalized, 63
normalized log-likelihood ratio, 10 probability of an event evaluated under
properties, 11 a distribution, iii
generalized mutual information measure, quantile, 2, 3, 5
11 properties, 5
inf-entropy rate, 9 resolution, 35
inf-information rate, 58 ε-achievable resolution rate, 37
inf-spectrum, 58 ε-mean-achievable resolution rate,
information density, 58 38
information stable channels, 89 resolvability, 30, 34, 38
information stable sources, 28, 86 ε-mean-resolvability, 39
information unstable channels, 92 ε-resolvability, 37
liminf in probability, 4 equality to sup-entropy rate, 41
limsup in probability, 5 equivalence to strong capacity, 81
for channel, 80
mean-resolvability, 39 for input and channel, 80
equivalence to capacity, 81 mean-resolvability, 39
for channel, 80 minimum source coding rate for ﬁxed-
for input and channel, 80 length codes, 49
operational meanings, 39 operational meanings, 39
variable-length codes source coding, 47
minimum source coding rate, 51
source compression rate
normalized entropy density, 8 ε-source compression rate, 48
normalized information density, 9 ﬁxed-length codes, 48
normalized log-likelihood ratio, 10 variable-length codes, 48
source-channel separation theorem, 83
optimality of independent inputs for δ- spectrum, 2
inf-information rate, 14 inf-spectrum, 2, 58
optimistic ε-achievable rate, 88 sup-spectrum, 2, 58
optimistic ε-achievable source coding rate,stationary-ergodic source, 29, 33
84 strong capacity, 70, 73
optimistic ε-capacity, 88, 89 Example, 75

100
strong converse theorem, 29
sup-entropy rate, 9
sup-information rate, 58
sup-spectrum, 58

variational distance, 36
bound for entropy diﬀerence, 45
bound for log-likelihood ratio spec-
trum, 40

weak converse theorem, 29

weakly δ-typical set, 30, 32
worst-case complexity for generating a
random variable, 35

101

80 - DAY 62 - Aixploria - AI
No ratings yet
80 - DAY 62 - Aixploria - AI
32 pages
BEF 22903 - Lecture 5 - Unbalanced Three-Phase Circuits
No ratings yet
BEF 22903 - Lecture 5 - Unbalanced Three-Phase Circuits
13 pages
1999 Mental Calculation Strategies For Addition and Subtraction
100% (2)
1999 Mental Calculation Strategies For Addition and Subtraction
4 pages
Project Report On Android
100% (1)
Project Report On Android
83 pages
Information Theory and Coding: Universit' A Degli Studi Di Siena Facolt'a Di Ingegneria
No ratings yet
Information Theory and Coding: Universit' A Degli Studi Di Siena Facolt'a Di Ingegneria
156 pages
Lecture Notes in Information Theory Volume II
No ratings yet
Lecture Notes in Information Theory Volume II
293 pages
Notes MSM
No ratings yet
Notes MSM
66 pages
MT427 Notebook 4 Fall 2013/2014: Prepared by Professor Jenny Baglivo
No ratings yet
MT427 Notebook 4 Fall 2013/2014: Prepared by Professor Jenny Baglivo
27 pages
AT-QIT Learning Theory
No ratings yet
AT-QIT Learning Theory
13 pages
Advanced Topics in Quantum Information Theory
No ratings yet
Advanced Topics in Quantum Information Theory
103 pages
Timothy Barth and Mario Ohlberger Finite Volume Methods Foundation and Analysis 0
No ratings yet
Timothy Barth and Mario Ohlberger Finite Volume Methods Foundation and Analysis 0
57 pages
Linear System and Random Process Final PhanLeNhatTan MEEIU23007
No ratings yet
Linear System and Random Process Final PhanLeNhatTan MEEIU23007
11 pages
OPT-PDE and FRET Imaging of Molecular Transport in Live Cells
No ratings yet
OPT-PDE and FRET Imaging of Molecular Transport in Live Cells
20 pages
Foundation of Tcs
No ratings yet
Foundation of Tcs
142 pages
Lectures 169
No ratings yet
Lectures 169
57 pages
Tối Ưu Hóa Cho Khoa Học Dữ Liệu
No ratings yet
Tối Ưu Hóa Cho Khoa Học Dữ Liệu
64 pages
Laboratory Journal: Signal Coding Estimation Theory
No ratings yet
Laboratory Journal: Signal Coding Estimation Theory
63 pages
Part 3 Essay
No ratings yet
Part 3 Essay
38 pages
Master Thesis Jimmy
No ratings yet
Master Thesis Jimmy
40 pages
Dynamics of stochastic approximation algorithms
No ratings yet
Dynamics of stochastic approximation algorithms
69 pages
Statistics
No ratings yet
Statistics
60 pages
SERFATY Functional Analysis
No ratings yet
SERFATY Functional Analysis
66 pages
STAT3006 Lecture Notes 2021 Aug8 2021
No ratings yet
STAT3006 Lecture Notes 2021 Aug8 2021
110 pages
Metric Note
No ratings yet
Metric Note
68 pages
Main
No ratings yet
Main
183 pages
Pattern Classification
No ratings yet
Pattern Classification
41 pages
Functional Analysis
No ratings yet
Functional Analysis
93 pages
8845Bayesian reasoning and machine learning Barber D. download
No ratings yet
8845Bayesian reasoning and machine learning Barber D. download
42 pages
Paul Honeiné, Cédric Richard, Patrick Flandrin, Jean-Baptiste Pothin
No ratings yet
Paul Honeiné, Cédric Richard, Patrick Flandrin, Jean-Baptiste Pothin
4 pages
Bayesian reasoning and machine learning Barber D. - Get instant access to the full ebook with detailed content
100% (1)
Bayesian reasoning and machine learning Barber D. - Get instant access to the full ebook with detailed content
47 pages
MA40189 Notes
No ratings yet
MA40189 Notes
70 pages
Statistical Models in Simulation
No ratings yet
Statistical Models in Simulation
32 pages
Generic regularity of free boundaries for the obstacle problem
No ratings yet
Generic regularity of free boundaries for the obstacle problem
112 pages
LN Week1
No ratings yet
LN Week1
24 pages
Mixing Times
No ratings yet
Mixing Times
70 pages
6.867 Section 3: Classification: 1 Intro 2 2 Representation 2 3 Probabilistic Models 2
No ratings yet
6.867 Section 3: Classification: 1 Intro 2 2 Representation 2 3 Probabilistic Models 2
10 pages
TDF&NonUniformlyHyperbolicSys-PhDThesis2006
No ratings yet
TDF&NonUniformlyHyperbolicSys-PhDThesis2006
77 pages
09 SS049
No ratings yet
09 SS049
14 pages
STAT 479: Machine Learning Lecture Notes: Sebastian Raschka Department of Statistics University of Wisconsin-Madison
No ratings yet
STAT 479: Machine Learning Lecture Notes: Sebastian Raschka Department of Statistics University of Wisconsin-Madison
16 pages
Nonlinear Causality Test in R
No ratings yet
Nonlinear Causality Test in R
12 pages
Probability and Queuing Theory - Question Bank.
67% (3)
Probability and Queuing Theory - Question Bank.
21 pages
Statistical Uncertainty and Error Propagation: Martin Vermeer March 27, 2014
No ratings yet
Statistical Uncertainty and Error Propagation: Martin Vermeer March 27, 2014
34 pages
Spectral Decomposition of H1
No ratings yet
Spectral Decomposition of H1
27 pages
Diss Web
No ratings yet
Diss Web
190 pages
Project Report
No ratings yet
Project Report
56 pages
bCS_2021_GhidirimschiN.pdf
No ratings yet
bCS_2021_GhidirimschiN.pdf
56 pages
Lecture Notes For STAT2602
No ratings yet
Lecture Notes For STAT2602
104 pages
CSE291 Course Notes
No ratings yet
CSE291 Course Notes
69 pages
Lecture Notes - Optimal Control (LQG, MPC)
No ratings yet
Lecture Notes - Optimal Control (LQG, MPC)
76 pages
Ast Part1 PDF
No ratings yet
Ast Part1 PDF
20 pages
Ech 4
No ratings yet
Ech 4
39 pages
Fourth Sem IT Dept PDF
No ratings yet
Fourth Sem IT Dept PDF
337 pages
Lecture Notes For Mathematical Statistics
No ratings yet
Lecture Notes For Mathematical Statistics
184 pages
Tecnico_Plasma_Masters_Thesis_HascodingZPIC
No ratings yet
Tecnico_Plasma_Masters_Thesis_HascodingZPIC
100 pages
Mutual Information and Capacity of Spatial Modulation Systems PDF
No ratings yet
Mutual Information and Capacity of Spatial Modulation Systems PDF
12 pages
Asymptotic Expansions of The Prime Counting Function
100% (1)
Asymptotic Expansions of The Prime Counting Function
38 pages
1 s2.0 S0377042723000511 Main
No ratings yet
1 s2.0 S0377042723000511 Main
38 pages
Matbematicae: Ordinary Differential Equations, Transport Theory and Sobolev Spaces
No ratings yet
Matbematicae: Ordinary Differential Equations, Transport Theory and Sobolev Spaces
37 pages
17 Entanglement and Tensor Network States
No ratings yet
17 Entanglement and Tensor Network States
41 pages
Neural ODES
No ratings yet
Neural ODES
32 pages
Bayesian Classifier Implementation Using MATLAB
No ratings yet
Bayesian Classifier Implementation Using MATLAB
21 pages
Digital Signal Processing (DSP) with Python Programming
From Everand
Digital Signal Processing (DSP) with Python Programming
Maurice Charbit
No ratings yet
Constraint Networks: Targeting Simplicity for Techniques and Algorithms
From Everand
Constraint Networks: Targeting Simplicity for Techniques and Algorithms
Christophe Lecoutre
No ratings yet
Introduction to the Mathematics of Inversion in Remote Sensing and Indirect Measurements
From Everand
Introduction to the Mathematics of Inversion in Remote Sensing and Indirect Measurements
S. Twomey
No ratings yet
NI Tutorial 11039 en
No ratings yet
NI Tutorial 11039 en
5 pages
Social Media Platform Agreements and Brand Risk
No ratings yet
Social Media Platform Agreements and Brand Risk
7 pages
Sunil Sharma PHD Proposal
No ratings yet
Sunil Sharma PHD Proposal
6 pages
Balanon Detailed Lesson Plan..5
0% (1)
Balanon Detailed Lesson Plan..5
12 pages
IEC61508 IEC61511 Presentation E
100% (3)
IEC61508 IEC61511 Presentation E
56 pages
Solutions To Quiz 1
No ratings yet
Solutions To Quiz 1
8 pages
DNV Se 0080
No ratings yet
DNV Se 0080
27 pages
Beginning FPGA Programming - Partie68
No ratings yet
Beginning FPGA Programming - Partie68
5 pages
Resume Rohit Jain
No ratings yet
Resume Rohit Jain
2 pages
Program Life Cycle: Steps To Follow in Writing or Creating A Program
No ratings yet
Program Life Cycle: Steps To Follow in Writing or Creating A Program
4 pages
Introduction To Digital Marketing 10
No ratings yet
Introduction To Digital Marketing 10
20 pages
Ma321 22 Prac4
No ratings yet
Ma321 22 Prac4
4 pages
Manual Lp3
No ratings yet
Manual Lp3
44 pages
Test Toulouse - Pieron (PDF - TXT)
No ratings yet
Test Toulouse - Pieron (PDF - TXT)
3 pages
TL783
No ratings yet
TL783
13 pages
Overview of The Fourth-Generation Mobile Communication System
No ratings yet
Overview of The Fourth-Generation Mobile Communication System
6 pages
Cordect Wireless Access System
No ratings yet
Cordect Wireless Access System
28 pages
A10 Restore Factory Value
No ratings yet
A10 Restore Factory Value
13 pages
Spark With Bigdata
No ratings yet
Spark With Bigdata
94 pages
Crimestat Iv: Part I: Program Overview
No ratings yet
Crimestat Iv: Part I: Program Overview
21 pages
Case 580 Super N Backhoe Service Manual 170707065718 PDF
40% (5)
Case 580 Super N Backhoe Service Manual 170707065718 PDF
4 pages
2d Animation Syllabus & Activities PDF
100% (1)
2d Animation Syllabus & Activities PDF
3 pages
26-Ghz Microwave Link Monitoring As A Warning System
No ratings yet
26-Ghz Microwave Link Monitoring As A Warning System
15 pages
CANoe ProductInformation EN PDF
No ratings yet
CANoe ProductInformation EN PDF
53 pages
MRF Name / Logo Quality and Environmental Manual - Template 2
No ratings yet
MRF Name / Logo Quality and Environmental Manual - Template 2
18 pages
Com - Upgadata.up7723 Logcat
No ratings yet
Com - Upgadata.up7723 Logcat
41 pages

Information Theory For Single-User Systems With Arbitrary Statistical Memory

Uploaded by

Information Theory For Single-User Systems With Arbitrary Statistical Memory

Uploaded by

Information Theory for Single-User Systems

with Arbitrary Statistical Memory

Fady Alajaji† and Po-Ning Chen‡

February 21, 2019

The reliable transmission of information bearing signals over a noisy commu-

List of Tables viii

1 Generalized Information Measures for Arbitrary Systems with

2 General Data Compression Theorems 23

3 Measure of Randomness for Stochastic Processes 34

4 Channel Coding Theorems and Approximations of Output Sta-

5 Optimistic Shannon Coding Theorems for Arbitrary Single-User

1.1 Generalized entropy measures where δ ∈ [0, 1]. . . . . . . . . . . 10

1.1 The asymptotic CDFs (spectrums) of a sequence of random vari-

2.1 Behavior of the probability of block decoding error as blocklength

3.1 Source generator: {Xt }0<t<1 is an independent random process

4.1 The ultimate CDFs of −(1/n) log PN n (N n ). . . . . . . . . . . . . 75

Generalized Information Measures for

The entropy of a discrete random variable X [2], deﬁned by

is a measure of the average amount of uncertainty in X. An extened notion

where X n := (X1 , X2 , . . . , Xn ); for convenience, we also write

In this chapter, we will ﬁrst introduce a new concept on deﬁning information

1.1 Spectrums and quantiles

Deﬁnition 1.1 (Inf/sup-spectrum) If {An }∞ n=1 is a sequence of random vari-

which implies that

1.2 Properties of quantiles

Lemma 1.4 Consider two random sequences, {An }∞ ∞

(u + v)(θ) := lim sup Pr{An + Bn ≤ θ},

γ := Ū − lim Ūδ > 0.

Again, deﬁne (U + V )δ and (U + V )δ be the quantiles with respect to (u + v)(·)

1. U δ and Ūδ are both non-decreasing and right-continuous functions of δ for

2. limδ↓0 U δ = U 0 and limδ↓0 Ūδ = Ū0 .

(U + V )δ ≤ U δ+γ + V̄(1−γ) , (1.2.3)

lim sup Pr{An + Bn ≤ U δ + Vγ − 2α}

which, by deﬁnition of (U + V )δ+γ , yields

which, by deﬁnition of (U + V )δ+γ and for arbitrarily small α, proves (1.2.2).

(U + V )δ + (−V )γ ≤ (U + V − V )δ+γ = U δ+γ .

1.3 Generalized information measures

we obtain two generalized (spectral) entropy measures for X:

H(X) = H0 (X), and H̄(X) = lim H̄δ (X).

Conceptually, we may imagine that if the random variable (1/n)h(X n ) ex-

or more speciﬁcally, PW = PY |X = {PY n |X n }∞

we get the δ-inf/sup-information rates, denoted respectively by I δ (X; Y ) and

system arbitrary source X

δ-sup-entropy rate H̄δ (X) := sup{θ : h(θ) ≤ δ}

inf-entropy rate H(X) := H0 (X)

Table 1.1: Generalized entropy measures where δ ∈ [0, 1].

Similarly, for a simple hypothesis testing system with arbitrary observation

inf-information rate I (X; Y ) := I 0 (X; Y )

1.4 Properties of generalized information measures

In this section, we present the properties of the general (spectral) information

system arbitrary sources X and X̂

δ-sup-divergence rate D̄δ (XX̂) := sup{θ : d(θ) ≤ δ}

sup-divergence rate D̄(XX̂) := limδ↑1 D̄δ (XX̂)

inf-divergence rate D(XX̂) := D0 (XX̂)

Table 1.3: Generalized divergence measures where δ ∈ [0, 1].

H(Y ) − H̄(Y |X) ≤ I (X; Y ) ≤H(Y ) −H(Y |X).

1. H̄δ (X) ≥ 0 for δ ∈ [0, 1].

I δ (X; Y ) ≤Hδ+γ (Y ) −Hγ (Y |X), (1.4.1)

5. I δ (X, Y ; Z) ≥ I δ (X; Z) for δ ∈ [0, 1].

and, by following the same procedure as (1.4.6),

Property 2 is an immediate consequence of the deﬁnition.

xn ∈ X n and y n ∈ Y n . For any input X and its corresponding output Y ,

I δ (X; Y ) ≤ I δ (X̄; Ȳ ) = I (X̄; Ȳ ),

Proof: First, we observe that

where PX n W n := PX n ,Y n = PY n |X n PX n is the joint input-output distribution

Zδ (X̄; Ȳ ) ≥ I δ (X; Y ) + D(Y Ȳ ) ≥ I δ (X; Y ),

for any δ ∈ [0, 1). Similarly, we can show that

I (X̄; Ȳ ) = Z(X̄; Ȳ ) ≥ I δ (X; Y ).

1.5 Examples for the computation of general information

where ⊕ represents addition modulo-2, and Z is an arbitrary binary noise pro-

I ε (X; Y ) ≥H0 (Y ) − H̄(1−ε) (Y |X), (1.5.1)

I ε (X; Y ) ≤ inf H̄ε+γ (Y ) − H̄γ (Y |X) .

I ε (X; Y ) ≤ inf H̄ε+γ (Y ) − H̄γ (Z)

≤ inf log(2) − H̄γ (Z) ,

I ε (X; Y ) ≤ log(2) − lim H̄γ (Z).

I ε (X; Y ) ≥ log(2) − H̄(1−ε) (Z),

δ-sup-divergence rate D̄δ (XX̂) := sup{θ : d(θ) ≤ δ}

sup-divergence rate D̄(XX̂) := limδ↑1 D̄δ (XX̂)

inf-divergence rate D(XX̂) := D0 (XX̂)

Zδ (X̄; Ȳ ) ≥ I δ (X; Y ) + D(Y Ȳ ) ≥ I δ (X; Y ),

Therefore, for all nj > N1 ,

The desired subsequence {nj }∞

1 0n ) < R0 + γ = Sε (X) + γ0 and X n − X