Information Theory For Single-User Systems With Arbitrary Statistical Memory
Information Theory For Single-User Systems With Arbitrary Statistical Memory
by
†
Department of Mathematics & Statistics,
Queen’s University, Kingston, ON K7L 3N6, Canada
Email: [email protected]
‡
Department of Electrical & Computer Engineering
National Chiao Tung University
1001, Ta Hsueh Road
Hsin Chu, Taiwan 300
Republic of China
Email: [email protected]
ii
8. Information theory of networks: Distributed detection, data compression
over distributed source, capacity of multiple access channels, degraded
broadcast channel, Gaussian multiple terminal channels.
The mathematical background on which these topics are based can be found
in Appendices A and B of [2].
Notation
For a discrete random variable X with alphabet X , we use PX to denote its
distribution. For convenience, we will use interchangeably the two expressions
for the probability of the elementary event [X = x]: Pr[X = x] and PX (x).
Similarly, the probability of a set characterized by an inequality, such as f (x) <
a, will be expressed by either
PX {x ∈ X : f (x) < a}
or
Pr [f (X) < a] .
In the second expression, we view f (X) as a new random variable defined through
X and a function f (·).
Obviously, the above expressions can be applied to any legitimate function
f (·) defined over X , including any probability function PX̂ (·) (or log PX̂ (x)) of a
random variable X̂. Therefore, the next two expressions denote the probability
of f (x) = PX̂ (x) < a evaluated under distribution PX :
PX {x ∈ X : f (x) < a} = PX {x ∈ X : PX̂ (x) < a}
and
Pr [f (X) < a] = Pr [PX̂ (X) < a] .
As a result, if we write
PX̂,Ŷ (x, y)
PX,Y (x, y) ∈ X × Y : log <a
PX̂ (x)PŶ (y)
PX̂,Ŷ (X, Y )
= Pr log <a ,
PX̂ (X)PŶ (Y )
we mean that we have defined a new function
PX̂,Ŷ (x, y)
f (x, y) := log
PX̂ (x)PŶ (y)
in terms of the joint distribution PX̂,Ŷ and its two marginal distributions, and
that we are interested in the probability of f (X, Y ) < a where X and Y have
joint distribution PX,Y .
iii
Notes to readers
In this book, all the assumptions, claims, conjectures, corollaries, definitions, ex-
amples, exercises, lemmas, observations, properties, and theorems are numbered
under the same counter for ease of their searching. For example, the lemma that
immediately follows Theorem 2.1 will be numbered as Lemma 2.2, instead of
Lemma 2.1.
iv
Acknowledgements
Thanks are given to our families for their full support during the period of
our writing the book.
v
Table of Contents
Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii
Notes to readers . . . . . . . . . . . . . . . . . . . . . . . . iv
Chapter Page
List of Figures ix
vi
4.4 Capacity formulas for general channels . . . . . . . . . . . . . . . 70
4.5 Examples for the general capacity formulas . . . . . . . . . . . . . 73
4.6 Capacity and resolvability for channels . . . . . . . . . . . . . . . 77
vii
List of Tables
Number Page
viii
List of Figures
Number Page
ix
Chapter 1
assuming the limit exists. The above quantities have an operational significance
established via Shannon’s source coding theorems when the stochastic systems
under consideration satisfy certain regularity conditions, such as stationarity and
ergodicity [24, 42]. However, in more complicated situations such as when the
systems have a time-varying nature and are non-stationary, these information
rates are no longer valid and lose their operational significance. This results in
the need to establish new entropy measures which appropriately characterize the
operational limits of arbitrary stochastic systems.
Let us begin with the model of an arbitrary system with memory. In gen-
eral, there are two indices for random variables or observations: a time in-
dex and a space index. When a sequence of random variables is denoted by
X1 , X2 , . . . , Xn , . . ., the subscript i of Xi can be treated as either a time index
or a space index, but not both. Hence, when a sequence of random variables is
a function of both time and space, the notation of X1 , X2 , . . . , Xn , . . ., is by no
means sufficient; and therefore, a new model for a general time-varying source,
such as
(n) (n) (n)
X1 , X2 , . . . , Xt , . . . ,
1
where t is the time index and n is the space or position index (or vice versa),
becomes significant.
When block-wise (fixed-length) compression of such source (with blocklength
n) is considered, the same question as for the compression of an i.i.d. source
arises:
what is the minimum compression rate (say in bits per
source sample) for which the probability of error can be (1.0.1)
made arbitrarily small as the blocklength goes to infinity?
To answer this question, information theorists have to find a sequence of data
compression codes for each blocklength n and investigate if the decompression
error probability goes to zero as n approaches infinity. However, unlike the simple
source models such as discrete memorylessness [2], the source being arbitrary
may exhibit distinct statistics for each blocklength n; e.g., for
(1)
n = 1 : X1
(2) (2)
n = 2 : X1 , X2
(3) (3) (3)
n = 3 : X1 , X2 , X3
(4) (4) (4) (4)
n = 4 : X1 , X2 , X3 , X4 (1.0.2)
..
.
(4) (1) (2) (3)
the statistics of X1 could be different from X1 , X1 and X1 (i.e., the source
statistics are not necessarily consistent). Since the model in question (1.0.1) is
general, and the system statistics can be arbitrarily defined, it is therefore named
an arbitrary system with memory.
The triangular array of random variables in (1.0.2) is denoted by a boldface
letter as
X := {X n }∞
n=1 ,
∞
X := X n = X1 , X2 , . . . , Xn(n)
(n) (n)
.
n=1
2
by
u(θ) := lim inf Pr{An ≤ θ}
¯ n→∞
and
ū(θ) := lim sup Pr{An ≤ θ},
n→∞
respectively, where θ ∈ R.
In other words, u(·) and ū(·) are respectively the liminf and the limsup of
¯
the cumulative distribution function (CDF) of An . Note that by definition, the
CDF of An — Pr{An ≤ θ} — is non-decreasing and right-continuous. However,
for u(·) and ū(·), only the non-decreasing property remains.1
¯
Definition 1.2 (Quantile of inf/sup-spectrum) For any 0 ≤ δ ≤ 1, the
quantile U δ of the sup-spectrum ū(·) and the quantile Ūδ of the inf-spectrum u(·)
¯ ¯
are defined by2
U δ := sup{θ : ū(θ) ≤ δ}
¯
and
Ūδ := sup{θ : u(θ) ≤ δ},
¯
respectively.
It follows from the above definitions that U δ and Ūδ are right-continuous and
¯
non-decreasing in δ. Note that the supremum of an empty set is defined to be
−∞.
Observation 1.3 Generally speaking, one can define “quantile” in four different
ways. For example, for the quantile of the inf-spectrum, we can define:
Ūδ := sup{θ : lim inf Pr[An ≤ θ] ≤ δ}
n→∞
Ūδ− := sup{θ : lim inf Pr[An ≤ θ] < δ}
n→∞
Ūδ+ := sup{θ : lim inf Pr[An < θ] ≤ δ}
n→∞
Ūδ+− := sup{θ : lim inf Pr[An < θ] < δ}.
n→∞
1
It is pertinent to also point out that even if we do not require right-continuity as a funda-
mental property of a CDF, the spectrums u(·) and ū(·) are not necessarily legitimate CDFs of
¯
(conventional real-valued) random variables since there might exist cases where the “probabi-
lity mass escapes to infinity” (cf. [3, pp. 346]). A necessary and sufficient condition for u(·)
¯
and ū(·) to be conventional CDFs (without requiring right-continuity) is that the sequence
of distribution functions of An is tight [3, pp. 346]. Tightness is actually guaranteed if the
alphabet of An is finite.
2
Note that the usual definition of the quantile function φ(δ) of a non-decreasing function
F (·) is slightly different from our definition [3, pp. 190], where φ(δ) := sup{θ : F (θ) < δ}.
Remark that if F (·) is strictly increasing, then the quantile is nothing but the inverse of F (·):
φ(δ) = F −1 (δ).
3
The general relations between these four quantities are as follows:
Ūδ− = Ūδ+− ≤ Ūδ = Ūδ+ .
Obviously, Ūδ− ≤ Ūδ ≤ Ūδ+ and Ūδ− ≤ Ūδ+− by their definitions. It remains to
show that Ūδ+− ≤ Ūδ , that Ūδ+ ≤ Ūδ and that Ūδ+− ≤ Ūδ− .
Suppose Ūδ+− > Ūδ + γ for some γ > 0. Then by definition of Ūδ+− ,
lim inf Pr[An < Ūδ + γ] < δ,
n→∞
and violates the definition of Ūδ . This completes the proof of Ūδ+− ≤ Ūδ . To prove
that Ūδ+ ≤ Ūδ , note that from the definition of Ūδ+ , we have that for any ε > 0,
lim inf n→∞ Pr[An < Ūδ+ − ε] ≤ δ and hence lim inf n→∞ Pr[An ≤ Ūδ+ − 2ε] ≤ δ,
implying that Ūδ+ − 2ε ≤ Ūδ . The latter yields that Ūδ+ ≤ inf ε>0 [Ūδ+ − 2ε] = Ūδ ,
which completes the proof. Proving that Ūδ+− ≤ Ūδ− follows a similar argument.
It is worth noting that Ūδ− = limξ↑δ Ūξ . Their equality can be proved by
first observing that Ūδ− ≥ limξ↑δ Ūξ by their definitions, and then assuming that
γ := Ūδ− − limξ↑δ Ūξ > 0. Then Ūξ < Ūξ + γ/2 ≤ (Ūδ− ) − γ/2 implies that
u((Ūδ− ) − γ/2) > ξ for ξ arbitrarily close to δ from below, which in turn implies
¯
u((Ūδ− ) − γ/2) ≥ δ, contradicting to the definition of Ūδ− . Throughout, we will
¯
interchangeably use Ūδ− and limξ↑δ Ūδ for convenience.
The final note in this observation is that Ūδ+− and Ūδ+ will not be used in
defining our general information measures. They are introduced only for math-
ematical interests. 2
Based on the above definitions, the liminf in probability U of {An }∞n=1 [26],
¯
which is defined as the largest extended real number such that for all ξ > 0,
lim Pr[An ≤ U − ξ] = 0,
n→∞ ¯
satisfies3
U = lim U δ = U 0 .
¯ δ↓0 ¯ ¯
3
It is obvious from their definitions that
lim U δ ≥ U 0 ≥ U .
δ↓0 ¯ ¯ ¯
The equality of limδ↓0 U δ and U can be proved by contradiction by first assuming
¯ ¯
γ := lim U δ − U > 0.
δ↓0 ¯ ¯
Then ū(U + γ/2) ≤ δ for arbitrarily small δ > 0, which immediately implies ū(U + γ/2) = 0,
¯ ¯
contradicting to the definition of U .
¯
4
Also, the limsup in probability Ū (cf. [26]), defined as the smallest extended real
number such that for all ξ > 0,
lim Pr[An ≥ Ū + ξ] = 0,
n→∞
is exactly4
Ū = lim Ūδ = sup{θ : u(θ) < 1}.
δ↑1 ¯
It readily follows from the above definitions that
U ≤ U δ ≤ Ūδ ≤ Ū
¯ ¯
for δ ∈ [0, 1).
Remark that U δ and Ūδ always exist. Furthermore, if U δ = Ūδ for all δ
¯ ¯
in [0, 1], the sequence of random variables An converges in distribution to a
random variable A, provided the sequence of An is tight. Finally, for a better
understanding of the quantities defined above, we depict them in Figure 1.1.
4
Since 1 = limn→∞ Pr{An < Ū + ξ} ≤ limn→∞ Pr{An ≤ Ū + ξ} = u(Ū + ξ), it is straight-
¯
forward that
Ū ≥ sup{θ : u(θ) < 1} = lim Ūδ .
¯ δ↑1
The equality of Ū and limδ↑1 Ūδ can be proved by contraction by first assuming that
Then 1 ≥ u(Ū −γ/2) > δ for δ arbitrarily close to 1, which implies u(Ū −γ/2) = 1. Accordingly,
¯ ¯
by
1 ≥ lim inf Pr{An < Ū − γ/4} ≥ lim inf Pr{An ≤ Ū − γ/2} = u(Ū − γ/2) = 1,
n→∞ n→∞ ¯
we obtain the desired contradiction.
5
6
ū(·) u(·)
¯
δ
0 -
U Ū0 Uδ Ūδ U 1− Ū
¯ ¯ ¯
Figure 1.1: The asymptotic CDFs (spectrums) of a sequence of random
variables {An }∞n=1 and their quantiles: ū(·) = sup-spectrum of {An },
u(·) = inf-spectrum of {An }, U δ = quantile of ū(·), Ūδ = quantile of
¯ ¯
u(·), U = limδ↓0 U δ = U 0 , U 1− = limξ↑1 U ξ , Ū = limδ↑1 Ūδ .
¯ ¯ ¯ ¯ ¯ ¯
and
(u + v)(θ) := lim inf Pr{An + Bn ≤ θ}.
n→∞
(U + V )δ+γ ≥ U δ + Vγ , (1.2.1)
¯
and
(U + V )δ+γ ≥ U δ + V̄γ . (1.2.2)
¯
4. For δ ≥ 0, γ ≥ 0, and δ + γ ≤ 1,
6
and
(U + V )δ ≤ Ūδ+γ + V̄(1−γ) . (1.2.4)
Proof: The proof of Property 1 follows directly from the definitions of U δ and
¯
Ūδ and the fact that the inf-spectrum and the sup-spectrum are non-decreasing
in δ.
The proof of Property 2 can be proved by contradiction as follows. Suppose
limδ↓0 U δ > U 0 + ε for some ε > 0. Then for any δ > 0,
¯ ¯
ū(U 0 + ε/2) ≤ δ.
¯
Since the above inequality holds for every δ > 0, and ū(·) is a non-negative
function, we obtain ū(U 0 + ε/2) = 0, which contradicts to the definition of U 0 .
¯ ¯
We can prove limδ↓0 Ūδ = Ū0 in a similar fashion.
To show (1.2.1), we observe that for α > 0,
(U + V )δ+γ ≥ U δ + Vγ − 2α.
¯
The proof is completed by noting that α can be made arbitrarily small.
Similarly, we note that for α > 0,
lim inf Pr{An + Bn ≤ U δ + V̄γ − 2α}
n→∞ ¯
≤ lim inf Pr{An ≤ U δ − α} + Pr{Bn ≤ V̄γ − α}
n→∞ ¯
≤ lim sup Pr{An ≤ U δ − α} + lim inf Pr{Bn ≤ V̄γ − α}
n→∞ ¯ n→∞
≤ δ + γ,
7
(Note that the case γ = 1 is not allowed here because it results in U 1 = (−V )1 =
¯
∞, and the subtraction between two infinite terms is undefined. That is why we
need to exclude the case of γ = 1 for the subsequent proof.) The proof is then
completed by showing that
−(−V )γ ≤ V̄(1−γ) . (1.2.5)
By definition,
(−v)(θ) := lim sup Pr {−Bn ≤ θ}
n→∞
= 1 − lim inf Pr {Bn < −θ} .
n→∞
Then
V̄(1−γ) := sup{θ : v(θ) ≤ 1 − γ}
= sup{θ : lim inf Pr[Bn ≤ θ] ≤ 1 − γ}
n→∞
≥ sup{θ : lim inf Pr[Bn < θ] < 1 − γ} (cf. footnote 1)
n→∞
= sup{−θ : lim inf Pr[Bn < −θ] < 1 − γ}
n→∞
= sup{−θ : 1 − lim sup Pr[−Bn ≤ θ] < 1 − γ}
n→∞
= sup{−θ : lim sup Pr[−Bn ≤ θ] > γ}
n→∞
= − inf{θ : lim sup Pr[−Bn ≤ θ] > γ}
n→∞
= − sup{θ : lim sup Pr[−Bn ≤ θ] ≤ γ}
n→∞
= − sup{θ : (−v)(θ) ≤ γ}
= −(−V )γ .
Finally, to show (1.2.4), we again note that it is trivially true for γ = 1. We then
observe from (1.2.2) that (U + V )δ + (−V )γ ≤ (U + V − V )δ+γ = Ūδ+γ . Hence,
(U + V )δ ≤ Ūδ+γ − (−V )γ .
Using (1.2.5), we have the desired result. 2
In Definitions 1.1 and 1.2, if we let the random variable An equal the normalized
entropy density5
1 1
hX n (X n ) := − log PX n (X n )
n n
5
The random variable hX n (X n ) := − log PX n (X n ) is named the entropy density. Hence,
the normalized entropy density is equal to hX n (X n ) divided or normalized by the blocklength
n.
8
of an arbitrary source
∞
n (n) (n)
X = X = X1 , X2 , . . . , Xn
(n)
,
n=1
1
δ-inf-entropy rate Hδ (X) = quantile of the sup-spectrum of hX n (X n )
n
1
δ-sup-entropy rate H̄δ (X) = quantile of the inf-spectrum of hX n (X n ).
n
Note that the inf-entropy-rate H(X) and the sup-entropy-rate H̄(X) introduced
in [25, 26] are special cases of the δ-inf/sup-entropy rate measures:
1 1 1 PX n ,Y n (X n , Y n )
iX n W n (X n ; Y n ) = iX n ,Y n (X n ; Y n ) := log ,
n n n PX n (X n )PY n (Y n )
1
I δ (X; Y ) = quantile of the sup-spectrum of iX n W n (X n ; Y n )
¯ n
1
I¯δ (X; Y ) = quantile of the inf-spectrum of iX n W n (X n ; Y n ).
n
9
Entropy Measures
we can replace An in Definitions 1.1 and 1.2 by the normalized log-likelihood ratio
1 1 PX n (X n )
dX n (X n X̂ n ) := log
n n PX̂ n (X n )
to obtain the δ-inf/sup-divergence rates, denoted respectively by Dδ (XX̂) and
D̄δ (XX̂), as
1
Dδ (XX̂) = quantile of the sup-spectrum of dX n (X n X̂ n )
n
1
D̄δ (XX̂) = quantile of the inf-spectrum of dX n (X n X̂ n ).
n
The above information measure definitions are summarized in Tables 1.1–1.3.
10
Mutual Information Measures
arbitrary channel PW = PY |X with input
system
X and output Y
1
norm. information density iX n W n (X n ; Y n )
n
1 PX n ,Y n (X n , Y n )
:= log
n PX n (X n ) × PY n (Y n )
1 n n
information sup-spectrum ī(θ) := lim sup Pr iX n W n (X ; Y ) ≤ θ
n→∞ n
1 n n
information inf-spectrum i(θ) := lim inf Pr iX n W n (X ; Y ) ≤ θ
n→∞ n
δ-inf-information rate I δ (X; Y ) := sup{θ : ī(θ) ≤ δ}
¯
δ-sup-information rate I¯δ (X; Y ) := sup{θ : i(θ) ≤ δ}
sup-information rate ¯
I(X; Y ) := limδ↑1 I¯δ (X; Y )
11
Divergence Measures
or equivalently,
Lemma 1.5 For a source X with finite alphabet X and arbitrary sources Y
and Z, the following properties hold.
12
I¯γ (X; Y ) ≤ H̄δ+γ (Y ) −Hδ (Y |X), (1.4.3)
I δ+γ (X; Y ) ≥Hδ (Y ) − H̄(1−γ) (Y |X), (1.4.4)
¯
and
I¯δ+γ (X; Y ) ≥ H̄δ (Y ) − H̄(1−γ) (Y |X). (1.4.5)
(Note that the case of (δ, γ) = (1, 0) holds for (1.4.1) and (1.4.2), and the
case of (δ, γ) = (0, 1) holds for (1.4.3), (1.4.4) and (1.4.5).)
(n)
4. 0 ≤Hδ (X) ≤ H̄δ (X) ≤ log |X | for δ ∈ [0, 1), where each Xi takes values
in X for i = 1, . . . , n and n = 1, 2, . . ..
≤ e−νn , (1.4.6)
13
With this fact and for 0 ≤ δ < 1, 0 < γ < 1 and δ + γ ≤ 1, (1.4.1) follows
directly from (1.2.1); (1.4.2) and (1.4.3) follow from (1.2.2); (1.4.4) follows from
(1.2.3); and (1.4.5) follows from (1.2.4). (Note that in (1.4.1), if δ = 0 and γ = 1,
then the right-hand-side would be a difference between two infinite terms, which
is undefined; hence, such case is therefore excluded by the condition that γ < 1.
For the same reason, we also exclude the cases of γ = 0 and δ = 1.) Now when
0 ≤ δ < 1 and γ = 0, we can confirm the validity of (1.4.1), (1.4.2) and (1.4.3),
again, by (1.2.1) and (1.2.2), and also examine the validity of (1.4.4) and (1.4.5)
by directly taking these values inside. The validity of (1.4.1) and (1.4.2) for
(δ, γ) = (1, 0), and the validity of (1.4.3), (1.4.4) and (1.4.5) for (δ, γ) = (0, 1)
can be checked by directly replacing δ and γ with the respective numbers.
Property 4 follows from the facts that H̄δ (·) is non-decreasing in δ, H̄δ (X) ≤
H̄(X), and H̄(X) ≤ log |X |. The last inequality can be proved as follows.
1 n
Pr hX n (X ) ≤ log |X | + ν
n
n n 1 PX n (X n )
= 1 − PX n x ∈ X : log < −ν
n 1/|X |n
≥ 1 − e−nν ,
where the last step can be obtained by using the same procedure as (1.4.6).
Therefore, h(log |X |+ν) = 1 for any ν > 0, which indicates that H̄(X) ≤ log |X |.
¯
Property 5 can be proved using the fact that
1 1 1
iX n ,Y n ,Z n (X n , Y n ; Z n ) = iX n ,Z n (X n ; Z n ) + iX n ,Y n ,Z n (Y n ; Z n |X n ).
n n n
By applying (1.2.1) with γ = 0, and observing I (Y ; Z|X) ≥ 0, we obtain the
¯
desired result. 2
Lemma 1.6 (Data processing lemma) Fix δ ∈ [0, 1]. Consider arbitrary
sources X 1 , X 2 and X 3 such that for every n, X1n and X3n are conditionally
independent given X2n . Then
I δ (X 1 ; X 3 ) ≤ I δ (X 1 ; X 2 ).
¯ ¯
Proof: By Property 5 of Lemma 1.5, we get
I δ (X 1 ; X 3 ) ≤ I δ (X 1 ; X 2 , X 3 ) = I δ (X 1 ; X 2 ),
¯ ¯ ¯
where the equality holds because
1 PX1n ,X2n ,X3n (xn1 , xn2 , xn3 ) 1 PX1n ,X2n (xn1 , xn2 )
log = log .
n PX1n (xn1 )PX2n ,X3n (xn2 , xn3 ) n PX1n (xn1 )PX2n (xn2 )
2
14
Lemma 1.7 (Optimality of independent inputs) Fix δ ∈ [0, 1). Consider
a channel with finite input and output alphabets, X nand Y, respectively, and
n n n n
with distribution PW (y |x ) = PY |X (y |x ) = i=1 PYi |Xi (yi |xi ) for all n,
n n n
1 PW n (y n |xn ) 1 PY n (y n ) 1 PW n (y n |xn )
log + log = log .
n PY n (y n ) n PY n (y n ) n PY n (y n )
In other words,
1 PX n W n (xn , y n) 1 PY n (y n ) 1 PX n W n (xn , y n )
log + log = log ,
n PX n (xn )PY n (y n ) n PY n (y n ) n PX n (xn )PY n (y n )
and
Zδ (X̄; Ȳ ) := sup{θ : z̄(θ) ≤ δ},
we obtain from (1.2.1) (with γ = 0) that
15
an independent process and has the same marginal statistics6 as Y . Hence,
n
1
PX i Wi (Xi , Yi ) PX i Wi (Xi , Yi )
Pr log − EXi Wi log >γ
n P X (X i )P Y (Y i ) P X (X i )P Y (Y i )
i=1 i i i i
n
1
PXi Wi (Xi , Yi ) PXi Wi (Xi , Yi )
= Pr log − EXi Wi log >γ
n PXi (Xi )PYi (Yi ) PXi (Xi )PYi (Yi )
i=1
→ 0,
for any γ > 0, where the convergence to zero follows from Chebyshev’s inequality
and the finiteness of the channel alphabets (or more directly, the finiteness of
individual variances). Consequently, z̄(θ) = 1 for
1 1
n n
PXi Wi (Xi , Yi )
θ > lim inf EXi Wi log = lim inf I(Xi ; Yi );
n→∞ n PXi (Xi )PYi (Yi ) n→∞ n
i=1 i=1
n
and z̄(θ) = 0 for θ < lim inf n→∞ (1/n) i=1 I(Xi ; Yi ), which implies
1
n
Z(X̄; Ȳ ) = Zδ (X̄; Ȳ ) = lim inf I(Xi ; Yi )
n→∞ n
i=1
1
n
I (X̄; Ȳ ) = I δ (X̄; Ȳ ) = lim inf I(Xi ; Yi )
¯ ¯ n→∞ n
i=1
6
The claim can be justified as follows. First, for y1 ∈ Y,
PY1 (y1 ) = PX n (xn )PW n (y n |xn )
y2n ∈Y n−1 xn ∈X n
= PX1 (x1 )PW1 (y1 |x1 ) PX2n |X1 (xn2 |x1 )PW2n (y2n |xn2 )
x1 ∈X y2n ∈Y n−1 xn
2 ∈X
n−1
= PX1 (x1 )PW1 (y1 |x1 ) PX2n W2n |X1 (xn2 , y2n |x1 )
x1 ∈X y2n ∈Y n−1 xn
2 ∈X
n−1
= PX1 (x1 )PW1 (y1 |x1 ),
x1 ∈X
where xn2 := (x2 , · · · , xn ), y2n := (y2 , · · · , yn ) and W2n is the channel law between input and
output tuples from time 2 to n. It can be similarly shown that for any time index 1 ≤ i ≤ n,
PYi (yi ) = PXi (xi )PWi (yi |xi )
xi ∈X
n
for xi ∈ X and yi ∈ Y. Hence, for a channel with PW n (y n |xn ) = i=1 PWi (yi |xi ), the output
marginal only depends on the respective input marginal.
16
for any δ ∈ [0, 1). Accordingly,
Consider a channel with binary input and output alphabets; i.e., X = Y = {0, 1},
and let every channel output be given by
(n) (n) (n)
Yi = Xi ⊕ Zi
Given that the channel noise is additive and independent from the channel input,
H̄γ (Y |X) = H̄γ (Z), which is independent of X. Hence,
17
where the last step follows from Property 4 of Lemma 1.5. Since log(2) − H̄γ (Z)
is non-increasing in γ,
On the other hand, we can derive the lower bound to I ε (X; Y ) in (1.5.1) by
¯
the fact that Y is Bernoulli uniform. We thus obtain that for ε ∈ (0, 1],
To summarize,
log(2) − H̄(1−ε) (Z) ≤ I ε (X; Y ) ≤ log(2) − lim H̄γ (Z) for ε ∈ (0, 1)
¯ γ↑(1−ε)
and
I (X; Y ) = I 0 (X; Y ) = log(2) − H̄(Z).
¯ ¯
An alternative method to compute I ε (X; Y ) is to derive its corresponding
¯
sup-spectrum in terms of the inf-spectrum of the noise process. Under the equally
likely Bernoulli input X, we can write
1 PY n |X n (Y n |X n )
ī(θ) := lim sup Pr log ≤θ
n→∞ n PY n (Y n )
1 n 1 n
= lim sup Pr log PZ n (Z ) − log PY n (Y ) ≤ θ
n→∞ n n
1 n
= lim sup Pr log PZ n (Z ) ≤ θ − log(2)
n→∞ n
1 n
= lim sup Pr − log PZ n (Z ) ≥ log(2) − θ
n→∞ n
1 n
= 1 − lim inf Pr − log PZ n (Z ) < log(2) − θ .
n→∞ n
18
Hence, for ε ∈ (0, 1),
where (1.5.3) follows from the fact described in Footnote 1.3. Therefore,
log(2) − H̄(1−ε) (Z) ≤ I ε (X; Y ) ≤ log(2) − lim H̄γ (Z) for ε ∈ (0, 1).
¯ γ↑(1−ε)
By taking ε ↓ 0, we obtain
19
6
1 t
β t d
0 d -
0 hb (p)
1 t
1−β t d
0 d -
1 − hb (p) 1
with any of its random variables equaling one with probability p. Then as
n → ∞, the sequence of random variables (1/n)hZ n (Z n ) converges to 0 and hb (p)
with respective masses β and 1 − β, where hb (p) := −p log p − (1 − p) log(1 − p) is
the binary entropy function. The resulting h(θ) is depicted in Figure 1.2. From
¯
(1.5.3), we obtain ī(θ) as shown in Figure 1.3.
Therefore,
1 − hb (p), if 0 < ε < 1 − β;
I ε (X; Y ) =
¯ 1, if 1 − β ≤ ε < 1.
Example 1.9 If ∞
Z = Z n = Z1 , . . . , Zn(n)
(n)
n=1
is a non-stationary binary independent sequence with
(n) (n)
Pr Zi = 0 = 1 − Pr Zi = 1 = pi ,
20
(n)
− log PZ (n) Zi , namely
i
2
(n) (n)
Var − log PZ (n) Zi ≤ E log PZ (n) Zi
i i
≤ sup pi (log pi )2 + (1 − pi )(log(1 − pi ))2
0<pi <1
< log(2),
1 (n) 1
n n
H̄(1−ε) (Z) = H̄(Z) = lim sup H Zi = lim sup hb (pi )
n→∞ n i=1 n→∞ n
i=1
for ε ∈ (0, 1], and infinity for ε = 0, where hb (pi ) = −pi log(pi )−(1−pi ) log(1−pi ).
Consequently,
n
1 − H̄(Z) = 1 − lim sup 1 hb (pi ), for ε ∈ [0, 1),
I ε (X; Y ) = n→∞ n
¯
i=1
∞, for ε = 1.
clustering points
···
-
H(Z) H̄(Z)
21
clustering points
···
-
log(2) − H̄(Z) log(2) −H(Z)
22
Chapter 2
is the minimum data compression rate (nats per source symbol) for arbitrarily
small data compression error for block coding of the stationary-ergodic source
[2]. For a more complicated situations where the sources become non-stationary,
the quantity limn→∞ (1/n)H(X n ) may not exist, and can no longer be used to
characterize the source compression. This results in the need to establish a
new entropy measure which appropriately characterizes the operational limits of
arbitrary stochastic systems, which was done in the previous chapter.
The role of a source code is to represent the output of a source efficiently.
Specifically, a source code design is to minimize the source description rate of
the code subject to a fidelity criterion constraint. One commonly used fidelity
criterion constraint is to place an upper bound on the probability of decoding
error Pe . If Pe is made arbitrarily small, we obtain a traditional (almost) error-
free source coding system.1 Lossy data compression codes are a larger class
of codes in the sense that the fidelity criterion used in the coding scheme is a
general distortion measure. In this chapter, we only demonstrate the bounded-
error data compression theorems for arbitrary (not necessarily stationary ergodic,
information stable, etc.) sources. The general lossy data compression theorems
will be introduced in subsequent chapters.
1
Recall that only for variable-length codes, a complete error-free data compression is re-
quired. A lossless data compression block codes only dictates that the compression error can
be made arbitrarily small, or asymptotically error-free.
23
2.1 Fixed-length data compression codes for arbitrary
sources
Definition 2.1 (cf. Definition 3.2 and its associated footnote in [2]) An
(n, M) block code for data compression is a set
∼Cn := {c1 , c2 , . . . , cM }
Definition 2.2 Fix ε ∈ [0, 1]. R is an ε-achievable data compression rate for
a source X if there exists a sequence of block data compression codes {∼Cn =
(n, Mn )}∞
n=1 with
1
lim sup log Mn ≤ R,
n→∞ n
and
lim sup Pe (∼Cn ) ≤ ε,
n→∞
n
where Pe (∼Cn ) := Pr (X ∈
/ ∼Cn ) is the probability of decoding error.
The infimum of all ε-achievable data compression rate for X is denoted by
Tε (X).
2
Actually, the theorems introduced also apply for sources with countable alphabets. We
assume finite alphabets in order to avoid uninteresting cases (such as H̄ε (X) = ∞) that might
arise with countable alphabets.
3
In [2, Def. 3.2], the (n, M ) block data compression code is defined by M codewords,
where each codeword represents a group of sourcewords of length n. However, we can actually
pick up one source symbol from each group, and equivalently define the code using these M
representative sourcewords. Later, it will be shown that this viewpoint facilitates the proving
of the general source coding theorem.
24
Lemma 2.3 (Lemma 1.5 in [25]) Fix a positive integer n. There exists an
(n, Mn ) source block code ∼Cn for PX n such that its error probability satisfies
1 n 1
Pe (∼Cn ) ≤ Pr hX n (X ) > log Mn .
n n
Lemma 2.4 (Lemma 1.6 in [25]) Every (n, Mn ) source block code ∼Cn for
PX n satisfies
1 n 1
Pe (∼Cn ) ≥ Pr hX n (X ) > log Mn + γ − exp{−nγ},
n n
25
Clearly,
n n 1 n 1
Pr {X ∈ ∼Cn } = Pr X ∈ ∼Cn and hX n (X ) ≤ log Mn + γ
n n
n 1 n 1
+ Pr X ∈ ∼Cn and hX n (X ) > log Mn + γ
n n
1 1
≤ Pr hX n (X n ) ≤ log Mn + γ
n n
n 1 n 1
+ Pr X ∈ ∼Cn and hX n (X ) > log Mn + γ
n n
1 1
= Pr hX n (X n ) ≤ log Mn + γ
n n
n 1 n 1
+ PX n (x ) · 1 hX n (x ) > log Mn + γ
n n
x ∈∼
n Cn
1 n 1
= Pr hX n (X ) ≤ log Mn + γ
n n
n n 1
+ PX n (x ) · 1 PX n (x ) < exp{−nγ}
Mn
xn ∈ ∼
C
n
1 n 1 1
< Pr hX n (X ) ≤ log Mn + γ + |∼Cn | exp{−nγ}
n n Mn
1 n 1
= Pr hX n (X ) ≤ log Mn + γ + exp{−nγ}.
n n
2
We now apply Lemmas 2.3 and 2.4 to prove a general source coding theorems
for block codes.
Proof: The case of ε = 1 follows directly from its definition; hence, the proof
only focuses on the case of ε ∈ [0, 1).
26
Lemma 2.3 ensures the existence (for any γ > 0) of a source block code
∼Cn = (n, Mn = exp{n(limδ↑(1−ε) H̄δ (X) + γ)} ) with error probability
1 n 1
Pe (∼Cn ) ≤ Pr hX n (X ) > log Mn
n n
1 n
≤ Pr hX n (X ) > lim H̄δ (X) + γ .
n δ↑(1−ε)
Therefore,
1 n
lim sup Pe (∼Cn ) ≤ lim sup Pr hX n (X ) > lim H̄δ (X) + γ
n→∞ n→∞ n δ↑(1−ε)
1 n
= 1 − lim inf Pr hX n (X ) ≤ lim H̄δ (X) + γ
n→∞ n δ↑(1−ε)
≤ 1 − (1 − ε) = ε,
where the last inequality follows from
1 n
lim H̄δ (X) = sup θ : lim inf Pr hX n (X ) ≤ θ < 1 − ε . (2.1.1)
δ↑(1−ε) n→∞ n
and
lim sup Pe (∼Cn ) ≤ ε. (2.1.3)
n→∞
(2.1.2) implies that
1
log Mn ≤ lim H̄δ (X) − 2γ
n δ↑(1−ε)
for all sufficiently large n. Hence, for those n satisfying the above inequality
and also by Lemma 2.4,
1 1
Pe (∼Cn ) ≥ Pr hX n (X ) > log Mn + γ − e−nγ
n
n n
1 n
≥ Pr hX n (X ) > lim H̄δ (X) − 2γ + γ − e−nγ .
n δ↑(1−ε)
27
Therefore,
1
lim sup Pe (∼Cn ) ≥ 1 − lim inf Pr hX n (X n ) ≤ lim H̄δ (X) − γ
n→∞ n→∞ n δ↑(1−ε)
> 1 − (1 − ε) = ε,
where the last inequality follows from (2.1.1). Thus, a contradiction to
(2.1.3) is obtained. 2
• Note that as ε = 0, limδ↑(1−ε) H̄δ (X) = H̄(X). Hence, the above theorem
generalizes the block source coding theorem in [26], which states that the
minimum achievable fixed-length source coding rate of any finite-alphabet
source is H̄(X).
• Consider the special case where −(1/n) log PX n (X n ) converges in proba-
bility to a constant H, which holds for all information stable sources.4 In
this case, both the inf- and sup-spectrums of X degenerate to a unit step
function:
1, if θ > H;
u(θ) =
0, if θ < H,
where H is the source entropy rate. Thus, H̄ε (X) = H for all ε ∈ [0, 1).
Hence, the general source coding theorem reduces to the conventional
source coding theorem.
• More generally, if −(1/n) log PX n (X n ) converges in probability to a random
variable Z whose cumulative distribution function (CDF) is FZ (·), then the
minimum achievable data compression rate subject to decoding error being
no greater than ε is
lim H̄δ (X) = sup {R : FZ (R) < 1 − ε} .
δ↑(1−ε)
Therefore, the relationship between the code rate and the ultimate optimal
error probability is also clearly defined. We further explore the case in the
next example.
4 (n) (n)
A source X = {X n = (X1 , . . . , Xn )}∞ n
n=1 is said to be information stable if H(X ) =
n
E [− log PX n (x )] > 0 for all n, and
− log PX n (xn )
lim Pr − 1 > ε = 0,
n→∞ H(X n )
for every ε > 0. By the definition, any stationary-ergodic source with finite n-fold entropy is
information stable; hence, it can be viewed a generalized source model for stationary-ergodic
sources.
28
Example 2.6 Consider a binary source X with each X n is Bernoulli(Θ)
distributed, where Θ is a random variable defined over (0, 1). This is a sta-
tionary but non-ergodic source [1]. We can view the source as a mixture of
Bernoulli(θ) processes where the parameter θ ∈ Θ = (0, 1), and has distri-
bution PΘ [1, Corollary 1]. Therefore, it can be shown by ergodic decom-
position theorem (which states that any stationary source can be viewed
as a mixture of stationary-ergodic sources) that −(1/n) log PX n (X n ) con-
verges in probability to a random variable Z = hb (Θ) [1], where hb (x) :=
−x log2 (x) − (1 − x) log2 (1 − x) is the binary entropy function. Conse-
quently, the CDF of Z is FZ (z) = Pr{hb (Θ) ≤ z}; and the minimum
achievable fixed-length source coding rate with compression error being no
larger than ε is
sup{R : FZ (R) < 1 − ε} = sup{R : Pr{hb (Θ) ≤ R} < 1 − ε}.
• From the above example, or from Theorem 2.5, it shows that the strong
converse theorem (which states that codes with rate below entropy rate
will ultimately have decompression error approaching one, cf. [2, Thm. 3.6])
does not hold in general. However, one can always claim the weak converse
statement for arbitrary sources.
Theorem 2.7 (weak converse theorem) For any block code sequence
of ultimate rate R < H̄(X), the probability of block decoding failure Pe
cannot be made arbitrarily small. In other words, there exists ε > 0 such
that Pe is lower bounded by ε infinitely often in blocklength n.
29
n (i.o.) n→∞
Pe −
−−→1 Pe is lower Pe −
−−→0
bounded (i.o. in n) - R
H̄0 (X) H̄(X)
We close this section by remarking that the definition that we adopt for the
ε-achievable data compression rate is slightly different from, but equivalent to,
the one used in [26, Def. 8]. The definition in [26] also brings the same result,
which was separately proved by Steinberg and Verdú as a direct consequence of
Theorem 10(a) (or Corollary 3) in [36]. To be precise, they showed that Tε (X),
denoted by Te (ε, X) in [36], is equal to R̄v (2ε) (cf. Def. 17 in [36]). By a simple
derivation, we obtain:
Te (ε, X) = R̄v (2ε)
1 n
= inf θ : lim sup PX n − log PX n (X ) > θ ≤ ε
n→∞ n
1 n
= inf θ : lim inf PX n − log PX n (X ) ≤ θ ≥ 1 − ε
n→∞ n
1 n
= sup θ : lim inf PX n − log PX n (X ) ≤ θ < 1 − ε
n→∞ n
= lim H̄δ (X).
δ↑(1−ε)
Note that Theorem 10(a) in [36] is a lossless data compression theorem for ar-
bitrary sources, which the authors show as a by-product of their results on
finite-precision resolvability theory. Specifically, they proved T0 (X) = S(X)
[26, Thm. 1] and S(X) = H̄(X) [26, Thm. 3], where S(X) is the resolvability5
of an arbitrary source X. Here, we establish Theorem 2.5 in a different and
more direct way.
30
totic Equipartition Property (AEP) which states that (1/n)hX n (X n ) converges
to H(X) with probability one (and hence in probability). The AEP – which
implies that the probability of the typical set is close to one for sufficiently large
n – also holds for stationary-ergodic sources. It is however invalid for more gen-
eral sources – e.g., non-stationary, non-ergodic sources. We herein demonstrate
a generalized AEP theorem.
1.
lim inf Pr Tn [H̄ε (X) − δ] ≤ ε (2.2.1)
n→∞
2.
lim inf Pr Tn [H̄ε (X) + δ] > ε (2.2.2)
n→∞
Proof: (2.2.1) and (2.2.2) follow from the definitions. For (2.2.3), we have
1 ≥ PX n (xn )
xn ∈Fn (δ;ε)
≥ exp −n (H̄ε (X) + δ)
xn ∈Fn (δ;ε)
31
Fn (δ; ε)
It remains to show (2.2.4). (2.2.2) implies that there exist ρ = ρ(δ) > 0 and
N1 such that for all n > N1 ,
Pr Tn [H̄ε (X) + δ] > ε + 2ρ.
Furthermore, (2.2.1) implies that for the previously chosen ρ, there exists a
subsequence {nj }∞
j=1 such that
Pr Tnj [H̄ε (X) − δ] < ε + ρ.
• The set
Fn (δ; ε) := Tn [H̄ε (X) + δ] \ Tn [H̄ε (X) − δ]
n 1
= x ∈ X : − log PX n (x ) − H̄ε (X) ≤ δ
n n
n
is nothing but the weakly δ-typical set.
32
• (2.2.1) and (2.2.2) imply that qn := Pr{Fn (δ; ε)} > 0 infinitely often in n.
• (2.2.3) and (2.2.4) imply that the number of sequences in Fn (δ; ε) (the
dashed region) is approximately equal to exp nH̄ε (X) , and the probabi-
lity of each sequence in Fn (δ; ε) can be estimated by qn · exp −nH̄ε (X) .
In this case, (2.2.1)-(2.2.2) and the fact that H̄ε (X) = Hε (X) for all ε ∈
[0, 1) imply that the probability of the typical set Fn (δ; ε) is close to one
(for n sufficiently large), and (2.2.3) and (2.2.4) imply that there are about
enH typical sequences of length n, each with probability about e−nH . Hence
we obtain the conventional AEP.
• The general source coding theorem can also be proved in terms of the
generalized AEP theorem. For more detail, readers can refer to [11].
33
Chapter 3
In the previous chapter, it is shown that the sup-entropy rate is indeed the
minimum lossless data compression ratio achievable for block codes. Hence, to
find an optimal block code becomes a well-defined mission since for any source
with well-formulated statistical model, the sup-entropy rate can be computed
and such quantity can be used as a criterion to evaluate the optimality of the
designed block code.
In [26], Verdú and Han found that, other than the minimum lossless data
compression ratio, the sup-entropy rate actually has another operational mean-
ing, which is called resolvability. In this chapter, we will explore the new concept
in details.
34
generate the random variable by computer that has an access to a equally likely
random experiments.
Example 3.2 Consider the generation of the random variable with probability
masses PX (−1) = 1/4, PX (0) = 1/2, and PX (1) = 1/4. An algorithm is written
as:
On the average, the above algorithm requires 1.5 coin flips, and in the worst-
case, 2 coin flips are necessary. Therefore, the complexity measure can take two
fundamental forms: worst-case or average-case over the range of outcomes of
the random variables. Note that we did not show in the above example that
the algorithm is the most efficient one in the sense of using minimum number of
random bits; however, it is indeed an optimal algorithm because it achieves the
lower bound of the minimum number of random bits. Later, we will show that
such bound for average minimum number of random bits required for generat-
ing the random variables is the entropy, which is exactly 1.5 bits in the above
example. As for the worse-case bound, a new terminology, resolution, will be
introduced. As a result, the above algorithm also achieves the lower bound of
the worst-case complexity, which is the resolution of the random variable.
35
Definition 3.4 (resolution of a random variable) The resolution1 R(X) of
a random variable X is the minimum log(M) such that PX is M-type. If PX is
not M-type for any integer M, then R(X) = ∞.
even = False;
while (1)
{Flip-a-fair-coin; \\ one random bit
if (Head)
{if (even==True) { output 0; break;}
else {output 1; break;}
}
else
{if (even==True) even=False;
else even=True;
}
}
36
Definition 3.5 (variational distance) The variational distance (or 1 dis-
tance) between two distributions P and Q defined on common measurable space
(Ω, F ) is
P − Q := |P (ω) − Q(ω)|.
ω∈Ω
(Note that an alternative way to formulate the variational distance is:
P − Q = 2 · sup |P (E) − Q(E)| = 2 [P (x) − Q(x)].
E∈F
x∈X : P (x)≥Q(x)
The ε-achievable resolution reveals the possibility that one can choose another
computer-resolvable random variable whose variational distance to the true sour-
ce is within an acceptable range, ε.
Next we define the ε-achievable resolution rate for a sequence of random
variables, which is an extension of ε-achievable resolution defined for a single
random variable. Such extension is analogous to extending entropy for a single
source to entropy rate for a random source sequence.
Definition 3.8 (ε-resolvability for X) Fix ε > 0. The ε-resolvability for in-
put X, denoted by Sε (X), is the minimum ε-achievable resolution rate of the
same input, i.e.,
Sε (X) := min R : (∀ γ > 0)(∃X̃ and N)(∀ n > N)
1 ) < R + γ and X − X
n n <ε .
n
R(X
n
2
Note that our definition of resolution rate is different from its original form (cf. Definition
7 in [26] and the statements following Definition 7 of the same paper for its modified Definition
for specific input X), which involves an arbitrary channel model W . Readers may treat our
definition as a special case of theirs over identity channel.
37
Here, we define Sε (X) using the “minimum” instead of a more general “in-
fimum” operation is simply because Sε (X) indeed belongs to the range of the
minimum operation,3 i.e.,
Sε (X) ∈ Rε (X) := R : (∀ γ > 0)(∃X̃ and N)(∀ n > N)
1 n n n
R(X ) < R + γ and X − X < ε .
n
Similar convention will be applied throughout the rest of this chapter.
38
Definition 3.11 (ε-mean-resolvability for X) Fix ε > 0. The ε-mean-re-
solvability for input X, denoted by S̄ε (X), is the minimum ε-mean achievable
resolution rate for the same input, i.e.,
S̄ε (X) := min R : (∀ γ > 0)(∃X̃ and N)(∀ n > N)
1 n n n
H(X ) < R + γ and X − X < ε .
n
The operational meanings for the resolution and entropy (a new operational
meaning for entropy other than the one from source coding theorem) follow the
next theorem.
39
Next, we reveal the operational meanings for resolvability and mean-resolva-
bility in source coding. We begin with some lemmas that are useful in charac-
terizing the resolvability.
Proof:
P − Q = 2 [P (x) − Q(x)]
x∈X : P (x)≥Q(x)
= 2 [P (x) − Q(x)]
x∈X : log[P (x)/Q(x)]≥0
= 2 [P (x) − Q(x)]
x∈X : log[P (x)/Q(x)]>µ
+ [P (x) − Q(x)]
x∈X : µ≥log[P (x)/Q(x)]≥0
≤ 2 P (x)
x∈X : log[P (x)/Q(x)]>µ
Q(x)
+ P (x) 1 −
P (x)
x∈X : µ≥log[P (x)/Q(x)]≥0
P (x)
≤ 2 P x ∈ X : log >µ
Q(x)
P (x)
+ P (x) log
Q(x)
x∈X : µ≥log[P (x)/Q(x)]≥0
40
P (x)
= 2 P x ∈ X : log >µ
Q(x)
P (x)
+µ · PX x ∈ X : µ ≥ log ≥ 0}
Q(x)
P (x)
= 2 P x ∈ X : log >µ +µ .
Q(x)
2
Lemma 3.15
n n 1 1 n)
PX n x ∈X : − log PX n (xn ) ≤ R(X = 1,
n n
for every n.
n ),
Proof: By definition of R(X
n )}
PX n (xn ) ≥ exp{−R(X
Theorem 3.16 The resolvability for input X is equal to its sup-entropy rate,
i.e.,
S(X) = H̄(X).
Proof:
1. S(X) ≥ H̄(X).
It suffices to show that S(X) < H̄(X) contradicts to Lemma 3.15.
Suppose S(X) < H̄(X). Then there exists δ > 0 such that
Let
n n1 n
D0 := x ∈ X : − log PX n (x ) ≥ S(X) + δ .
n
41
By definition of H̄(X),4
lim sup PX n (D0 ) > 0.
n→∞
1
≤ √ |PX n (xn ) − PX n (xn )|
xn ∈X n
ε
ε √
≤ √ = ε.
ε
Consider that
PX n (D1 ∩ D0 ) ≥ PX n (D0 ) − PX n (D1c )
√
≥ α − ε > 0, (3.3.2)
4
lim inf n→∞ PX (D0c ) ≤ h(S(X)+δ) < 1 because if h(S(X)+δ) ≥ 1, then H̄(X) ≤ S(X)+δ.
¯ ¯
42
which holds infinitely often in n; and every xn0 in D1 ∩ D0 satisfies
√
PX n (xn0 ) ≤ (1 + ε)PX n (xn0 ) (since xn0 ∈ D1 )
and
1 1 1 1
− log PX n (xn0 ) ≥ − log PX n (xn0 ) + log √
n n n 1+ ε
1 1
≥ (S(X) + δ) + log √ (since xn0 ∈ D0 )
n 1+ ε
δ
≥ S(X) + ,
2
√
for n > (2/δ) log(1 + ε). Therefore, for those n that (3.3.2) holds,
n n 1 n 1 n
PX n x ∈ X : − log PX n (x ) > R(X )
n n
n n 1 n δ
≥ PX n x ∈ X : − log PX n (x ) > S(X) + (From (3.3.1))
n 2
n n 1 n
≥ PX n x ∈ X : − log PX n (x ) ≥ S(X) + δ
n
$ %& '
=D0
≥ PX n (D1 ∩ D0 )
√
≥ (1 − ε)PX n (D1 ∩ D0 ) (By definition of D1 )
> 0,
which contradicts to the result of Lemma 3.15.
2. S(X) ≤ H̄(X).
It suffices to show the existence of X̃ for arbitrary γ > 0 such that
n = 0
lim X n − X
n→∞
n = X
Let X n (X n ) be uniformly distributed over a set
G := {Uj ∈ X n : j = 1, . . . , M}
43
which drawn randomly (independently) according to PX n . Define for µ > 0,
n n 1 n γ µ
D := x ∈ X : − log PX n (x ) > H̄(X) + + .
n 2 n
For each G chosen, we obtain from Lemma 3.14 that
n
X n − X
n n PX n (xn )
≤ 2µ + 2 · PX n x ∈ X : log >µ
PX n (xn )
n 1/M
= 2µ + 2 · PX n x ∈ G : log > µ (since PX n (G c ) = 0)
PX n (xn )
( ) µ
n 1 n 1 n(H̄(X)+ γ2 )
= 2µ + 2 · PX n x ∈ G : − log PX n (x ) > log e +
n n n
1 γ µ
≤ 2µ + PX n xn ∈ G : − log PX n (xn ) > H̄(X) + +
n 2 n
= 2µ + PX n (G ∩ D)
1
= 2µ + |G ∩ D| .
M
Since G is chosen randomly, we can take the expectation values (with
respect to the random G) of the above inequality to obtain:
EG X n − X n ≤ 2µ + 1 EG [|G ∩ D|] .
M
Observe that each Uj is either in D or not in D. From the i.i.d. assumption
of {Uj }M
j=1 , we can then evaluate EG [|G ∩ D|] by
5
which implies
n n
lim sup EG X − X = 0 (3.3.3)
n→∞
since µ can be chosen arbitrarily small. (3.3.3) therefore guarantees the
existence of the desired X̃. 2
5
Readers may imagine that there are cases: where |G ∩ D| = M , |G ∩ D| = M − 1, . . .,
M M−1
|G ∩D| = 1 and |G ∩D| = 0, respectively with drawing probability PX n (D), PX n (D)PX n (Dc ),
M−1 c M c
. . ., PX (D)PX n (D ) and PX n (D ).
n
44
The next two lemmas are useful in characterizing mean-resolvability.
∂[f (t + τ ) − f (t)] t
= log ≤0
∂t t+τ
Hence,
sup [f (t + τ ) − f (t)] = f (τ ) − f (0) = f (τ )
0<t≤1−τ
and
sup [f (t) − f (t + τ )] = f (1 − τ ) − f (0) = f (1 − τ ).
0<t≤1−τ
Thus
n )| ≤ X n − X
n · log |X |n
|H(X n ) − H(X ,
X n − X n
n ≤ 1 .
provided X n − X 2
45
Proof:
|H(X n ) − H(X n )|
n 1 n 1
= PX (x ) log
n − PX̃ n (x ) log
n n n
PX (x ) xn ∈X n
n PX̃ n (x )
n
x ∈X
1 1
≤ PX n (xn ) log − P (xn
) log
PX n (xn ) X̃ n
P n (xn )
xn ∈X n X̃
1
≤ |PX n (xn ) − PX̃ n (xn )| · log (3.3.4)
xn ∈X n
|PX n
n (x ) − PX̃ n (xn )|
xn ∈X n 1
n n
≤ |PX n (x ) − PX̃ n (x )| log
n n
xn ∈X n |PX n (x ) − PX̃ n (x )|
xn ∈X n
(3.3.5)
n
|X |
= X n − X̃ n log ,
X n − X̃ n
n ≤
where (3.3.4) follows from X n − X 1
and Lemma 3.17, and (3.3.5) uses
2
the log-sum inequality. 2
Proof:
1. S̄(X) ≤ lim supn→∞ (1/n)H(X n).
It suffices to prove that S̄ε (X) ≤ lim supn→∞ (1/n)H(X n) for every ε > 0.
This is equivalent to show that for all γ > 0, there exists X̃ such that
1 n ) < lim sup 1 H(X n ) + γ
H(X
n n→∞ n
and
n < ε
X n − X
for sufficiently large n. This can be trivially achieved by letting X̃ = X,
since for sufficiently many n,
1 1
H(X n ) < lim sup H(X n ) + γ
n n→∞ n
and
X n − X n = 0.
46
2. S̄(X) ≥ lim supn→∞ (1/n)H(X n).
Observe that S̄(X) ≥ S̄ε (X) for any 0 < ε < e−1 ≈ 0.36788. Then for any
n such that
γ > 0 and all sufficiently large n, there exists X
1 n ) < S̄(X) + γ
H(X (3.3.6)
n
and
n < ε.
X n − X
n )| ≤ X n − X
n · log |X |n |X |n
|H(X n ) − H(X ≤ ε log ,
X n − X n ε
where the last inequality holds because t log(1/t) is increasing for 0 < t <
e−1 , we obtain
n ) ≥ H(X n ) − ε log |X |n + ε log ε.
H(X
In the previous chapter, we have proved that the lossless data compression rate
for block codes is lower bounded by H̄(X). We also show in Section 3.3 that
H̄(X) is also the resolvability for source X. We can therefore conclude that
resolvability is equal to the minimum lossless data compression rate for block
codes. We will justify their equivalence directly in this section.
As explained in AEP theorem for memoryless source, the set Fn (δ) contains
approximately 2nH(X) elements, and the probability for source sequences being
outside Fn (δ) will eventually goes to 0. Therefore, we can binary-index those
source sequences in Fn (δ) by
47
and encode the source sequences outside Fn (δ) by a unique default binary code-
word, which results in an asymptotically zero probability of decoding error. This
is indeed the main idea for Shannon’s source coding theorem for block codes.
By further exploring the above concept, we found that the key is actually the
existence of a set An = {xn1 , xn2 , . . . , xnM } with M ≈ 2nH(X) and PX n (Acn ) → 0.
Thus, if we can find such typical set, Shannon’s source coding theorem for block
codes can actually be generalized to more general sources, such as non-stationary
sources. Furthermore, extension of the theorems to codes of some specific types
becomes feasible.
1
lim sup log |An | ≤ R and lim sup PX n [Acn ] ≤ ε.
n→∞ n n→∞
Note that the definition of Tε (X) is equivalent to the one in Definition 2.2.
1
lim sup n ≤R
n→∞ n
where n is the average codeword length of ∼Cn . T̄ (X) is the minimum of all
such rates.
Recall that for a single source, the measure of its uncertainty is entropy.
Although the entropy can also be used to characterize the overall uncertainty of
a random sequence X, the source coding however concerns more on the “average”
entropy of it. So far, we have seen four expressions of “average” entropy:
1 1
lim sup H(X n ) := lim sup −PX n (xn ) log PX n (xn );
n→∞ n n→∞ n n
x ∈X n
48
1 1
lim inf H(X n ) := lim inf −PX n (xn ) log PX n (xn );
n→∞ n n→∞ n
xn ∈X n
1 n
H̄(X) := inf β : lim sup PX n − log PX n (X ) > β = 0 ;
β∈ n→∞ n
1 n
H(X) := sup α : lim sup PX n − log PX n (X ) < α = 0 .
α∈ n→∞ n
If
1 1 1
lim H(X n ) = lim sup H(X n ) = lim inf H(X n ),
n→∞ n n→∞ n n→∞ n
then limn→∞ (1/n)H(X n ) is named the entropy rate of the source. H̄(X) and
H(X) are called the sup-entropy rate and inf-entropy rate, which were introduced
in Section 1.3.
Next we will prove that T (X) = S(X) = H̄(X) and T̄ (X) = S̄(X) =
lim supn→∞ (1/n)H(X n ) for a source X. The operational characterization of
lim inf n→∞ (1/n)H(X n ) and H(X) will be introduced in Chapter 5.
Proof: Equality of S(X) and H̄(X) is already given in Theorem 3.16. Also,
T (X) = H̄(X) can be obtained from Theorem 2.5 by letting ε = 0. Here, we
provide an alternative proof for T (X) = S(X).
1. T (X) ≤ S(X).
If we can show that, for any ε fixed, Tε (X) ≤ S2ε (X), then the proof is
completed. This claim is proved as follows.
• By definition of S2ε (X), we know that for any γ > 0, there exists X̃
and N such that for n > N,
1 n ) < S2ε (X) + γ n < 2ε.
R(X and X n − X
n
n ) < S2ε (X) + γ,
• Let An := xn : PX n (xn ) > 0 . Since (1/n)R(X
Therefore,
1
lim sup log |An | ≤ S2ε (X) + γ.
n→∞ n
49
• Also,
n = 2 sup |PX n (E) − P n (E)|
2ε > X n − X X
E⊂X n
≥ 2|PX n (Acn ) − PX n (Acn )|
= 2PX n (Acn ), (sincePX n (Acn ) = 0).
2. T (X) ≥ S(X).
Similarly, if we can show that, for any ε fixed, Tε (X) ≥ S3ε (X), then the
proof is completed. This claim can be proved as follows.
• Fix α > 0. By definition of Tε (X), we know that for any γ > 0, there
exists N and a sequence of sets {An }∞
n=1 such that for n > N,
1
log |An | < Tε (X) + γ and PX n (Acn ) < ε + α.
n
• Choose Mn to satisfy
Also select one element xn0 from Acn . Define a new random variable
n as follows:
X
0, if xn ∈ {xn0 } ∪ An ;
n n
PX n (x ) = k(x )
, if xn ∈ {xn0 } ∪ An ,
Mn
where n
Mn PX
n (x ), if xn ∈ An ;
k(xn ) := k(xn ), if xn = xn0 .
Mn −
xn ∈An
50
(c) for all xn ∈ An ,
n
P n (xn ) − PX n (xn ) = PX n (xn ) − Mn PX n (x ) ≤ 1 .
X
Mn Mn
(d) PX n (An ) + PX n (xn0 ) = 1.
• Consequently,
1 n ) ≤ Tε (X) + 3γ, (by (3.4.1))
R(X
n
and
n =
X n − X P n (xn ) − PX n (xn ) + P n (xn ) − PX n (xn )
X X 0 0
xn ∈An
+ P n (xn ) − PX n (xn )
X
xn ∈Acn \{xn
0}
≤ P n (xn ) − PX n (xn ) + P n (xn ) + PX n (xn )
X X 0 0
xn ∈An
+ P n (xn ) − PX n (xn )
X
xn ∈Acn \{xn
0}
1
≤ + PX n (xn0 ) + PX n (xn0 )
xn ∈An
M n
+ PX n (xn )
xn ∈Acn \{xn
0}
|An |
= + PX n (xn0 ) + PX n (xn )
Mn xn ∈Ac n
• Since Tε (X) is just one of the rates that satisfy the condition of 3(ε +
α)-resolvability, and S3(ε+α) (X) is the smallest one of such quantities,
S3(ε+α) (X) ≤ Tε (X).
The proof is completed by noting that α can be made arbitrarily
small. 2
This theorem tells us that the minimum source compression ratio for fixed-
length code is the resolvability, which in turn is equal to the sup-entropy rate.
51
Theorem 3.24 (equality of mean-resolvability and minimum source
coding rate for variable-length codes)
1
T̄ (X) = S̄(X) = lim sup H(X n ).
n→∞ n
Proof: Equality of S̄(X) and lim supn→∞ (1/n)H(X n ) is already given in The-
orem 3.19.
1. S̄(X) ≤ T̄ (X).
Definition 3.22 states that there exists, for all γ > 0 and all sufficiently
large n, an error-free variable-length code whose average codeword length
n satisfies
1
n < T̄ (X) + γ.
n
Moreover, the fundamental source coding lower bound for a uniquely de-
codable code (cf. [2, Thm. 3.22]) is
H(X n ) ≤ n.
n = 0 and
Thus, by letting X̃ = X, we obtain X n − X
1 n ) = 1 H(X n ) ≤ 1
H(X n < T̄ (X) + γ,
n n n
which concludes that T̄ (X) is an ε-achievable mean-resolution rate of X
for any ε > 0, i.e.,
S̄(X) = lim S̄ε (X) ≤ T̄ (X).
ε→0
2. T̄ (X) ≤ S̄(X).
Observe that S̄ε (X) ≤ S̄(X) for 0 < ε < e−1 ≈ 0.36788. Hence, by taking
γ satisfying ε log |X | < γ < 2ε log |X | and for all sufficiently large n, there
exists X n such that
1 n ) < S̄(X) + γ
H(X
n
and
X n − X n < ε. (3.4.2)
On the other hand, Theorem 3.27 in [2] proves the existence of an error-free
prefix code for X n with average codeword length n satisfies
52
From Lemma 3.18 that states
n )| ≤ X n − X
n · log |X |n |X |n
|H(X n ) − H(X ≤ ε log ,
X n − X n ε
where the last inequality holds because t log(1/t) is increasing for 0 < t <
e−1 , we obtain
1 1 1
n ≤ H(X n ) + log(2)
n n n
1 n ) + ε log |X | − 1 ε log(ε) + 1 log(2)
≤ H(X
n n n
1 1
≤ S̄(X) + γ + ε log |X | − ε log(ε) + log(2)
n n
≤ S̄(X) + 2γ,
if n > (log(2) − ε log(ε))/(γ − ε log |X |). Since γ can be made arbitrarily
small, S̄(X) is an achievable source compression rate for variable-length
codes; and hence,
T̄ (X) ≤ S̄(X).
2
Again, the above theorem tells us that the minimum source compression ratio
for variable-length code is the mean-resolvability, and the mean-resolvability is
exactly lim supn→∞ (1/n)H(X n ).
Note that lim supn→∞ (1/n)H(X n ) ≤ H̄(X), which follows straightforwardly
by the fact that the mean of the random variable −(1/n) log PX n (X n ) is no
greater than its right margin of the support. Also note that for stationary-
ergodic sources, all these quantities are equal, i.e.,
1
T (X) = S(X) = H̄(X) = T̄ (X) = S̄(X) = lim sup H(X n ).
n→∞ n
53
Source Generator
..
.
source Xt1
.. Z
.
?
source Xt2 - . . . , X2 , X1
- Selector -
-
..
.
source Xt
t∈I
..
.
54
n
is true for −(1/n) log PX (Xi ) since
i=1
* n +
1
sup E log PX (Xi )
n>0 n
i=1
1 n
≤ sup E [|log PX (Xi )|]
n>0 n i=1
= sup E [|log PX (X)|] , because of i.i.d. of {Xi }ni=1
n>0
= E [|log PX (X)|]
= E E |log PX (X)| Z
, 1
=
E |log PX (X)| Z = z dz
0
, 1
= z · | log(z)| + (1 − z) · | log(1 − z)| dz
0
, 1
≤ log(2)dz = log(2).
0
We therefore have:
1
H(X n ) − E[hb (Z)] = E − 1 log PX n (X n ) − E[hb (Z)]
n n
1
≤ E − log PX n (X n ) − hb (Z) → 0 as n → ∞.
n
Consequently,
1
lim sup H(X n ) = E[hb (Z)]
n→∞ n
, 1
= [z log(z) + (1 − z) log(1 − z)]dz
0
, 1
= 2z log(z)dz
0
1
2 1 2
= z log(z) + z
2 0
1 1
= nats or ≈ 0.72135 bits.
2 2 log(2)
55
1
0
0 log(2) nats
Figure 3.2: The ultimate CDF of −(1/n) log PX n (X n ): Pr{hb (Z) ≤ t}.
56
Chapter 4
Shannon’s channel capacity [2] is usually derived under the assumption that the
channel is memoryless. With moderate modification of the proof, this result
can be extended to stationary-ergodic channels for which the capacity formula
becomes the maximization of the mutual information rate:
1
lim sup I(X n ; Y n ).
n→∞ X n n
57
sources, the sup- and inf-(mutual-)information rates are respectively defined by1
¯
I(X; Y ) := sup{θ : i(θ) < 1}
and
I (X; Y ) := sup{θ : ī(θ) ≤ 0},
¯
where
1 n n
i(θ) := lim inf Pr iX n W n (X ; Y ) ≤ θ
n→∞ n
is the inf-spectrum of the normalized information density,
1 n n
ī(θ) := lim sup Pr iX n W n (X ; Y ) ≤ θ
n→∞ n
is the sup-spectrum of the normalized information density, and
n n PY n |X n (y n |xn )
iXnW n (x ; y ) := log
PY n (y n )
is the information density.
In 1994, Verdú and Han [42] showed that the channel capacity in its most
general form is
C := sup I (X; Y ).
X ¯
In their proof, they showed the achievability part via Feinstein’s lemma for the
channel coding average error probability. More importantly, they provided a new
converse based on an error lower bound for multihypothesis testing, which can
be seen as a natural counterpart to the error upper bound afforded by Feinstein’s
lemma. In this chapter, we do not present the original proof of Verdú and Han in
the converse theorem. Instead, we will first derive and illustrate in Section 4.3 a
general lower bound on the minimum error probability of multihypothesis testing
[14].2 We then use a special case of the bound, which results the so-called Poor-
Verdú bound [34], to complete the proof of the converse theorem.
1
In the paper of Verdú and Han [42], these two quantities are defined by:
1
¯
I(X; Y ) := inf β : (∀ γ > 0) lim sup PX n W n iX n W n (X n ; Y n ) > β + γ = 0
β∈ n→∞ n
and
1 n n
I (X; Y ) := sup α : (∀ γ > 0) lim sup PX n W n iX n W n (X ; Y ) < α + γ = 0 .
¯ α∈ n→∞ n
The above definitions are in fact equivalent to ours.
2
We refer the reader to [40] for other tight characterizations of the error probability of
multihypothesis testing.
58
4.2 Channel coding and Feinstein’s lemma
3. a decoding function
g : Y n → {1, 2, . . . , M},
which is (usually) a deterministic rule that assigns a guess to each possible
received vector.
1
M
Pe (∼Cn ) = λi ,
M i=1
where
λi := PY n |X n (y n |f (i)).
y n ∈Y n : g(y n )=i
We assume that the message set (of size M) is governed by a uniform distri-
bution. Thus, under the average probability of error criterion, all codewords are
treated equally (having a uniform prior distribution).
and
lim sup Pe (∼Cn ) = 0.
n→∞
59
Lemma 4.4 (Feinstein’s Lemma) Fix a positive n. For every γ > 0 and
input distribution PX n on X n , there exists an (n, M) block code for the transition
probability PW n = PY n |X n that its average error probability Pe (∼Cn ) satisfies
1 1
Pe (∼Cn ) ≤ Pr iX n W n (X ; Y ) < log M + γ + e−nγ .
n n
n n
Proof:
PX n W n (G c ) < ν < 1,
or equivalently,
PX n W n (G) > 1 − ν > 0. (4.2.1)
Therefore, denoting
PX n (A) > 0,
because if PX n (A) = 0,
60
Step 2: Encoder. Choose an xn1 in A (Recall that PX n (A) > 0.) Define Γ1 =
Gxn1 . (Then PY n |X n (Γ1 |xn1 ) > 1 − ν.)
Next choose, if possible, a point xn2 ∈ X n without replacement (i.e., xn2 can
be identical to xn1 ) for which
PY n |X n Gxn2 − Γ1 xn2 > 1 − ν,
0i−1
and define Γi := Gxni − j=1 Γj .
Repeat the above codeword selecting procedure until either M codewords
are selected or all the points in A are exhausted.
Step 4: Probability of error. For all selected codewords, the error probabi-
lity given codeword i is transmitted, λe|i , satisfies
(Note that (∀ i) PY n |X n (Γi |xni ) ≥ 1−ν by Step 2.) Therefore, if we can show
that the above codeword selecting procedures will not terminate before M,
then
1
M
Pe (∼Cn ) = λe|i < ν.
M i=1
Step 5: Claim. The codeword selecting procedure in Step 2 will not terminate
before M.
Proof: We will prove it by contradiction.
Suppose the above procedure terminates before M, say at N < M. Define
the set
.
N
F := Γi ∈ Y n .
i=1
61
Consider the probability
PX n W n (G) = PX n W n [G ∩ (X n × F )] + PX n W n [G ∩ (X n × F c )]. (4.2.2)
62
4.3 Error bounds for multihypothesis testing
Proof: Fix θ ≥ 1. We only provide the proof for 0 < α < 1 since the lower
bound trivially holds when α = 0 and α = 1.
It is known that the estimate e(Y ) of X from observing Y that minimizes
the error probability is the maximum a posteriori (MAP) estimate given by3
e(Y ) = arg max PX|Y (x|Y ). (4.3.3)
x∈X
3
Since randomization among those x’s that achieve maxx∈X PX|Y (x|y) results in the same
optimal error probability, we assume without loss of generality that e(y) is a deterministic
mapping that selects the maximizing x ∈ X of lowest index.
63
where fx (y) := PX|Y (x|y). Note that fx (y) satisfies
fx (y) = PX|Y (x|y) = 1.
x∈X x∈X
64
where 1(·) is the indicator function4 and the second inequality follows from
(4.3.5).
To complete the proof, we next relate E[h1 (Y )·1(h1 (Y ) > α)] with E[h1 (Y )],
which is exactly 1 − Pe . For any α ∈ (0, 1) and any random variable U with
Pr{0 ≤ U ≤ 1} = 1, the following inequality holds with probability one:
where the first equality follows from (4.3.4). This completes the proof. 2
We next show that if the MAP estimate e(Y ) of X from Y is almost surely
unique in (4.3.3), then the bound of Lemma 4.5, without the (1 − α) factor, is
tight in the limit of θ going to infinity.
Lemma 4.6 Consider two random variables X and Y , where X has a finite or
countably infinite alphabet X = {x1 , x2 , x3 , . . .} and Y has an arbitrary alphabet
Y. Assume that
PX|Y (e(y)|y) > max PX|Y (x|y) (4.3.7)
x∈X :x=e(y)
holds almost surely in PY , where e(y) is the MAP estimate from y as defined in
(4.3.3); in other words, the MAP estimate is almost surely unique in PY . Then,
the error probability in the MAP estimation of X from Y satisfies
(θ)
Pe = lim PX,Y (x, y) ∈ X × Y : PX|Y (x|y) ≤ α (4.3.8)
θ→∞
(θ)
for each α ∈ (0, 1), where the tilted distribution PX|Y (·|y) is given in (4.3.2) for
y ∈ Y.
4 (θ)
I.e., if h1 (y) > α is true, 1(·) = 1; else, it is zero.
65
(θ)
Proof: It can be easily verified from the definitions of hj (·) and hj (·) that the
following two limits hold for each y ∈ Y:
1 1
(θ)
lim h1 (y) = lim θ
= ,
θ→∞ θ→∞ 1+ j≥2 [hj (y)/h1 (y)] (y)
where
(y) := max{j ∈ N : hj (y) = h1 (y)} (4.3.9)
and N := {1, 2, 3, . . .} is the set of positive integers, and
(θ)
lim hj (y) · 1 hj (y) > α
θ→∞
hj (y) · 1 (y)
1
>α for j = 1, · · · , (y);
= (4.3.10)
0 for j > (y).
66
Now the condition in (4.3.7) is equivalent to
Pr[ (Y ) = 1] := PY {y ∈ Y : (y) = 1} = 1; (4.3.13)
thus,
(θ)
lim PX,Y (x, y) ∈ X × Y : PX|Y (x|y) > α
θ→∞
,
= h1 (y) · 1 (1 > α) dPY (y)
Y
= E[h1 (Y )] = 1 − Pe , (4.3.14)
where (4.3.14) follows from (4.3.4). This immediately yields that for 0 < α < 1,
(θ)
Pe = 1 − lim PX,Y (x, y) ∈ X × Y : PX|Y (x|y) > α
θ→∞
(θ)
= lim PX,Y (x, y) ∈ X × Y : PX|Y (x|y) ≤ α .
θ→∞
2
The following two examples illustrate that the condition specified in (4.3.7)
holds for certain situations and hence the generalized Poor-Verdú bound can be
made arbitrarily close to the minimum probability of error Pe by adjusting θ and
α.
Example 4.7 (binary erasure channel) Suppose that X and Y are respec-
tively the channel input and channel output of a binary erasure channel (BEC)
with erasure probability ε, where X = {0, 1} and Y = {0, 1, E}, and
1 − p, y = x ∈ {0, 1};
PY |X (y|x) =
p, y = E.
Let PX (0) = 1 − p = 1 − PX (1) with 0 < p < 1/2. Then, the MAP estimate of
X from Y is given by
y if y ∈ {0, 1}
e(y) =
0 if y = E
and the resulting minimum probability of error is Pe = εp.
Calculating bound (4.3.1) of Lemma 4.5 yields
(θ)
(1 − α)PX,Y (x, y) ∈ X × Y : PX|Y (x|y) ≤ α
pθ
0 if 0 ≤ α <
pθ + (1 − p)θ
θ
p (1 − p)θ
= εp(1 − α) if θ ≤ α < (4.3.15)
p + (1 − p)θ pθ + (1 − p)θ
θ
(1 − p)
ε(1 − α) if θ ≤ α < 1.
p + (1 − p)θ
67
Thus, taking θ ↑ ∞ and then α ↓ 0 in (4.3.15) results in the exact error proba-
bility εp. Note that in this example, the original Poor-Verdú bound (i.e., with
θ = 1) also achieves the minimum probability of error εp by choosing α = 1 − p;
however this maximizing choice of α = 1 − p for the original bound is a function
of system’s statistics (here, the input distribution p) which may be undesirable.
On the other hand, the generalized bound (4.3.1) can herein achieve its peak by
systematically taking θ ↑ ∞ and then letting α ↓ 0.
Furthermore, since (y) = 1 for every y ∈ {0, 1, E}, we have that (4.3.7)
holds; hence, by Lemma 4.6, (4.3.8) yields that for 0 < α < 1,
(θ)
Pe = lim PX,Y (x, y) ∈ X × Y : PX|Y (x|y) ≤ α = εp,
θ→∞
where the last equality follows directly from (4.3.15) without the (1 − α) factor.
(θ) 1
PX|Y (x|y) = ,
2 /θ }
1 + exp{− σ2xy
68
and the generalized Poor-Verdú bound (4.3.1) yields
(θ)
Pe ≥ (1 − α)PX,Y (x, y) ∈ X × Y : PX|Y (x|y) ≤ α
, − σ2 log( 1 −1)−1
2θ α 1 t2
= (1 − α) √ exp − 2 dt
−∞ 2πσ 2 2σ
σ 1 1
= (1 − α)Φ − log −1 − . (4.3.17)
2θ α σ
for every γ > 0, where X n places probability mass 1/M on each codeword, and
Pe (∼Cn ) denotes the error probability of the code.
69
4.4 Capacity formulas for general channels
In this section, the general formulas for several operational notions of channel
capacity are derived. The units of these capacity quantities are in nats/channel
use (assuming that the natural logarithm is used).
C = inf Cε = lim Cε = C0 .
0≤ε≤1 ε↓0
Definition 4.13 (strong capacity CSC ) Define the strong converse capacity
(or strong capacity) CSC as the infimum of the rates R such that for all ∼Cn =
(n, Mn ) channel block codes with
1
lim inf log Mn ≥ R,
n→∞ n
we have
lim inf Pe (∼Cn ) = 1.
n→∞
5
The claim of C0 = limε↓0 Cε can be proved by contradiction as follows.
Suppose C0 + 2γ < limε↓0 Cε for some γ > 0. For any positive integer j, and by definition
of C1/j , there exist Nj and a sequence of block codes ∼Cn = (n, Mn ) such that for n > Nj ,
(1/n) log Mn > C1/j − γ ≥ limε↓0 Cε − γ > C0 + γ and Pe ( ∼Cn ) < 2/j. Construct a sequence
of block codes ∼ Cn = (n, M̃n ) as: ∼
Cn = ∼Cn , if max1≤i≤j−1 Ni ≤ n < max1≤i≤j Ni . Then
lim supn→∞ (1/n) log M̃n ≥ C0 + γ and lim inf n→∞ Pe ( ∼ Cn ) = 0, which contradicts to the
definition of C0 .
70
Based on these definitions, general formulas for the above capacity notions
are established as follows.
Theorem 4.14 (ε-capacity) For 0 < ε < 1, the ε-capacity Cε for arbitrary
channels satisfies
Cε = sup I ε (X; Y ).
X ¯
Proof:
1. Cε ≥ supX I ε (X; Y ).
¯
Fix input X. It suffices to show the existence of ∼Cn = (n, Mn ) data
transmission code with rate
1 γ
I ε (X; Y ) − γ < log Mn < I ε (X; Y ) −
¯ n ¯ 2
and probability of decoding error satisfying
lim sup Pe (∼Cn ) ≤ ε
n→∞
for every γ > 0. (Because if such code exists, then lim inf n→∞ (1/n) log Mn ≥
I ε (X; Y ) − γ, which implies Cε ≥ I ε (X; Y ) − γ for arbitrarily small γ.)
¯ ¯
From Lemma 4.4, there exists an ∼Cn = (n, Mn ) code whose error probabi-
lity satisfies
1 n n 1 γ
Pe (∼Cn ) < Pr iX n W n (X ; Y ) < log Mn + + e−nγ/4
n n 4
1 n n γ γ
≤ Pr iX n W n (X ; Y ) < I ε (X; Y ) − + + e−nγ/4
n ¯ 2 4
1 n n γ
≤ Pr iX n W n (X ; Y ) < I ε (X; Y ) − + e−nγ/4 .
n ¯ 4
Since
1 n n
I ε (X; Y ) := sup R : lim sup Pr iW n W n (X ; Y ) ≤ R ≤ ε ,
¯ n→∞ n
we obtain
1 n n γ
lim sup Pr iX n W n (X ; Y ) < lim I ε (X; Y ) − ≤ ε.
n→∞ n δ↑ε ¯ 4
Hence, the proof of the direct part is completed by noting that
1 n n γ
lim sup Pe (∼Cn ) ≤ lim sup Pr iX n W n (X ; Y ) < I ε (X; Y ) −
n→∞ n→∞ n ¯ 4
+ lim sup e−nγ/4 .
n→∞
≤ ε.
71
2. Cε ≤ supX I ε (X; Y ).
¯
Suppose that there exists a sequence of ∼Cn = (n, Mn ) codes with rate
strictly larger than supX I ε (X; Y ) and lim supn→∞ Pe (∼Cn ) ≤ ε. Let the
¯
ultimate code rate for this code be supX I ε (X; Y ) + 3ρ for some ρ > 0.
¯
Then for sufficiently large n,
1
log Mn > sup I ε (X; Y ) + 2ρ.
n X ¯
Since the above inequality holds for every X, it certainly holds if taking
input X̂ n which places probability mass 1/Mn on each codeword, i.e.,
1
log Mn > I ε (X̂; Ŷ ) + 2ρ, (4.4.1)
n ¯
where Ŷ is the channel output due to channel input X̂. Then from Corol-
lary 4.9, the error probability of the code satisfies
−nρ 1 n n 1
Pe (∼Cn ) ≥ 1 − e Pr iX̂ n W n (X̂ ; Ŷ ) ≤ log Mn − ρ
n n
−nρ 1 n n
≥ 1−e Pr iX̂ n W n (X̂ ; Ŷ ) ≤ I ε (X̂; Ŷ ) + ρ ,
n ¯
where the last inequality follows from (4.4.1). Taking the limsup of both
sides, we have
Theorem 4.15 (general channel capacity) The channel capacity C for ar-
bitrary channel satisfies
C = sup I (X; Y ).
X ¯
72
It remains to show that C ≤ supX I (X; Y ).
¯
Suppose that there exists a sequence of ∼Cn = (n, Mn ) codes with rate strictly
larger than supX I (X; Y ) and error probability tends to 0 as n → ∞. Let the
¯
ultimate code rate for this code be supX I (X; Y ) + 3ρ for some ρ > 0. Then for
¯
sufficiently large n,
1
log Mn > sup I (X; Y ) + 2ρ.
n X ¯
Since the above inequality holds for every X, it certainly holds if taking input
X̂ n which places probability mass 1/Mn on each codeword, i.e.,
1
log Mn > I (X̂; Ŷ ) + 2ρ, (4.4.2)
n ¯
where Ŷ is the channel output due to channel input X̂. Then from Corollary
4.9, the error probability of the code satisfies
−nρ 1 n n 1
Pe (∼Cn ) ≥ 1 − e Pr iX̂ n W n (X̂ ; Ŷ ) ≤ log Mn − ρ
n n
−nρ 1 n n
≥ 1−e Pr iX̂ n W n (X̂ ; Ŷ ) ≤ I (X̂; Ŷ ) + ρ , (4.4.3)
n ¯
where the last inequality follows from (4.4.2). Since, by assumption, Pe (∼Cn )
vanishes as n → ∞ but (4.4.3) cannot vanish by definition of I (X̂; Ŷ ), we
¯
obtain the desired contradiction. 2
We close this section by providing the general formula for strong capacity for
which the proof follows similar steps as the previous two theorems and hence we
omit it. Note that in the general formula for strong capacity, the sup-information
rate is used as opposed to the inf-information rate formula for channel capacity.
With the general capacity formulas shown in the previous section, we can now
compute them for some non-stationary or non-ergodic channels, and analyze
their properties.
Example 4.17 (capacity) Let the input and output alphabets be {0, 1}, and
let every output Yi be given by:
Yi = Xi ⊕ Ni ,
73
where “⊕” represents modulo-2 addition operation. Assume the input process
X and the noise process N are independent.
A general relation between the inf-information rate and inf/sup-entropy rates
can be derived from (1.4.2) and (1.4.4) as follows:
H(Y ) − H̄(Y |X) ≤ I (X; Y ) ≤ H̄(Y ) − H̄(Y |X).
¯
Since N is completely determined from Y n under the knowledge of X n ,
n
74
···
-
H(N ) cluster points H̄(N )
Case B) If N has the same distribution as the source process in Example 3.25,
then H̄(N ) = log(2) nats, which yields a zero channel capacity.
Example 4.18 (strong capacity) Considering the same additive noise chan-
nel as in Example 4.17, we have that under a uniform input (in this case,
PX n (xn ) = PY n (y n ) = 2−n ),
1 PY n |X n (Y n |X n )
Pr log ≤θ
n PY n (Y n )
1 n 1 n
= Pr log PN n (N ) − log PY n (Y ) ≤ θ
n n
1 n
= Pr log PN n (N ) ≤ θ − log(2)
n
1 n
= Pr − log PN n (N ) ≥ log(2) − θ
n
1 n
= 1 − Pr − log PN n (N ) < log(2) − θ . (4.5.1)
n
We again consider the two cases in Example 4.17.
1
n
CSC = 1 − lim inf hb (pi ).
n→∞ n
i=1
Case B) From (4.5.1) and also from Figure 4.2, we obtain that
CSC = log(2).
In Figure 4.2, we plot the ultimate CDF of the channel’s normalized infor-
mation density for Case B. Recall that this limiting CDF is the spectrum of the
normalized information density.
75
1
-
0 log(2)
Figure 4.2 indicates that the channel capacity is 0, and that the strong ca-
pacity is log(2). Hence, the operational meaning of these two extremum values
are determined. One can then naturally ask “what is the operational meaning of
the function value between 0 and log(2)? ” The answer to this question actually
comes from Theorem 4.14, which is ε-capacity. Indeed in practice, it may not be
easy to design a block code which transmits information with (asymptotically)
no error through a very noisy channel with rate equals to channel capacity. How-
ever, if we allow some errors during transmission such that the error probability
is bounded above by 0.001, we have a better chance to construct a practical
block code.
76
since strong capacity dictates a strong condition of lim inf n→∞ Pe (∼Cn ) = 1 as
contrary to lim supn→∞ Pe (∼Cn ) ≤ ε required for ε-capacity. However, equality of
the above inequality holds in this example.
C := max I(PX , QY |X ).
PX
Here, the performance of the code is assumed to be the average error probability,
namely
1
M
Pe (∼Cn ) = Pe (∼Cn |xni ),
M i=1
if the codebook is ∼Cn := {xn1 , xn2 , . . . , xnM }. Due to the random coding argument,
a deterministic good code with arbitrarily small error probability and with rate
less than channel capacity must exist. One can ask: What is the relationship
between a good code and the optimizer PX ? It is widely believed that if the
code is good (with rate close to capacity and low error probability), then the
output statistics PY n – due to the equally-likely code – must approximate the
output distribution, denoted by PY n , due to the input distribution achieving the
channel capacity.
This fact is actually reflected in the next theorem.
Theorem 4.20 ([26]) For any channel W n = (Y n |X n ) with finite input al-
phabet and capacity C that satisfies the strong converse (i.e., C = CSC ), the
following statement holds.
Fix γ > 0 and a sequence of {∼Cn = (n, Mn )}∞
n=1 block codes with
1
log Mn ≥ C − γ/2,
n
and vanishing error probability (i.e., error probability approaches zero as block-
length n tends to infinity.) Then,
1 n n
Y − Y ≤ γ for all sufficiently large n,
n
77
n n
where Y n is the output due to the block code and Y is the output due the X
that satisfies
n n
I(X ; Y ) = max n
I(X n ; Y n ).
X
To be specific,
1
PY n (y n ) = PX n (xn )PW n (y n |xn ) = PW n (y n |xn )
Mn
xn ∈ ∼
Cn xn ∈ ∼
Cn
and
PY n (y n ) = PX n (xn )PW n (y n |xn ).
xn ∈X n
Note that the above theorem holds for arbitrary channels, not restricted to
only discrete memoryless channels.
One can wonder whether a result in the spirit of the above theorem can be
proved for the input statistics rather than the output statistics. The answer
is negative. Hence, the statement that the statistics of any good code must
approximate those that maximize the mutual information is erroneously taken
for granted. (However, we do not rule out the possibility of the existence of
good codes that approximate those that maximize the mutual information.) To
n
n (which
see this, simply consider the normalized entropy of X versus that of X
is uniformly distributed over the codewords) for discrete memoryless channels:
1 n 1 ) =
n 1 n n 1 n n 1
H(X ) − H(X H(X |Y ) + I(X ; Y ) − log(Mn )
n n n n n
1
= H(X|Y ) + I(X; Y ) − log(Mn )
n
1
= H(X|Y ) + C − log(Mn ).
n
A good code with vanishing error probability exists for (1/n) log(Mn ) arbitrarily
close to C; hence, we can find a good code sequence to satisfy
1 n 1 n
lim H(X ) − H(X ) = H(X|Y ).
n→∞ n n
Since the term H(X|Y ) is in general positive, where a quick example is the BSC
with crossover probability p that yields X and Y are both uniform and
78
. . . , X3 , X2 , X1 . . . , Y3 , Y2 , Y1
- PY n |X n -
true source true channel true output
3 , X
...,X 2 , X
1 . . . , Y3 , Y2 , Y1
- PY n |X n -
computer-generated true channel corresponding
source output
Figure 4.4: The simulated communication system.
X n := (X1 , . . . , Xn ),
W n := (W1 , . . . , Wn ),
and
Y n := (Y1 , . . . , Yn ),
where Wi has distribution PYi |Xi . In order to simulate the behavior of the chan-
nel, a computer-generated input may be necessary as shown in Figure 4.4. As
stated in Chapter 3, such computer-generated input is based on an algorithm
formed by a few basic uniform random experiments, which has finite resolu-
tion. Our goal is to find a good computer-generated input X n such that the
corresponding output Y n is very close to the true output Y n .
79
input X and channel W is defined by:
Sε (X, W ) := min R : (∀ γ > 0)(∃ X̃ and N)(∀ n > N)
1 n n n
R(X ) < R + γ and Y − Y < ε ,
n
where PY n = PX n PW n . (The definitions of resolution R(·) and variational dis-
tance (·) − (·) are given by Definitions 3.4 and 3.5.)
and
S̄(W ) := sup S̄(X, W ).
X
80
As an extension to Chapter 3, the above definitions lead to the following
theorem.
and
S̄(W ) = C = sup I (X; Y ).
X ¯
81
Chapter 5
5.1 Motivations
82
ble, etc.) sources and single-user channels in [41, 42]. More specifically, they
establish an additional operational characterization for the optimistic minimum
achievable source coding rate (T (X) for source X) by demonstrating that for a
given channel, the classical statement of the source-channel separation theorem1
holds for every channel if T (X) = T (X) [41]. In a dual fashion, they also show
that for channels with C̄ = C, the classical separation theorem holds for every
source. They also conjecture that T (X) and C̄ do not seem to admit a simple
expression.
In this chapter, we demonstrate that T (X) and C̄ do indeed have a general
formula. The key to these results is the application of the generalized sup-
information rate introduced in Chapter 1 to the existing proofs by Verdú and
Han [42] of the direct and converse parts of the conventional coding theorems.
We also provide a general expression for the optimistic minimum ε-achievable
source coding rate and the optimistic ε-capacity.
In this section, we provide the optimistic source coding theorems. They are
shown based on two new bounds due to Han [25] on the error probability of a
source code as a function of its size. Interestingly, these bounds constitute the
natural counterparts of the upper bound provided by Feinstein’s Lemma (see
Lemma 4.4) and the Verdú-Han lower bound [42] to the error probability of a
channel code. Furthermore, we show that for information stable sources, the
formula for T (X) reduces to
1
T (X) = lim inf H(X n ).
n→∞ n
This is in contrast to the expression for T (X), which is known to be
1
T (X) = lim sup H(X n ).
n→∞ n
The above result leads us to observe that for sources that are both stationary and
information stable, the classical separation theorem is valid for every channel.
In [41], Vembu et.al characterize the sources for which the classical separation
theorem holds for every channel. They demonstrate that for a given source X,
1
By the “classical statement of the source-channel separation theorem,” we mean the fol-
lowing. Given a source X with (conventional) source coding rate T (X) and channel W
with capacity C, then X can be reliably transmitted over W if T (X) < C. Conversely, if
T (X) > C, then X cannot be reliably transmitted over W . By reliable transmissibility of
the source over the channel, we mean that there exits a sequence of joint source-channel codes
such that the decoding error probability vanishes as the blocklength n → ∞ (cf. [41]).
83
the separation theorem holds for every channel if its optimistic minimum achiev-
able source coding rate T (X) coincides with its conventional (or pessimistic)
minimum achievable source coding rate T (X); i.e., if T (X) = T (X).
We herein establish a general formula for T (X). We prove that for any source
X,
T (X) = limHδ (X) = H1− (X).
δ↑1
We also provide the general expression for the optimistic minimum ε-achievable
source coding rate. We show these results based on two new bounds due to Han
(one upper bound and one lower bound) on the error probability of a source code
[25, Chapter 1]. The upper bound (i.e., Lemma 2.3) consists of the counterpart
of Feinstein’s Lemma for channel codes, while the lower bound (i.e., Lemma 2.4)
consists of the counterpart of the Verdú-Han lower bound on the error probability
of a channel code ([42, Theorem 4]). As in the case of the channel coding bounds,
both source coding bounds (Lemmas 2.3 and 2.4) hold for arbitrary sources and
for arbitrary fixed blocklength.
Definition 5.2 (optimistic ε-achievable source coding rate) Fix 0 < ε <
1. R ≥ 0 is an optimistic ε-achievable rate if, for every γ > 0, there exists a
sequence of (n, Mn ) fixed-length source codes ∼Cn such that
1
lim sup log Mn ≤ R
n→∞ n
and
lim inf Pe (∼Cn ) ≤ ε.
n→∞
The infimum of all ε-achievable source coding rates for source X is denoted
by T ε (X). Also define T (X) := sup0<ε<1 T ε (X) = limε↓0 T ε (X) = T 0 (X) as
the optimistic source coding rate.
We can then use Lemmas 2.3 and 2.4 (in a similar fashion to the general
source coding theorem in Theorem 2.5) to prove the general optimistic (fixed-
length) source coding theorems.
84
Proof: The case of ε = 1 follows directly from its definition; hence, the proof
only focuses on the case of ε ∈ [0, 1).
Lemma 2.3 ensures the existence (for any γ > 0) of a source block code
∼Cn = (n, Mn = exp{n(limδ↑(1−ε) Hδ (X) + γ)} ) with error probability
1 n 1
Pe (∼Cn ) ≤ Pr hX n (X ) > log Mn
n n
1 n
≤ Pr hX n (X ) > lim Hδ (X) + γ .
n δ↑(1−ε)
Therefore,
1 n
lim inf Pe (∼Cn ) ≤ lim inf Pr hX n (X ) > lim Hδ (X) + γ
n→∞ n→∞ n δ↑(1−ε)
1 n
= 1 − lim sup Pr hX n (X ) ≤ lim Hδ (X) + γ
n→∞ n δ↑(1−ε)
≤ 1 − (1 − ε) = ε,
and
lim inf Pe (∼Cn ) ≤ ε. (5.2.3)
n→∞
85
(5.2.2) implies that
1
log Mn ≤ lim Hδ (X) − 2γ
n δ↑(1−ε)
for all sufficiently large n. Hence, for those n satisfying the above inequality
and also by Lemma 2.4,
1 1
Pe (∼Cn ) ≥ Pr hX n (X ) > log Mn + γ − e−nγ
n
n n
1 n
≥ Pr hX n (X ) > lim Hδ (X) − 2γ + γ − e−nγ .
n δ↑(1−ε)
Therefore,
1
lim inf Pe (∼Cn ) ≥ 1 − lim sup Pr hX n (X n ) ≤ lim Hδ (X) − γ
n→∞ n→∞ n δ↑(1−ε)
> 1 − (1 − ε) = ε,
86
Proof:
(5.2.4)
From the definition of limδ↑1 Hδ (X), it directly follows that the first term
in the right hand side of (5.2.4) is upper bounded by limδ↑1 Hδ (X) + ε, and
that the liminf of the second term is zero. Thus
1
T (X) = limHδ (X) ≥ lim inf H(X n ).
δ↑1 n→∞ n
87
It is worth pointing out that if the source X is both information stable and
stationary, the above Lemma yields
1
T (X) = T (X) = lim H(X n ).
n→∞ n
This implies that given a stationary and information stable source X, the clas-
sical separation theorem holds for every channel.
In this section, we state without proving the general expressions for the opti-
mistic ε-capacity2 (C̄ε ) and for the optimistic capacity (C̄) of arbitrary single-
user channels. The proofs of these expressions are straightforward once the right
definition (of I¯ε (X; Y )) is made. They employ Feinstein’s Lemma and the Poor-
Verdú bound, and follow the same arguments used in Theorems 4.14 and 4.15
to show the general expressions of the conventional ε-capacity
Cε = sup I ε (X; Y ),
X ¯
We close this section by proving the formula of C̄ for information stable channels.
and
lim inf Pe (∼Cn ) ≤ ε.
n→∞
Definition 5.7 (optimistic ε-capacity C̄ε ) Fix 0 < ε < 1. The supremum of
optimistic ε-achievable rates is called the optimistic ε-capacity, C̄ε .
2
Note that the expression of C̄ε was also separately obtained in [37, Theorem 7].
88
Definition 5.8 (optimistic capacity C̄) The optimistic channel capacity C̄
is defined as the supremum of the rates that are ε-achievable for all ε ∈ [0, 1]. It
follows immediately from the definition that C̄ = inf 0≤ε≤1 C̄ε = limε↓0 C̄ε = C̄0
and that C̄ is the supremum of all the rates R for which there exists a sequence
of ∼Cn = (n, Mn ) channel block codes such that
1
lim inf log Mn ≥ R,
n→∞ n
and
lim inf Pe (∼Cn ) = 0.
n→∞
Theorem 5.9 (optimistic ε-capacity formula) Fix 0 < ε < 1. The opti-
mistic ε-capacity C̄ε satisfies
89
Proof:
1. C̄ ≤ lim supn→∞ supX n (1/n)I(X n ; Y n )
By using a similar argument as in the proof of [42, Theorem 8, property
h)], we have
1
I¯0 (X; Y ) ≤ lim sup sup I(X n ; Y n ).
n→∞ X n n
Hence,
1
C̄ = sup I¯0 (X; Y ) ≤ lim sup sup I(X n ; Y n ).
X n→∞ X n n
Observations:
• It is known that for discrete memoryless channels, the optimistic capacity
C̄ is equal to the (conventional) capacity C [42, 16]. The same result
holds for modulo-q additive noise channels with stationary ergodic noise.
However, in general, C̄ ≥ C since I¯0 (X; Y ) ≥ I (X; Y ) [10, 11].
¯
• Remark that Theorem 11 in [41] holds if, and only if,
sup I (X; Y ) = sup I¯0 (X; Y ).
X ¯ X
90
5.4 Examples for the computations of capacity and str-
ong capacity
We provide four examples to illustrate the computation of C and C̄. The first two
examples present information stable channels for which C̄ > C. The third exam-
ple shows an information unstable channel for which C̄ = C. These examples in-
dicate that information stability is neither necessary nor sufficient to ensure that
C̄ = C or thereby the validity of the classical source-channel separation theorem.
The last example illustrates the situation where 0 < C < C̄ < CSC < log2 |Y|,
where CSC We assume in this section that all logarithms are in base 2 so that C
and C̄ are measured in bits.
Example 5.14 Here we use the information stable channel provided in [41,
Section III] to show that C̄ > C. Let N be the set of all positive integers.
Define the set J as
J := {n ∈ N : 22i+1 ≤ n < 22i+2 , i = 0, 1, 2, . . .}
= {2, 3, 8, 9, 10, 11, 12, 13, 14, 15, 32, 33, · · · , 63, 128, 129, · · · , 255, · · · }.
Consider the following nonstationary symmetric channel W . At times n ∈ J ,
Wn is a BSC(0), whereas at times n ∈ J , Wn is a BSC(1/2). Put W n =
W1 × W2 × · · · × Wn . Here again Cn is achieved by a Bernoulli(1/2) input X̂ n .
Since the set J is deterministic and is known to both transmitter and receiver,
we then obtain
1
n
1 J(n)
Cn = I(X̂i ; Yi ) = [J(n) · 1 + (n − J(n)) · 0] = ,
n i=1 n n
91
where J(n) := |J ∩ {1, 2, · · · , n}|. It can be shown that
2 2 log2 n 1
J(n) 1 − × + , for log2 n odd;
= 3 n 3n
log2 n
n
2×2 2
− , for log2 n even.
3 n 3n
Consequently, C = lim inf n→∞ Cn = 1/3 and C̄ = lim supn→∞ Cn = 2/3.
3
Some work adopts slightly different definitions of Cε and C̄ε from ours. For example, the ε-
achievable rate in [42] and the optimistic ε-achievable rate in [13] are defined via arbitrary γ > 0
and a sequence of ∼Cn = (n, Mn ) block codes such that i) (1/n) log Mn > R − γ for sufficiently
large n and ii) Pe ( ∼Cn ) ≤ ε for sufficiently large n under ε-achievability and Pe ( ∼Cn ) ≤ ε for
infinitely many n under optimistic ε-achievability. We, however, define the two quantities
without the auxiliary (arbitrarily small) γ but dictate the existence of a sequence of ∼Cn =
(n, Mn ) block codes such that i) lim inf n→∞ (1/n) log Mn ≥ R and ii) lim supn→∞ Pe ( ∼Cn ) ≤ ε
for ε-achievability and lim inf n→∞ Pe ( ∼Cn ) ≤ ε for optimistic ε-achievability (cf. Definitions 4.10
and 5.6). Notably, by adopting the definitions in [42] and [13], one can only obtain (cf. [11,
Part I])
1 − H̄1−ε (X) ≤ Cε ≤ 1 − lim H̄δ (X),
δ↑(1−ε)
and
1 −H1−ε (X) ≤ C̄ε ≤ 1 − lim Hδ (X).
δ↑(1−ε)
Our definitions accordingly provide simpler equality formulas for ε-capacity and optimistic
ε-capacity.
92
and
C̄ε = 1 − lim Hδ (X).
δ↑(1−ε)
H̄1−ε (X) = lim H̄δ (X) = H1−ε (X) = lim Hδ (X) = FV−1 (1 − ε),
δ↑(1−ε) δ↑(1−ε)
and
C = C̄ = lim 1 − FV−1 (1 − ε) = 0.
ε↓0
Example 5.16 Let W̃1 , W̃2 , . . . consist of the channel in Example 5.14, and let
Ŵ1 , Ŵ2 , . . . consist of the channel in Example 5.15. Define a new channel W as
follows:
W2i = W̃i and W2i−1 = Ŵi for i = 1, 2, · · · .
As in the previous examples, the channel is symmetric, and a Bernoulli(1/2)
input maximizes the inf/sup-information rates. Therefore for a Bernoulli(1/2)
input X, we have
1 PW n (Y n |X n )
Pr log ≤θ
n PY n (Y n )
1 PW̃ i (Y i |X i ) PŴ i (Y i |X i )
Pr log + log ≤θ ,
2i PY i (Y i ) PY i (Y i )
if n = 2i;
= i i i+1 i+1
1 P i (Y |X ) P i+1 (Y |X )
W̃ Ŵ
≤θ ,
Pr 2i + 1 log P i (Y i ) + log
PY i+1 (Y i+1 )
Y
if n = 2i + 1;
1 i
1 − Pr − log PZ i (Z ) < 1 − 2θ + J(i) ,
1
i i
if n = 2i;
=
1 i+1 1 1
1 − Pr −
log PZ i+1 (Z ) < 1 − 2 − θ+ J(i) ,
i+1 i+1 i+1
if n = 2i + 1.
93
The fact that −(1/i) log[PZ i (Z i )] converges in distribution to the continuous
random variable V := hb (U), where U is beta-distributed with parameters
(ρ/δ, (1 − ρ)/δ), and the fact that
1 1 1 2
lim inf J(n) = and lim sup J(n) =
n→∞ n 3 n→∞ n 3
imply that
1 PW n (Y n |X n ) 5
i(θ) := lim inf Pr log ≤θ = 1 − FV − 2θ ,
n→∞ n PY n (Y n ) 3
and
1 PW n (Y n |X n ) 4
ī(θ) := lim sup Pr log ≤θ = 1 − FV − 2θ .
n→∞ n PY n (Y n ) 3
Consequently,
5 1 −1 2 1 −1
C̄ε = − F (1 − ε) and Cε = − F (1 − ε).
6 2 V 3 2 V
Thus
1 1 5
0<C= < C̄ = < CSC = < log2 |Y| = 1.
6 3 6
94
Bibliography
[3] P. Billingsley. Probability and Measure, 2nd edition, Wiley, New York, 1995.
[7] P.-N. Chen, “General formulas for the Neyman-Pearson type-II error ex-
ponent subject to fixed and exponential type-I error bound,” IEEE Trans.
Inf. Theory, vol. 42, no. 1, pp. 316-323, Jan 1996.
[9] P.-N. Chen and F. Alajaji, “Strong converse, feedback capacity and hy-
pothesis testing,” Proc. Conf. Inf. Sciences Systems, John Hopkins Univ.,
Baltimore, Mar. 1995.
[11] P.-N. Chen and F. Alajaji, “Generalized source coding theorems and hy-
pothesis testing (Parts I and II),” J. Chin. Inst. Eng., vol. 21, no. 3, pp. 283-
303, May 1998.
95
[12] P.-N. Chen and F. Alajaji, “On the optimistic capacity of arbitrary chan-
nels,” in Proc. IEEE Int. Symp. Inf. Theory, Cambridge, MA, Aug. 1998.
[13] P.-N. Chen and F. Alajaji, “Optimistic Shannon coding theorems for arbi-
trary single-user systems,” IEEE Trans. Inf. Theory, vol. 45, no. 7, pp. 2623-
2629, Nov. 1999.
[14] P.-N. Chen and F. Alajaji, “A generalized Poor-Verdú error bound for mul-
tihypothesis testings,” IEEE Trans. Inf. Theory, vol. 58, no. 1, pp. 311-316,
Jan. 2012.
[15] P.-N. Chen and A. Papamarcou, “New asymptotic results in parallel dis-
tributed detection,” IEEE Trans. Inf. Theory, vol. 39, no. 6, pp. 1847-1863,
Nov. 1993.
[16] I. Csiszár and J. Körner, Information Theory: Coding Theorems for Discrete
Memoryless Systems, Academic Press, New York, 1981.
[17] J.-D. Deuschel and D. W. Stroock, Large Deviations, Academic Press, San
Diego, 1989.
[19] T. Ericson and V. A. Zinoviev, “An improvement of the Gilbert bound for
constant weight codes,” IEEE Trans. Inf. Theory, vol. 33, no. 5, pp. 721-723,
Sep. 1987.
[21] J. A. Fill and M. J. Wichura, “The convergence rate for the strong law
of large numbers: General lattice distributions,” Probab. Th. Rel. Fields,
vol. 81, pp. 189-212, 1989.
[23] G. van der Geer and J. H. van Lint. Introduction to Coding Theory and
Algebraic Geometry. Birkhauser, Basel, 1988.
96
[25] T. S. Han, Information-Spectrum Methods in Information Theory, Springer,
2003.
[26] T. S. Han and S. Verdú, “Approximation theory of output statistics,” IEEE
Trans. Inf. Theory, vol. 39, no. 3, pp. 752-772, May 1993.
[27] A. N. Kolmogorov and S. V. Fomin. Introductory Real Analysis. Dover Pub-
lications, New York, 1970.
[28] D. E. Knuth and A. C. Yao, “The complexity of random number gener-
ation,” in Proc. Symp. Algorithms and Complexity: New Directions and
Recent Results, Academic Press, New York, 1976.
[29] J. H. van Lint, Introduction to Coding Theory. 2nd edition, Springer-Verlag,
New York, 1992.
[30] S. N. Litsyn and M. A. Tsfasman, “A note on lower bounds,” IEEE Trans.
Inf. Theory, vol. 32, no. 5, pp. 705-706, Sep. 1986.
[31] J. K. Omura, “On general Gilbert bounds,” IEEE Trans. Inf. Theory,
vol. 19, no. 5, pp. 661-666, Sep. 1973.
[32] M. S. Pinsker, Information and Information Stability of Random Variables
and Processes, Holden-Day, 1964.
[33] G. Polya, “Sur quelques points de la théorie des probabilités,” Ann. Inst.
H. Poincarré, vol. 1, pp. 117-161, 1931.
[34] H. V. Poor and S. Verdú, “A lower bound on the probability of error in
multihypothesis testing,” IEEE Trans. Inf. Theory, vol. 41, no. 6, pp. 1992-
1994, Nov. 1995.
[35] H. L. Royden. Real Analysis, 3rd edition, Macmillan, New York, 1988.
[36] Y. Steinberg and S. Verdú, “Simulation of random processes and rate-
distortion theory,” IEEE Trans. Inf. Theory, vol. 42, no. 1, pp. 63-86,
Jan. 1996.
[37] Y. Steinberg, “New converses in the theory of identification via channels,”
IEEE Trans. Inf. Theory, vol. 44, no. 3, pp. 984-998, May 1998.
[38] M. A. Tsfasman and S. G. Vladut. Algebraic-Geometric Codes, Kluwer Aca-
demic, Netherlands, 1991.
[39] J. N. Tsitsiklis, “Decentralized detection by a large number of sensors,”
Mathematics of Control, Signals and Systems, vol. 1, no. 2, pp. 167-182,
1988.
97
[40] G. Vazquez-Vilar, A. Tauste Campo, A. Guillen i Fabregas, and A. Mar-
tinez, “Bayesian M-ary hypothesis testing: The meta-converse and Verdú-
Han bounds are tight,” IEEE Trans. Inf. Theory, vol. 62, no. 5, pp. 2324-
2333, May 2016.
[42] S. Verdú and T. S. Han, “A general formula for channel capacity,” IEEE
Trans. Inf. Theory, vol. 40, no. 4, pp. 1147-1157, Jul. 1994.
[44] V. A. Zinoviev and S. N. Litsyn, “Codes that exceed the Gilbert bound,”
Probl. Inf. Transm., vol. 21, no. 1, pp. 105-108, 1985.
98
Index
M-type, 35 capacity, 59
Algorithm for 3-type, 36 ε-capacity, 70
Algorithm for 4-type, 35 definition, 70
δ-inf-divergence rate, 10 Example, 73
δ-inf-entropy rate, 9 for arbitrary channels, 58
δ-inf-information rate, 9 general formula, 72
δ-sup-divergence rate, 10 optimistic, 89
δ-sup-entropy rate, 9 strong, 70, 73
δ-sup-information rate, 9 channel capacity, 59
ε-achievable data compression rate, 24 channels
ε-achievable rate, 70 general models, 57
optimistic, 88
ε-achievable resolution, 37 data processing lemma, 14
ε-achievable resolution rate, 37 data transmission code
ε-achievable source coding rate fixed length, 59
optimistic, 84 entropy, 1
ε-capacity, 70 entropy rate, 1
Example, 76 entropy rate, 1
optimistic, 88, 89 inf-entropy rate, 9
theorem, 71 sup-entropy rate, 9
ε-mean-achievable resolution rate, 38
ε-mean-resolvability, 39 Feinstein’s lemma, 60
for input and channel, 80 fixed-length
ε-resolvability, 37 data transmission code, 59
for input and channel, 79
ε-source compression rate general lossless data compression theo-
fixed-length codes, 48 rem, 23
general optimistic source coding theo-
AEP, 24 rem, 84
generalized, 30 general source coding theorem, 26
arbitrary statistics system, 1 optimistic, 84
arbitrary system with memory, 2 generalized AEP theorem, 31
average probability of error, 59 generalized divergence measure, 12
average-case complexity for generating generalized entropy measure, 10
a random variable, 35 generalized information measure, 1, 8
99
δ-inf-divergence rate, 10 optimistic capacity, 89
δ-inf-entropy rate, 9 information stable channel, 89
δ-inf-information rate, 9 optimistic source coding rate, 84
δ-sup-divergence rate, 10 information stable source, 86
δ-sup-entropy rate, 9
δ-sup-information rate, 9 Polya-contagion channel, 92
examples, 17 Poor-Verdú lemma
normalized entropy density, 8 example, 67, 68
normalized information density, 9 generalized, 63
normalized log-likelihood ratio, 10 probability of an event evaluated under
properties, 11 a distribution, iii
generalized mutual information measure, quantile, 2, 3, 5
11 properties, 5
inf-entropy rate, 9 resolution, 35
inf-information rate, 58 ε-achievable resolution rate, 37
inf-spectrum, 58 ε-mean-achievable resolution rate,
information density, 58 38
information stable channels, 89 resolvability, 30, 34, 38
information stable sources, 28, 86 ε-mean-resolvability, 39
information unstable channels, 92 ε-resolvability, 37
liminf in probability, 4 equality to sup-entropy rate, 41
limsup in probability, 5 equivalence to strong capacity, 81
for channel, 80
mean-resolvability, 39 for input and channel, 80
equivalence to capacity, 81 mean-resolvability, 39
for channel, 80 minimum source coding rate for fixed-
for input and channel, 80 length codes, 49
operational meanings, 39 operational meanings, 39
variable-length codes source coding, 47
minimum source coding rate, 51
source compression rate
normalized entropy density, 8 ε-source compression rate, 48
normalized information density, 9 fixed-length codes, 48
normalized log-likelihood ratio, 10 variable-length codes, 48
source-channel separation theorem, 83
optimality of independent inputs for δ- spectrum, 2
inf-information rate, 14 inf-spectrum, 2, 58
optimistic ε-achievable rate, 88 sup-spectrum, 2, 58
optimistic ε-achievable source coding rate,stationary-ergodic source, 29, 33
84 strong capacity, 70, 73
optimistic ε-capacity, 88, 89 Example, 75
100
strong converse theorem, 29
sup-entropy rate, 9
sup-information rate, 58
sup-spectrum, 58
variational distance, 36
bound for entropy difference, 45
bound for log-likelihood ratio spec-
trum, 40
101