0% found this document useful (0 votes)

3 views

Lecture5

This lecture discusses channel capacity in information theory, focusing on discrete memoryless channels (DMC) and their definitions. It covers various types of channels, including noiseless binary channels, binary symmetric channels, and binary erasure channels, explaining how to calculate their capacities. The channel coding theorem is introduced, stating that rates below channel capacity are achievable while rates above are not.

Uploaded by

zhangxbkimmich

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

3 views

Lecture5

Uploaded by

zhangxbkimmich

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 69

Information Theory

Lecture 5: Channel Capacity

Basak Guler

1 / 51
Channel Capacity
• In previous lectures, we have learned how to efficiently compress the
outputs of a given source, and the fundamental limits of compression.

• Next, we will learn the fundamental limits of the amount of information

we can transfer to a receiver through a “channel”.

2 / 51
Discrete Memoryless Channel (DMC)
• Definition 26. A discrete channel is a system considering an input
alphabet X , output alphabet Y, and a conditional PMF p(y |x), which
represents the probability of observing y when x is sent. The channel
is memoryless if the output depends only at the present output (and is
independent of all previous inputs and outputs, given the present
input). The discrete memoryless channel is abbreviated as DMC.

3 / 51
Channel Capacity
• Definition 27. The “information” channel capacity of a DMC is defined
as:
C = max I(X ; Y )
p(x)

where the maximum is taken over all possible input distributions.

• Soon we will get to an “operational” definition of channel capacity, by

establishing this quantity as the highest rate in bits/channel use with
which reliable communication can be accomplished.
• Consequently we will drop the term “information” and call this simply
the channel capacity.

4 / 51
Noiseless Binary Channel
• Noiseless binary channel is defined as Y = X with X = Y = {0, 1}.

0 0

X Y

1 1

5 / 51
Noiseless Binary Channel
• Noiseless binary channel is defined as Y = X with X = Y = {0, 1}.

0 0

X Y

1 1
• The capacity is:

C = max I(X ; Y ) = max I(X ; X ) = max(H(X ) − H(X |X )) = max H(X )

p(x) p(x) p(x) | {z } p(x)
=0

• Recall that entropy H(X ) is maximized with a uniform distribution

p(x) = (1/2, 1/2), so

C = max H(X ) = log |X | = 1 bit

p(x)

which is consistent with our intuition.

5 / 51
Binary Symmetric Channel
• The binary symmetric channel is characterized by the conditional
probability:
1−p y =x
p(y|x) = (1)
p y 6= x
where X = Y = {0, 1}.
1 p
Channel matrix
0 0
Y
p p(y|x) 0 1
X Y 0 1 p p
p X
1 p 1 p
1 1
1 p

• An error occurs with probability p, also called the bit flip/cross-over

probability.

6 / 51
Binary Symmetric Channel (BSC)
• Then,

I(X ; Y ) = H(Y ) − H(Y |X )

X
= H(Y ) − p(x) H(Y |X = x)
| {z }
x∈X
H(p)

= H(Y ) − H(p)
≤ 1 − H(p)

where H(p) is the binary entropy function.

• Note that H(Y ) = 1 if p(y ) = ( 21 , 12 ), which can be made by choosing

p(x) = ( 12 , 12 ). Therefore, the capacity is,

C = max I(X ; Y ) = 1 − H(p)

p(x)

7 / 51
Binary Erasure Channel (BEC)
• The binary erasure channel is characterized by the conditional
probability:
α y =e
p(y|x) = (2)
1−α y =x
where X = {0, 1}, Y = {0, 1, e}.

1 ↵
0 0
↵
X e Y
↵

1 1
1 ↵

• In other words, the transmitted bit gets erased with probability α.

8 / 51
Binary Erasure Channel (BEC)
• Then,

I(X ; Y ) = H(X ) − H(X |Y )

= H(X ) − (H(X |Y = e) P[Y = e]
| {z } | {z }
H(X ) α

+ H(X |Y = 0) P[Y = 0] + H(X |Y = 1) P[Y = 1])

| {z } | {z }
0 0

= (1 − α)H(X )

where H(X ) is maximized by p(x) = ( 12 , 12 ).

• Then, the capacity is

C = max I(X ; Y ) = 1 − α
p(x)

• Intuitively, this means that an α fraction of the bits gets “lost”.

9 / 51
Channel Coding Theorem

W Xn Channel Yn Ŵ
Source Encoder Decoder
p(y|x)
message

A mathematical model of a physical communication system consists of:

• Message W drawn from a set {1, . . . , M}.

• Encoding function X n : {1, . . . , M} → X n that maps the messages W
to codewords X n (W ). The set of all codewords is called a codebook.
• A channel p(y n |x n ) through which the codeword X n is sent, which is
then received as Y n .
• Decoding function g : Y n → {1, . . . , M} that maps the channel output
Y n to a guess Ŵ = g(Y n ) of the original message W .
• An error occurs if the guess is wrong, i.e., Ŵ 6= W .

10 / 51
Channel Coding Theorem - Definitions
• Definition 28. Recall that a discrete channel denoted by
(X , p(y|x), Y) consists of an input alphabet X , output alphabet Y, and
conditional PMF (channel transition matrix) p(y |x), where x is the
input and y is the output of the channel.

11 / 51
Channel Coding Theorem - Definitions
• Definition 29. The nth extension of a discrete memoryless channel
(DMC) is defined as (X n , p(y n |x n ), Y n ), where

p(yk |x k , y k−1 ) = p(yk |xk ) k = 1, . . . , n (memoryless)

Assuming no feedback, i.e., the input symbols do not depend on the

past output symbols,

p(xk |x k−1 , y k−1 ) = p(xk |x k −1 ),

the nth extension of the DMC becomes

12 / 51
Channel Coding Theorem - Definitions
Definition 30. An (M, n) code for the channel (X , p(y |x), Y) consists of:
• An index set {1, . . . , M}.
• An encoding function X n : {1, . . . , M} → X n that maps the messages
{1, . . . , M} to codewords x n (1), . . . , x n (M),

MESSAGES CODEBOOK
W=1→ x n (1) = x1 (1), . . . , xn (1)
W=2→ x n (2) = x1 (2), . . . , xn (2)
..
.
W=M→ x n (M) = x1 (M), . . . , xn (M)

• A deterministic decoding function g : Y n → {1, . . . , M}.

13 / 51
Channel Coding Theorem - Definitions
• Definition 31. Probability of error (conditional):

λi = P[g(Y n ) 6= i|X n = X n (i)]

denotes the probability of guessing the wrong message given

message i was sent.
• Definition 32. Maximal probability of error:

λ(n) = max λi
i=1,...,M

• Definition 33. Average (arithmetic) probability of error for an (M, n)

code is given by:
M
(n) 1 X
Pe = λi
M
i=1
(n)
• Note that by definition Pe ≤λ . (n)

• If W is distributed uniformly random over {1, . . . , M}, then

(n)
Pe = P[g(Y n ) 6= W ]

14 / 51
Channel Coding Theorem - Definitions
• Definition 34. Rate R of an M, n code is defined as:

log M
R= bits per transmission
n
• Definition 35. A rate R is said to be achievable if there exists a
sequence of (d2nR e, n) codes such that λ(n) → 0 as n → ∞.

• We want to find the best possible performance (highest rate R) for a

given noisy communication channel.

15 / 51
Channel Coding Theorem - Definitions

• Recall the definition of channel capacity C = maxp(x) I(X ; Y ).

• Now we are ready to prove the channel coding theorem:

• Theorem 34 (Channel coding theorem). For a discrete memoryless

channel, any rate below the channel capacity C is achievable. That is,
for any rate R < C, there exists a sequence of (2nR , n) codes with
maximal probability of error λ(n) → 0 as n → ∞. Conversely, for any
R > C, no sequence of codes (2nR , n) can have λ(n) → 0

16 / 51
Channel Coding Theorem - Definitions

• This theorem consists of two parts:

1. Achievability: All rates below C are achievable. That is, if R < C, then we
(n)
can design a code (2nR , n) such that λ(n) = Pe → 0 as n → ∞.

2. Converse: No rate above C is achievable. That is, if R > C, then for all
(n)
(2nR , n) codes Pe (and λ(n) ) is bounded away from 0.

17 / 51
Channel Coding Theorem - Definitions

• This theorem consists of two parts:

1. Achievability: All rates below C are achievable. That is, if R < C, then we
(n)
can design a code (2nR , n) such that λ(n) = Pe → 0 as n → ∞.

2. Converse: No rate above C is achievable. That is, if R > C, then for all
(n)
(2nR , n) codes Pe (and λ(n) ) is bounded away from 0.

• In both the achievability and converse, we will assume W (message)

is distributed uniformly random over {1, . . . , 2nR }.

• This way we can characterize the behavior of the channel (i.e., the
(n)
error probability Pe ) irrespective of the distribution of the messages.

• This is arguably the most important theorem in information theory.

17 / 51
Channel Coding Theorem - Converse
• We will start by proving the converse.
• Recall Fano’s inequality, we observe Y and want to guess X . Let the
guess be X̂ = g(Y ) and Pe = P[X̂ 6= X ]. Then,

H(Pe ) + Pe log(|X | − 1) ≥ H(X |Y ) (3)

which can be relaxed to

1 + Pe log |X | ≥ H(X |Y ) (4)

• Note that Fano does not consider any particular decoding/guessing

method.

18 / 51
Channel Coding Theorem - Converse
• In channel coding, we estimate W using Y n , through a decoding
function Ŵ = g(Y n ). The alphabet of W is {1, . . . , 2nR }.
• If the messages W ∈ {1, . . . , 2nR } are equally likely,
(n)
Pe = P[W 6= Ŵ ].
• Then, Fano’s inequality says that
(n)
H(W |Y n ) ≤ 1 + Pe log(2nR ) (5)

in other words
(n)
H(W |Y n ) ≤ 1 + Pe nR (6)

19 / 51
Channel Coding Theorem - Converse
• Since we assume the messages W ∈ {1, . . . , 2nR } are equally likely,
H(W ) = log 2nR = nR. Then,

nR = H(W ) (W has uniform distribution) (7)

n n
= H(W |Y ) + I(W ; Y ) (by definition) (8)
n n n
≤ H(W |Y ) + I(X ; Y ) (data-processing inequality) (9)
(n)
≤1+ Pe nR + I(X n ; Y n ) (Fano’s inequality) (10)
(n) n n n
=1+ Pe nR + H(Y ) − H(Y |X ) (by definition) (11)
n
(n)
X
= 1+Pe nR +H(Y n )− H(Yi |Yi−1 , . . . , Y1 , X n ) (chain rule)
i=1
n
(n)
X
= 1 + Pe nR + H(Y n ) − H(Yi |Xi ) (DMC) (12)
i=1
n n
(n)
X X
≤ 1 + Pe nR + H(Yi ) − H(Yi |Xi ) (conditioning) (13)
i=1 i=1

20 / 51
Channel Coding Theorem - Converse
(continued)
n
X
= 1 + Pen nR + I(Yi ; Xi ) (by definition) (14)
i=1

≤ 1 + Pen nR + nC (15)

• Equation (9) follows from the data processing inequality (DPI), since
W − X n − Y n forms a Markov chain.
• Equation (10) follows from Fano’s inequality (6).
• Equation (12) follows from the fact that p(y n |x n ) = ni=1 p(yi |xi ) for the
Q

discrete memoryless channel (without feedback).

• Equation (13) holds from the independence bound on entropy
(Theorem 10).
• Finally, (15) follows from the fact that I(Xi ; Yi ) ≤ C since
C = maxp(x) I(X ; Y ).

21 / 51
Channel Coding Theorem - Converse
• By dividing both sides of equation (15) with nR,

(n) C 1
Pe ≥ 1 − − (16)
R Rn
|{z}
→0 as n→∞

• If R > C, then

(n) C
Pe ≥ 1 − >0 (17)
R
(n)
and Pe is bounded away from zero.

22 / 51
Channel Coding Theorem - Achievability
• So far we have introduced the channel capacity and proved the
converse part of the channel coding theorem. We have covered
Sections 7.1-7.5 and 7.9.

• That is, no code can achieve a rate of R > C with probability of error
going to zero.

• We will now prove the achievability part of the channel coding theorem
(we will cover Sections 7.6-7.7).

• That is, for any rate R < C, there exists a deterministic code
(encoding and decoding function) that achieves a rate of R.

• For the proof, we will need to extend the notion of typicality. This
definition will be very important in proving the channel coding theorem
- we will use a decoding technique based on this notion.

23 / 51
Jointly Typical Sequences
• Recall that typical sequences x n = (x1 , . . . , xn ) were those whose
empirical entropy was close to the true entropy H(X ).

• We will now extend this notion to pairs of random variables

(x n , y n ) = ((x1 , y1 ), . . . , (xn , yn ))

24 / 51
Jointly Typical Sequences
• Recall that typical sequences x n = (x1 , . . . , xn ) were those whose
empirical entropy was close to the true entropy H(X ).

• We will now extend this notion to pairs of random variables

(x n , y n ) = ((x1 , y1 ), . . . , (xn , yn ))

• Definition 36 (Jointly Typical Sequences). The set A(n) of

jointly typical sequences with respect to a joint PMF p(x, y ) is the set
of all sequences (x n , y n ) whose empirical entropies are -close to the
true entropies:

1
A(n) = {(X n , Y n ) ∈ X n × Y n : − log p(x n ) − H(X ) < (18)
n
1
− log p(y n ) − H(Y ) < (19)
n
1
− log p(x n , y n ) − H(X , Y ) < } (20)
n
Qn
where p(x n , y n ) = i=1 p(xi , yi ).
24 / 51
Big Picture
All y n sequences, |Y|n = 2n log |Y|

2nH(Y ) typical y n
All xn sequences, |X |n = 2n log |X |

2nH(X) typical xn

2nH(X,Y ) jointly typical (xn , y n ) pairs

25 / 51
Jointly Asymptotic Equipartition Property (Joint AEP)
• We will now extend the notion of Asymptotic Equipartition Property
(AEP) to jointly typical sequences.

26 / 51
Jointly Asymptotic Equipartition Property (Joint AEP)
• We will now extend the notion of Asymptotic Equipartition Property
(AEP) to jointly typical sequences.

• Definition 37 (Joint AEP). Let (X n , Y n ) be drawn i.i.d. according to

Qn
the distribution p(x n , y n ) = i=1 p(xi , yi ). Then,

1. lim P[(X n , Y n ) ∈ A(n)

]=1 (21)
n→∞

2. (1 − )2n(H(X ,Y )−) ≤ |A(n)

|≤2
n(H(X ,Y )+)
(22)

3. If (X̃ n , Ỹ n ) ∼ p(x n )p(y n ), i.e., X̃ n and Ỹ n are independent

with marginal PMFs p(x n ) and p(y n ) that are obtained from
the joint PMF p(x n , y n ), then:
(1 − )2−n(I(X ;Y )+3) ≤ P[(X̃ n , Ỹ n ) ∈ A(n)
]≤2
−n(I(X ;Y )−3)
(23)

26 / 51
Jointly Asymptotic Equipartition Property (Joint AEP)
Proof.

• Parts 1, 2: Similar to the original AEP.

27 / 51
Jointly Asymptotic Equipartition Property (Joint AEP)
Proof.

• Parts 1, 2: Similar to the original AEP.

• Part 3: Here X̃ n and Ỹ n are independent and have the same
marginals as X n and Y n . Then,
X
P[(X̃ n , Ỹ n ) ∈ A(n)
]= p(x̃ n , ỹ n )
(n)
(x̃ n ,ỹ n )∈A
X
= p(x̃ n )p(ỹ n )
(n)
(x̃ n ,ỹ n )∈A
X
= p(x n )p(y n )
(n)
(x̃ n ,ỹ n )∈A
X
= p(x n )p(y n )
(n)
(x n ,y n )∈A

27 / 51
Jointly Asymptotic Equipartition Property (Joint AEP)
Proof (continued).

• When (X n , Y n ) ∈ A(n)
, we know that:

2−n(H(X )+) ≤ p(x n ) ≤ 2−n(H(X )−) (24)

and
2−n(H(Y )+) ≤ p(y n ) ≤ 2−n(H(Y )−) (25)

28 / 51
Jointly Asymptotic Equipartition Property (Joint AEP)
Proof (continued).

• When (X n , Y n ) ∈ A(n)
, we know that:

2−n(H(X )+) ≤ p(x n ) ≤ 2−n(H(X )−) (24)

and
2−n(H(Y )+) ≤ p(y n ) ≤ 2−n(H(Y )−) (25)
• So:
X
P[(X̃ n , Ỹ n ) ∈ A(n)
]≤ 2−n(H(X )−) 2−n(H(Y )−)
(n)
(x n ,y n )∈A
−n(H(X )−) −n(H(Y )−)
= |A(n)
|2 2
≤ 2n(H(X ,Y )+) 2−n(H(X )−) 2−n(H(Y )−) (from part 2)
−n(H(Y )−(H(X ,Y )−H(X ))) n3
=2 2
−n(I(X ;Y )−3)
=2 ( since H(X , Y ) − H(X ) , H(Y |X ))

28 / 51
Jointly Asymptotic Equipartition Property (Joint AEP)
Proof (continued).

• Similarly,

P[(X̃ n , Ỹ n ) ∈ A(n)
]
X
= p(x n )p(y n )
(n)
(x n ,y n )∈A
−n(H(X )+) −n(H(Y )+)
≥ |A(n)
|2 2
≥ (1 − )2n(H(X ,Y )−) 2−n(H(X )+) 2−n(H(Y )+) (from part 2)
−n(I(X ;Y )+3)
= (1 − )2

which concludes the proof.

29 / 51
Jointly Asymptotic Equipartition Property (Joint AEP)
Basically, this theorem says that:

• The collective probability of the jointly typical set is “about” 1.

• There are about 2H(X ,Y ) jointly typical sequences. They are almost
equally likely, i.e.,

p(x n , y n ) ≈ 2−nH(X ,Y ) if (x n , y n ) ∈ A(n)

• Any randomly chosen x n and y n are jointly typical with probability

≈ 2−nI(X ;Y ) .

30 / 51
Achievability of Channel Coding Theorem
• We will now ready to proving the achievability part of the channel
coding theorem.

• That is, for any R < C, there exists a code (2nR , n) such that λ(n) → 0
(maximum probability of error) as n → ∞.

31 / 51
Achievability of Channel Coding Theorem
• First, fix p(x). Then, generate a (2nR , n) code at random according to
the distribution p(x). That is, generate 2nR codewords independently
according to:
n
Y
p(x n ) = p(xi ) (26)
i=1

• The codebook can be expressed in a matrix form where each row

represents a codeword:
 
x1 (1) ... xn (1)
 .. .. .. 
C= . . .  (27)
x1 (2nR ) ... xn (2nR )

32 / 51
Achievability of Channel Coding Theorem
• First, fix p(x). Then, generate a (2nR , n) code at random according to
the distribution p(x). That is, generate 2nR codewords independently
according to:
n
Y
p(x n ) = p(xi ) (26)
i=1

• The codebook can be expressed in a matrix form where each row

represents a codeword:
 
x1 (1) ... xn (1)
 .. .. .. 
C= . . .  (27)
x1 (2nR ) ... xn (2nR )

• x1 (1), . . . , xn (1) represents the codeword for message 1 (codeword 1),

and x1 (2nR ), . . . , xn (2nR ) represents the codeword for message 2nR
(codeword 2nR ).

32 / 51
Achievability of Channel Coding Theorem
• Note that each entry in the code C is generated i.i.d. with p(x).

• Then, the probability that we generate a particular code is:

nR
2
Y n
Y
P(C) = p(xi (w)) (28)
w=1 i=1

where w is the message index.

• This technique is called random coding. This is just a proof technique,

and not a method we use for actual transmission.

• It just simplifies the analysis to show that a good deterministic code

exists.

33 / 51
Achievability of Channel Coding Theorem
• Now, consider the following sequence of events:
1. Generate a random codebook C.

34 / 51
Achievability of Channel Coding Theorem
• Now, consider the following sequence of events:
1. Generate a random codebook C.
2. The code is revealed to the sender and receiver. They also both know
the channel transition matrix p(y|x).

P[W = w] = 2−nR w = 1, . . . , 2nR (29)

4. Send the corresponding codeword X n (w) through the channel.

P[W = w] = 2−nR w = 1, . . . , 2nR (29)

4. Send the corresponding codeword X n (w) through the channel.

5. The receiver receives a sequence Y n with probability:
n
Y
p(y n |x n (w)) = p(yi |xi (w)) (30)
i=1

6. The receiver guesses the message. To do so, it will declare Ŵ was sent
if the following two conditions are satisfied:
a) (X n (Ŵ ), Y n ) is jointly typical.
b) There is no other W 0 6= Ŵ such that (X n (W 0 ), Y n ) is jointly typical.

P[W = w] = 2−nR w = 1, . . . , 2nR (29)

4. Send the corresponding codeword X n (w) through the channel.

5. The receiver receives a sequence Y n with probability:
n
Y
p(y n |x n (w)) = p(yi |xi (w)) (30)
i=1

34 / 51
Achievability of Channel Coding Theorem
• This decoding rule is called “joint typicality decoding”.

• Note that this decoding rule based on joint typicality is sub-optimal in

terms of minimizing the error.

• The rule that minimizes the error is the MAP rule (maximum
aposteriori rule) that says that ŵ is declared when
ŵ = arg max p(x n (w)|y n ) which is equivalent to the ML (maximum
likelihood) rule since all w’s are equally likely.

• But this decoding rule (known as maximum likelihood decoder) is

difficult to analyze and the decoding rule based on joint typicality will
actually get us the desired result.

35 / 51
Achievability of Channel Coding Theorem
• Probability of error calculation: The average probability of error
over all codewords in a codebook and over all randomly generated
codebooks is:
(n)
X
P(E) = P[Ŵ 6= W ] = P(C)Pe (C) (31)
C
nR
2
X 1 X
= P(C) nR λw (C) (32)
2
C w=1
nR
2
1 XX
= nR P(C)λw (C) (33)
2
w=1 C
| {z }
(∗)
X
= P(C)λw (C) = P(E|W = 1) (34)
C

(*): does not depend on w since we average over all possible

codewords. Therefore, without loss of generality, we can assume
W = 1. So, X n (1) is sent, Y n is received.
36 / 51
Achievability of Channel Coding Theorem
• Now, let’s define the event

Ei = {(X n (i), Y n ) ∈ A(n)

} i = 1, . . . , 2nR (35)

37 / 51
Achievability of Channel Coding Theorem
• Now, let’s define the event

Ei = {(X n (i), Y n ) ∈ A(n)

} i = 1, . . . , 2nR (35)

• Hence, E1c , Ej for j > 1 are all error events conditioned on W = 1, and

P(E|W = 1) = P(E1c ∪ E2 ∪ E3 ∪ . . . ∪ E2nR |W = 1). (36)

37 / 51
Achievability of Channel Coding Theorem
• Now, let’s define the event

Ei = {(X n (i), Y n ) ∈ A(n)

} i = 1, . . . , 2nR (35)

• Hence, E1c , Ej for j > 1 are all error events conditioned on W = 1, and

P(E|W = 1) = P(E1c ∪ E2 ∪ E3 ∪ . . . ∪ E2nR |W = 1). (36)

• Then, using the union of events bound,

nR
2
X
P(E|W = 1) ≤ P(E1c |W = 1) + P(Ei |W = 1) (37)
i=2

37 / 51
Achievability of Channel Coding Theorem
• By joint AEP P(E1c |W = 1) → 0 as n → ∞, which means for large
enough n we have:
P(E1c |W = 1) ≤ (38)

38 / 51
Achievability of Channel Coding Theorem
• By joint AEP P(E1c |W = 1) → 0 as n → ∞, which means for large
enough n we have:
P(E1c |W = 1) ≤ (38)

• Also since the codewords X (n) (1) and X (n) (w) for w 6= 1 are
independently generated, Y n should be independent of X (n) (w) for
w 6= 1 (because Y n is generated by X n (1)).

38 / 51
Achievability of Channel Coding Theorem
• By joint AEP P(E1c |W = 1) → 0 as n → ∞, which means for large
enough n we have:
P(E1c |W = 1) ≤ (38)

• Also since the codewords X (n) (1) and X (n) (w) for w 6= 1 are
independently generated, Y n should be independent of X (n) (w) for
w 6= 1 (because Y n is generated by X n (1)).

• Recall that the probability of X n (w) being jointly typical with Y n when
they are independent was ≤ 2−n(I(X ;Y )−3) by the joint AEP (part 3).

• Then,

P(Ei |W = 1) ≤ 2−n(I(X ;Y )−3) (39)

(40)

38 / 51
Achievability of Channel Coding Theorem
• Therefore, the average error can be bounded by:

P(E|W = 1) ≤ + 2nR 2−n(I(X ;Y )−3) (41)

−n(I(X ;Y )−R−3)
=+2
| {z } (42)
(∗)

39 / 51
Achievability of Channel Coding Theorem
• Therefore, the average error can be bounded by:

P(E|W = 1) ≤ + 2nR 2−n(I(X ;Y )−3) (41)

−n(I(X ;Y )−R−3)
=+2
| {z } (42)
(∗)

• When

I(X ; Y ) − R > 3 (43)

R < I(X ; Y ) − 3 (44)

then, the exponent in (42) becomes −na for some a > 0, and the
whole second term (∗) ≤ for large n.
• Then, the average error becomes:

P(E|W = 1) ≤ 2 (45)

for R < I(X ; Y ) − 3.

39 / 51
Achievability of Channel Coding Theorem
• So far we have shown that when R < I(X ; Y ), the average probability
of error P(E) will be ≤ 2.

• To finish the proof, we will show that the maximal error probability λ(n)
goes to 0 as well (not just the average error probability P(E)).

• We will do this by taking the following steps.

40 / 51
Achievability of Channel Coding Theorem
1. First, choose p(x) such that it is the distribution that maximizes I(X ; Y ).
Recall that this is the definition of C. So,

p∗ (x) = arg max I(X ; Y ), (46)

p(x)

where

max I(X ; Y ) = I(X ; Y )|p(x)=p∗ (x) = C (47)

p(x)

By doing this we replace the condition R < I(X ; Y ) with R < C.

41 / 51
Achievability of Channel Coding Theorem
2. Now, since the average P(E) over all codewords is ≤ 2, there must be
at least one codebook (let’s call this C ∗ ) whose error performance is better
than the average, i.e.,

P(E|C ∗ ) ≤ 2 (48)
where
nR
2
∗ (n) 1 X
P(E|C ) = Pe (C ∗ ) = nR λw (C ∗ ) (49)
2
w=1

since the messages w ∈ {1, . . . , 2nR } are chosen uniformly random.

42 / 51
Achievability of Channel Coding Theorem
3. Throw away the worst half of the codewords in the best codebook C ∗ .
The worst codewords are the ones with the highest probability of error λw .
Then, since the (arithmetic) average probability of error is,
nR
2
(n) 1 X
Pe (C ∗ ) = nR λw (C ∗ ) ≤ 2, (50)
2
w=1

for each of the remaining codewords, we must have:

λw ≤ 4 (51)
(n)
Otherwise the average Pe (C ∗ ) would be higher than 2!
So, the best half of the codewords have a maximal probability of error ≤ 4.
2nR
By using only the remaining codewords, we can transmit 2 messages.

43 / 51
Achievability of Channel Coding Theorem
2nR
4. So, we are now sending 2 messages instead of 2nR messages.
nR 2nR
log
This has changed our rate from logn2 = R to n 2 = R − n1 , but now we
are guaranteed to have a maximal error probability λ(n) ≤ 4.

1
Note that the term n vanishes as n → ∞.

This means that for any R < C, we can find a code with rate R for which
λ(n) → 0 for large enough n, which completes the proof.

44 / 51
Capacity with Feedback
• In feedback channels, encoder observes the previous channel output.

W Xi (W, Y i 1
) Yi Ŵ
Encoder Channel Decoder

message g(Y n )

FEEDBACK

• For the discrete memoryless channel (DMC) we have discussed here,

feedback cannot increase capacity.

• For more advanced channels (channels with memory, multi-terminal

channels) feedback can increase capacity.

45 / 51
Source-Channel Separation Theorem
• Recall that earlier, we have established the fundamental limits of data
compression and recently we have established the fundamental limits
on how much information can be transferred in a given channel.

• Now we will combine the two results.

46 / 51
Source-Channel Separation Theorem
• Suppose we want to send an arbitrary source V (e.g., speech) that
generates symbols V1 , . . . , Vn , through a channel p(y|x) with capacity
C. The source has an entropy rate H(V).

47 / 51
Source-Channel Separation Theorem
• Suppose we want to send an arbitrary source V (e.g., speech) that
generates symbols V1 , . . . , Vn , through a channel p(y|x) with capacity
C. The source has an entropy rate H(V).

• Consider the following two options:

1) Joint source-channel coding: We can map the source symbols

V n = (V1 , . . . , Vn ) directly into channel inputs X n = (X1 , . . . , Xn ).
Then we can estimate the original source symbols (V1 , . . . , Vn ) from
the channel output Y n = (Y1 , . . . , Yn ) .
Let’s call these estimated symbols V̂ n = (V̂1 , . . . , V̂n ).

Source symbols
Vn X n (V n ) Yn V̂ n = g(Y n )
Channel
Encoder Decoder
p(y|x)

47 / 51
Source-Channel Separation Theorem
2) Separate source and channel coding: Or, we can separate source and
channel coding.
nH(V )
Recall that we can compress a source up to rate R = log 2n = H(V ) as
we saw in the consequences of the AEP in Lecture 2.
Then, we can first compress the source as efficiently as possible, then use
an optimal channel code to transfer this compressed representation as
long as R < C.

Vn Source Channel Xn Channel Yn Channel Source V̂ n

Encoder Encoder p(y|x) Decoder Decoder

48 / 51
Source-Channel Separation Theorem
• The following theorem says that the two options perform the same.

• Theorem 35 (Source-Channel Coding Theorem). Consider a

source represented by a stochastic process V1 , . . . , Vn drawn from a
finite alphabet with an entropy rate H(V). If H(V) < C, then there
exists a source-channel code with probability of error
P[V̂ n 6= V n ] → 0. Conversely, we cannot send any source through a
channel for which H(V) > C with arbitrarily small probability of error.

• If the V1 , . . . , Vn is i.i.d. distributed with PMF p(v ), then H(V) can be

P
replaced with the entropy H(V ) = − v p(v ) log p(v ).

49 / 51
Source-Channel Separation Theorem
Proof:

• (Achievability) From the AEP, there are 2nH(V) typical sequences for
the source V1 , . . . , Vn (Recall that the typical sequences are “almost”
equally likely.). Map each of these typical sequences to the
“messages” for the channel decoder. Ignore the non-typical
sequences. Use a channel code (2nR , n) for sending 2nR = 2nH(V)
messages. Then, from the channel coding theorem, as long as
H(V) = R < C, the messages can be transferred with vanishing error.

• (Converse) From Fano’s Theorem, similar to the converse of channel

coding theorem.

50 / 51
Source-Channel Separation Theorem
• This theorem states that joint source-channel coding does not provide
any advantage over separate source coding (compression) and
channel coding.

• Advantages:
• Greatly simplifies our system design!
• Design the best code for the source, design the best code for the
channel, and combine them together.
• Can even use the same source code with different channel codes, and
vice versa.

• Limitations:
• Requires an infinite codeword length! Doesn’t hold when codeword
lengths are finite. Real world codes have finite length!
• Doesn’t hold for more advanced channel models, such as
multi-terminal channels.

51 / 51

Channel Capacity
No ratings yet
Channel Capacity
68 pages
Channel Capacity and Models
No ratings yet
Channel Capacity and Models
30 pages
Lossless Deterministic Channel
No ratings yet
Lossless Deterministic Channel
6 pages
Lecture 15: Channel Capacity, Rate of Channel Code
No ratings yet
Lecture 15: Channel Capacity, Rate of Channel Code
6 pages
Different Integration Formulas
No ratings yet
Different Integration Formulas
11 pages
Discrete Memoryless Channel and Its Capacity Tutorial
No ratings yet
Discrete Memoryless Channel and Its Capacity Tutorial
16 pages
It Co 3 en
No ratings yet
It Co 3 en
14 pages
Chapter 7: Channel Capacity
No ratings yet
Chapter 7: Channel Capacity
33 pages
ch7 PDF
No ratings yet
ch7 PDF
33 pages
Channel Capacity
No ratings yet
Channel Capacity
51 pages
Information Theory: Mohamed Hamada
No ratings yet
Information Theory: Mohamed Hamada
24 pages
ITC_Module2
No ratings yet
ITC_Module2
30 pages
Discrete Memoryless Channels
No ratings yet
Discrete Memoryless Channels
16 pages
Communication Systems Channel Capacity
No ratings yet
Communication Systems Channel Capacity
6 pages
Information Theory: Today's Topics
No ratings yet
Information Theory: Today's Topics
4 pages
TEOI-Capacity of discrete channels
No ratings yet
TEOI-Capacity of discrete channels
62 pages
Information Theory and Coding - Chapter 5
No ratings yet
Information Theory and Coding - Chapter 5
41 pages
Channel Coding: - Channel Capacity Channel Capacity, C Is Defined As
No ratings yet
Channel Coding: - Channel Capacity Channel Capacity, C Is Defined As
11 pages
Broadcast Channels
No ratings yet
Broadcast Channels
94 pages
Lecture 15
No ratings yet
Lecture 15
5 pages
TE361 Channel Coding 1
No ratings yet
TE361 Channel Coding 1
24 pages
TE361 Channel Coding
No ratings yet
TE361 Channel Coding
65 pages
CE00036-3-Data Communication Systems Individual Assignment Page 1 of 34
No ratings yet
CE00036-3-Data Communication Systems Individual Assignment Page 1 of 34
34 pages
ETN642_lec8__Ch8_handouts
No ratings yet
ETN642_lec8__Ch8_handouts
12 pages
L9,10, L11 - Module 3 Channel Models and Capacity
No ratings yet
L9,10, L11 - Module 3 Channel Models and Capacity
40 pages
Discrete Memory Less Channel
No ratings yet
Discrete Memory Less Channel
68 pages
CH 7
No ratings yet
CH 7
68 pages
Information Capacity
No ratings yet
Information Capacity
22 pages
DISCRETE MEMORYLESS CHANNELS-itc
No ratings yet
DISCRETE MEMORYLESS CHANNELS-itc
20 pages
Channel Capacity: 1 Preliminaries and Definitions
No ratings yet
Channel Capacity: 1 Preliminaries and Definitions
5 pages
Physical Layer CCN
No ratings yet
Physical Layer CCN
59 pages
data rate(Nyquist bit rate and Shannan capacity)
No ratings yet
data rate(Nyquist bit rate and Shannan capacity)
6 pages
Channel Capacity and The Channel Coding Theorem, Part I: Information Theory 2013
No ratings yet
Channel Capacity and The Channel Coding Theorem, Part I: Information Theory 2013
17 pages
Lecture 6 - Communication Channels and Channel Capacity
No ratings yet
Lecture 6 - Communication Channels and Channel Capacity
19 pages
Lecture 3 Channel Capacity
No ratings yet
Lecture 3 Channel Capacity
38 pages
Comsysf311 2023 M5L39
No ratings yet
Comsysf311 2023 M5L39
15 pages
Information Theory Assignment: Talib Abbas (TC-005)
No ratings yet
Information Theory Assignment: Talib Abbas (TC-005)
1 page
Channel Capacity PDF
No ratings yet
Channel Capacity PDF
30 pages
Data rate limits
No ratings yet
Data rate limits
15 pages
Lec - 38 Ergodic Capacity and Capacity With Outage
No ratings yet
Lec - 38 Ergodic Capacity and Capacity With Outage
7 pages
Data Communication & Computer Networks: Week # 04
No ratings yet
Data Communication & Computer Networks: Week # 04
19 pages
1 Merged
No ratings yet
1 Merged
182 pages
Information Theory
No ratings yet
Information Theory
21 pages
Comm... System CH2-Lec1
No ratings yet
Comm... System CH2-Lec1
36 pages
CCN-Lecture-4-BRIS-SP25-24022025-041228pm
No ratings yet
CCN-Lecture-4-BRIS-SP25-24022025-041228pm
26 pages
Maximum Data Rate
No ratings yet
Maximum Data Rate
6 pages
DC Handouts
No ratings yet
DC Handouts
51 pages
Rohini 82317868804
No ratings yet
Rohini 82317868804
5 pages
Principles of Communication Note 4
No ratings yet
Principles of Communication Note 4
12 pages
Digital Communication Chapter 3
No ratings yet
Digital Communication Chapter 3
37 pages
Unit 4 - Week 3: Assignment 3
No ratings yet
Unit 4 - Week 3: Assignment 3
4 pages
Chapter 2
No ratings yet
Chapter 2
12 pages
LDPC Codes: An Introduction: Amin Shokrollahi
No ratings yet
LDPC Codes: An Introduction: Amin Shokrollahi
34 pages
01-Syllabus and Intro
No ratings yet
01-Syllabus and Intro
21 pages
Section 15 - Digital Communications
0% (1)
Section 15 - Digital Communications
27 pages
CH 11
No ratings yet
CH 11
36 pages
5_2020_12_19!06_17_51_PM
No ratings yet
5_2020_12_19!06_17_51_PM
7 pages
Lecture 2
No ratings yet
Lecture 2
55 pages
Theory of Approximation
From Everand
Theory of Approximation
N. I. Achieser
No ratings yet
Differential Forms
From Everand
Differential Forms
Henri Cartan
5/5 (2)
CEU ChallengesResponse
No ratings yet
CEU ChallengesResponse
3 pages
Kalinga University Naya Raipur Department of Computer Science
No ratings yet
Kalinga University Naya Raipur Department of Computer Science
34 pages
Rajkiya: Engineering College, Kannauj
No ratings yet
Rajkiya: Engineering College, Kannauj
3 pages
Controlling Kuka Industrial Robots Flexible Communication Interface Jopenshowvar
No ratings yet
Controlling Kuka Industrial Robots Flexible Communication Interface Jopenshowvar
14 pages
File (SQL Tutorial)
No ratings yet
File (SQL Tutorial)
81 pages
New Text Documentccc
No ratings yet
New Text Documentccc
5 pages
Quality Objectives 2015
No ratings yet
Quality Objectives 2015
1 page
DSTP2.0-Batch-07 COM101 4
No ratings yet
DSTP2.0-Batch-07 COM101 4
5 pages
Auditor List 22112024 0
No ratings yet
Auditor List 22112024 0
6 pages
Resume Jan 2021 v2
No ratings yet
Resume Jan 2021 v2
1 page
Project Launch Plan
No ratings yet
Project Launch Plan
5 pages
Web Content Classification (Webcc) Bundle: (URL Filtering, IP Reputation, and Geolocation Filtering)
No ratings yet
Web Content Classification (Webcc) Bundle: (URL Filtering, IP Reputation, and Geolocation Filtering)
3 pages
SN000144 Ecube 9 DIA Software Update App V3.6.2.259 - Base V2.0.10.10
No ratings yet
SN000144 Ecube 9 DIA Software Update App V3.6.2.259 - Base V2.0.10.10
7 pages
Easiprox
No ratings yet
Easiprox
2 pages
Po - 015 Fire Alarm, Apar
No ratings yet
Po - 015 Fire Alarm, Apar
3 pages
Learn Linux Quickly-299-311
No ratings yet
Learn Linux Quickly-299-311
13 pages
Up - Lecture 17 - Basic IO Interface
No ratings yet
Up - Lecture 17 - Basic IO Interface
118 pages
The Complete Guide To Running Alpha Online
No ratings yet
The Complete Guide To Running Alpha Online
23 pages
What Is SAP?: 2.2 Centralized System
No ratings yet
What Is SAP?: 2.2 Centralized System
9 pages
FA-Mega Sale-Merchandising Display - FA
No ratings yet
FA-Mega Sale-Merchandising Display - FA
30 pages
DS T01 - Pseudo Code
No ratings yet
DS T01 - Pseudo Code
12 pages
Tobii Pro Glasses 3: Designed For The Real World
No ratings yet
Tobii Pro Glasses 3: Designed For The Real World
3 pages
ABAP Program Tips v3 PDF
No ratings yet
ABAP Program Tips v3 PDF
157 pages
Ficha Tecnica Tarjeta Notifier
No ratings yet
Ficha Tecnica Tarjeta Notifier
2 pages
Handbook - Curtin University
No ratings yet
Handbook - Curtin University
9 pages
galileo-g2-schematic
No ratings yet
galileo-g2-schematic
28 pages
Database Lab Manual 1 FAST NUCES
No ratings yet
Database Lab Manual 1 FAST NUCES
7 pages
Focus40 Blue Users Guide
No ratings yet
Focus40 Blue Users Guide
57 pages
Advanced Digital Signal Processing Lab
No ratings yet
Advanced Digital Signal Processing Lab
53 pages
Project Embedded System - Design Industrial DC Motor Controller
No ratings yet
Project Embedded System - Design Industrial DC Motor Controller
19 pages

Lecture5

Uploaded by

Lecture5

Uploaded by

Information Theory

Lecture 5: Channel Capacity

• Next, we will learn the fundamental limits of the amount of information

where the maximum is taken over all possible input distributions.

• Soon we will get to an “operational” definition of channel capacity, by

C = max I(X ; Y ) = max I(X ; X ) = max(H(X ) − H(X |X )) = max H(X )

• Recall that entropy H(X ) is maximized with a uniform distribution

C = max H(X ) = log |X | = 1 bit

which is consistent with our intuition.

• An error occurs with probability p, also called the bit flip/cross-over

I(X ; Y ) = H(Y ) − H(Y |X )

where H(p) is the binary entropy function.

• Note that H(Y ) = 1 if p(y ) = ( 21 , 12 ), which can be made by choosing

C = max I(X ; Y ) = 1 − H(p)

• In other words, the transmitted bit gets erased with probability α.

I(X ; Y ) = H(X ) − H(X |Y )

+ H(X |Y = 0) P[Y = 0] + H(X |Y = 1) P[Y = 1])

where H(X ) is maximized by p(x) = ( 12 , 12 ).

• Then, the capacity is

• Intuitively, this means that an α fraction of the bits gets “lost”.

A mathematical model of a physical communication system consists of:

• Message W drawn from a set {1, . . . , M}.

p(yk |x k , y k−1 ) = p(yk |xk ) k = 1, . . . , n (memoryless)

Assuming no feedback, i.e., the input symbols do not depend on the

p(xk |x k−1 , y k−1 ) = p(xk |x k −1 ),

the nth extension of the DMC becomes

• A deterministic decoding function g : Y n → {1, . . . , M}.

λi = P[g(Y n ) 6= i|X n = X n (i)]

denotes the probability of guessing the wrong message given

• Definition 33. Average (arithmetic) probability of error for an (M, n)

• If W is distributed uniformly random over {1, . . . , M}, then

• We want to find the best possible performance (highest rate R) for a

• Recall the definition of channel capacity C = maxp(x) I(X ; Y ).

• Now we are ready to prove the channel coding theorem:

• Theorem 34 (Channel coding theorem). For a discrete memoryless

• This theorem consists of two parts:

• This theorem consists of two parts:

• In both the achievability and converse, we will assume W (message)

• This is arguably the most important theorem in information theory.

H(Pe ) + Pe log(|X | − 1) ≥ H(X |Y ) (3)

which can be relaxed to

1 + Pe log |X | ≥ H(X |Y ) (4)

• Note that Fano does not consider any particular decoding/guessing

nR = H(W ) (W has uniform distribution) (7)

discrete memoryless channel (without feedback).

• We will now extend this notion to pairs of random variables

• We will now extend this notion to pairs of random variables

• Definition 36 (Jointly Typical Sequences). The set A(n)  of

2nH(X,Y ) jointly typical (xn , y n ) pairs

• Definition 37 (Joint AEP). Let (X n , Y n ) be drawn i.i.d. according to

1. lim P[(X n , Y n ) ∈ A(n)

2. (1 − )2n(H(X ,Y )−) ≤ |A(n)

3. If (X̃ n , Ỹ n ) ∼ p(x n )p(y n ), i.e., X̃ n and Ỹ n are independent

• Parts 1, 2: Similar to the original AEP.

• Parts 1, 2: Similar to the original AEP.

2−n(H(X )+) ≤ p(x n ) ≤ 2−n(H(X )−) (24)

2−n(H(X )+) ≤ p(x n ) ≤ 2−n(H(X )−) (24)

which concludes the proof.

• The collective probability of the jointly typical set is “about” 1.

p(x n , y n ) ≈ 2−nH(X ,Y ) if (x n , y n ) ∈ A(n)

• Any randomly chosen x n and y n are jointly typical with probability

• The codebook can be expressed in a matrix form where each row

• The codebook can be expressed in a matrix form where each row

• x1 (1), . . . , xn (1) represents the codeword for message 1 (codeword 1),

• Then, the probability that we generate a particular code is:

where w is the message index.

• This technique is called random coding. This is just a proof technique,

• It just simplifies the analysis to show that a good deterministic code

P[W = w] = 2−nR w = 1, . . . , 2nR (29)

P[W = w] = 2−nR w = 1, . . . , 2nR (29)

4. Send the corresponding codeword X n (w) through the channel.

P[W = w] = 2−nR w = 1, . . . , 2nR (29)

4. Send the corresponding codeword X n (w) through the channel.

P[W = w] = 2−nR w = 1, . . . , 2nR (29)

4. Send the corresponding codeword X n (w) through the channel.

• Note that this decoding rule based on joint typicality is sub-optimal in

• Definition 36 (Jointly Typical Sequences). The set A(n) of

2. (1 − )2n(H(X ,Y )−) ≤ |A(n)

2−n(H(X )+) ≤ p(x n ) ≤ 2−n(H(X )−) (24)

2−n(H(X )+) ≤ p(x n ) ≤ 2−n(H(X )−) (24)

P(Ei |W = 1) ≤ 2−n(I(X ;Y )−3) (39)

P(E|W = 1) ≤ + 2nR 2−n(I(X ;Y )−3) (41)

P(E|W = 1) ≤ + 2nR 2−n(I(X ;Y )−3) (41)

I(X ; Y ) − R > 3 (43)

for R < I(X ; Y ) − 3.