0% found this document useful (0 votes)
3 views

Lecture5

This lecture discusses channel capacity in information theory, focusing on discrete memoryless channels (DMC) and their definitions. It covers various types of channels, including noiseless binary channels, binary symmetric channels, and binary erasure channels, explaining how to calculate their capacities. The channel coding theorem is introduced, stating that rates below channel capacity are achievable while rates above are not.

Uploaded by

zhangxbkimmich
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views

Lecture5

This lecture discusses channel capacity in information theory, focusing on discrete memoryless channels (DMC) and their definitions. It covers various types of channels, including noiseless binary channels, binary symmetric channels, and binary erasure channels, explaining how to calculate their capacities. The channel coding theorem is introduced, stating that rates below channel capacity are achievable while rates above are not.

Uploaded by

zhangxbkimmich
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 69

Information Theory

Lecture 5: Channel Capacity

Basak Guler

1 / 51
Channel Capacity
• In previous lectures, we have learned how to efficiently compress the
outputs of a given source, and the fundamental limits of compression.

• Next, we will learn the fundamental limits of the amount of information


we can transfer to a receiver through a “channel”.

2 / 51
Discrete Memoryless Channel (DMC)
• Definition 26. A discrete channel is a system considering an input
alphabet X , output alphabet Y, and a conditional PMF p(y |x), which
represents the probability of observing y when x is sent. The channel
is memoryless if the output depends only at the present output (and is
independent of all previous inputs and outputs, given the present
input). The discrete memoryless channel is abbreviated as DMC.

3 / 51
Channel Capacity
• Definition 27. The “information” channel capacity of a DMC is defined
as:
C = max I(X ; Y )
p(x)

where the maximum is taken over all possible input distributions.

• Soon we will get to an “operational” definition of channel capacity, by


establishing this quantity as the highest rate in bits/channel use with
which reliable communication can be accomplished.
• Consequently we will drop the term “information” and call this simply
the channel capacity.

4 / 51
Noiseless Binary Channel
• Noiseless binary channel is defined as Y = X with X = Y = {0, 1}.

0 0

X Y

1 1

5 / 51
Noiseless Binary Channel
• Noiseless binary channel is defined as Y = X with X = Y = {0, 1}.

0 0

X Y

1 1
• The capacity is:

C = max I(X ; Y ) = max I(X ; X ) = max(H(X ) − H(X |X )) = max H(X )


p(x) p(x) p(x) | {z } p(x)
=0

• Recall that entropy H(X ) is maximized with a uniform distribution


p(x) = (1/2, 1/2), so

C = max H(X ) = log |X | = 1 bit


p(x)

which is consistent with our intuition.


5 / 51
Binary Symmetric Channel
• The binary symmetric channel is characterized by the conditional
probability: 
1−p y =x
p(y|x) = (1)
p y 6= x
where X = Y = {0, 1}.
1 p
Channel matrix
0 0
Y
p p(y|x) 0 1
X Y 0 1 p p
p X
1 p 1 p
1 1
1 p

• An error occurs with probability p, also called the bit flip/cross-over


probability.

6 / 51
Binary Symmetric Channel (BSC)
• Then,

I(X ; Y ) = H(Y ) − H(Y |X )


X
= H(Y ) − p(x) H(Y |X = x)
| {z }
x∈X
H(p)

= H(Y ) − H(p)
≤ 1 − H(p)

where H(p) is the binary entropy function.

• Note that H(Y ) = 1 if p(y ) = ( 21 , 12 ), which can be made by choosing


p(x) = ( 12 , 12 ). Therefore, the capacity is,

C = max I(X ; Y ) = 1 − H(p)


p(x)

7 / 51
Binary Erasure Channel (BEC)
• The binary erasure channel is characterized by the conditional
probability: 
α y =e
p(y|x) = (2)
1−α y =x
where X = {0, 1}, Y = {0, 1, e}.

1 ↵
0 0

X e Y

1 1
1 ↵

• In other words, the transmitted bit gets erased with probability α.

8 / 51
Binary Erasure Channel (BEC)
• Then,

I(X ; Y ) = H(X ) − H(X |Y )


= H(X ) − (H(X |Y = e) P[Y = e]
| {z } | {z }
H(X ) α

+ H(X |Y = 0) P[Y = 0] + H(X |Y = 1) P[Y = 1])


| {z } | {z }
0 0

= (1 − α)H(X )

where H(X ) is maximized by p(x) = ( 12 , 12 ).

• Then, the capacity is

C = max I(X ; Y ) = 1 − α
p(x)

• Intuitively, this means that an α fraction of the bits gets “lost”.

9 / 51
Channel Coding Theorem

W Xn Channel Yn Ŵ
Source Encoder Decoder
p(y|x)
message

A mathematical model of a physical communication system consists of:

• Message W drawn from a set {1, . . . , M}.


• Encoding function X n : {1, . . . , M} → X n that maps the messages W
to codewords X n (W ). The set of all codewords is called a codebook.
• A channel p(y n |x n ) through which the codeword X n is sent, which is
then received as Y n .
• Decoding function g : Y n → {1, . . . , M} that maps the channel output
Y n to a guess Ŵ = g(Y n ) of the original message W .
• An error occurs if the guess is wrong, i.e., Ŵ 6= W .

10 / 51
Channel Coding Theorem - Definitions
• Definition 28. Recall that a discrete channel denoted by
(X , p(y|x), Y) consists of an input alphabet X , output alphabet Y, and
conditional PMF (channel transition matrix) p(y |x), where x is the
input and y is the output of the channel.

11 / 51
Channel Coding Theorem - Definitions
• Definition 29. The nth extension of a discrete memoryless channel
(DMC) is defined as (X n , p(y n |x n ), Y n ), where

p(yk |x k , y k−1 ) = p(yk |xk ) k = 1, . . . , n (memoryless)

Assuming no feedback, i.e., the input symbols do not depend on the


past output symbols,

p(xk |x k−1 , y k−1 ) = p(xk |x k −1 ),

the nth extension of the DMC becomes


Qn k −1
n n p(y n , x n ) k=1 p(yk , xk |y , x k −1 )
p(y |x ) = =
p(x n ) p(x n )
Qn k −1 n
k=1 p(yk |xk )p(xk |x ) Y
= = p(yk |xk )
p(x n )
k=1

12 / 51
Channel Coding Theorem - Definitions
Definition 30. An (M, n) code for the channel (X , p(y |x), Y) consists of:
• An index set {1, . . . , M}.
• An encoding function X n : {1, . . . , M} → X n that maps the messages
{1, . . . , M} to codewords x n (1), . . . , x n (M),

MESSAGES CODEBOOK
W=1→ x n (1) = x1 (1), . . . , xn (1)
W=2→ x n (2) = x1 (2), . . . , xn (2)
..
.
W=M→ x n (M) = x1 (M), . . . , xn (M)

• A deterministic decoding function g : Y n → {1, . . . , M}.

13 / 51
Channel Coding Theorem - Definitions
• Definition 31. Probability of error (conditional):

λi = P[g(Y n ) 6= i|X n = X n (i)]

denotes the probability of guessing the wrong message given


message i was sent.
• Definition 32. Maximal probability of error:

λ(n) = max λi
i=1,...,M

• Definition 33. Average (arithmetic) probability of error for an (M, n)


code is given by:
M
(n) 1 X
Pe = λi
M
i=1
(n)
• Note that by definition Pe ≤λ . (n)

• If W is distributed uniformly random over {1, . . . , M}, then


(n)
Pe = P[g(Y n ) 6= W ]

14 / 51
Channel Coding Theorem - Definitions
• Definition 34. Rate R of an M, n code is defined as:

log M
R= bits per transmission
n
• Definition 35. A rate R is said to be achievable if there exists a
sequence of (d2nR e, n) codes such that λ(n) → 0 as n → ∞.

• We want to find the best possible performance (highest rate R) for a


given noisy communication channel.

15 / 51
Channel Coding Theorem - Definitions

• Recall the definition of channel capacity C = maxp(x) I(X ; Y ).

• Now we are ready to prove the channel coding theorem:

• Theorem 34 (Channel coding theorem). For a discrete memoryless


channel, any rate below the channel capacity C is achievable. That is,
for any rate R < C, there exists a sequence of (2nR , n) codes with
maximal probability of error λ(n) → 0 as n → ∞. Conversely, for any
R > C, no sequence of codes (2nR , n) can have λ(n) → 0

16 / 51
Channel Coding Theorem - Definitions

• This theorem consists of two parts:

1. Achievability: All rates below C are achievable. That is, if R < C, then we
(n)
can design a code (2nR , n) such that λ(n) = Pe → 0 as n → ∞.

2. Converse: No rate above C is achievable. That is, if R > C, then for all
(n)
(2nR , n) codes Pe (and λ(n) ) is bounded away from 0.

17 / 51
Channel Coding Theorem - Definitions

• This theorem consists of two parts:

1. Achievability: All rates below C are achievable. That is, if R < C, then we
(n)
can design a code (2nR , n) such that λ(n) = Pe → 0 as n → ∞.

2. Converse: No rate above C is achievable. That is, if R > C, then for all
(n)
(2nR , n) codes Pe (and λ(n) ) is bounded away from 0.

• In both the achievability and converse, we will assume W (message)


is distributed uniformly random over {1, . . . , 2nR }.

• This way we can characterize the behavior of the channel (i.e., the
(n)
error probability Pe ) irrespective of the distribution of the messages.

• This is arguably the most important theorem in information theory.

17 / 51
Channel Coding Theorem - Converse
• We will start by proving the converse.
• Recall Fano’s inequality, we observe Y and want to guess X . Let the
guess be X̂ = g(Y ) and Pe = P[X̂ 6= X ]. Then,

H(Pe ) + Pe log(|X | − 1) ≥ H(X |Y ) (3)

which can be relaxed to

1 + Pe log |X | ≥ H(X |Y ) (4)

• Note that Fano does not consider any particular decoding/guessing


method.

18 / 51
Channel Coding Theorem - Converse
• In channel coding, we estimate W using Y n , through a decoding
function Ŵ = g(Y n ). The alphabet of W is {1, . . . , 2nR }.
• If the messages W ∈ {1, . . . , 2nR } are equally likely,
(n)
Pe = P[W 6= Ŵ ].
• Then, Fano’s inequality says that
(n)
H(W |Y n ) ≤ 1 + Pe log(2nR ) (5)

in other words
(n)
H(W |Y n ) ≤ 1 + Pe nR (6)

19 / 51
Channel Coding Theorem - Converse
• Since we assume the messages W ∈ {1, . . . , 2nR } are equally likely,
H(W ) = log 2nR = nR. Then,

nR = H(W ) (W has uniform distribution) (7)


n n
= H(W |Y ) + I(W ; Y ) (by definition) (8)
n n n
≤ H(W |Y ) + I(X ; Y ) (data-processing inequality) (9)
(n)
≤1+ Pe nR + I(X n ; Y n ) (Fano’s inequality) (10)
(n) n n n
=1+ Pe nR + H(Y ) − H(Y |X ) (by definition) (11)
n
(n)
X
= 1+Pe nR +H(Y n )− H(Yi |Yi−1 , . . . , Y1 , X n ) (chain rule)
i=1
n
(n)
X
= 1 + Pe nR + H(Y n ) − H(Yi |Xi ) (DMC) (12)
i=1
n n
(n)
X X
≤ 1 + Pe nR + H(Yi ) − H(Yi |Xi ) (conditioning) (13)
i=1 i=1

20 / 51
Channel Coding Theorem - Converse
(continued)
n
X
= 1 + Pen nR + I(Yi ; Xi ) (by definition) (14)
i=1

≤ 1 + Pen nR + nC (15)

• Equation (9) follows from the data processing inequality (DPI), since
W − X n − Y n forms a Markov chain.
• Equation (10) follows from Fano’s inequality (6).
• Equation (12) follows from the fact that p(y n |x n ) = ni=1 p(yi |xi ) for the
Q

discrete memoryless channel (without feedback).


• Equation (13) holds from the independence bound on entropy
(Theorem 10).
• Finally, (15) follows from the fact that I(Xi ; Yi ) ≤ C since
C = maxp(x) I(X ; Y ).

21 / 51
Channel Coding Theorem - Converse
• By dividing both sides of equation (15) with nR,

(n) C 1
Pe ≥ 1 − − (16)
R Rn
|{z}
→0 as n→∞

• If R > C, then

(n) C
Pe ≥ 1 − >0 (17)
R
(n)
and Pe is bounded away from zero.

22 / 51
Channel Coding Theorem - Achievability
• So far we have introduced the channel capacity and proved the
converse part of the channel coding theorem. We have covered
Sections 7.1-7.5 and 7.9.

• That is, no code can achieve a rate of R > C with probability of error
going to zero.

• We will now prove the achievability part of the channel coding theorem
(we will cover Sections 7.6-7.7).

• That is, for any rate R < C, there exists a deterministic code
(encoding and decoding function) that achieves a rate of R.

• For the proof, we will need to extend the notion of typicality. This
definition will be very important in proving the channel coding theorem
- we will use a decoding technique based on this notion.

23 / 51
Jointly Typical Sequences
• Recall that typical sequences x n = (x1 , . . . , xn ) were those whose
empirical entropy was close to the true entropy H(X ).

• We will now extend this notion to pairs of random variables


(x n , y n ) = ((x1 , y1 ), . . . , (xn , yn ))

24 / 51
Jointly Typical Sequences
• Recall that typical sequences x n = (x1 , . . . , xn ) were those whose
empirical entropy was close to the true entropy H(X ).

• We will now extend this notion to pairs of random variables


(x n , y n ) = ((x1 , y1 ), . . . , (xn , yn ))

• Definition 36 (Jointly Typical Sequences). The set A(n)  of


jointly typical sequences with respect to a joint PMF p(x, y ) is the set
of all sequences (x n , y n ) whose empirical entropies are -close to the
true entropies:

1
A(n) = {(X n , Y n ) ∈ X n × Y n : − log p(x n ) − H(X ) <  (18)
n
1
− log p(y n ) − H(Y ) <  (19)
n
1
− log p(x n , y n ) − H(X , Y ) < } (20)
n
Qn
where p(x n , y n ) = i=1 p(xi , yi ).
24 / 51
Big Picture
All y n sequences, |Y|n = 2n log |Y|

2nH(Y ) typical y n
All xn sequences, |X |n = 2n log |X |

2nH(X) typical xn

2nH(X,Y ) jointly typical (xn , y n ) pairs

25 / 51
Jointly Asymptotic Equipartition Property (Joint AEP)
• We will now extend the notion of Asymptotic Equipartition Property
(AEP) to jointly typical sequences.

26 / 51
Jointly Asymptotic Equipartition Property (Joint AEP)
• We will now extend the notion of Asymptotic Equipartition Property
(AEP) to jointly typical sequences.

• Definition 37 (Joint AEP). Let (X n , Y n ) be drawn i.i.d. according to


Qn
the distribution p(x n , y n ) = i=1 p(xi , yi ). Then,

1. lim P[(X n , Y n ) ∈ A(n)


 ]=1 (21)
n→∞

2. (1 − )2n(H(X ,Y )−) ≤ |A(n)


 |≤2
n(H(X ,Y )+)
(22)

3. If (X̃ n , Ỹ n ) ∼ p(x n )p(y n ), i.e., X̃ n and Ỹ n are independent


with marginal PMFs p(x n ) and p(y n ) that are obtained from
the joint PMF p(x n , y n ), then:
(1 − )2−n(I(X ;Y )+3) ≤ P[(X̃ n , Ỹ n ) ∈ A(n)
 ]≤2
−n(I(X ;Y )−3)
(23)

26 / 51
Jointly Asymptotic Equipartition Property (Joint AEP)
Proof.

• Parts 1, 2: Similar to the original AEP.

27 / 51
Jointly Asymptotic Equipartition Property (Joint AEP)
Proof.

• Parts 1, 2: Similar to the original AEP.


• Part 3: Here X̃ n and Ỹ n are independent and have the same
marginals as X n and Y n . Then,
X
P[(X̃ n , Ỹ n ) ∈ A(n)
 ]= p(x̃ n , ỹ n )
(n)
(x̃ n ,ỹ n )∈A
X
= p(x̃ n )p(ỹ n )
(n)
(x̃ n ,ỹ n )∈A
X
= p(x n )p(y n )
(n)
(x̃ n ,ỹ n )∈A
X
= p(x n )p(y n )
(n)
(x n ,y n )∈A

27 / 51
Jointly Asymptotic Equipartition Property (Joint AEP)
Proof (continued).

• When (X n , Y n ) ∈ A(n)
 , we know that:

2−n(H(X )+) ≤ p(x n ) ≤ 2−n(H(X )−) (24)

and
2−n(H(Y )+) ≤ p(y n ) ≤ 2−n(H(Y )−) (25)

28 / 51
Jointly Asymptotic Equipartition Property (Joint AEP)
Proof (continued).

• When (X n , Y n ) ∈ A(n)
 , we know that:

2−n(H(X )+) ≤ p(x n ) ≤ 2−n(H(X )−) (24)

and
2−n(H(Y )+) ≤ p(y n ) ≤ 2−n(H(Y )−) (25)
• So:
X
P[(X̃ n , Ỹ n ) ∈ A(n)
 ]≤ 2−n(H(X )−) 2−n(H(Y )−)
(n)
(x n ,y n )∈A
−n(H(X )−) −n(H(Y )−)
= |A(n)
 |2 2
≤ 2n(H(X ,Y )+) 2−n(H(X )−) 2−n(H(Y )−) (from part 2)
−n(H(Y )−(H(X ,Y )−H(X ))) n3
=2 2
−n(I(X ;Y )−3)
=2 ( since H(X , Y ) − H(X ) , H(Y |X ))

28 / 51
Jointly Asymptotic Equipartition Property (Joint AEP)
Proof (continued).

• Similarly,

P[(X̃ n , Ỹ n ) ∈ A(n)
 ]
X
= p(x n )p(y n )
(n)
(x n ,y n )∈A
−n(H(X )+) −n(H(Y )+)
≥ |A(n)
 |2 2
≥ (1 − )2n(H(X ,Y )−) 2−n(H(X )+) 2−n(H(Y )+) (from part 2)
−n(I(X ;Y )+3)
= (1 − )2

which concludes the proof.

29 / 51
Jointly Asymptotic Equipartition Property (Joint AEP)
Basically, this theorem says that:

• The collective probability of the jointly typical set is “about” 1.

• There are about 2H(X ,Y ) jointly typical sequences. They are almost
equally likely, i.e.,

p(x n , y n ) ≈ 2−nH(X ,Y ) if (x n , y n ) ∈ A(n)




• Any randomly chosen x n and y n are jointly typical with probability


≈ 2−nI(X ;Y ) .

30 / 51
Achievability of Channel Coding Theorem
• We will now ready to proving the achievability part of the channel
coding theorem.

• That is, for any R < C, there exists a code (2nR , n) such that λ(n) → 0
(maximum probability of error) as n → ∞.

31 / 51
Achievability of Channel Coding Theorem
• First, fix p(x). Then, generate a (2nR , n) code at random according to
the distribution p(x). That is, generate 2nR codewords independently
according to:
n
Y
p(x n ) = p(xi ) (26)
i=1

• The codebook can be expressed in a matrix form where each row


represents a codeword:
 
x1 (1) ... xn (1)
 .. .. .. 
C= . . .  (27)
x1 (2nR ) ... xn (2nR )

32 / 51
Achievability of Channel Coding Theorem
• First, fix p(x). Then, generate a (2nR , n) code at random according to
the distribution p(x). That is, generate 2nR codewords independently
according to:
n
Y
p(x n ) = p(xi ) (26)
i=1

• The codebook can be expressed in a matrix form where each row


represents a codeword:
 
x1 (1) ... xn (1)
 .. .. .. 
C= . . .  (27)
x1 (2nR ) ... xn (2nR )

• x1 (1), . . . , xn (1) represents the codeword for message 1 (codeword 1),


and x1 (2nR ), . . . , xn (2nR ) represents the codeword for message 2nR
(codeword 2nR ).

32 / 51
Achievability of Channel Coding Theorem
• Note that each entry in the code C is generated i.i.d. with p(x).

• Then, the probability that we generate a particular code is:


nR
2
Y n
Y
P(C) = p(xi (w)) (28)
w=1 i=1

where w is the message index.

• This technique is called random coding. This is just a proof technique,


and not a method we use for actual transmission.

• It just simplifies the analysis to show that a good deterministic code


exists.

33 / 51
Achievability of Channel Coding Theorem
• Now, consider the following sequence of events:
1. Generate a random codebook C.

34 / 51
Achievability of Channel Coding Theorem
• Now, consider the following sequence of events:
1. Generate a random codebook C.
2. The code is revealed to the sender and receiver. They also both know
the channel transition matrix p(y|x).

34 / 51
Achievability of Channel Coding Theorem
• Now, consider the following sequence of events:
1. Generate a random codebook C.
2. The code is revealed to the sender and receiver. They also both know
the channel transition matrix p(y|x).
3. Choose a message W according to a (discrete) uniform distribution:

P[W = w] = 2−nR w = 1, . . . , 2nR (29)

34 / 51
Achievability of Channel Coding Theorem
• Now, consider the following sequence of events:
1. Generate a random codebook C.
2. The code is revealed to the sender and receiver. They also both know
the channel transition matrix p(y|x).
3. Choose a message W according to a (discrete) uniform distribution:

P[W = w] = 2−nR w = 1, . . . , 2nR (29)

4. Send the corresponding codeword X n (w) through the channel.

34 / 51
Achievability of Channel Coding Theorem
• Now, consider the following sequence of events:
1. Generate a random codebook C.
2. The code is revealed to the sender and receiver. They also both know
the channel transition matrix p(y|x).
3. Choose a message W according to a (discrete) uniform distribution:

P[W = w] = 2−nR w = 1, . . . , 2nR (29)

4. Send the corresponding codeword X n (w) through the channel.


5. The receiver receives a sequence Y n with probability:
n
Y
p(y n |x n (w)) = p(yi |xi (w)) (30)
i=1

6. The receiver guesses the message. To do so, it will declare Ŵ was sent
if the following two conditions are satisfied:
a) (X n (Ŵ ), Y n ) is jointly typical.
b) There is no other W 0 6= Ŵ such that (X n (W 0 ), Y n ) is jointly typical.

34 / 51
Achievability of Channel Coding Theorem
• Now, consider the following sequence of events:
1. Generate a random codebook C.
2. The code is revealed to the sender and receiver. They also both know
the channel transition matrix p(y|x).
3. Choose a message W according to a (discrete) uniform distribution:

P[W = w] = 2−nR w = 1, . . . , 2nR (29)

4. Send the corresponding codeword X n (w) through the channel.


5. The receiver receives a sequence Y n with probability:
n
Y
p(y n |x n (w)) = p(yi |xi (w)) (30)
i=1

6. The receiver guesses the message. To do so, it will declare Ŵ was sent
if the following two conditions are satisfied:
a) (X n (Ŵ ), Y n ) is jointly typical.
b) There is no other W 0 6= Ŵ such that (X n (W 0 ), Y n ) is jointly typical.
7. An error occurs if Ŵ 6= W . Let E be the event Ŵ 6= W .

34 / 51
Achievability of Channel Coding Theorem
• This decoding rule is called “joint typicality decoding”.

• Note that this decoding rule based on joint typicality is sub-optimal in


terms of minimizing the error.

• The rule that minimizes the error is the MAP rule (maximum
aposteriori rule) that says that ŵ is declared when
ŵ = arg max p(x n (w)|y n ) which is equivalent to the ML (maximum
likelihood) rule since all w’s are equally likely.

• But this decoding rule (known as maximum likelihood decoder) is


difficult to analyze and the decoding rule based on joint typicality will
actually get us the desired result.

35 / 51
Achievability of Channel Coding Theorem
• Probability of error calculation: The average probability of error
over all codewords in a codebook and over all randomly generated
codebooks is:
(n)
X
P(E) = P[Ŵ 6= W ] = P(C)Pe (C) (31)
C
nR
2
X 1 X
= P(C) nR λw (C) (32)
2
C w=1
nR
2
1 XX
= nR P(C)λw (C) (33)
2
w=1 C
| {z }
(∗)
X
= P(C)λw (C) = P(E|W = 1) (34)
C

(*): does not depend on w since we average over all possible


codewords. Therefore, without loss of generality, we can assume
W = 1. So, X n (1) is sent, Y n is received.
36 / 51
Achievability of Channel Coding Theorem
• Now, let’s define the event

Ei = {(X n (i), Y n ) ∈ A(n)


 } i = 1, . . . , 2nR (35)

37 / 51
Achievability of Channel Coding Theorem
• Now, let’s define the event

Ei = {(X n (i), Y n ) ∈ A(n)


 } i = 1, . . . , 2nR (35)

• Hence, E1c , Ej for j > 1 are all error events conditioned on W = 1, and

P(E|W = 1) = P(E1c ∪ E2 ∪ E3 ∪ . . . ∪ E2nR |W = 1). (36)

37 / 51
Achievability of Channel Coding Theorem
• Now, let’s define the event

Ei = {(X n (i), Y n ) ∈ A(n)


 } i = 1, . . . , 2nR (35)

• Hence, E1c , Ej for j > 1 are all error events conditioned on W = 1, and

P(E|W = 1) = P(E1c ∪ E2 ∪ E3 ∪ . . . ∪ E2nR |W = 1). (36)

• Then, using the union of events bound,


nR
2
X
P(E|W = 1) ≤ P(E1c |W = 1) + P(Ei |W = 1) (37)
i=2

37 / 51
Achievability of Channel Coding Theorem
• By joint AEP P(E1c |W = 1) → 0 as n → ∞, which means for large
enough n we have:
P(E1c |W = 1) ≤  (38)

38 / 51
Achievability of Channel Coding Theorem
• By joint AEP P(E1c |W = 1) → 0 as n → ∞, which means for large
enough n we have:
P(E1c |W = 1) ≤  (38)

• Also since the codewords X (n) (1) and X (n) (w) for w 6= 1 are
independently generated, Y n should be independent of X (n) (w) for
w 6= 1 (because Y n is generated by X n (1)).

38 / 51
Achievability of Channel Coding Theorem
• By joint AEP P(E1c |W = 1) → 0 as n → ∞, which means for large
enough n we have:
P(E1c |W = 1) ≤  (38)

• Also since the codewords X (n) (1) and X (n) (w) for w 6= 1 are
independently generated, Y n should be independent of X (n) (w) for
w 6= 1 (because Y n is generated by X n (1)).

• Recall that the probability of X n (w) being jointly typical with Y n when
they are independent was ≤ 2−n(I(X ;Y )−3) by the joint AEP (part 3).

• Then,

P(Ei |W = 1) ≤ 2−n(I(X ;Y )−3) (39)


(40)

38 / 51
Achievability of Channel Coding Theorem
• Therefore, the average error can be bounded by:

P(E|W = 1) ≤  + 2nR 2−n(I(X ;Y )−3) (41)


−n(I(X ;Y )−R−3)
=+2
| {z } (42)
(∗)

39 / 51
Achievability of Channel Coding Theorem
• Therefore, the average error can be bounded by:

P(E|W = 1) ≤  + 2nR 2−n(I(X ;Y )−3) (41)


−n(I(X ;Y )−R−3)
=+2
| {z } (42)
(∗)

• When

I(X ; Y ) − R > 3 (43)


R < I(X ; Y ) − 3 (44)

then, the exponent in (42) becomes −na for some a > 0, and the
whole second term (∗) ≤  for large n.
• Then, the average error becomes:

P(E|W = 1) ≤ 2 (45)

for R < I(X ; Y ) − 3.


39 / 51
Achievability of Channel Coding Theorem
• So far we have shown that when R < I(X ; Y ), the average probability
of error P(E) will be ≤ 2.

• To finish the proof, we will show that the maximal error probability λ(n)
goes to 0 as well (not just the average error probability P(E)).

• We will do this by taking the following steps.

40 / 51
Achievability of Channel Coding Theorem
1. First, choose p(x) such that it is the distribution that maximizes I(X ; Y ).
Recall that this is the definition of C. So,

p∗ (x) = arg max I(X ; Y ), (46)


p(x)

where

max I(X ; Y ) = I(X ; Y )|p(x)=p∗ (x) = C (47)


p(x)

By doing this we replace the condition R < I(X ; Y ) with R < C.

41 / 51
Achievability of Channel Coding Theorem
2. Now, since the average P(E) over all codewords is ≤ 2, there must be
at least one codebook (let’s call this C ∗ ) whose error performance is better
than the average, i.e.,

P(E|C ∗ ) ≤ 2 (48)
where
nR
2
∗ (n) 1 X
P(E|C ) = Pe (C ∗ ) = nR λw (C ∗ ) (49)
2
w=1

since the messages w ∈ {1, . . . , 2nR } are chosen uniformly random.

42 / 51
Achievability of Channel Coding Theorem
3. Throw away the worst half of the codewords in the best codebook C ∗ .
The worst codewords are the ones with the highest probability of error λw .
Then, since the (arithmetic) average probability of error is,
nR
2
(n) 1 X
Pe (C ∗ ) = nR λw (C ∗ ) ≤ 2, (50)
2
w=1

for each of the remaining codewords, we must have:

λw ≤ 4 (51)
(n)
Otherwise the average Pe (C ∗ ) would be higher than 2!
So, the best half of the codewords have a maximal probability of error ≤ 4.
2nR
By using only the remaining codewords, we can transmit 2 messages.

43 / 51
Achievability of Channel Coding Theorem
2nR
4. So, we are now sending 2 messages instead of 2nR messages.
nR 2nR
log
This has changed our rate from logn2 = R to n 2 = R − n1 , but now we
are guaranteed to have a maximal error probability λ(n) ≤ 4.

1
Note that the term n vanishes as n → ∞.

This means that for any R < C, we can find a code with rate R for which
λ(n) → 0 for large enough n, which completes the proof.

44 / 51
Capacity with Feedback
• In feedback channels, encoder observes the previous channel output.

W Xi (W, Y i 1
) Yi Ŵ
Encoder Channel Decoder

message g(Y n )

FEEDBACK

• For the discrete memoryless channel (DMC) we have discussed here,


feedback cannot increase capacity.

• For more advanced channels (channels with memory, multi-terminal


channels) feedback can increase capacity.

45 / 51
Source-Channel Separation Theorem
• Recall that earlier, we have established the fundamental limits of data
compression and recently we have established the fundamental limits
on how much information can be transferred in a given channel.

• Now we will combine the two results.

46 / 51
Source-Channel Separation Theorem
• Suppose we want to send an arbitrary source V (e.g., speech) that
generates symbols V1 , . . . , Vn , through a channel p(y|x) with capacity
C. The source has an entropy rate H(V).

47 / 51
Source-Channel Separation Theorem
• Suppose we want to send an arbitrary source V (e.g., speech) that
generates symbols V1 , . . . , Vn , through a channel p(y|x) with capacity
C. The source has an entropy rate H(V).

• Consider the following two options:

1) Joint source-channel coding: We can map the source symbols


V n = (V1 , . . . , Vn ) directly into channel inputs X n = (X1 , . . . , Xn ).
Then we can estimate the original source symbols (V1 , . . . , Vn ) from
the channel output Y n = (Y1 , . . . , Yn ) .
Let’s call these estimated symbols V̂ n = (V̂1 , . . . , V̂n ).

Source symbols
Vn X n (V n ) Yn V̂ n = g(Y n )
Channel
Encoder Decoder
p(y|x)

47 / 51
Source-Channel Separation Theorem
2) Separate source and channel coding: Or, we can separate source and
channel coding.
nH(V )
Recall that we can compress a source up to rate R = log 2n = H(V ) as
we saw in the consequences of the AEP in Lecture 2.
Then, we can first compress the source as efficiently as possible, then use
an optimal channel code to transfer this compressed representation as
long as R < C.

Vn Source Channel Xn Channel Yn Channel Source V̂ n


Encoder Encoder p(y|x) Decoder Decoder

48 / 51
Source-Channel Separation Theorem
• The following theorem says that the two options perform the same.

• Theorem 35 (Source-Channel Coding Theorem). Consider a


source represented by a stochastic process V1 , . . . , Vn drawn from a
finite alphabet with an entropy rate H(V). If H(V) < C, then there
exists a source-channel code with probability of error
P[V̂ n 6= V n ] → 0. Conversely, we cannot send any source through a
channel for which H(V) > C with arbitrarily small probability of error.

• If the V1 , . . . , Vn is i.i.d. distributed with PMF p(v ), then H(V) can be


P
replaced with the entropy H(V ) = − v p(v ) log p(v ).

49 / 51
Source-Channel Separation Theorem
Proof:

• (Achievability) From the AEP, there are 2nH(V) typical sequences for
the source V1 , . . . , Vn (Recall that the typical sequences are “almost”
equally likely.). Map each of these typical sequences to the
“messages” for the channel decoder. Ignore the non-typical
sequences. Use a channel code (2nR , n) for sending 2nR = 2nH(V)
messages. Then, from the channel coding theorem, as long as
H(V) = R < C, the messages can be transferred with vanishing error.

• (Converse) From Fano’s Theorem, similar to the converse of channel


coding theorem.

50 / 51
Source-Channel Separation Theorem
• This theorem states that joint source-channel coding does not provide
any advantage over separate source coding (compression) and
channel coding.

• Advantages:
• Greatly simplifies our system design!
• Design the best code for the source, design the best code for the
channel, and combine them together.
• Can even use the same source code with different channel codes, and
vice versa.

• Limitations:
• Requires an infinite codeword length! Doesn’t hold when codeword
lengths are finite. Real world codes have finite length!
• Doesn’t hold for more advanced channel models, such as
multi-terminal channels.

51 / 51

You might also like