Lecture5
Lecture5
Basak Guler
1 / 51
Channel Capacity
• In previous lectures, we have learned how to efficiently compress the
outputs of a given source, and the fundamental limits of compression.
2 / 51
Discrete Memoryless Channel (DMC)
• Definition 26. A discrete channel is a system considering an input
alphabet X , output alphabet Y, and a conditional PMF p(y |x), which
represents the probability of observing y when x is sent. The channel
is memoryless if the output depends only at the present output (and is
independent of all previous inputs and outputs, given the present
input). The discrete memoryless channel is abbreviated as DMC.
3 / 51
Channel Capacity
• Definition 27. The “information” channel capacity of a DMC is defined
as:
C = max I(X ; Y )
p(x)
4 / 51
Noiseless Binary Channel
• Noiseless binary channel is defined as Y = X with X = Y = {0, 1}.
0 0
X Y
1 1
5 / 51
Noiseless Binary Channel
• Noiseless binary channel is defined as Y = X with X = Y = {0, 1}.
0 0
X Y
1 1
• The capacity is:
6 / 51
Binary Symmetric Channel (BSC)
• Then,
= H(Y ) − H(p)
≤ 1 − H(p)
7 / 51
Binary Erasure Channel (BEC)
• The binary erasure channel is characterized by the conditional
probability:
α y =e
p(y|x) = (2)
1−α y =x
where X = {0, 1}, Y = {0, 1, e}.
1 ↵
0 0
↵
X e Y
↵
1 1
1 ↵
8 / 51
Binary Erasure Channel (BEC)
• Then,
= (1 − α)H(X )
C = max I(X ; Y ) = 1 − α
p(x)
9 / 51
Channel Coding Theorem
W Xn Channel Yn Ŵ
Source Encoder Decoder
p(y|x)
message
10 / 51
Channel Coding Theorem - Definitions
• Definition 28. Recall that a discrete channel denoted by
(X , p(y|x), Y) consists of an input alphabet X , output alphabet Y, and
conditional PMF (channel transition matrix) p(y |x), where x is the
input and y is the output of the channel.
11 / 51
Channel Coding Theorem - Definitions
• Definition 29. The nth extension of a discrete memoryless channel
(DMC) is defined as (X n , p(y n |x n ), Y n ), where
12 / 51
Channel Coding Theorem - Definitions
Definition 30. An (M, n) code for the channel (X , p(y |x), Y) consists of:
• An index set {1, . . . , M}.
• An encoding function X n : {1, . . . , M} → X n that maps the messages
{1, . . . , M} to codewords x n (1), . . . , x n (M),
MESSAGES CODEBOOK
W=1→ x n (1) = x1 (1), . . . , xn (1)
W=2→ x n (2) = x1 (2), . . . , xn (2)
..
.
W=M→ x n (M) = x1 (M), . . . , xn (M)
13 / 51
Channel Coding Theorem - Definitions
• Definition 31. Probability of error (conditional):
λ(n) = max λi
i=1,...,M
14 / 51
Channel Coding Theorem - Definitions
• Definition 34. Rate R of an M, n code is defined as:
log M
R= bits per transmission
n
• Definition 35. A rate R is said to be achievable if there exists a
sequence of (d2nR e, n) codes such that λ(n) → 0 as n → ∞.
15 / 51
Channel Coding Theorem - Definitions
16 / 51
Channel Coding Theorem - Definitions
1. Achievability: All rates below C are achievable. That is, if R < C, then we
(n)
can design a code (2nR , n) such that λ(n) = Pe → 0 as n → ∞.
2. Converse: No rate above C is achievable. That is, if R > C, then for all
(n)
(2nR , n) codes Pe (and λ(n) ) is bounded away from 0.
17 / 51
Channel Coding Theorem - Definitions
1. Achievability: All rates below C are achievable. That is, if R < C, then we
(n)
can design a code (2nR , n) such that λ(n) = Pe → 0 as n → ∞.
2. Converse: No rate above C is achievable. That is, if R > C, then for all
(n)
(2nR , n) codes Pe (and λ(n) ) is bounded away from 0.
• This way we can characterize the behavior of the channel (i.e., the
(n)
error probability Pe ) irrespective of the distribution of the messages.
17 / 51
Channel Coding Theorem - Converse
• We will start by proving the converse.
• Recall Fano’s inequality, we observe Y and want to guess X . Let the
guess be X̂ = g(Y ) and Pe = P[X̂ 6= X ]. Then,
18 / 51
Channel Coding Theorem - Converse
• In channel coding, we estimate W using Y n , through a decoding
function Ŵ = g(Y n ). The alphabet of W is {1, . . . , 2nR }.
• If the messages W ∈ {1, . . . , 2nR } are equally likely,
(n)
Pe = P[W 6= Ŵ ].
• Then, Fano’s inequality says that
(n)
H(W |Y n ) ≤ 1 + Pe log(2nR ) (5)
in other words
(n)
H(W |Y n ) ≤ 1 + Pe nR (6)
19 / 51
Channel Coding Theorem - Converse
• Since we assume the messages W ∈ {1, . . . , 2nR } are equally likely,
H(W ) = log 2nR = nR. Then,
20 / 51
Channel Coding Theorem - Converse
(continued)
n
X
= 1 + Pen nR + I(Yi ; Xi ) (by definition) (14)
i=1
≤ 1 + Pen nR + nC (15)
• Equation (9) follows from the data processing inequality (DPI), since
W − X n − Y n forms a Markov chain.
• Equation (10) follows from Fano’s inequality (6).
• Equation (12) follows from the fact that p(y n |x n ) = ni=1 p(yi |xi ) for the
Q
21 / 51
Channel Coding Theorem - Converse
• By dividing both sides of equation (15) with nR,
(n) C 1
Pe ≥ 1 − − (16)
R Rn
|{z}
→0 as n→∞
• If R > C, then
(n) C
Pe ≥ 1 − >0 (17)
R
(n)
and Pe is bounded away from zero.
22 / 51
Channel Coding Theorem - Achievability
• So far we have introduced the channel capacity and proved the
converse part of the channel coding theorem. We have covered
Sections 7.1-7.5 and 7.9.
• That is, no code can achieve a rate of R > C with probability of error
going to zero.
• We will now prove the achievability part of the channel coding theorem
(we will cover Sections 7.6-7.7).
• That is, for any rate R < C, there exists a deterministic code
(encoding and decoding function) that achieves a rate of R.
• For the proof, we will need to extend the notion of typicality. This
definition will be very important in proving the channel coding theorem
- we will use a decoding technique based on this notion.
23 / 51
Jointly Typical Sequences
• Recall that typical sequences x n = (x1 , . . . , xn ) were those whose
empirical entropy was close to the true entropy H(X ).
24 / 51
Jointly Typical Sequences
• Recall that typical sequences x n = (x1 , . . . , xn ) were those whose
empirical entropy was close to the true entropy H(X ).
1
A(n) = {(X n , Y n ) ∈ X n × Y n : − log p(x n ) − H(X ) < (18)
n
1
− log p(y n ) − H(Y ) < (19)
n
1
− log p(x n , y n ) − H(X , Y ) < } (20)
n
Qn
where p(x n , y n ) = i=1 p(xi , yi ).
24 / 51
Big Picture
All y n sequences, |Y|n = 2n log |Y|
2nH(Y ) typical y n
All xn sequences, |X |n = 2n log |X |
2nH(X) typical xn
25 / 51
Jointly Asymptotic Equipartition Property (Joint AEP)
• We will now extend the notion of Asymptotic Equipartition Property
(AEP) to jointly typical sequences.
26 / 51
Jointly Asymptotic Equipartition Property (Joint AEP)
• We will now extend the notion of Asymptotic Equipartition Property
(AEP) to jointly typical sequences.
26 / 51
Jointly Asymptotic Equipartition Property (Joint AEP)
Proof.
27 / 51
Jointly Asymptotic Equipartition Property (Joint AEP)
Proof.
27 / 51
Jointly Asymptotic Equipartition Property (Joint AEP)
Proof (continued).
• When (X n , Y n ) ∈ A(n)
, we know that:
and
2−n(H(Y )+) ≤ p(y n ) ≤ 2−n(H(Y )−) (25)
28 / 51
Jointly Asymptotic Equipartition Property (Joint AEP)
Proof (continued).
• When (X n , Y n ) ∈ A(n)
, we know that:
and
2−n(H(Y )+) ≤ p(y n ) ≤ 2−n(H(Y )−) (25)
• So:
X
P[(X̃ n , Ỹ n ) ∈ A(n)
]≤ 2−n(H(X )−) 2−n(H(Y )−)
(n)
(x n ,y n )∈A
−n(H(X )−) −n(H(Y )−)
= |A(n)
|2 2
≤ 2n(H(X ,Y )+) 2−n(H(X )−) 2−n(H(Y )−) (from part 2)
−n(H(Y )−(H(X ,Y )−H(X ))) n3
=2 2
−n(I(X ;Y )−3)
=2 ( since H(X , Y ) − H(X ) , H(Y |X ))
28 / 51
Jointly Asymptotic Equipartition Property (Joint AEP)
Proof (continued).
• Similarly,
P[(X̃ n , Ỹ n ) ∈ A(n)
]
X
= p(x n )p(y n )
(n)
(x n ,y n )∈A
−n(H(X )+) −n(H(Y )+)
≥ |A(n)
|2 2
≥ (1 − )2n(H(X ,Y )−) 2−n(H(X )+) 2−n(H(Y )+) (from part 2)
−n(I(X ;Y )+3)
= (1 − )2
29 / 51
Jointly Asymptotic Equipartition Property (Joint AEP)
Basically, this theorem says that:
• There are about 2H(X ,Y ) jointly typical sequences. They are almost
equally likely, i.e.,
30 / 51
Achievability of Channel Coding Theorem
• We will now ready to proving the achievability part of the channel
coding theorem.
• That is, for any R < C, there exists a code (2nR , n) such that λ(n) → 0
(maximum probability of error) as n → ∞.
31 / 51
Achievability of Channel Coding Theorem
• First, fix p(x). Then, generate a (2nR , n) code at random according to
the distribution p(x). That is, generate 2nR codewords independently
according to:
n
Y
p(x n ) = p(xi ) (26)
i=1
32 / 51
Achievability of Channel Coding Theorem
• First, fix p(x). Then, generate a (2nR , n) code at random according to
the distribution p(x). That is, generate 2nR codewords independently
according to:
n
Y
p(x n ) = p(xi ) (26)
i=1
32 / 51
Achievability of Channel Coding Theorem
• Note that each entry in the code C is generated i.i.d. with p(x).
33 / 51
Achievability of Channel Coding Theorem
• Now, consider the following sequence of events:
1. Generate a random codebook C.
34 / 51
Achievability of Channel Coding Theorem
• Now, consider the following sequence of events:
1. Generate a random codebook C.
2. The code is revealed to the sender and receiver. They also both know
the channel transition matrix p(y|x).
34 / 51
Achievability of Channel Coding Theorem
• Now, consider the following sequence of events:
1. Generate a random codebook C.
2. The code is revealed to the sender and receiver. They also both know
the channel transition matrix p(y|x).
3. Choose a message W according to a (discrete) uniform distribution:
34 / 51
Achievability of Channel Coding Theorem
• Now, consider the following sequence of events:
1. Generate a random codebook C.
2. The code is revealed to the sender and receiver. They also both know
the channel transition matrix p(y|x).
3. Choose a message W according to a (discrete) uniform distribution:
34 / 51
Achievability of Channel Coding Theorem
• Now, consider the following sequence of events:
1. Generate a random codebook C.
2. The code is revealed to the sender and receiver. They also both know
the channel transition matrix p(y|x).
3. Choose a message W according to a (discrete) uniform distribution:
6. The receiver guesses the message. To do so, it will declare Ŵ was sent
if the following two conditions are satisfied:
a) (X n (Ŵ ), Y n ) is jointly typical.
b) There is no other W 0 6= Ŵ such that (X n (W 0 ), Y n ) is jointly typical.
34 / 51
Achievability of Channel Coding Theorem
• Now, consider the following sequence of events:
1. Generate a random codebook C.
2. The code is revealed to the sender and receiver. They also both know
the channel transition matrix p(y|x).
3. Choose a message W according to a (discrete) uniform distribution:
6. The receiver guesses the message. To do so, it will declare Ŵ was sent
if the following two conditions are satisfied:
a) (X n (Ŵ ), Y n ) is jointly typical.
b) There is no other W 0 6= Ŵ such that (X n (W 0 ), Y n ) is jointly typical.
7. An error occurs if Ŵ 6= W . Let E be the event Ŵ 6= W .
34 / 51
Achievability of Channel Coding Theorem
• This decoding rule is called “joint typicality decoding”.
• The rule that minimizes the error is the MAP rule (maximum
aposteriori rule) that says that ŵ is declared when
ŵ = arg max p(x n (w)|y n ) which is equivalent to the ML (maximum
likelihood) rule since all w’s are equally likely.
35 / 51
Achievability of Channel Coding Theorem
• Probability of error calculation: The average probability of error
over all codewords in a codebook and over all randomly generated
codebooks is:
(n)
X
P(E) = P[Ŵ 6= W ] = P(C)Pe (C) (31)
C
nR
2
X 1 X
= P(C) nR λw (C) (32)
2
C w=1
nR
2
1 XX
= nR P(C)λw (C) (33)
2
w=1 C
| {z }
(∗)
X
= P(C)λw (C) = P(E|W = 1) (34)
C
37 / 51
Achievability of Channel Coding Theorem
• Now, let’s define the event
• Hence, E1c , Ej for j > 1 are all error events conditioned on W = 1, and
37 / 51
Achievability of Channel Coding Theorem
• Now, let’s define the event
• Hence, E1c , Ej for j > 1 are all error events conditioned on W = 1, and
37 / 51
Achievability of Channel Coding Theorem
• By joint AEP P(E1c |W = 1) → 0 as n → ∞, which means for large
enough n we have:
P(E1c |W = 1) ≤ (38)
38 / 51
Achievability of Channel Coding Theorem
• By joint AEP P(E1c |W = 1) → 0 as n → ∞, which means for large
enough n we have:
P(E1c |W = 1) ≤ (38)
• Also since the codewords X (n) (1) and X (n) (w) for w 6= 1 are
independently generated, Y n should be independent of X (n) (w) for
w 6= 1 (because Y n is generated by X n (1)).
38 / 51
Achievability of Channel Coding Theorem
• By joint AEP P(E1c |W = 1) → 0 as n → ∞, which means for large
enough n we have:
P(E1c |W = 1) ≤ (38)
• Also since the codewords X (n) (1) and X (n) (w) for w 6= 1 are
independently generated, Y n should be independent of X (n) (w) for
w 6= 1 (because Y n is generated by X n (1)).
• Recall that the probability of X n (w) being jointly typical with Y n when
they are independent was ≤ 2−n(I(X ;Y )−3) by the joint AEP (part 3).
• Then,
38 / 51
Achievability of Channel Coding Theorem
• Therefore, the average error can be bounded by:
39 / 51
Achievability of Channel Coding Theorem
• Therefore, the average error can be bounded by:
• When
then, the exponent in (42) becomes −na for some a > 0, and the
whole second term (∗) ≤ for large n.
• Then, the average error becomes:
P(E|W = 1) ≤ 2 (45)
• To finish the proof, we will show that the maximal error probability λ(n)
goes to 0 as well (not just the average error probability P(E)).
40 / 51
Achievability of Channel Coding Theorem
1. First, choose p(x) such that it is the distribution that maximizes I(X ; Y ).
Recall that this is the definition of C. So,
where
41 / 51
Achievability of Channel Coding Theorem
2. Now, since the average P(E) over all codewords is ≤ 2, there must be
at least one codebook (let’s call this C ∗ ) whose error performance is better
than the average, i.e.,
P(E|C ∗ ) ≤ 2 (48)
where
nR
2
∗ (n) 1 X
P(E|C ) = Pe (C ∗ ) = nR λw (C ∗ ) (49)
2
w=1
42 / 51
Achievability of Channel Coding Theorem
3. Throw away the worst half of the codewords in the best codebook C ∗ .
The worst codewords are the ones with the highest probability of error λw .
Then, since the (arithmetic) average probability of error is,
nR
2
(n) 1 X
Pe (C ∗ ) = nR λw (C ∗ ) ≤ 2, (50)
2
w=1
λw ≤ 4 (51)
(n)
Otherwise the average Pe (C ∗ ) would be higher than 2!
So, the best half of the codewords have a maximal probability of error ≤ 4.
2nR
By using only the remaining codewords, we can transmit 2 messages.
43 / 51
Achievability of Channel Coding Theorem
2nR
4. So, we are now sending 2 messages instead of 2nR messages.
nR 2nR
log
This has changed our rate from logn2 = R to n 2 = R − n1 , but now we
are guaranteed to have a maximal error probability λ(n) ≤ 4.
1
Note that the term n vanishes as n → ∞.
This means that for any R < C, we can find a code with rate R for which
λ(n) → 0 for large enough n, which completes the proof.
44 / 51
Capacity with Feedback
• In feedback channels, encoder observes the previous channel output.
W Xi (W, Y i 1
) Yi Ŵ
Encoder Channel Decoder
message g(Y n )
FEEDBACK
45 / 51
Source-Channel Separation Theorem
• Recall that earlier, we have established the fundamental limits of data
compression and recently we have established the fundamental limits
on how much information can be transferred in a given channel.
46 / 51
Source-Channel Separation Theorem
• Suppose we want to send an arbitrary source V (e.g., speech) that
generates symbols V1 , . . . , Vn , through a channel p(y|x) with capacity
C. The source has an entropy rate H(V).
47 / 51
Source-Channel Separation Theorem
• Suppose we want to send an arbitrary source V (e.g., speech) that
generates symbols V1 , . . . , Vn , through a channel p(y|x) with capacity
C. The source has an entropy rate H(V).
Source symbols
Vn X n (V n ) Yn V̂ n = g(Y n )
Channel
Encoder Decoder
p(y|x)
47 / 51
Source-Channel Separation Theorem
2) Separate source and channel coding: Or, we can separate source and
channel coding.
nH(V )
Recall that we can compress a source up to rate R = log 2n = H(V ) as
we saw in the consequences of the AEP in Lecture 2.
Then, we can first compress the source as efficiently as possible, then use
an optimal channel code to transfer this compressed representation as
long as R < C.
48 / 51
Source-Channel Separation Theorem
• The following theorem says that the two options perform the same.
49 / 51
Source-Channel Separation Theorem
Proof:
• (Achievability) From the AEP, there are 2nH(V) typical sequences for
the source V1 , . . . , Vn (Recall that the typical sequences are “almost”
equally likely.). Map each of these typical sequences to the
“messages” for the channel decoder. Ignore the non-typical
sequences. Use a channel code (2nR , n) for sending 2nR = 2nH(V)
messages. Then, from the channel coding theorem, as long as
H(V) = R < C, the messages can be transferred with vanishing error.
50 / 51
Source-Channel Separation Theorem
• This theorem states that joint source-channel coding does not provide
any advantage over separate source coding (compression) and
channel coding.
• Advantages:
• Greatly simplifies our system design!
• Design the best code for the source, design the best code for the
channel, and combine them together.
• Can even use the same source code with different channel codes, and
vice versa.
• Limitations:
• Requires an infinite codeword length! Doesn’t hold when codeword
lengths are finite. Real world codes have finite length!
• Doesn’t hold for more advanced channel models, such as
multi-terminal channels.
51 / 51