0% found this document useful (0 votes)
58 views

An Introduction To Information Theory: Adrish Banerjee

This document provides an introduction and overview of Shannon's noisy channel coding theorem. It defines key terms like achievable rates, channel capacity, encoding, decoding, and average probability of error. The summary outlines the proof of the noisy channel coding theorem, showing that as the codeword length increases, the average probability of error goes to 0 if the transmission rate is below the channel capacity.

Uploaded by

as
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
58 views

An Introduction To Information Theory: Adrish Banerjee

This document provides an introduction and overview of Shannon's noisy channel coding theorem. It defines key terms like achievable rates, channel capacity, encoding, decoding, and average probability of error. The summary outlines the proof of the noisy channel coding theorem, showing that as the codeword length increases, the average probability of error goes to 0 if the transmission rate is below the channel capacity.

Uploaded by

as
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

An Introduction to Information Theory

Adrish Banerjee
Department of Electrical Engineering
Indian Institute of Technology Kanpur
Kanpur, Uttar Pradesh
India

Lecture 10B: NOISY CHANNEL CODING THEOREM


Welcome to the course on an Introduction to Information Theory. In this lecture we are going to talk
about Shannon’s Noisy channel coding theorem. We are going to prove the achievability of this theorem and
its converse. In the last lecture we have built up the background required to prove this theorem. The outline
of this lecture is as follows,

• Noisy channel coding theorem


• Converse to noisy channel coding theorem

So we start our lecture with the basic model of communication channel that we are going to consider.

n Yn Decoder W
^
W Encoder X Channel
p(y|x)
Communication Channel Model
The various blocks of the above model are explained below.
* W is the message that we want to send.
* The encoder encodes the message into a n bit codeword denoted by Xn and is transmitted over DMC.

* The channel here is a discrete memoryless channel whose transition probabilities are given by p(y|x).
* The received codeword is Yn .
c based on the received codeword Yn . Now error
* The decoder estimates the transmitted message as W
occurs when W 6= Wc.

Before going into the details of the channel coding theorem we are going to define the following terms as
follows. An (M, n) code where M is the number of codewords and n is the length of the codeword, for the
channel (X, p(y|x), Y ) consists of the following:
1. An index set {1, 2, · · · , M } where they denote the corresponding messages transmitted.

2. An encoding function, which is a mapping X n : {1, 2, · · · , M } → X n which gives the codewords X n (1), X n (2)
· · · , X n (M ). The set of codewords is referred to as codebook.
3. Now at the receiver we receive the codewords. So we need a decoding function

g : Y n → {1, 2, · · · , M }

which is a deterministic rule that assigns a guess to each possible received vector.

1
4. Let λi be the conditional probability of error given that codeword corresponding to index i was sent. Now
error occurs whenever the decoded message index is not equal to the transmitted message corresponding
to the index i and it is given as follows,
X
λi = P r(g(Y n ) 6= i|X n = X n (i)) = p(y n |xn (i))I(g(y n ) 6= i)
yn

where I(·) is the indicator function.


5. The maximal probability of error λ(n) for an (M, n) code is defined as

λ(n) = max λi
i∈1,2,··· ,M

6. Similarly we define average probability of error


M
1 X
P(n) = λi
M i=1

We also know that


P(n) ≤ λ(n)
In the proof of the theorem we are going to show that as the value of n increases, and if the transmission
rate is below channel capacity the average probability of error goes to 0.

7. The number of indexes we are transmitting is M and each codeword length is n. So, the rate of an (M,n)
code is given by
log M
R= bits per transmission
n
8. We say a rate R is achievable if there exists a code with the following parameters (d2nR e, n) such that the
maximal probability of error λ(n) → 0 as n → ∞. d.e denotes the ceil function.

9. We define the capacity of the discrete memoryless channel as the supremum of all achievable rates.

Noisy Channel Coding Theorem


For a Discrete Memoryless Channel all the rates below the capacity ’C’ are achievable. Which means that
specifically, for every rate R < C, there exists a sequence of (2nR , n) codes with maximum probability of error
λ(n) → 0. We are going to prove the achievability part of the theorem. Firstly we are going to show that the
average probability of error goes to 0 as the length of the codeword becomes large and then we show that the
maximum probability of error also goes to 0.

Proof:

Now inorder to prove this we proceed along the following steps,


• We fix the input distribution
Qn p(x) and generate generate independently 2nR codewords according to the
n
distribution p(x ) = i=1 p(xi ).

• We arrange the set of codewords as rows of the matrix as shown where each element of the matrix is
randomly generated,  
x1 (1) x2 (1) ··· xn (1)
C=
 .
.. .. .. .. 
. . . 
x1 (2nR ) x2 (2nR ) ··· xn (2nR )

• Probability that we generate a particular code C is


nR
2Y n
Y
P r(C) = p(xi (w))
w=1 i=1

2
• Once we have generated the code we assume,
=⇒ The code that is generated is revealed to both the sender and the receiver.
=⇒ The channel transition matrix is known both at the transmitter and the receiver.
• Now we know that a message W is chosen according to the uniform distribution, then

P r(W = w) = 2−nR , w = 1, 2, · · · , 2nR

• Now the wth codeword corresponding to the wth row of C is sent over the channel.
• At the receiver the sequence Y n is received according to the distribution as shown below.
n
Y
P (y n |xn (w)) = p(yi |xi (w))
i=1

This is because the channel is Discrete Memoryless Channel.


• Once we have received the sequence Y n we perform Jointly typical decoding which is a sub-optimal
technique but asymptotically it is optimal.

How does a Typical set decoding works??

It works as follows,
o The receiver declares that the index W̃ was sent if the following conditions are satisfied:
– (X n (W̃ ), Y n ) is jointly typical.
(n)
– There is no other index k, such that (X n (k), Y n ) ∈ A .
o In typical set decoding error occurs in the following cases
1. There exists no such W̃ for a given index transmitted.
2. If there is more than such W̃ for a given index.
o In both these cases the decoding error occurs.
Let us try to calculate the average probability of error which is averaged over all possible codewords in
a codebook and also averaged over all codebooks. Here P (C) is the probability or error associated with a
(n)
codebook, P (C) is the probability of error associated with a codeword and λw (C) is the probability of error
associated with the codeword W.
X
P r(E) = P (C)P(n) (C)
C
nR
2
X 1 X
= P (C) λw (C)
2nR w=1
C
nR
2
1 XX
= P (C)λw (C)
2nR w=1 C

From the above expression we can see that the average probability of error does not depend on the particular
index W this is because of the symmetry in the code construction.

So without loss of generality we assume that the message W = 1 was sent and the average probability of
error is given by,
nR
2
1 XX
P r(E) = P (C)λw (C)
2nR w=1
C
X
= P (C)λ1 (C)
C
= P r(E|W = 1)

3
Let Ei be the event that the ith codeword and Y n are jointly typical.
n o
Ei = (X n (i), Y n ) is in A(n) , i ∈ 1, 2, · · · , 2nR


As we know that the probability of error does not depend on what index has been transmitted. So assume
that index W = 1 was sent. We have,
P r(E|W = 1) = P (E1c ∪ E2 ∪ · · · ∪ E2nR |W = 1)
It is written from the face that if index W = 1 was transmitted, then event E1c denotes that the pair
(X (1), Y n ) do not form a jointly typical sequence which is the first type of error and events E2 , · · · E2nR
n

denotes the case where there exists more than one sequence X n (i) which is jointly typical with Y n which
denotes the second type of error. So the probability of error is union of all those sets that are given above. Now
using the union bound we can write the above expression as,
nR
2
X
P (E1c ∪ E2 ∪ · · · ∪ E2nR |W = 1) ≤ P (E1c |W = 1) + P (Ei |W = 1)
i=2

The first term P (E1c |W = 1) corresponds to (X n (1), Y n ) is not jointly typical and the second summation
term corresponds to event (X n (i), Y n ) is a jointly typical.

Now from the properties of joint AEP. We have P (E1c |W = 1) ≤  for large n.
It follows from the property of joint AEP that for a large value of n the pair (X(n), Y (n)) is jointly typical
and its probability of being a jointly typical sequence → 1 as n → ∞.

We know that X n (1) and X n (i) are independent, so are Y n and X n (i). Which implies that the probabil-
ity of Y n and X n (i) being jointly typical is ≤ 2−n(I(X;Y )−3) . We have proved this result in the previous lecture.

Hence we have,
nR
2
X
P (E) = P (E|W = 1) ≤ P (E1c |W = 1) + P (Ei |W = 1)
i=2
nR
2
X
≤ + 2−n(I(X;Y )−3)
i=2

=  + (2 nR
− 1)2−n(I(X;Y )−3)
=  + 23n 2−n(I(X;Y )−R)
≤ 2
As long as R < I(X; Y ) − 3 and for sufficiently large value of n the above relation holds. We can tighten
this bound by choosing  and n so that the average probability of error is less than 2.

Now we choose the distribution p(x) in the proof that achieves capacity. So the value of p(x) be some P ∗ (x)
and then the condition R < I(X; Y ) can be written as R < C.

Till now we have shown that the average probability of error is less that 2 as long as the condition R < C is
satisfied. Now we have to show that the maximum probability of error is also less than . It is done as follows,

Since the average probability of error over the codebooks is small, there exists at least one codebook C ∗
with small average probability of error.We throw away the worst half of the codewords in the best codebook
C ∗ . Which means we are throwing away the codewords which have more probability of error, then we have only
the best half of the codewords that has the maximal probability of error less than 4. Now, we reindex these
best codewords and in all we have 2nR−1 codewords.

So we have constructed a code of rate R0 = R − n1 with maximal probability of error λ(n) ≤ 4. So we have
proved that as long as the transmission rate is less than the capacity, the probability of error goes to 0 as n goes
to infinity. Now in the next half of the lecture we are going to prove the converse of the Noisy channel coding
theorem.

In the converse of the theorem, we try to show that if our transmission rate is above channel capacity, then
the probability of error is no longer bounded to 0.

4
Converse to Noisy channel coding theorem
Consider a Binary symmetric source (BSS) which is emitting information bits at a rate R via a DMC of
capacity C without feedback, then the error probability at the destination satisfies
 
−1 C
Pb ≥ H 1− , if R > C
R

Where H −1 denotes the inverse binary entropy function defined by H −1 (x) = min{p : H(p) = x}. Consider
the following binary symmetric source model.
^ ^
.... .... Y X1 .... XN U1.... UK
Information U1 UK Channel Y1 N
DMC
Channel
BSS
Destination Decoder Encoder

The binary symmetric source outputs the bits Ui ’s and these are fed to the channel encoder that generates
the codewords Xi ’s which are passed through the DMC. So the rate of the code is K N at the receiver side the
received codewords Yi ’s which are fed to the channel decoder to give the estimated message bits U ci . Now we
have to show that if transmission rate is above capacity, then the probability of error is lower bounded by the
non zero quantity.

Proof:
Let us consider BSS i.e., DMS with PU (0) = PU (1) = 1/2. Therefore H(U ) = 1 bit. Also for a DMC
without feedback we can write the following expressions,
N
Y
P (y1 , · · · , yn |x1 , · · · , xN ) = P (yi |xi ) (1)
i=1
N
X
=⇒ H(Y1 · · · YN |X1 · · · XN ) = H(Yi |Xi ) (2)
i=1

Remember that our transmission rate is given by R = KN bits/use. We are going to apply the data processing
lemma which states that further processing of data cannot improve Mutual information.

Consider U → X → Y → U
b form a Markov chain. From the data processing lemma we get,

I(U1 · · · UK ; Uˆ1 · · · UˆK ) ≤ I(X1 · · · XN ; Uˆ1 · · · UˆK ) (3)

Again applying data processing lemma we get

I(X1 · · · XN ; Uˆ1 · · · UˆK ) ≤ I(X1 · · · XN ; Y1 · · · YN ) (4)

From the above inequalities (1) and (2), we get

I(U1 · · · UK ; Uˆ1 · · · UˆK ) ≤ I(X1 · · · XN ; Y1 · · · YN ) (5)

On simplifying the term on the RHS in (5) using the definition of mutual information,

I(X1 · · · XN ; Y1 · · · YN ) = H(Y1 · · · YN ) − H(Y1 · · · YN |X1 · · · XN )

As the channel is DMC each Yi depends only on the Xi .


N
X
H(Y1 · · · YN ) − H(Y1 · · · YN |X1 · · · XN ) = H(Y1 · · · YN ) − H(Yi |Xi )
i=1

Expanding the joint entropy term using chain rule and the followed by the rule that conditioning cannot
improve entropy we can write it as,
N
X N
X N
X
H(Y1 · · · YN ) − H(Yi |Xi ) ≤ [H(Yi ) − H(Yi |Xi )] = I(Xi ; Yi ) ≤ N C (6)
i=1 i=1 i=1

By combining the results in (5) and (6). We can write

I(U1 · · · UK ; Uˆ1 · · · UˆK ) ≤ N C

5
We define bit error probability as
K
1 X
Pb = Pei
K i=1

Where Pei = P (Ûi 6= Ui ).

Now let us try to find out H(U1 · · · UK |Uˆ1 · · · UˆK )

H(U1 · · · UK |Uˆ1 · · · UˆK ) = H(U1 · · · UK ) − I(U1 · · · UK ; Uˆ1 · · · UˆK )

We know that as Ui are IID. Each has an uncertainty 1. So we can write

H(U1 · · · UK ) − I(U1 · · · UK ; Uˆ1 · · · UˆK ) = K − I(U1 · · · UK ; Uˆ1 · · · UˆK ) ≥ K − N C


K
Finally using the definition of rate R = N. We can write

[H(U1 · · · UK |Uˆ1 · · · UˆK ) ≥ N (R − C)

On further simplifying the term H(U1 · · · UK |Uˆ1 · · · UˆK ) we get,


K
X
H(U1 · · · UK |Uˆ1 · · · UˆK ) = H(Ui |Uˆ1 · · · UˆK U1 · · · Ui−1 ) (7)
i=1
K
X
≤ H(Ui |Ûi ) (8)
i=1

Hence we get
K
X
H(Ui |Ûi ) ≥ N (R − C) (9)
i=1

Using Fano’s lemma and as the input is binary [log2 (2 − 1) = 0] we get,


K
X K
X
H(Ui |Ûi ) ≤ H(Pei ) (10)
i=1 i=1

Combining all the previous results and dividing by K on both sides we get
K
1 X N C
H(Pei ) ≥ (R − C) = 1 − (11)
K i=1 K R

Since H(p) is a concave function, by applying Jensen’s inequality (Jensen’s inequality says if the function is
concave then expected value of the function is less than equal to function evaluated at expected value) we get,
K K
!
1 X 1 X
H(Pei ) ≤ H Pei = H(Pb ) (12)
K i=1 K i=1

Combining the above results (11) and (12) we get,

C
H(Pb ) ≥ 1 −
R
Which shows that the average probability of error is a non-zero quantity and thereby we have shown the
converse of the noisy channel coding theorem.

You might also like