An Introduction To Information Theory: Adrish Banerjee
An Introduction To Information Theory: Adrish Banerjee
Adrish Banerjee
Department of Electrical Engineering
Indian Institute of Technology Kanpur
Kanpur, Uttar Pradesh
India
So we start our lecture with the basic model of communication channel that we are going to consider.
n Yn Decoder W
^
W Encoder X Channel
p(y|x)
Communication Channel Model
The various blocks of the above model are explained below.
* W is the message that we want to send.
* The encoder encodes the message into a n bit codeword denoted by Xn and is transmitted over DMC.
* The channel here is a discrete memoryless channel whose transition probabilities are given by p(y|x).
* The received codeword is Yn .
c based on the received codeword Yn . Now error
* The decoder estimates the transmitted message as W
occurs when W 6= Wc.
Before going into the details of the channel coding theorem we are going to define the following terms as
follows. An (M, n) code where M is the number of codewords and n is the length of the codeword, for the
channel (X, p(y|x), Y ) consists of the following:
1. An index set {1, 2, · · · , M } where they denote the corresponding messages transmitted.
2. An encoding function, which is a mapping X n : {1, 2, · · · , M } → X n which gives the codewords X n (1), X n (2)
· · · , X n (M ). The set of codewords is referred to as codebook.
3. Now at the receiver we receive the codewords. So we need a decoding function
g : Y n → {1, 2, · · · , M }
which is a deterministic rule that assigns a guess to each possible received vector.
1
4. Let λi be the conditional probability of error given that codeword corresponding to index i was sent. Now
error occurs whenever the decoded message index is not equal to the transmitted message corresponding
to the index i and it is given as follows,
X
λi = P r(g(Y n ) 6= i|X n = X n (i)) = p(y n |xn (i))I(g(y n ) 6= i)
yn
λ(n) = max λi
i∈1,2,··· ,M
7. The number of indexes we are transmitting is M and each codeword length is n. So, the rate of an (M,n)
code is given by
log M
R= bits per transmission
n
8. We say a rate R is achievable if there exists a code with the following parameters (d2nR e, n) such that the
maximal probability of error λ(n) → 0 as n → ∞. d.e denotes the ceil function.
9. We define the capacity of the discrete memoryless channel as the supremum of all achievable rates.
Proof:
• We arrange the set of codewords as rows of the matrix as shown where each element of the matrix is
randomly generated,
x1 (1) x2 (1) ··· xn (1)
C=
.
.. .. .. ..
. . .
x1 (2nR ) x2 (2nR ) ··· xn (2nR )
2
• Once we have generated the code we assume,
=⇒ The code that is generated is revealed to both the sender and the receiver.
=⇒ The channel transition matrix is known both at the transmitter and the receiver.
• Now we know that a message W is chosen according to the uniform distribution, then
• Now the wth codeword corresponding to the wth row of C is sent over the channel.
• At the receiver the sequence Y n is received according to the distribution as shown below.
n
Y
P (y n |xn (w)) = p(yi |xi (w))
i=1
It works as follows,
o The receiver declares that the index W̃ was sent if the following conditions are satisfied:
– (X n (W̃ ), Y n ) is jointly typical.
(n)
– There is no other index k, such that (X n (k), Y n ) ∈ A .
o In typical set decoding error occurs in the following cases
1. There exists no such W̃ for a given index transmitted.
2. If there is more than such W̃ for a given index.
o In both these cases the decoding error occurs.
Let us try to calculate the average probability of error which is averaged over all possible codewords in
a codebook and also averaged over all codebooks. Here P (C) is the probability or error associated with a
(n)
codebook, P (C) is the probability of error associated with a codeword and λw (C) is the probability of error
associated with the codeword W.
X
P r(E) = P (C)P(n) (C)
C
nR
2
X 1 X
= P (C) λw (C)
2nR w=1
C
nR
2
1 XX
= P (C)λw (C)
2nR w=1 C
From the above expression we can see that the average probability of error does not depend on the particular
index W this is because of the symmetry in the code construction.
So without loss of generality we assume that the message W = 1 was sent and the average probability of
error is given by,
nR
2
1 XX
P r(E) = P (C)λw (C)
2nR w=1
C
X
= P (C)λ1 (C)
C
= P r(E|W = 1)
3
Let Ei be the event that the ith codeword and Y n are jointly typical.
n o
Ei = (X n (i), Y n ) is in A(n) , i ∈ 1, 2, · · · , 2nR
As we know that the probability of error does not depend on what index has been transmitted. So assume
that index W = 1 was sent. We have,
P r(E|W = 1) = P (E1c ∪ E2 ∪ · · · ∪ E2nR |W = 1)
It is written from the face that if index W = 1 was transmitted, then event E1c denotes that the pair
(X (1), Y n ) do not form a jointly typical sequence which is the first type of error and events E2 , · · · E2nR
n
denotes the case where there exists more than one sequence X n (i) which is jointly typical with Y n which
denotes the second type of error. So the probability of error is union of all those sets that are given above. Now
using the union bound we can write the above expression as,
nR
2
X
P (E1c ∪ E2 ∪ · · · ∪ E2nR |W = 1) ≤ P (E1c |W = 1) + P (Ei |W = 1)
i=2
The first term P (E1c |W = 1) corresponds to (X n (1), Y n ) is not jointly typical and the second summation
term corresponds to event (X n (i), Y n ) is a jointly typical.
Now from the properties of joint AEP. We have P (E1c |W = 1) ≤ for large n.
It follows from the property of joint AEP that for a large value of n the pair (X(n), Y (n)) is jointly typical
and its probability of being a jointly typical sequence → 1 as n → ∞.
We know that X n (1) and X n (i) are independent, so are Y n and X n (i). Which implies that the probabil-
ity of Y n and X n (i) being jointly typical is ≤ 2−n(I(X;Y )−3) . We have proved this result in the previous lecture.
Hence we have,
nR
2
X
P (E) = P (E|W = 1) ≤ P (E1c |W = 1) + P (Ei |W = 1)
i=2
nR
2
X
≤ + 2−n(I(X;Y )−3)
i=2
= + (2 nR
− 1)2−n(I(X;Y )−3)
= + 23n 2−n(I(X;Y )−R)
≤ 2
As long as R < I(X; Y ) − 3 and for sufficiently large value of n the above relation holds. We can tighten
this bound by choosing and n so that the average probability of error is less than 2.
Now we choose the distribution p(x) in the proof that achieves capacity. So the value of p(x) be some P ∗ (x)
and then the condition R < I(X; Y ) can be written as R < C.
Till now we have shown that the average probability of error is less that 2 as long as the condition R < C is
satisfied. Now we have to show that the maximum probability of error is also less than . It is done as follows,
Since the average probability of error over the codebooks is small, there exists at least one codebook C ∗
with small average probability of error.We throw away the worst half of the codewords in the best codebook
C ∗ . Which means we are throwing away the codewords which have more probability of error, then we have only
the best half of the codewords that has the maximal probability of error less than 4. Now, we reindex these
best codewords and in all we have 2nR−1 codewords.
So we have constructed a code of rate R0 = R − n1 with maximal probability of error λ(n) ≤ 4. So we have
proved that as long as the transmission rate is less than the capacity, the probability of error goes to 0 as n goes
to infinity. Now in the next half of the lecture we are going to prove the converse of the Noisy channel coding
theorem.
In the converse of the theorem, we try to show that if our transmission rate is above channel capacity, then
the probability of error is no longer bounded to 0.
4
Converse to Noisy channel coding theorem
Consider a Binary symmetric source (BSS) which is emitting information bits at a rate R via a DMC of
capacity C without feedback, then the error probability at the destination satisfies
−1 C
Pb ≥ H 1− , if R > C
R
Where H −1 denotes the inverse binary entropy function defined by H −1 (x) = min{p : H(p) = x}. Consider
the following binary symmetric source model.
^ ^
.... .... Y X1 .... XN U1.... UK
Information U1 UK Channel Y1 N
DMC
Channel
BSS
Destination Decoder Encoder
The binary symmetric source outputs the bits Ui ’s and these are fed to the channel encoder that generates
the codewords Xi ’s which are passed through the DMC. So the rate of the code is K N at the receiver side the
received codewords Yi ’s which are fed to the channel decoder to give the estimated message bits U ci . Now we
have to show that if transmission rate is above capacity, then the probability of error is lower bounded by the
non zero quantity.
Proof:
Let us consider BSS i.e., DMS with PU (0) = PU (1) = 1/2. Therefore H(U ) = 1 bit. Also for a DMC
without feedback we can write the following expressions,
N
Y
P (y1 , · · · , yn |x1 , · · · , xN ) = P (yi |xi ) (1)
i=1
N
X
=⇒ H(Y1 · · · YN |X1 · · · XN ) = H(Yi |Xi ) (2)
i=1
Remember that our transmission rate is given by R = KN bits/use. We are going to apply the data processing
lemma which states that further processing of data cannot improve Mutual information.
Consider U → X → Y → U
b form a Markov chain. From the data processing lemma we get,
On simplifying the term on the RHS in (5) using the definition of mutual information,
Expanding the joint entropy term using chain rule and the followed by the rule that conditioning cannot
improve entropy we can write it as,
N
X N
X N
X
H(Y1 · · · YN ) − H(Yi |Xi ) ≤ [H(Yi ) − H(Yi |Xi )] = I(Xi ; Yi ) ≤ N C (6)
i=1 i=1 i=1
5
We define bit error probability as
K
1 X
Pb = Pei
K i=1
Hence we get
K
X
H(Ui |Ûi ) ≥ N (R − C) (9)
i=1
Combining all the previous results and dividing by K on both sides we get
K
1 X N C
H(Pei ) ≥ (R − C) = 1 − (11)
K i=1 K R
Since H(p) is a concave function, by applying Jensen’s inequality (Jensen’s inequality says if the function is
concave then expected value of the function is less than equal to function evaluated at expected value) we get,
K K
!
1 X 1 X
H(Pei ) ≤ H Pei = H(Pb ) (12)
K i=1 K i=1
C
H(Pb ) ≥ 1 −
R
Which shows that the average probability of error is a non-zero quantity and thereby we have shown the
converse of the noisy channel coding theorem.