Shannon Entropy
Shannon Entropy
Shannon Entropy
Dipan Kumar Ghosh
Indian Institute of Technology Bombay,
Powai, Mumbai 400076
April 15, 2017
1 Introduction
In the last lecture we introduced the concept of information. We discussed a method
of quantifying information and found that unlike the colloquial usage of the term ‘in-
ormation’, the word in technical terms implies a measure of the uncertainty in a given
statement or a given situation. It was pointed out that when an event actually takes place
out of various possibilities that could arise before the event, the amount of uncertainty
that gets removed is a measure of the information associated with that event. We defined
a function H(p1 , p2 , ...pM ) as a measure of such information where there are M possi-
bilities associated with that event with probabilities p1 , p2 , ...pM corresponding to these
events. We also defined an auxiliary function f (M ) as equal to H where probability of
each of the M events are identical. We found that f (M ) must satisfy certain properties,
which are as follows:
1 1 1
1. f (M ) = H( , , . . . , ) is a non-negative, monotonic and continuously increas-
M M M
ing function of M .
3. f (M N ) = f (M ) + f (N )
In the following we will find the explicit form of a function which satisfies the above.
We will now find a function which satisfies the above. which satisfies the four properties
mentioned above.
c D. K. Ghosh, IIT Bombay 2
2 Information Measure
We claim that the function f (M ) = C log M where C > 0 is the function which satisfies
the four properties mentioned above.
3. Let M > 1. Let r be an arbitrary positive integer. For any integral M , we can then
find an integer k such that M k ≤ 2r ≤ M k+1 . (Example, let M = 4 and r = 3,
then 2r = 8 which lies between 4 = 41 and 16 = 42 , so that k = 1. Since f (M ) is a
monotonic function of M , it then follows that
f (M k ) ≤ f (2r ) ≤ f (M k+1 )
kf (M ) ≤ rf (2) ≤ (k + 1)f (M )
k f (2) k+1
≤ ≤
r f (M ) r
Consider now the function C log M . Since
we have
k log 2 k+1
≤ ≤
r log M r
Thus both f (2)/f (M ) and log 2/ log M lie between k/r and (k+1)/r. Clearly, the distance
between them on the real line must be less than 1/r. Since r is arbitrary, we can make it
indefinitely large and in this limit
log 2 f (2)
=
log M f (M )
M
!
X p p pr
+ pi × H Prr+1 , PM r+2 , . . . PM
i=r+1 i=1 pi i=r+1 pr+1 i=r+1 pi
Consider a total of s events each having the same probability and r of them in group
A and s?r of them in group B. We can then write, using pi = 1/s for each of the events,
1 1 1 r s−r r 1 1 1 s−r 1 1 1
H , ,... −H , = H , ,... + H , ,...
s s s s s s r r r s s−r s−r s−r
where, in the above expression there are r arguments of H in the first term to the
right and s − r arguments in the second term. Using the definition of f (m), this gives
r s−r r s−r
f (s) = H , + f (r) + f (s − r)
s s s s
Substituting f (M ) = C log(M ),
which gives
We generalize the above to more than two events and assert that
M
X
H({pi }) = −C pi log pi
i=1
In the above we have proved this for M = 1 and for M = 2. We can use the method of
induction to prove that if the theorem is valid for M − 1, it would be true for M . Dividing
M events into two groups, one containing a single event and the other M − 1 events, we
have,
c D. K. Ghosh, IIT Bombay 4
H(p,1−p)
p
0 1/2 1
We will take C = 1 and the base of the logarithm to be 2. The above shows that the
uncertainty associated with an event does not depend on the values that X takes but on
the probability of occurrence of the events. Consider tossing of a coin. According to what
we have shown above, since the head and the tail occur with a probability 1/2 each, the
uncertainty associated with a coin toss is
1 1 X 1 1 1
H( , ) = − pi log2 pi = − log2 (1/2) − (1 − ) log2 (1 − ) = 1
2 2 i
2 2 2
The uncertainty has its maximum value (1 bit) at phead = ptail = 1/2. If the coin is
biased, the uncertainty decreases because we become more certain on which way the coin
is likely to face (Figure 2).
There are several interpretation of the concept of uncertainty measure.
P
1. The relation H({pi }) = − i pi log2 pi is the weighted average of probabilities of
occurrence of various values of a random variable W (X) which assumes the value
c D. K. Ghosh, IIT Bombay 5
x1 (two questions)
Yes
Is it x1 ?
No
Yes x2 (two questions)
Is it either x1 or x2 ?
x3 (two questions)
No Yes
Is it x3 ? x4 (three questions)
Yes
No
Is it x4 ?
No x5 (three questions)
− log2 pi when the random variable X takes the value xi , i.e. W takes the value
equal to the negative logarithm of the probability of X = xi
Example : Suppose X takes five values x1 , x2 , x3 , x4 and x5 with probabilities 0.3,
0.2, 0.2, 0.15 and 0.15 respectively. W takes values log2 (0.3) = 1.736, log2 (0.2) =
2.322, 2.322, − log2 (0.15) = 2.737 and 2.737 respectively with the corresponding
probabilities. Adding the contributions, we get H = 2.27 bits of uncertainty.
Flipping a coin once gives 1 bit of information. Flipping a coin n times (which is the
same as flipping n coins simultaneously) gives n bit of information, because there are 2n
events each with 1/2n probability.
1
H = −2n × log2 (1/2)n = n
2n
The above can easily be generalized to the case of a continuous variable and we have
in that case
Z
H(P ) = P (x) × log(1/P (x))dx
Gibb’s Inequality
It can be seen from Figure 4 that log(x) ≤ x − 1 (This is valid for any base of the
logarithm). The slope of log x being 1/x, its value at x = 1 is 1 so that the tangent to
log x at x = 1 is 1. Further, the tangent line passes through the point x = 1 where its vale
c D. K. Ghosh, IIT Bombay 6
X 1 X
H(P ) − log(n) = pi log( ) − log(n) pi
i
p i
X i
1 1
= pi log( ) − log( )
i
pi n
X 1/n
= pi log ≤0
i
p i
where we have used Gibb’s inequality in the last step. We have considered P = p1 , p2 , ..., pn
and Q = 1/n, 1/n, ..., 1/n, i.e. Q is a distribution where each of the n events has the same
probability 1/n. Thus we have, for the function H(P )
0 ≤ H(P ) ≤ log(n)
c D. K. Ghosh, IIT Bombay 7
H(P ) can be zero only when one of the pi is 1 and the rest are zero while it assumes its
maximum value when the distribution is uniform.
1. Consider the case where all particles are in a single box i.e. pi = 1 for a particular
box and all other probabilities are zero. Clearly the entropy in this case is zero.
The number of configurations is the same as the number of boxes, viz. L.
2. Consider the case where particles are distributed equally in two specific boxes. The
number of different configurations is found by choosing two boxes out of L (we take
L = 106 ) and put half of the particles in one of the boxes and the other half in the
second box. This gives
Since the probability of a particle being in either box is 1/2, the entropy of this
configuration is (1/2) ln 2 + (1/2) ln 2 = ln 2. The entropy is somewhat higher than
the case where the particles are all in one single box. The number of configurations
in the single box case is 106 while in the case of two boxes, it is 5 × 1011 . Thus if we
started with a zero entropy situation (and if these two situations were the only ones
5 × 1011
possible) then, the possibility that the entropy becomes ln 2 is '
5 × 1011 + 106
1 − 10−5 . This is simply a statement of the fact that the system equlibriate to a
state of maximum entropy.
4 Communication System
A typical communication system consists of a source which emits signals, an encoder,
which provides a symbolic representation to the message using the bits generated by the
source, a channel for transmission, such as an optical fiber, which on the way may pick
up stray noise which will attempt to deteriorate the signal, a receiver which will intercept
the message and finally a decoder. A channel?s information capacity is defined as the rate
(say, in Kbps) of user information that can be carried over a noisy channel with as small
error as possible. This is less than the raw channel capacity, which is the capacity in the
absence of any noise. Suppose we wish to code the letters A, C, G, T by a two bit code.
Assume that the letter A appears with 40% frequency, C with 30%, G and T with 15%
each. If we code A=00, C=01, G=10 and T=11, we have on an average 2 bits of code per
letter. However, consider a new scheme where we code A=0, C=10, G=110 and T=111.
The number of bits per letter (on an average) is 0.4 × 1 + 0.3 × 2 + 0.15 × 3 + 0.15 × 3 =
c D. K. Ghosh, IIT Bombay 9
Channel
Source Encoder Receiver Decoder
Noise
1.9 per letter which is a small saving over the previous one, but a saving nevertheless.
The entropy associated with the code (which is the optimal compression possible) is
P
− i pi log pi = −0.4 log(0.4) − 0.3 log(0.3) − 0.15 log(0.15) − 0.15 log(0.15) = 1.871. This
does not tell us how to construct codes but gives an idea of the optimal compression.
Shannon’s Noiseless Coding theorem, which is applicable for all uniquely deci-
pherable codes, provides a limit for the average length of a code which can be carried
with high degree of fidelity over a noiseless channel. We will prove the theorem for the
special case of “prefix code” ,in which no code word is a prefix for another code word.
The following example illustrates a prefix code.
A=0
B=1
C= 00
D= 11
This is not a uniquely decipherable code. The following is an example of an uniquely
decipherable code but is not a prefix code.
word code comments
A 0
B 01 A is a prefix of B
C 011 B is a prefix of C
D 0111 C is a prefix of D
The following two are valid prefix codes.
A 00 A 0
B 01 B 10
C 10 C 110
D 11 D 111
A prefix code is best illustrated through a tree diagram which hangs upside down from
a node. From the node we take one step left if the code is 0 and one step right if the code
is 1. When the code terminates at a word (letter), we have a ‘leaf’. Take the following
illustration for coding the word “QUANTUM” with the following prefix coding.
c D. K. Ghosh, IIT Bombay 10
0 1
A
0 1
1 0 1
0
M
U
Q
0 1
T N
word code
A 0
M 01
N 011
U 0111
Q 100
T 1010
The word “QUANTUM” will then be coded as 100 110 0 1011 1010 110 111 which has
21 bits against 56 bits required to code it by using a byte for every letter. This gives a
compression of 37.5%. The tree is as follows:
If the i-th code word is a leaf at a depth ni , the length of the code word is ni itself. If
nk is the depth of the tree,we have nk ≥ nk?1 ≥ . . . n1 . Maximum number of leaves appear
in the tree when the only terminal points of the tree are at level k. If there is a leaf r at
1
the level i it removes a fraction n of leaves from the level k, leaving 2nk ?ni number of
2 k
leaves. Thus we have
k k
X
nk −ni nk
X 1
2 ≤2 =⇒ ≤1
i=1 i=1
2ni
The last relation is known as the “Kraft Inequality”. If a set of integers n1 , n2 , . . . , nk
satisfies the Kraft inequality, it is both a necessary and a sufficient condition for the ex-
istence of a prefix code of lengths equal to these set of numbers.
Shannon’s Theorem
Given a source with alphabet {a1 , a2 , . . . , ak } which occur with probabilities {p1 , p2 , . . . , pk }
and entropy H(X) = − ki=1 pi log pi , the average length of a uniquely decipherable code
P
c D. K. Ghosh, IIT Bombay 11
is
X
n̄ ≥ H(x), i.e. pi ni ≥ H
i
Proof:
X X
H − n̄ = − pi log pi − pi ni
i i
X 1
= pi log − ni
i
pi
X 1 −ni
= pi log + log 2
i
pi
X 2−ni
= pi log
i
pi
2−ni
X X
≤ pi −1 = 2−ni − 1 ≤ 0
i
pi i
Example :
There are two coins of which one is a fair coin while the other has heads on both sides.
A coin is selected at random and tossed twice. If the tosses result in two heads, what
information does one get regarding the coin that was selected to begin with? Let X be a
random variable which takes value 0 if the coin chosen is a fair coin and takes value 1 for
the biased coin. Let Y be the number of heads. H(X) is the initial uncertainty regarding
the selected coin (which is a one bit uncertainty). The uncertainty remaining when the
number of heads is revealed is H(X|Y ). The information conveyed about the value of X
by revealing Y is then given by I(X|Y ) = H(X)H(X|Y ). Note that if the value of Y is
zero or 1, there is no uncertainty remaining because the coin must then be a fair coin. If
the coin is fair, the probability that Y = 2 is (1/2) × (1/4) = 1/8. If the coin is biased,
the probability that Y = 2 is (1/2) × 1 = 1/2. (In both cases 1/2 is the probability that a
coin is selected). Thus the probability of getting Y = 2 is 1/8 + 1/2 = 5/8. We now need
to multiply this with the entropy associated with the process. Using Bay’ ?s theorem, we
have
P (2|X)P (X)
P (X|Y = 2) =
P (2)
Using the above probability, we can see that given that Y = 2, the probability of X =
0 is 1/5 while the corresponding probability for X = 1 is 4/5. We then have
5 4 5 1
H(X|Y ) = log + log 5 = 0.45
8 5 4 5
Thus the information conveyed about X is 0.55.