Information Theory and Coding by Example
Information Theory and Coding by Example
This fundamental monograph introduces both the probabilistic and the algebraic
aspects of information theory and coding. It has evolved from the authors years
of experience teaching at the undergraduate level, including several Cambridge
Mathematical Tripos courses. The book provides relevant background material, a
wide range of worked examples and clear solutions to problems from real exam
papers. It is a valuable teaching aid for undergraduate and graduate students, or for
researchers and engineers who want to grasp the basic principles.
Mark Kelbert is a Reader in Statistics in the Department of Mathematics at
Swansea University. For many years he has also been associated with the Moscow
Institute of Information Transmission Problems and the International Institute of
Earthquake Prediction Theory and Mathematical Geophysics (Moscow).
Yuri Suhov is a Professor of Applied Probability in the Department of Pure Mathematics and Mathematical Statistics at the University of Cambridge (Emeritus). He
is also affiliated to the University of Sao Paulo in Brazil and to the Moscow Institute
of Information Transmission Problems.
INFORMATION THEORY
A N D CODING BY EXAMPLE
M A R K K E L B E RT
Swansea University, and Universidade de Sao Paulo
Y U R I S U H OV
University of Cambridge, and Universidade de Sao Paulo
Contents
Preface
1
page vii
1
1
18
41
59
86
95
213
243
269
269
291
300
313
328
340
144
144
162
184
199
vi
Contents
366
366
397
409
436
453
480
Bibliography
Index
501
509
Preface
This book is partially based on the material covered in several Cambridge Mathematical Tripos courses: the third-year undergraduate courses Information Theory (which existed and evolved over the last four decades under slightly varied
titles) and Coding and Cryptography (a much younger and simplified course avoiding cumbersome technicalities), and a number of more advanced Part III courses
(Part III is a Cambridge equivalent to an MSc in Mathematics). The presentation
revolves, essentially, around the following core concepts: (a) the entropy of a probability distribution as a measure of uncertainty (and the entropy rate of a random
process as a measure of variability of its sample trajectories), and (b) coding as a
means to measure and use redundancy in information generated by the process.
Thus, the contents of this book includes a more or less standard package of
information-theoretical material which can be found nowadays in courses taught
across the world, mainly at Computer Science and Electrical Engineering Departments and sometimes at Probability and/or Statistics Departments. What makes this
book different is, first of all, a wide range of examples (a pattern that we followed
from the onset of the series of textbooks Probability and Statistics by Example
by the present authors, published by Cambridge University Press). Most of these
examples are of a particular level adopted in Cambridge Mathematical Tripos exams. Therefore, our readers can make their own judgement about what level they
have reached or want to reach.
The second difference between this book and the majority of other books
on information theory or coding theory is that it covers both possible directions: probabilistic and algebraic. Typically, these lines of inquiry are presented
in different monographs, textbooks and courses, often by people who work in
different departments. It helped that the present authors had a long-time association with the Institute for Information Transmission Problems, a section of the
Russian Academy of Sciences, Moscow, where the tradition of embracing a broad
spectrum of problems was strongly encouraged. It suffices to list, among others,
vii
viii
Preface
Preface
ix
was active in this area more than 40 years ago. [Although on several advanced
topics Shannon, probably, could have thought, re-phrasing Einsteins words: Since
mathematicians have invaded the theory of communication, I do not understand it
myself anymore.]
During the years that passed after Shannons inceptions and inventions, mathematics changed drastically, and so did electrical engineering, let alone computer
science. Who could have foreseen such a development back in the 1940s and 1950s,
as the great rivalry between Shannons information-theoretical and Wieners cybernetical approaches was emerging? In fact, the latter promised huge (even fantastic)
benefits for the whole of humanity while the former only asserted that a modest goal of correcting transmission errors could be achieved within certain limits.
Wieners book [171] captivated the minds of 1950s and 1960s thinkers in practically all domains of intellectual activity. In particular, cybernetics became a serious
political issue in the Soviet Union and its satellite countries: first it was declared
a bourgeois anti-scientific theory, then it was over-enthusiastically embraced. [A
quotation from a 1953 critical review of cybernetics in a leading Soviet ideology
journal Problems of Philosophy reads: Imperialists are unable to resolve the controversies destroying the capitalist society. They cant prevent the imminent economical crisis. And so they try to find a solution not only in the frenzied arms race
but also in ideological warfare. In their profound despair they resort to the help of
pseudo-sciences that give them some glimmer of hope to prolong their survival.
The 1954 edition of the Soviet Concise Dictionary of Philosophy printed in hundreds of thousands of copies defined cybernetics as a reactionary pseudo-science
which appeared in the USA after World War II and later spread across other capitalist countries: a kind of modern mechanicism. However, under pressure from
top Soviet physicists who gained authority after successes of the Soviet nuclear
programme, the same journal, Problems of Philosophy, had to print in 1955 an article proclaiming positive views on cybernetics. The authors of this article included
Alexei Lyapunov and Sergei Sobolev, prominent Soviet mathematicians.]
Curiously, as was discovered in a recent biography on Wiener [35], there exist
secret [US] government documents that show how the FBI and the CIA pursued
Wiener at the height of the Cold War to thwart his social activism and the growing
influence of cybernetics at home and abroad. Interesting comparisons can be found
in [65].
However, history went its own way. As Freeman Dyson put it in his review [41]
of [35]: [Shannons theory] was mathematically elegant, clear, and easy to apply
to practical problems of communication. It was far more user-friendly than cybernetics. It became the basis of a new discipline called information theory . . . [In
modern times] electronic engineers learned information theory, the gospel according to Shannon, as part of their basic training, and cybernetics was forgotten.
Preface
Not quite forgotten, however: in the former Soviet Union there still exist at
least seven functioning institutes or departments named after cybernetics: two in
Moscow and two in Minsk, and one in each of Tallinn, Tbilisi, Tashkent and Kiev
(the latter being a renowned centre of computer science in the whole of the former USSR). In the UK there are at least four departments, at the Universities of
Bolton, Bradford, Hull and Reading, not counting various associations and societies. Across the world, cybernetics-related societies seem to flourish, displaying
an assortment of names, from concise ones such as the Institute of the Method
(Switzerland) or the Cybernetics Academy (Italy) to the Argentinian Association of the General Theory of Systems and Cybernetics, Buenos Aires. And we
were delighted to discover the existence of the Cambridge Cybernetics Society
(Belmont, CA, USA). By contrast, information theory figures only in a handful of
institutions names. Apparently, the old Shannon vs. Wiener dispute may not be
over yet.
In any case, Wieners personal reputation in mathematics remains rock solid:
it suffices to name a few gems such as the PaleyWiener theorem (created on
Wieners numerous visits to Cambridge), the WienerHopf method and, of course,
the Wiener process, particularly close to our hearts, to understand his true role in
scientific research and applications. However, existing recollections of this giant of
science depict an image of a complex and often troubled personality. (The title of
the biography [35] is quite revealing but such views are disputed, e.g., in the review
[107]. In this book we attempt to adopt a more tempered tone from the chapter on
Wiener in [75], pp. 386391.) On the other hand, available accounts of Shannons
life (as well as other fathers of information and coding theory, notably, Richard
Hamming) give a consistent picture of a quiet, intelligent and humorous person.
It is our hope that this fact will not present a hindrance for writing Shannons
biographies and that in future we will see as many books on Shannon as we see on
Wiener.
As was said before, the purpose of this book is twofold: to provide a synthetic
introduction both to probabilistic and algebraic aspects of the theory supported by
a significant number of problems and examples, and to discuss a number of topics
rarely presented in most mainstream books. Chapters 13 give an introduction into
the basics of information theory and coding with some discussion spilling over to
more modern topics. We concentrate on typical problems and examples [many of
them originated in Cambridge courses] more than on providing a detailed presentation of the theory behind them. Chapter 4 gives a brief introduction into a variety
of topics from information theory. Here the presentation is more concise and some
important results are given without proofs.
Because the large part of the text stemmed from lecture notes and various solutions to class and exam problems, there are inevitable repetitions, multitudes of
Preface
xi
1
Essentials of Information Theory
(1.1.2)
(1.1.3a)
j=1
(1.1.3b)
j=1
where (u) = P(U1 = u), u I, are the initial probabilities and P(u, u ) = P(U j+1 =
u |U j = u), u, u I, are transition probabilities. A Markov source is called stationary if P(U j = u) = (u), j 1, i.e. = { (u), u = 1, . . . , m} is an invariant
row-vector for matrix P = {P(u, v)}: (u)P(u, v) = (v), v I, or, shortly,
uI
P = .
(c) A degenerated example of a Markov source is where a source emits repeated
symbols. Here,
P(U1 = U2 = = Uk = u) = p(u), u I,
P(Uk = Uk ) = 0, 1 k < k ,
(1.1.3c)
Strings f (u) that are images, under f , of symbols u I are called codewords
(in code f ). A code has (constant) length N if the value s (the length of a codeword) equals N for all codewords. A message u(n) = u1 u2 . . . un is represented as a
concatenation of codewords
f (u(n) ) = f (u1 ) f (u2 ) . . . f (un );
it is again a string from J .
Definition 1.1.3 We say that a code is lossless if u = u implies that f (u) = f (u ).
(That is, the map f : I J is one-to-one.) A code is called decipherable if any
string from J is the image of at most one message. A string x is a prefix in another
string y if y = xz, i.e. y may be represented as a result of a concatenation of x and z.
A code is prefix-free if no codeword is a prefix in any other codeword (e.g. a code
of constant length is prefix-free).
A prefix-free code is decipherable, but not vice versa:
Example 1.1.4 A code with three source letters 1, 2, 3 and the binary encoder
alphabet J = {0, 1} given by
f (1) = 0, f (2) = 01, f (3) = 011
is decipherable, but not prefix-free.
Theorem 1.1.5 (The Kraft inequality) Given positive integers s1 , . . . , sm , there
exists a decipherable code f : I J , with codewords of lengths s1 , . . . , sm , iff
m
qs
1.
(1.1.4)
i=1
Furthermore, under condition (1.1.4) there exists a prefix-free code with codewords
of lengths s1 , . . . , sm .
Proof (I) Sufficiency. Let (1.1.4) hold. Our goal is to construct a prefix-free code
with codewords of lengths s1 , . . . , sm . Rewrite (1.1.4) as
s
nl ql 1,
l=1
(1.1.5)
or
s1
ns qs 1 nl ql ,
l=1
(1.1.6a)
(1.1.6b)
(1.1.6.s1)
(1.1.6.s)
Observe that actually either ni+1 = 0 or ni is less than the RHS of the inequality,
for all i = 1, . . . , s 1 (by definition, ns 1 so that for i = s 1 the second possibility occurs). We can perform the following construction. First choose n1 words
of length 1, using distinct symbols from J: this is possible in view of (1.1.6.s).
It leaves (q n1 ) symbols unused; we can form (q n1 )q words of length 2 by
appending a symbol to each. Choose n2 codewords from these: we can do so in
view of (1.1.6.s1). We still have q2 n1 q n2 words unused: form n3 codewords,
etc. In the course of the construction, no new word contains a previous codeword
as a prefix. Hence, the code constructed is prefix-free.
(II) Necessity. Suppose there exists a decipherable code in J with codeword
lengths s1 , . . . , sm . Set s = max si and observe that for any positive integer r
rs
r
s
q 1 + + qsm = bl ql
l=1
where bl is the number of ways r codewords can be put together to form a string of
length l.
sm
++q
1
= exp (log r + log s) .
r
1/r 1/r
Leon G. Kraft introduced inequality (1.1.4) in his MIT PhD thesis in 1949.
One of the principal aims of the theory is to find the best (that is, the shortest)
decipherable (or prefix-free) code. We now adopt a probabilistic point of view and
assume that symbol u I is emitted by a source with probability p(u):
P(Uk = u) = p(u).
[At this point, there is no need to specify a joint probability of more than one
subsequently emitted symbol.]
Recall, given a code f : I J , we encode a letter i I by a prescribed codeword f (i) = x1 . . . xs(i) of length s(i). For a random symbol, the generated codeword
becomes a random string from J . When f is lossless, the probability of generating
a given string as a codeword for a symbol is precisely p(i) if the string coincides
with f (i) and 0 if there is no letter i I with this property. If f is not one-to-one,
the probability of a string equals the sum of terms p(i) for which the codeword f (i)
equals this string. Then the length of a codeword becomes a random variable, S,
with the probability distribution
P(S = s) =
1(s(i) = s)p(i).
(1.1.7)
1im
We are looking for a decipherable code that minimises the expected word-length:
ES =
sP(S = s) = s(i)p(i).
s1
i=1
(1.1.8)
Theorem 1.1.7
lows:
The optimal value for problem (1.1.8) is lower-bounded as folmin ES hq (p(1), . . . , p(m)),
(1.1.9)
(1.1.10)
where
i
< 0,
z = 0, and
L
= p(i) + qs(i) ln q = 0,
s(i)
whence
p(i)
= qs(i) ,
ln q
p(i)/( ln q) = 1,
i.e. ln q = 1.
Hence,
s(i) = logq p(i),
1 i m,
is the (unique) optimiser for the relaxed problem, giving the value hq from (1.1.10).
The relaxed problem is solved on a larger set of variables s(i); hence, its minimal
value does not exceed that in the original one.
Remark 1.1.8 The quantity hq defined in (1.1.10) plays a central role in the
whole of information theory. It is called the q-ary entropy of the probability distribution (p(x), x I) and will emerge in a great number of situations. Here we note
that the dependence on q is captured in the formula
hq (p(1), . . . , p(m)) =
1
h2 (p(1), . . . , p(m))
log q
(1.1.11)
Worked Example 1.1.9 (a) Give an example of a lossless code with alphabet
Jq which does not satisfy the Kraft inequality. Give an example of a lossless code
with the expected code-length strictly less than hq (X).
(b) Show that the Kraft sum qs(i) associated with a lossless code may be
i
p(1) = p(2) = 1/3 the expected codeword-length Es(X) = 4/3 < h(X) = log 3 =
1.585.
(b) Assume that the alphabet size m = I = 2(2L 1) for some positive
integer L. Consider the lossless code assigning to the letters x I the codewords
0, 1, 00, 01, 10, 11, 000, . . ., with the maximum codeword-length L. The Kraft sum is
2s(x) =
xI
2s(x) =
lL x:s(x)=l
2l 2l = L,
lL
(1.1.12)
where hq = p(i) logq p(i) is the q-ary entropy of the source; see (1.1.10).
i
Proof The LHS inequality is established in (1.1.9). For the RHS inequality, let
s(i) be a positive integer such that
qs(i) p(i) < qs(i)+1 .
The non-strict bound here implies qs(i) p(i) = 1, i.e. the Kraft inequality.
i
Hence, there exists a decipherable code with codeword-lengths s(1), . . . , s(m). The
strict bound implies
s(i) <
log p(i)
+ 1,
log q
and thus
ES <
log q
+ p(i) =
i
h
+ 1.
log q
on average, at least k binary digits for decipherable encoding. Using a term bit for
a unit of entropy, we say that on average the encoding requires at least k bits.
Moreover, the NLCT leads to a ShannonFano encoding procedure: we fix positive integer codeword-lengths s(1), . . . , s(m) such that qs(i) p(i) < qs(i)+1 , or,
equivalently,
(1.1.13)
logq p(i) s(i) < logq p(i) + 1; that is, s(i) = logq p(i) .
Then construct a prefix-free code, from the shortest s(i) upwards, ensuring that
the previous codewords are not prefixes. The Kraft inequality guarantees enough
room. The obtained code may not be optimal but has the mean codeword-length
satisfying the same inequalities (1.1.13) as an optimal code.
Optimality is achieved by Huffmans encoding fmH : Im Jq . We first discuss
it for binary encodings, when q = 2 (i.e. J = {0, 1}). The algorithm constructs a
binary tree, as follows.
(i) First, order the letters i I so that p(1) p(2) p(m).
(ii) Assign symbol 0 to letter m 1 and 1 to letter m.
(iii) Construct a reduced alphabet Im1 = {1, . . . , m 2, (m 1, m)}, with probabilities
p(1), . . . , p(m 2), p(m 1) + p(m).
Repeat steps (i) and (ii) with the reduced alphabet, etc. We obtain a binary tree. For
an example of Huffmans encoding for m = 7 see Figure 1.1.
The number of branches we must pass through in order to reach a root i of the
tree equals s(i). The tree structure, together with the identification of the roots
as source letters, guarantees that encoding is prefix-free. The optimality of binary
Huffman encoding follows from the following two simple lemmas.
10
m= 7
i
pi
f(i)
0
100
101
110
1110
11110
1
3
3
3
4
5
7 .025 11111
1 .5
2 .15
3 .15
4 .1
5 .05
6 .025
1.0
si
.5
.15
.025
Figure 1.1
Lemma 1.1.12 Any optimal prefix-free binary code has the codeword-lengths
reverse-ordered versus probabilities:
p(i) p(i ) implies
s(i) s(i ).
(1.1.14)
Proof If not, we can form a new code, by swapping the codewords for i and i .
This shortens the expected codeword-length and preserves the prefix-free property.
Lemma 1.1.13 In any optimal prefix-free binary code there exist, among the
codewords of maximum length, precisely two agreeing in all but the last digit.
Proof If not, then either (i) there exists a single codeword of maximum length,
or (ii) there exist two or more codewords of maximum length, and they all differ
before the last digit. In both cases we can drop the last digit from some word of
maximum length, without affecting the prefix-free property.
Theorem 1.1.14
codes.
Proof The proof proceeds with induction in m. For m = 2, the Huffman code f2H
has f2H (1) = 0, f2H (2) = 1, or vice versa, and is optimal. Assume the Huffman code
H is optimal for I
fm1
m1 , whatever the probability distribution. Suppose further that
11
the Huffman code fmH is not optimal for Im for some probability distribution. That
is, there is another prefix-free code, fm , for Im with a shorter expected word-length:
H
< ESm
.
ESm
(1.1.15)
H and ES ,
and ESm1
are the same as the corresponding contributions to ESm
m
H
respectively. Therefore, fm1 is better than fm1 : ESm1 < ESm1 , which contradicts the assumption.
In view of Theorem 1.1.14, we obtain
Corollary 1.1.15
codes.
The generalisation of the Huffman procedure to q-ary codes (with the code
alphabet Jq = {0, 1, . . . , q 1}) is straightforward: instead of merging two symbols, m 1, m Im , having lowest probabilities, you merge q of them (again with
the smallest probabilities), repeating the above argument. In fact, Huffmans original 1952 paper was written for a general encoding alphabet. There are numerous
modifications of the Huffman code covering unequal coding costs (where some of
the encoding digits j Jq are more expensive than others) and other factors; we
will not discuss them in this book.
12
1
p (1) < 1/
3
1
p (b)
_ p (1) < 2/
3 p (1) + p (b)
p (b)
(a)
p (b)
p (4)
(b)
Figure 1.2
Solution (a) Two cases are possible: the letter 1 either was, or was not merged with
other letters before two last steps in constructing a Huffman code. In the first case,
s(1) 2. Otherwise we have symbols 1, b and b , with
p(1) < 1/3,
Then letter 1 is to be merged, at the last but one step, with one of b, b , and hence
s(1) 2. Indeed, suppose that at least one codeword has length 1, and this codeword is assigned to letter 1 with p(1) < 1/3. Hence, the top of the Huffman tree is
as in Figure 1.2(a) with 0 p(b), p(b ) 1 p(1) and p(b) + p(b ) = 1 p(1).
min p(b), p(b ) . Hence, Figure 1.2(a) is impossible, and letter 1 has codewordlength 2.
The bound is sharp as both codes
{0, 01, 110, 111} and {00, 01, 10, 11}
are binary Huffman codes, e.g. for a probability distribution 1/3, 1/3, 1/4, 1/12.
(b) Now let p(1) > 2/5 and assume that letter 1 has a codeword-length s(1) 2 in
a Huffman code. Thus, letter 1 was merged with other letters before the last step.
That is, at a certain stage, we had symbols 1, b and b say, with
(A)
(B)
(C)
(D)
13
Indeed, if, say, p(b) < 1/2p(b ) then b should be selected instead of p(3) or p(4)
on the previous step when p(b ) was formed. By virtue of (D), p(b) 1/5 which
makes (A)+(C) impossible.
A piece of the Huffman tree over p(1) is then as in Figure 1.2(b), with p(3) +
p(4) = p(b ) and p(1) + p(b ) + p(b) 1. Write
p(1) = 2/5 + , p(b ) = 2/5 + + , p(b) = 2/5 + + ,
with > 0, , 0. Then
p(1) + p(b ) + p(b) = 6/5 + 3 + 2 1, and 1/5 + 3 + 2 .
This yields
p(b) 1/5 2 < 1/5.
However, since
probability p(b) should be merged with min p(3), p(4) , i.e. diagram (b) is
impossible. Hence, the letter 1 has codeword-length s(1) = 1.
Worked Example 1.1.17 Suppose that letters i1 , . . . , i5 are emitted with probabilities 0.45, 0.25, 0.2, 0.05, 0.05. Compute the expected word-length for Shannon
Fano and Huffman coding. Illustrate both methods by finding decipherable binary
codings in each case.
Solution In this case q = 2, and
ShannonFano:
with E codeword-length) = .9 + .5 + .6 + .25 + .25 = 2.5, and
Huffman:
pi codeword
.45
1
.25
01
.2
000
.05
0010
.05
0011
with E codeword-length) = 0.45 + 0.5 + 0.6 + 0.2 + 0.2 = 1.95.
14
Worked Example 1.1.18 A ShannonFano code is in general not optimal. However, it is not much longer than Huffmans. Prove that, if SSF is the Shannon
Fano codeword-length, then for any r = 1, 2, . . . and any decipherable code f with
codeword-length S ,
P S SSF r q1r .
Solution Write
P(S SSF r) =
p(i).
iI : s (i)sSF (i)r
p(i)
iI : s (i)sSF (i)r
p(i)
p(i)
p(i)
(i)+1r
iI : p(i)qs
(i)+1r
qs
iI
1r
=q
qs (i)
iI
1r
hq =
(1.1.16)
i1 ,...,in
(n)
hq
1
hq
en
+ .
n
n
n
(n)
We see that, for large n, en hq n.
(1.1.17)
15
Example 1.1.19 For a Bernoulli source emitting letter i with probability p(i) (cf.
Example 1.1.2), equation (1.1.16) yields
(n)
hq = p(i1 ) p(in ) logq p(i1 ) p(in )
i1 ,...,in
n
(1.1.18)
j=1 i1 ,...,in
where hq = p(i) logq p(i). Here, en hq . Thus, for n large, the minimal
expected codeword-length per source letter, in a segmented code, eventually attains the lower bound in (1.1.13), and hence does not exceed min ES, the minimal
expected codeword-length for letter-by-letter encodings. This phenomenon is much
more striking in the situation where the subsequent source letters are dependent. In
(n)
many cases hq n hq , i.e. en hq . This is the gist of data compression.
Therefore, statistics of long strings becomes an important property of a source.
Nominally, the strings u(n) = u1 . . . un of length n fill the Cartesian power I n ; the
total number of such strings is mn , and to encode them all we need mn = 2n log m
distinct codewords. If the codewords have a fixed length (which guarantees the
prefix-free property), this length is between n log m, and n log m, and the rate
of encoding, for large n, is log m bits/source letter. But if some strings are rare,
we can disregard them, reducing the number of codewords used. This leads to the
following definitions.
Definition 1.1.20 A source is said to be (reliably) encodable at rate R > 0 if, for
any n, we can find a set An I n such that
An 2nR
and
lim P(U(n) An ) = 1.
(1.1.19)
In other words, we can encode messages at rate R with a negligible error for long
source strings.
Definition 1.1.21 The information rate H of a given source is the infimum of the
reliable encoding rates:
H = inf[R : R is reliable].
Theorem 1.1.22
(1.1.20)
(1.1.21)
16
E
T
A
I
N
O
S
H
R
12000 9000 8000 8000 8000 8000 8000 6400 6200
D
L
U
C
M
F
W
Y
G
B
V
K
Q
J
X
Z
1700 1600 1200 800 500 400 400 200
(b) A similar idea was applied to the decimal and binary decomposition of a
given number. For example, take number . If the information rate for its binary
17
ln x
1
x
Figure 1.3
N 1nN n
or disprove that is a transcendental number, and transcendental numbers form
a set of probability one under the Bernoulli source of subsequent digits.) As the
results of numerical experiments show, for the number of digits N 500, 000 all
the above-mentioned numbers display the same pattern of behaviour as a completely random number; see [26]. In Section 1.3 we will calculate the information
rates of Bernoulli and Markov sources.
We conclude this section with the following simple but fundamental fact.
Theorem 1.1.24 (The Gibbs inequality: cf. PSE II, p. 421) Let {p(i)} and {p (i)}
be two probability distributions (on a finite or countable set I ). Then, for any b > 1,
p(i) logb
i
p (i)
0, i.e. p(i) logb p(i) p(i) logb p (i),
p(i)
i
i
The bound
logb x
x1
ln b
(1.1.22)
18
holds for each x > 0, with equality iff x = 1. Setting I = {i : p(i) > 0}, we have
1
p (i)
p (i)
p (i)
= p(i) logb
1
p(i)
p(i) logb
p(i)
p(i)
ln biI
i
iI
p(i)
1
1
=
p (i) p(i) =
p (i) 1 0.
ln b iI
ln b iI
iI
For equality we need: (a) p (i) = 1, i.e. p (i) = 0 when p(i) = 0; and (b)
p (i)/p(i) = 1 for i I .
iI
[In view of the adopted equality 0 log 0 = 0, the sum may be reduced to those xi
for which pX (xi ) > 0.]
Sometimes an alternative view is useful: i(A) represents the amount of information needed to specify event A and h(X) gives the expected amount of information
required to specify a random variable X.
Clearly, the entropy h(X) depends on the probability distribution, but not on
the values x1 , . . . , xm : h(X) = h(p1 , . . . , pm ). For m = 2 (a two-point probability
distribution), it is convenient to consider the function (p)(= 2 (p)) of a single
variable p [0, 1]:
(1.2.2a)
19
0.0
0.2
0.4
h(p,1p)
0.6
0.8
1.0
0.0
0.2
0.4
0.6
0.8
1.0
Figure 1.4
Entropy for p, q, p+q in [0,1]
1.5
h( p,q,1pq)
1.0
0.5
0.8
0.6
q
0.4
0.2
0.0
0.0
0.0
0.2
0.4
0.6
0.8
Figure 1.5
The graph of (p) is plotted in Figure 1.4. Observe that the graph is concave as
d2
(p) = log e [p(1 p)] < 0. See Figure 1.4.
2
dp
The graph of the entropy of a three-point distribution
20
(1.2.2b)
Here, pX,Y (i, j) is the joint probability P(X = xi ,Y = y j ) and pX|Y (xi |y j ) the conditional probability P(X = xi |Y = y j ). Clearly, (1.2.3) and (1.2.4) imply
h(X|Y ) = h(X,Y ) h(Y ).
(1.2.5)
(1.2.7)
21
the LHS equality occurring iff X takes a single value, and the RHS equality
occurring iff X takes m values with equal probabilities.
(b) The joint entropy obeys
h(X,Y ) h(X) + h(Y ),
(1.2.8)
with equality iff X and Y are independent, i.e. P(X = x,Y = y) = P(X = x)
P(Y = y) for all x, y I .
(c) The relative entropy is always non-negative:
h(X||Y ) 0,
(1.2.9)
i2
= h(X) + h(Y ).
We used here the identities pX,Y (i1 , i2 ) = pX (i1 ), pX,Y (i1 , i2 ) = pY (i2 ).
i2
i1
(1.2.10)
22
Show that:
(bi) when f E( f ) then the maximising probability distribution is uniform on K , with P(Z = k) = 1/( K), k K ;
(bii) when f < E( f ) and f is not constant then the maximising probability distribution has
P(Z = k) = pk = e f (k)
e f (i) ,
k K,
(1.2.11)
pk f (k) = .
(1.2.12)
Moreover, suppose that Z takes countably many values, but f 0 and for a
given there exists a < 0 such that e f (i) < and pk f (k) = where pk
i
(1.2.13)
23
Eq f = qk f (k) . Next, observe that the mean value (1.2.12) calculated for the
k
f
(k)
[ f (k)] e
f (k)e
d
k
2
2
=
k
2 = E[ f (Z)] [E f (Z)]
f
(i)
d
e
e f (i)
i
i
is positive (it yields the variance of the random variable f (Z)); for a non-constant
f the RHS is actually non-negative. Therefore, for non-constant f (i.e. with
f < E( f ) < f ), for all from the interval [ f , f ] there exists exactly one probability distribution of form (1.2.11) satisfying (1.2.12), and for f < E( f ) the
corresponding ( ) is < 0.
Next, we use the fact that the KullbackLeibler distance D(q||p ) (cf. (1.2.6))
satisfies D(q||p ) = qk log (qk /pk ) 0 (Gibbs inequality) and that qk f (k)
k
qk log pk
qk log e f (i)
k
=
k
pk
= pk log pk = h(p ).
k
For part (biii): the above argument still works for an infinite countable set K
provided that the value ( ) determined from (1.2.12) is < 0.
(c) By the Gibbs inequality hY (X) 0. Next, we use part (b) by taking f (k) = k,
= and = ln q. The maximum-entropy distribution can be written as pj =
(1 p)p j , j = 0, 1, 2, . . ., with kpk = , or = p/(1 p). The entropy of this
k
distribution equals
h(p ) = (1 p)p j log (1 p)p j
j
p
log p log(1 p) = ( + 1) log( + 1) log ,
1 p
24
Alternatively:
0 hY (X) = p(i) log
i
p(i)
(1 p)pi
ip(i)
i
log
= ( + 1) log( + 1) log .
+1
+1
The RHS is the entropy h(Y ) of the geometric random variable Y . Equality holds
iff X Y , i.e. X is geometric.
A simple but instructive corollary of the Gibbs inequality is
Lemma 1.2.5
(1.2.15)
25
log x
1/
2
Figure 1.6
p log p (1 p ) log(1 p )
(i) h
(ii) h log p ;
(iii) h 2(1 p ).
(p );
Solution Part (i) follows from the pooling inequality, and (ii) holds as
h pi log p = log p .
i
(1.2.16)
(1.2.17)
26
pi log pi
2im
p2
pm
= h(p1 , 1 p1 ) + (1 p1 )h
,...,
1 p1
1 p1
;
in the RHS the first term is ( ) and the second one does not exceed log(m 1).
Definition 1.2.9 Given random variables X, Y , Z, we say that X and Y are conditionally independent given Z if, for all x and y and for all z with P(Z = z) > 0,
P(X = x,Y = y|Z = z) = P(X = x|Z = z)P(Y = y|Z = z).
(1.2.18)
(1.2.19)
the first equality occurring iff X is a function of Y and the second equality holding
iff X and Y are independent.
(b) For all random variables X , Y , Z ,
h(X|Y, Z) h(X|Y ) h(X| (Y )),
(1.2.20)
the first equality occurring iff X and Z are conditionally independent given Y and
the second equality holding iff X and Z are conditionally independent given (Y ).
Proof (a) The LHS bound in (1.2.19) follows from definition (1.2.4) (since
h(X|Y ) is a sum of non-negative terms). The RHS bound follows from representation (1.2.5) and bound (1.2.8). The LHS quality in (1.2.19) is equivalent
to the equation h(X,Y ) = h(Y ) or h(X,Y ) = h( (X,Y )) with (X,Y ) = Y . In
view of Theorem 1.2.6, this occurs iff, with probability 1, the map (X,Y ) Y
is invertible, i.e. X is a function of Y . The RHS equality in (1.2.19) occurs iff
h(X,Y ) = h(X) + h(Y ), i.e. X and Y are independent.
(b) For the lower bound, use a formula analogous to (1.2.5):
h(X|Y, Z) = h(X, Z|Y ) h(Z|Y )
(1.2.21)
(1.2.22)
27
with equality iff X and Z are conditionally independent given Y . For the RHS
bound, use:
(i) a formula that is a particular case of (1.2.21): h(X|Y, (Y )) = h(X,Y | (Y ))
h(Y | (Y )), together with the remark that h(X|Y, (Y )) = h(X|Y );
(ii) an inequality which is a particular case of (1.2.22): h(X,Y | (Y ))
h(X| (Y )) + h(Y | (Y )), with equality iff X and Y are conditionally independent
given (Y ).
Theorems 1.2.8 above and 1.2.11 below show how the entropy h(X) and conditional entropy h(X|Y ) are controlled when X is nearly a constant (respectively,
nearly a function of Y ).
Theorem 1.2.11 (The generalised Fano inequality) For a pair of random variables, X and Y taking values x1 , . . . , xm and y1 , . . . , ym , if
m
P(X = x j ,Y = y j ) = 1 ,
(1.2.23)
(1.2.24)
j=1
then
where ( ) is defined in (1.2.3).
Proof
pY (y j ) j = P(X = x j ,Y = y j ) = .
j
(1.2.25)
If the random variable X takes countably many values {x1 , x2 , . . .}, the above
definitions may be repeated, as well as most of the statements; notable exceptions
are the RHS bound in (1.2.7) and inequalities (1.2.17) and (1.2.24).
Many properties of entropy listed so far are extended to the case of random
strings.
Theorem 1.2.12
(Y1 , . . . ,Yn ),
28
obeys
n
i=1
i=1
(1.2.26)
x(n) ,y(n)
satisfies
n
i=1
i=1
(1.2.27)
with the LHS equality holding iff X1 , . . . , Xn are conditionally independent, given
Y(n) , and the RHS equality holding iff, for each i = 1, . . . , n, Xi and {Yr : 1 r
n, r = i} are conditionally independent, given Yi .
Proof The proof repeats the arguments used previously in the scalar case.
Definition 1.2.13 The mutual information or mutual entropy, I(X : Y ), between
X and Y is defined as
I(X : Y ) := pX,Y (x, y) log
x,y
pX,Y (X,Y )
pX,Y (x, y)
= E log
pX (x)pY (y)
pX (X)pY (Y )
(1.2.28)
(1.2.29)
the first equality occurring iff X and (Y ) are independent, and the second iff X
and Y are conditionally independent, given (Y ).
29
1 p
1
, 0<z< ,
1 zp
p
Y =
K(1 p) + p
p
p
=
.
=K+
1 p
1 p
1 p
This yields
p =
K(1 p) + p
1 p
, E(zY ) =
,
1 + K(1 p)
1 + K(1 p) z(p + K(1 p))
and
E(zX ) =
1 zp
.
1 + K(1 p) z(p + K(1 p))
(1.2.30)
1 pX
,
1 zpX
p + K(1 p)
p
, 0 =
,
1 + K(1 p)
p + K(1 p)
(1.2.31)
30
i=1
i=1
(1.2.32)
(1.2.33)
i=1
Observe that
n
i=1
i=1
(1.2.34)
I(X : Y j ),
(1.2.35)
j=1
first under the assumption that Y1 , . . . ,Yn are independent random variables,
and then under the assumption that Y1 , . . . ,Yn are conditionally independent
given X .
(c) Prove or disprove by producing a counter-example the inequality
I(X : Y(n) )
I(X : Y j ),
(1.2.36)
j=1
first under the assumption that Y1 , . . . ,Yn are independent random variables,
and then under the assumption that Y1 , . . . ,Yn are conditionally independent
given X .
31
P(X = x, Z = z)
P(X = x)P(Z = z)
(1.2.37)
j=1
I(X : Y j ),
j=1
j=1
j=1
I(X : Y j ),
j=1
32
Recall that a real function f (y) defined on a convex set V Rm is called concave if
f (0 y(0) + 1 y(1) ) 0 f (y(0) ) + 1 f (y(1) )
for any y(0) , y(1) V and 0 , 1 [0, 1] with 0 + 1 = 1. It is called strictly concave
if the equality is attained only when either y(0) = y(1) or 0 1 = 0. We treat h(X) as
a function of variables p = (p1 , . . . , pm ); set V in this case is {y = (y1 , . . . , ym ) Rm :
yi 0, 1 i m, y1 + + ym = 1}.
Theorem 1.2.18
bution.
Proof Let the random variables X (i) have probability distributions p(i) , i = 0, 1,
and assume that the random variable takes values 0 and 1 with probabilities
0 and 1 , respectively, and is independent of X (0) , X (1) . Set X = X () ; then the
inequality h(0 p(0) + 1 p(1) ) 0 h(p(0) ) + 1 h(p(1) ) is equivalent to
h(X) h(X|)
(1.2.38)
(0)
and the RHS equals 0 0 pi +
(0)
(1)
(1 0 )pi = 1 pi ,
i.e. the probability distributions p(0) and p(1) are proportional. Then either they are
equal or 1 = 0, 0 = 1. The assumption 1 > 0 leads to a similar conclusion.
Worked Example 1.2.19
obeys
(X,Y ) = h(X) + h(Y ) 2I(X : Y )
= h(X,Y ) I(X : Y ) = 2h(X,Y ) h(X) h(Y ).
Prove that is symmetric, i.e. (X,Y ) = (Y, X) 0, and satisfies the triangle
inequality, i.e. (X,Y ) + (Y, Z) (X, Z). Show that (X,Y ) = 0 iff X and Y
are functions of each other. Also show that if X and X are functions of each other
then (X,Y ) = (X ,Y ). Hence, may be considered as a metric on the set of
the random variables X , considered up to equivalence: X X iff X and X are
functions of each other.
33
(1.2.39)
(1.2.40)
In fact, for all triples Xi1 , Xi2 , Xi3 as above, in the metric we have that
34
1in
I(X : Zi ) I Z I X : Z .
(1.2.41)
1in
As we show below, bound (1.2.41) holds for any X and Z (without referring to a
Markov property). Indeed, (1.2.41) is equivalent to
nh(X) h(X, Zi ) + h Z h(X) + h Z h X, Z
1in
or
h X, Z h(X)
h(X, Zi ) nh(X)
1in
which in turn is nothing but the inequality h Z|X h(Zi |X).
1in
(b) Show that h(p) p j Pjk log Pjk if P is a stochastic matrix and p is an
j=1 k=1
invariant vector of P: Pp = p.
Solution (a) By concavity of the log-function x log x, for all i , ci 0 such
m
i,k
the Gibbs inequality the RHS h(p). The equality holds iff PT Pp p, i.e. PT P =
I, the unit matrix. This happens iff P is a permutation matrix.
35
(b) The LHS equals h(Un ) for the stationary Markov source (U1 ,U2 , . . .) with equilibrium distribution p, whereas the RHS is h(Un |Un1 ). The general inequality
h(Un |Un1 ) h(Un ) gives the result.
Worked Example 1.2.23 The sequence of random variables {X j : j = 1, 2, . . .}
forms a DTMC with a finite state space.
(a) Quoting standard properties of conditional entropy, show that h(X j |X j1 )
h(X j |X j2 ) and, in the case of a stationary DTMC, h(X j |X j2 ) 2h(X j |X j1 ).
(b) Show that the mutual information I(Xm : Xn ) is non-decreasing in m and nonincreasing in n, 1 m n.
Solution (a) By the Markov property and stationarity
h(X j |X j1 ) = h(X j |X j1 , X j2 )
h(X j |X j2 ) h(X j , X j1 |X j2 )
= h(X j |X j1 , X j2 ) + h(X j1 |X j2 ) = 2h(X j |X j1 ).
(b) Write
I(Xm : Xn ) I(Xm : Xn+1 ) = h(Xm |Xn+1 ) h(Xm |Xn )
= h(Xm |Xn+1 ) h(Xm |Xn , Xn+1 ) (because Xm and
Xn+1 are conditionally independent, given Xn )
which is 0. Thus, I(Xm : Xn ) does not increase with n.
Similarly,
I(Xm1 : Xn ) I(Xm : Xn ) = h(Xn |Xm1 ) h(Xn |Xm , Xm1 ) 0.
Thus, I(Xm : Xn ) does not decrease with m.
Here, no assumption of stationarity has been used. The DTMC may not even be
time-homogeneous (i.e. the transition probabilities may depend not only on i and j
but also on the time of transition).
Worked Example 1.2.24
36
Solution By the Markov property, Xn1 and Xn+1 are conditionally independent,
given Xn . Hence,
h(Xn1 , Xn+1 |Xn ) = h(Xn+1 |Xn ) + h(Xn1 |Xn )
and I(Xn1 : Xn+1 |Xn ) = 0. Also,
I(Xn : Xn+m ) I(Xn : Xn+m+1 )
= h(Xn+m ) h(Xn+m+1 ) h(Xn , Xn+m+1 ) + h(Xn , Xn+m )
= h(Xn |Xn+m+1 ) h(Xn |Xn+m )
= h(Xn |Xn+m+1 ) h(Xn |Xn+m , Xn+m+1 ) 0,
the final equality holding because of the conditional independence and the last
inequality following from (1.2.21).
Worked Example 1.2.25
(1.2.43)
(1.2.44)
37
Solution (a) Using (1.2.42), we obtain for the function F(m) = h 1/m, . . . , 1/m
the following identity:
1 1 1
1 1
1
2
F(m ) = h
,..., , 2 ,..., 2
m m
m m m
m
1 1
1
1
, , . . . , 2 + F(m)
=h
m m2
m
m
..
.
1
m
1
,...,
+ F(m) = 2F(m).
=h
m
m
m
The induction hypothesis is F(mk1 ) = (k 1)F(m). Then
1
1
1
1
1
1
(m ) = h
k1 , . . . , k1 , k , . . . , k
m m
m m
m
m
1
1
1
1
=h
, , . . . , k + F(m)
mk1 mk
m
m
..
.
m
1
1
, . . . , k1 + F(m)
=h
k1
m
m
m
= (k 1)F(m) + F(m) = kF(m).
Now, for given positive integers b > 2 and m, we can find a positive integer n such
that 2n bm 2n+1 , i.e.
n
n 1
log2 b + .
m
m m
By monotonicity of F(m), we obtain nF(2) mF(b) (n + 1)F(2), or
F(b)
n 1
n
+ .
m F(2) m m
F(b)
1
We conclude that log2 b
, and letting m , F(b) = c log b with
F(2)
m
c = F(2).
38
r1
rm
, . . . , pm =
and obtain
r
r
r
rm
rm
r1
r1 1 r2
r1 1
1
,...,
,..., , ,...,
F(r1 )
h
=h
r
r
r r1
r r1 r
r
r
..
.
1
ri
1
,...,
c
log ri
=h
r
r
1im r
ri
ri
ri
log ri = c
log .
= c log r c
r
1im r
1im r
1
1
,...,
F(m) = h
m
m
m1
1
1
1 m1
,
h
,...,
+
,
=h
m m
m
m1
m1
i.e.
1 m1
m1
,
= F(m)
F(m 1).
m m
m
(1.2.45)
39
(1.2.46)
Next, we write
m
mF(m) =
F(k)
k=1
k1
F(k 1)
k
or, equivalently,
m
F(m) m + 1
2
k1
=
k F(k) k F(k 1) .
m
2m m(m + 1) k=1
The quantity in the square brackets is the arithmetic mean of m(m + 1)/2 terms of
a sequence
2
2
F(1), F(2) F(1), F(2) F(1), F(3) F(2), F(3) F(2),
3
3
k1
2
F(k 1), . . . ,
F(3) F(2), . . . , F(k)
3
k
k1
F(k)
F(k 1), . . .
k
that tends to 0. Hence, it goes to 0 and F(m)/m 0. Furthermore,
1
m1
F(m 1) F(m 1) 0,
F(m) F(m 1) = F(m)
m
m
and (1.2.46) holds. Now define
c(m) =
F(m)
,
log m
and prove that c(m) = const. It suffices to prove that c(p) = const for any prime
number p. First, let us prove that a sequence (c(p)) is bounded. Indeed, suppose the
numbers c(p) are not bounded from above. Then, we can find an infinite sequence
of primes p1 , p2 , . . . , pn , . . . such that pn is the minimal prime such that pn > pn1
and c(pn ) > c(pn1 ). By construction, if a prime q < pn then c(q) < c(pn ).
40
F(pn )
(1.2.47)
j=1
p
p
log
= 0, equations (1.2.46) and (1.2.47) imply that
p log p
p1
c(pn ) c(2) 0 which contradicts with the construction of c(p). Hence, c(p) is
bounded from above. Similarly, we check that c(p) is bounded from below. Moreover, the above proof yields that sup p c(p) and inf p c(p) are both attained.
Now assume that c( p) = sup p c(p) > c(2). Given a positive integer m, decompose into prime factors pm 1 = q1 1 . . . qs s with q1 = 2. Arguing as before, we
write the difference F( pm ) F( pm 1) as
Moreover, as lim
F( pm )
log( pm 1) + c( p) log( pm 1) F( pm 1)
log pm
s
pm
F( pm ) pm
+
=
log
j (c( p) c(q j )) log q j
F( pm )
c( pm ) pm
pm
+ (c( p) c(2)).
log
pm log pm
pm 1
1 1
2, 2
= 1 we get c = 1. Finally, as
h(p1 , . . . , pm ) = pi log pi
(1.2.48)
i=1
1.3 Shannons first coding theorem. The entropy rate of a Markov source
41
q1 qn ,
one has
k
i=1
pi qi ,
for all k = 1, . . . , n.
i=1
Then
h(p) h(q) whenever p q.
Solution We write the probability distributions p and q as non-increasing functions
of a discrete argument
p p(1) p(n) 0, q q(1) q(n) 0,
with
p(i) = q(i) = 1.
i
Condition p q means that if p = q then there exist i1 and i2 such that (a) 1 i1
i2 n, (b) q(i1 ) > p(i1 ) p(i2 ) > q(i2 ) and (c) q(i) p(i) for 1 i i1 , q(i) p(i) for
i i2 .
Now apply induction in s, the number of values i = 1, . . . , n for which q(i) = p(i) .
If s = 0 we have p = q and the entropies coincide. Make the induction hypothesis
and then increase s by 1. Take a pair i1 , i2 as above. Increase q(i2 ) and decrease q(i1 )
so that the sum q(i1 ) +q(i2 ) is preserved, until either q(i1 ) reaches p(i1 ) or q(i2 ) reaches
p(i2 ) (see Figure 1.7). Property (c) guarantees that the modified distributions p q.
As the function x (x) = x log x (1 x) log(1 x) strictly increases on
[0, 1/2]. Hence, the entropy of the modified distribution strictly increases. At the
end of this process we diminish s. Then we use our induction hypothesis.
42
i1
i2
Figure 1.7
(1.3.2)
(1.3.4)
1.3 Shannons first coding theorem. The entropy rate of a Markov source
Proof
43
For brevity, omit the upper index (n) in the notation u(n) and U(n) . Set
Bn := {u I n : pn (u) 2nR }
= {u I n : log pn (u) nR}
= {u I n : n (u) R}.
Then
1 P(U Bn ) =
uBn
Thus,
44
In fact,
P 2n(H+ ) pn (U(n) ) 2n(H )
1
(n)
= P H log pn (U ) H +
n
= P |n H| = 1 P |n H| > .
In other words, for all > 0 there exists n0 = n0 ( ) such that, for any n > n0 , the
set I n decomposes into disjoint subsets, n and Tn , with
(i) P U(n) n < ,
(ii) 2n(H+ ) P U(n) = u(n) 2n(H ) for all u(n) Tn .
Pictorially speaking, Tn is a set of typical strings and n is the residual set.
We conclude that, for a source with the asymptotic equipartition property, it is
worthwhile to encode the typical strings with codewords of the same length, and
the rest anyhow. Then we have the effective encoding rate H + o(1) bits/sourceletter, though the source emits log m bits/source-letter.
(b) Observe that
E n =
1
1
pn (u(n) ) log pn (u(n) ) = h(n) .
n u(n) I n
n
(1.3.7)
The simplest example of an information source (and one among the most
instructive) is a Bernoulli source.
Theorem 1.3.7
1.3 Shannons first coding theorem. The entropy rate of a Markov source
Proof
45
i=1
1 n
i . Observe that Ei = j p( j) log p( j) = h and
n i=1
E n = E
1 n
i
n i=1
=
1 n
1 n
Ei = h = h,
n i=1
n i=1
the final equality being in agreement with (1.3.7), since, for the Bernoulli source,
P
h(n) = nh (see (1.1.18)), and hence En = h. We immediately see that n h by
the law of large numbers. So H = h by Theorem 1.3.5 (FCT).
Theorem 1.3.8 (The law of large numbers for IID random variables) For any
sequence of IID random variables 1 , 2 , . . . with finite variance and mean Ei = r,
and for any > 0,
1 n
(1.3.8)
lim P | i r| = 0.
n
n i=1
Proof The proof of Theorem 1.3.8 is based on the famous Chebyshev inequality;
see PSE II, p. 368.
Lemma 1.3.9
Proof
1
E 2 .
2
(1.3.9)
46
This condition means that the DTMC is irreducible and aperiodic. Then (see
PSE II, p. 71), the DTMC has a unique invariant (equilibrium) distribution
(1), . . . , (m):
m
0 (u) 1,
(u) = 1,
(v) =
u=1
(u)P(u, v),
(1.3.10)
u=1
(1.3.11)
for all initial distribution { (u), u I}. Moreover, the convergence in (1.3.11) is
exponentially (geometrically) fast.
Theorem 1.3.10 Assume that condition (1.3.9) holds with r = 1. Then the DTMC
U1 ,U2 , . . . possesses a unique invariant distribution (1.3.10), and for any u, v I and
any initial distribution on I ,
|P(n) (u, v) (v)| (1 )n and |P(Un = v) (v)| (1 )n1 .
(1.3.12)
1u,vm
(1.3.13)
Proof We again use the Shannon FCT to check that n H where H is given by
1
(1.3.13), and n = log pn (U(n) ), cf. (1.3.3b). In other words, condition (1.3.9)
n
implies the asymptotic equipartition property for a Markov source.
The Markov property means that, for all string u(n) = u1 . . . un ,
pn (u(n) ) = (u1 )P(u1 , u2 ) P(un1 , un ),
(1.3.14a)
(1.3.14b)
For a random string, U(n) = U1 . . .Un , the random variable log pn (U(n) ) has a
similar form:
log (U1 ) log P(U1 ,U2 ) log P(Un1 ,Un ).
(1.3.15)
1.3 Shannons first coding theorem. The entropy rate of a Markov source
47
(1.3.16)
and write
n =
n1
1
1 + i+1 .
n
i=1
(1.3.17)
(1.3.18a)
= Pi1 (u)P(u, u ) log P(u, u ),
u,u
i 1.
(1.3.18b)
1 n
Ei = H,
n n
i=1
lim En = lim
n
P
and the convergence n H is again a law of large numbers, for the sequence
(i ):
1 n
(1.3.19)
lim P i H = 0.
n
n i=1
However, the situation here is not as simple as in the case of a Bernoulli source.
There are two difficulties to overcome: (i) Ei equals H only in the limit i ;
(ii) 1 , 2 , . . . are no longer independent. Even worse, they do not form a DTMC,
or even a Markov chain of a higher order. [A sequence U1 ,U2 , . . . is said to form a
DTMC of order k, if, for all n 1,
P(Un+k+1 = u |Un+k = uk , . . . ,Un+1 = u1 , . . .)
= P(Un+k+1 = u |Un+k = uk , . . . ,Un+1 = u1 ).
An obvious remark is that, in a DTMC of order k, the vectors U n =
(Un ,Un+1 , . . . ,Un+k1 ), n 1, form an ordinary DTMC.] In a sense, the memory in a sequence 1 , 2 , . . . is infinitely long. However, it decays exponentially:
the precise meaning of this is provided in Worked Example 1.3.14.
48
(1.3.20)
(i H)
C n,
(1.3.21)
i=1
C
and goes to zero as n .
n 2
(1.3.22)
Solution (Compare with PSE II, p. 72.) First, observe that (1.3.12) implies the
second bound in Theorem 1.3.10 as well as (1.3.10). Indeed, (v) is identified as
the limit
lim P(n) (u, v) = lim P(n1) (u, u)P(u, v) = (u)P(u, v),
(1.3.23)
(u) = 1,
u=1
then (v) = (u)P(n) (u, v) for all n 1. The limit n gives then
u
(1.3.24)
1.3 Shannons first coding theorem. The entropy rate of a Markov source
49
Then
mn+1 (v) = min P(n+1) (u, v) = min
u
Similarly,
Mn+1 (v) = max P(n+1) (u, v) = max
u
Since 0 mn (v) Mn (v) 1, both mn (v) and Mn (v) have the limits
m(v) = lim mn (v) lim Mn (v) = M(v).
n
n u,u
(1.3.25)
then M(v) = m(v) for each v. Furthermore, denoting the common value M(v) =
m(v) by (v), we obtain (1.3.22)
|P(n) (u, v) (v)| Mn (v) mn (v) (1 )n .
To prove (1.3.25), consider a DTMC on I I, with states (u1 , u2 ), and transition
probabilities
P(u1 , v1 )P(u2 , v2 ), if u1 = u2 ,
P (u1 , u2 ), (v1 , v2 ) =
P(u, v), if u1 = u2 = u; v1 = v2 = v,
0, if u1 = u2 and v1 = v2 .
(1.3.26)
It is easy to check that P (u1 , u2 ), (v1 , v2 ) is indeed a transition probability matrix
(of size m2 m2 ): if u1 = u2 = u then
P
(u
,
u
),
(v
,
v
)
= P(u, v) = 1
1 2
1 2
v1 ,v2
whereas if u1 = u2 then
P (u1 , u2 ), (v1 , v2 ) = P(u1 , v1 ) P(u2 , v2 ) = 1
v1 ,v2
v1
v2
50
(the inequalities 0 P (u1 , u2 ), (v1 , v2 ) 1 follow directly from the definition
(1.3.26)).
This is the so-called coupled DTMC on I I; we denote it by (Vn ,Wn ), n 1.
Observe that both components Vn and Wn are DTMCs with transition probabilities
P(u, v). More precisely, the components Vn and Wn move independently, until the
first (random) time when they coincide; we call it the coupling time. After time
the components Vn and Wn stick together and move synchronously, again with
transition probabilities P(u, v).
Suppose we start the coupled chain from a state (u, u ). Then
|P(n) (u, v) P(n) (u , v)|
= |P(Vn = v|V1 = u,W1 = u ) P(Wn = v|V1 = u,W1 = u )|
(because each component of (Vn ,Wn ) moves with the same transition probabilities)
= |P(Vn = v,Wn = v|V1 = u,W1 = u )
P(Vn = v,Wn = v|V1 = u,W1 = u )|
P(Vn = Wn |V1 = u,W1 = u )
= P( > n|V1 = u,W1 = u ).
(1.3.27)
(1.3.28)
2
|E (i H)(i+k H) | H + | log | (1 )k1 .
(1.3.29)
1.3 Shannons first coding theorem. The entropy rate of a Markov source
51
Solution For brevity, we assume i > 1; the case i = 1 requires minor changes.
Returning to the definition of random variables i , i > 1, write
E (i H)(i+k H)
= P(Ui = u,Ui+1 = u ;Ui+k = v,Ui+k+1 = v )
u,u v,v
log P(u, u ) H log P(v, v ) H .
(1.3.30)
i1
P
(u)P(u, u ) log P(u, u ) H
u,u v,v
(1.3.31)
Observe that (1.3.31) in fact vanishes because the sum vanishes due to the defiv,v
nition (1.3.13) of H.
The difference between sums (1.3.30) and (1.3.31) comes from the fact that the
probabilities
P(Ui = u,Ui+1 = u ;Ui+k = v,Ui+k+1 = v )
= Pi1 (u)P(u, u )P(k1) (u , v)P(v, v )
and
Pi1 (u)P(u, u ) (v)P(v, v )
E (i H) = E (i H)2
i=1
1in
+2
E (i H)( j H) .
(1.3.32)
1i< jn
The first sum in (1.3.32) is OK: it contains n terms E(i H)2 each bounded by a
2
constant (say, C may be taken to be H + | log | ). Thus this sum is at most C n.
52
It is the second sum that causes problems: it contains n(n 1) 2 terms. We bound
it as follows:
n
E (i H)( j H) |E (i H)(i+k H) | , (1.3.33)
i=1 k=1
1i< jn
and use (1.3.29) to finish the proof.
Our next theorem shows the role of the (relative) entropy in the asymptotic analysis of probabilities; see PSE I, p. 82.
Theorem 1.3.15 Let 1 , 2 , . . . be a sequence of IID random variables taking
values 0 and 1 with probabilities 1 p and p, respectively, 0 < p < 1. Then, for
any sequence kn of positive integers such that kn and n kn as n ,
P
i = kn
(1.3.34)
i=1
Here, means that the ratio of the left- and right-hand sides tends to 1 as n ,
kn
p (= pn ) denotes the ratio , and D(p||p ) stands for the relative entropy h(X||Y )
n
where X is distributed as i (i.e. it takes values 0 and 1 with probabilities 1 p
and p), while Y takes the same values with probabilities 1 p and p .
Proof Use Stirlings formula (see PSE I, p.72):
n! 2 nnn en .
(1.3.35)
[In fact, this formula admits a more precise form: n! = 2 nnn en+ (n) , where
1
1
< (n) <
, but for our purposes (1.3.35) is enough.] Then the proba12n + 1
12n
bility in the LHS of (1.3.34) is (for brevity, the subscript n in kn is omitted)
1/2
n k
n
nn
nk
p (1 p)
pk (1 p)nk
k
2 k(n k)
kk (n k)nk
1/2
= 2 np (1 p )
exp [k ln k/n (n k) ln (n k)/n + k ln p + (n k) ln(1 p)] .
But the RHS of the last formula coincides with the RHS of (1.3.34).
If p is close to p, we can write
1
1 1
+
D(p||p ) =
(p p)2 + O(|p p|3 ),
2 p 1 p
d
) |
D(p||p
as D(p||p )| p =p =
p =p = 0, and immediately obtain
dp
(1.3.36)
1.3 Shannons first coding theorem. The entropy rate of a Markov source
53
2
(p p) .
P i = kn $
exp
(1.3.37)
2p(1 p)
2 np(1 p)
i=1
Worked Example 1.3.17 At each time unit a device reads the current version
of a string of N characters each of which may be either 0 or 1. It then transmits
the number of characters which are equal to 1. Between each reading the string is
perturbed by changing one of the characters at random (from 0 to 1 or vice versa,
with each character being equally likely to be changed). Determine an expression
for the information rate of this source.
Solution The source is Markov, with the state space
probability matrix
0
1
0
0
1/N
0
(N
1)/N
0
2/N
0
(N 2)/N
0
...
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0 1/N
1
0
1 N1 N
N
j j log j .
N j=1
The distribution of the first symbol is equiprobable. Find the information rate of
the source. Does the result contradict Shannons FCT?
54
How does the answer change if m is odd? How can you use, for m odd, Shannons
FCT to derive the information rate of the above source?
Solution For m even, the DTMC is reducible: there are two communicating classes,
I1 = {0, 2, . . . , m} with m/2 + 1 states, and I2 = {1, 3, . . . , m 1} with m/2 states.
Correspondingly, for any set An of n-strings,
P(An ) = qP1 (An1 ) + (1 q)P2 (An2 ),
(1.3.38)
(1.3.39)
Both DTMCs are irreducible and aperiodic on their communicating classes and
their invariant distributions are uniform:
(1)
2
2
(2)
, i I1 , i = , i I2 .
m+2
m
8
8
and H (2) = log 3
.
3(m + 2)
3m
(1.3.40)
As follows from (1.3.38), the information rate of the whole DTMC equals
% (1)
H = max [H (1) , H (2) ], if 0 < q 1,
Hodd =
(1.3.41)
H (2) , if q = 0.
For 0 < q < 1 Shannons FCT is not applicable:
1
1
P1
P2
H (1) whereas log pn2 (U(n) )
H (2) ,
log pn1 (U(n) )
n
n
i.e. n converges to a non-constant limit. However, if q(1 q) = 0, then (1.3.41)
is reduced to a single line, and Shannons FCT is applicable: n converges to the
corresponding constant H (i) .
If m is odd, again there are two communicating classes, I1 = {0, 2, . . . , m 1}
and I2 = {1, 3, . . . , m}, each of which now contains (m + 1)/2 states. As before,
1.3 Shannons first coding theorem. The entropy rate of a Markov source
55
DTMCs P1 and P2 are irreducible and aperiodic and their invariant distributions
are uniform:
2
2
(1)
(2)
, i I1 , i =
, i I2 .
i =
m+1
m+1
Their common information rate equals
Hodd = log 3
8
,
3(m + 1)
(1.3.42)
which also gives the information rate of the whole DTMC. It agrees with Shannons
FCT, because now
1
P
n = log pn (U(n) ) Hodd .
(1.3.43)
n
Worked Example 1.3.19 Let a be the size of A and b the size of the alphabet B.
Consider a source with letters chosen from an alphabet A + B, with the constraint
that no two letters of A should ever occur consecutively.
(a) Suppose the message follows a DTMC, all characters which are permitted at a
given place being equally likely. Show that this source has information rate
H=
a log b + (a + b) log(a + b)
.
2a + b
(1.3.44)
(b) By solving a recurrence relation, or otherwise, find how many strings of length
n satisfy the constraint that no two letters of A occur consecutively. Suppose
these strings are equally likely and let n . Show that the limiting information rate becomes
b + b2 + 4ab
.
H = log
2
if x, y {1, . . . , a},
0,
P(x, y) = 1/b,
if x {1, . . . , a}, y {a + 1, . . . , a + b},
56
the detailed balance equations (DBEs) (x)P(x, y) = (y)P(y, x) (cf. PSE II, p. 82),
which yields
'
1 (2a + b),
x {1, . . . , a},
(x) =
(a + b) [b(2a + b)], x {a + 1, . . . , a + b}.
The DBEs imply that is invariant: (y) = (x)P(x, y), but not vice versa. Thus,
x
we obtain (1.3.44).
(b) Let Mn denote the number of allowed n-strings, An the number of allowed
n-strings ending with a letter from A, and Bn the number of allowed n-strings
ending with a letter from B. Then
Mn = An + Bn , An+1 = aBn , and Bn+1 = b(An + Bn ),
which yields
Bn+1 = bBn + abBn1 .
The last recursion is solved by
Bn = c+ +n + c n ,
where are the eigenvalues of the matrix
0 ab
,
1 b
i.e.
b b2 + 4ab
,
=
2
and c are constants, c+ > 0. Hence,
n1
n1
Mn = a c
+ c+ +n + c n
+ + + c
n1
n
1
+1
,
= +n c a n + n + c+ a
+
+
+
1
log Mn is represented as the sum
n
n1 n
1
1
.
+1
log + + log c a n + n + c+ a
n
+
+
+
Note that < 1. Thus, the limiting information rate equals
+
and
1
log Mn = log + .
n n
lim
1.3 Shannons first coding theorem. The entropy rate of a Markov source
57
The answers are different since the conditional equidistribution results in a strong
dependence between subsequent letters: they do not form a DTMC.
Worked Example 1.3.20 Let {U j : j = 1, 2, . . .} be an irreducible and aperiodic
DTMC with a finite state space. Given n 1 and (0, 1), order the
strings u(n)
(n)
(n)
according to their probabilities P(U(n) = u1 ) P(U(n) = u2 ) and select
them in this order until the probability of the remaining set becomes 1 . Let
1
Mn ( ) denote the number of the selected strings. Prove that lim log Mn ( ) = H ,
n n
the information rate of the source,
(a) in the case where the rows of the transition probability matrix P are all equal
(i.e. {U j } is a Bernoulli sequence),
(b) in the case where the rows of P are permutations of each other, and in a general
case. Comment on the significance of this result for coding theory.
Solution (a) Let P stand for the probability distribution of the IID sequence (Un )
m
and set H = p j log p j (the binary entropy of the source). Fix > 0 and partij=1
2n(H+ ) Mn ( ) P Mn ( ) P(K+ ) + 2n(H ) Mn ( ).
Excluding P Mn ( ) yields
2n(H+ ) Mn ( ) + 2n(H ) and 2n(H ) Mn ( ) P(K+ ).
58
1 2 3
4 5 6 .
(1.3.45)
7 8 9
Find the entropy rate for a rook, bishop (both kinds), queen and king.
59
Solution We consider the kings DTMC only; other cases are similar. The transition
probability matrix is
0 1/3 0
0 1/3 1/3 0
0
0
1/5 1/5 0
0 1/5 0 1/5 1/5 0
1/8
1/8
1/8
1/8
0
1/8
1/8
1/8
1/8
0
0
1/3
1/3
0
0
0
1/3
0
0
0 1/5 1/5 1/5 1/5 0 1/5
0
0
0
0 1/3 1/3 0 1/3 0
By symmetry the invariant distribution is 1 = 3 = 9 = 7 = , 4 = 2 = 6 =
8 = , 5 = , and by the DBEs
/3 = /5, /3 = /8, 4 + 4 + = 1
implies =
3
40 ,
= 18 , = 15 . Now
1
1
1
1
3
1
1
1
log 15 + .
H = 4 log 4 log log =
3
3
5
5
8
8 10
40
60
61
In other words, strings of length N sent to the channel will be codewords representing source messages of a shorter length n. The maximal ratio n/N which still
allows the receiver to recover the original message is an important characteristic of
the channel, called the capacity. As we will see, passing from a = 1 to a = 1
changes the capacity from 1 (no encoding needed) to 1/2 (where the codewordlength is twice as long as the length of the source message).
So, we need to introduce a decoding rule fN : J N I n such that the overall
probability of error (= ( fn , fN , P)) defined by
= P fN (Y(N) ) = u(n) , u(n) emitted
u(n)
= P U(n) = u(n) Pch fN (Y(N) ) = u(n) | fn (u(n) ) sent
(1.4.3)
u(n)
is small. We will try (and under certain conditions succeed) to have the errorprobability (1.4.3) tending to zero as n .
The idea which is behind the construction is based on the following facts:
(1) For a source with the asymptotic equipartition property the number of distinct n-strings emitted is 2n(H+o(1)) where H log m is the information rate of
the source. Therefore, we have to encode not mn = 2n log m messages, but only
2n(H+o(1)) which may be considerably less. That is, the code fn may be defined
on a subset of I n only, with the codeword-length N = nH.
1
(2) We may try even a larger N: N = R nH, where R is a constant with 0 < R <
1. In other words, the increasing length of the codewords used from nH to
1
R nH will allow us to introduce a redundancy in the code fn , and we may
hope to be able to use this redundancy to diminish the overall error-probability
(1.4.3) (provided that in addition a decoding rule is good). It is of course
1
desirable to minimise R , i.e. maximise R: it will give the codes with optimal
parameters. The question of how large R is allowed to be depends of course on
the channel.
It is instrumental to introduce a notational convention. As the codeword-length is
1
a crucial parameter, we write N instead of R Hn and RN instead of Hn: the number of distinct strings emitted by the source becomes 2N(R+o(1)) . In future, the index
NR
n
will be omitted wherever possible (and replaced by N otherwise). It is conH
venient to consider a typical set UN of distinct strings emitted by the source, with
UN = 2N(R+o(1)) . Formally, UN can include strings of different length; it is only
the log-asymptotics of UN that matter. Accordingly, we will omit the superscript
(n) in the notation u(n) .
62
log UN = R,
N N
J N , and a sequence
N UN
uUN Y(N)
C = sup R (0, 1) : R is a reliable transmission rate .
(1.4.5)
Definition 1.4.3
(1.4.6)
Remark 1.4.4
X xXN
Accordingly, it makes sense to write = ave and speak about the average probability of error. Another form is the maximum error-probability
max = max 1 Pch fN (Y(N) ) = x|x sent : x XN ;
obviously, ave max . In this section we work with ave 0 leaving the question
of whether max 0. However, in Section 2.2 we reduce the problem of assessing
max to that with ave , and as a result, the formulas for the channel capacity deduced
in this section will remain valid if ave is replaced by max .
Remark 1.4.5
given by
63
(1.4.7)
pXk
Here, I(Xk : Yk ) is the mutual information between a single pair of input and output
letters Xk and Yk (the index k may be omitted), with the joint distribution
P(X = x,Y = y) = pX (x)P(y|x), x, y = 0, 1,
(1.4.8)
where pX (x) = P(X = x). The supremum in (1.4.7) is over all possible distributions
pX = (pX (0), pX (1)). A useful formula is I(X : Y ) = h(Y ) h(Y |X) (see (1.3.12)).
In fact, for the MBSC
h(Y |X) =
x=0,1
pX (x)
y=0,1
(1.4.9)
y=0,1
(1.4.10)
pX
But sup h(Y ) is equal to log 2 = 1: it is attained at pX (0) = pX (1) = 1/2, and
pX
pY (0) = pY (1) = 1/2(p + 1 p) = 1/2. Therefore, for an MBSC, with the row
error-probability p,
C = 1 (p).
(1.4.11)
64
In fact, Shannons ideas have not been easily accepted by leading contemporary
mathematicians. It would be interesting to see the opinions of the leading scientists
who could be considered as creators of information theory.
Theorem 1.4.6 Fix a channel (i.e. conditional probabilities Pch in (1.4.1)) and a
set U of the source strings and denote by (P) the overall error-probability (1.4.3)
for U(n) having a probability distribution P over U , minimised over all encoding
and decoding rules. Then
(P) (P0 ),
(1.4.12)
(u) :=
y: f(y)=u
(= (P, f , f)) =
P(u) (u).
uU
1
U
uU
such
that ( ) . In fact, take a random permutation, , equidistributed among
all U ! permutations of degree U . Then
min ( ) E () = E P(u) (u)
uU
uU
1
(u) = .
U uU
Hence, given any f and f, we can find new encoding and decoding rules with
overall error-probability (P0 , f , f). Minimising over f and f leads to (1.4.12).
65
Let h(X | Y ) denote the conditional entropy of X given Y Define the ideal observer
decoding rule as a map f IO : J I such that P( f (y) | y) = maxxI P(x | y) for all
y J . Show that
(a) under this rule the error-probability
1
satisfies erIO (y) h(P( | y));
2
1
(b) the expected value of the error-probability obeys EerIO (Y ) h(X | Y ).
2
Solution Indeed, (a) follows from (iii) in Worked Example 1.2.7, as
IO
err
= 1 P f (y) | y = 1 Pmax ( |y),
1
h(P( |y)). Finally, (b) follows from (a) by taking
2
expectations, as h(X|Y ) = Eh(P( |Y )).
which is less than or equal to
As was noted before, a general decoding rule (or a decoder) is a map fN : J N
UN ; in the case of a lossless encoding rule fN , fN is a map J N XN . Here X
is a set of codewords. Sometimes it is convenient to identify the decoding rule by
(N)
(N)
fixing, for each codeword x(N) , a set A(x(N) ) J N , so that A(x1 ) and A(x2 ) are
(N)
(N)
disjoint for x1 = x2 , and the union x(N) XN A(x(N) ) gives the whole J N . Given
that y(N) A(x(N) ), we decode it as fN (y(N) ) = x(N) .
Although in the definition of the channel capacity we assume that the source
messages are equidistributed (as was mentioned, it gives the worst case in the
sense of Theorem 1.4.6), in reality of course the source does not always follow
this assumption. To this end, we need to distinguish between two situations: (i) the
receiver knows the probabilities
p(u) = P(U = u)
(1.4.13)
of the source strings (and hence the probability distribution pN (x(N) ) of the codewords x(N) XN ), and (ii) he does not know pN (x(N) ). Two natural decoding rules
are, respectively,
66
(i) the ideal observer (IO) rule decodes a received word y(N) by a codeword x(N)
that maximises the posterior probability
p (x(N) )P (y(N) |x(N) )
N
ch
,
(1.4.14)
P x(N) sent |y(N) received =
pY(N) (y(N) )
where
pY(N) (y(N) ) =
x(N) XN
and
(ii) the maximum likelihood (ML) rule decodes a received word y(N) by a codeword
(N)
x that maximises the prior probability
Pch (y(N) |x(N) ).
(1.4.15)
Theorem 1.4.8 Suppose that an encoding rule f is defined for all messages that
occur with positive probability and is one-to-one. Then:
(a) For any such encoding rule, the IO decoder minimises the overall errorprobability among all decoders.
(b) If the source message U is equiprobable on a set U , then for any encoding rule
f : U XN as above, the random codeword X(N) = f (U) is equiprobable on
XN , and the IO and ML decoders coincide.
Proof Again, for simplicity let us omit the upper index (N).
(a) Note that, given a received word y, the IO obviously maximises the joint
probability p(x)Pch (y|x) (the denominator in (1.4.14) is fixed when word y is
fixed). If we use an encoding rule f and decoding rule f, the overall errorprobability (see (1.4.3)) is
P(U = u)Pch f(y) = u| f (u) sent
u
= p(x) 1 f(y) = x Pch (y|x)
x
y
= 1 x = f(y) p(x)Pch (y|x)
y x
= p(x)Pch (y|x) p f(y) Pch y| f(y)
y x
y
= 1 p f (y) Pch y| f(y) .
y
It remains to note that each term in the sum p f(y) Pch y| f(y) is maximised
y
when f coincides with the IO rule. Hence, the whole sum is maximised, and the
overall error-probability minimised.
(b) The first statement is obvious, as, indeed is the second.
67
Assuming in the definition of the channel capacity that the source messages are
equidistributed, it is natural to explore further the ML decoder. While using the ML
decoder, an error can occur because either the decoder chooses a wrong codeword
x or an encoding rule f used is not one-to-one. The probability of this is assessed
in Theorem 1.4.8. For further simplification, we write P instead of Pch ; symbol P
is used mainly for the joint input/output distribution.
Lemma 1.4.9 If the source messages are equidistributed over a set U then, while
using the ML decoder and an encoding rule f , the overall error-probability satisfies
( f )
Proof
1
U
uU u U : u =u
P P Y| f (u ) P (Y| f (u)) |U = u .
(1.4.16)
(a) an error when P Y | f (u ) > P Y | f (u) for some u = u,
(b) possibly an error when P Y| f (u ) = P Y | f (u) for some u = u (this includes the case when f (u) = f (u )), and finally
(c) no error when P Y | f (u ) < P Y | f (u) for any u = u.
Thus, the probability is bounded as follows:
P error | U = u
P P Y | f (u ) P (Y | f (u)) for some u = u | U = u
1 u = u P P Y | f (u ) P (Y | f (u)) | U = u .
u U
Multiplying by
1
and summing up over u yields the result.
U
Remark 1.4.10
As was already noted, a random coding is a useful tool alongside with deterN
ministic encoding rules. A deterministic encoding rule
( is a map f : U) J ; if
U = r then f is given as a collection of codewords f (u1 ), . . . , f (ur ) or, equivalently, as a concatenated megastring (or codebook)
r
= {0, 1}Nr .
f (u1 ) . . . f (ur ) J N
Here, u1 , . . . , ur are the source strings (not letters!) constituting set U . If f is lossless then f (ui ) =
N r rule is a random eler f (u j ) whenever i = j. A random encoding
N
ment F of J
, with probabilities P(F = f ), f J
. Equivalently, F may
68
(1.4.17)
Theorem 1.4.11
(i) There exists a deterministic encoding rule f with ( f ) E .
E
(ii)
P (F) <
for any (0, 1).
1
Proof
For (ii), use the Chebyshev inequality (see PSE I, p. 75):
Part (i) is obvious.
E
1
E = 1 .
P (F)
1
E
Definition 1.4.12
69
(1.4.18)
Recall that I X(N) : Y(N) is the mutual entropy given by
h X(N) h X(N) |Y(N) = h Y(N) h Y(N) |X(N) .
Remark 1.4.13 A simple heuristic argument (which will be made rigorous in
Section 2.2) shows that the capacity of the channel cannot exceed the mutual
information between its input and output. Indeed, for each typical input Nsequence, there are
(N) |X(N) )
approximately 2h(Y
all of them equally likely. We will not be able to detect which sequence X was sent
unless no two X(N) sequences produce the same Y(N) output sequence. The total
(N)
number of typical Y(N) sequences is 2h(Y ) . This set has to be divided into subsets
(N) (N)
of size 2h(Y |X ) corresponding to the different input X(N) sequences. The total
number of disjoint sets is
(N) )h(Y(N) |X(N) )
2h(Y
(N) :Y(N)
= 2I (X
).
Hence, the total number of distinguishable signals of the length N could not be
(N) (N)
bigger than 2I (X :Y ) . Putting the same argument slightly differently, the number
(N)
(N) (N)
of typical sequences X(N) is 2Nh(X ) . However, there are only 2Nh(X ,Y ) jointly
typical sequences (X(N) , Y(N) ). So, the probability that any randomly chosen pair
(N) (N)
is jointly typical is about 2I (X :Y ) . So, the number of distinguished signals is
(N)
(N)
(N) (N)
bounded by 2h(X )+h(Y )h(X |Y ) .
Theorem 1.4.14
(1.4.19)
( f ) 1
CN + o(1)
.
R + o(1)
(1.4.20)
70
The assertion of the theorem immediately follows from (1.4.20) and the definition
of the channel capacity because
lim inf ( f ) 1
N
1
lim sup CN
R
N
( f ) = P(X(N) = xi , f(Y(N) ) = xi ).
(N)
(N)
i=1
( f )
C + o(1)
N(R + o(1)) NCN
= 1 N
.
R + o(1)
log 2N(R+o(1)) 1
Let p(X(N) , Y(N) ) be the random variable that assigns, to random words X(N) and
the joint probability of having these words at the input and output of a channel, respectively. Similarly, pX (X(N) ) and pY (Y(N) ) denote the random variables
that give the marginal probabilities of words X(N) and Y(N) , respectively.
Y(N) ,
71
Theorem 1.4.15 (Shannons SCT: direct part) Suppose we can find a constant
c (0, 1) such that for any R (0, c) and N 1 there exists a random coding
F(u1 ), . . . , F(ur ), where r = 2N(R+o(1)) , with IID codewords F(ui ) J N , such
that the (random) input/output mutual information
N :=
p(X(N) , Y(N) )
1
log
N
pX (X(N) )pY (Y(N) )
(1.4.21)
(1.4.22)
So, if the LHS and RHS sides of (1.4.22) coincide, then their common value gives
the channel capacity.
Next, we use Shannons SCT for calculating the capacity of an MBC. Recall (cf.
(1.4.2)), for an MBC,
N
(1.4.23)
P y(N) |x(N) = P(yi |xi ).
i=1
72
Theorem 1.4.17
For an MBC,
I X(N) : Y(N)
I(X j : Y j ),
(1.4.24)
j=1
I X(N) : Y(N) = h Y(N) h Y(N) |X(N)
= h Y(N) h(Y j |X j )
1 jN
h(Y j ) h(Y j |X j ) = I(X j : Y j ).
j
The equality holds iff Y1 , . . . ,YN are independent. But Y1 , . . . ,YN are independent if
X1 , . . . , XN are.
Remark 1.4.18 Compare with inequalities (1.4.24) and (1.2.27). Note the opposite inequalities in the bounds.
Theorem 1.4.19
(1.4.25)
73
any r, i.e. for any R (even R > 1!).] For this random coding, the (random) mutual
entropy N equals
p X(N) , Y(N)
1
log (N) (N)
N
pY Y
pX X
N
p(X j ,Y j )
1 N
1
= j,
= log
N j=1
p (X j )pY (Y j ) N j=1
p(X j ,Y j )
.
j )pY (Y j )
The random variables j are IID, and
where j := log
p (X
E j = E log
p(X j ,Y j )
= Ip (X1 : Y1 ).
j )pY (Y j )
p (X
By the law of large numbers for IID random variables (see Theorem 1.3.5), for the
random coding as suggested,
P
Remark 1.4.20
(1.4.26)
Under what condition does he not strictly decrease the capacity? Equality in
(1.4.26) holds iff, under the distribution pX that maximises I(X : Y ), the random variables X and Y are conditionally independent given g(Y ). [For example,
g(y1 ) = g(y2 ) iff for any x, PX|Y (x|y1 ) = PX|Y (x|y2 ); that is, g glues together only
those values of y for which the conditional probability PX|Y ( |y) is the same.] For
an MBC, equality holds iff g is one-to-one, or p = P(1|0) = P(0|1) = 1/2.
74
(1.4.27)
(see (1.4.11)). The channel capacity is realised by a random coding with the IID
symbols Vl j taking values 0 and 1 with probability 1/2.
Worked Example 1.4.23
(a) Consider a memoryless channel with two input symbols A and B, and three
output symbols, A, B, . Suppose each input symbol is left intact with probability 1/2, and transformed into a with probability 1/2. Write down the channel
matrix and calculate the capacity.
(b) Now calculate the new capacity of the channel if the output is further processed
by someone who cannot distinguish A and , so that the matrix becomes
1
0
.
1/2 1/2
Solution (a) The channel has the matrix
1/2 0 1/2
0 1/2 1/2
and is symmetric (the rows are permutations of each other). So, h(Y |X = x) =
1
1
2 log = 1 does not depend on the value of x = A, B. Then h(Y |X) = 1, and
2
2
I(X : Y ) = h(Y ) 1.
(1.4.28)
If P(X = A) = then Y has the output distribution
1 1
1
, (1 ),
2 2
2
and h(Y |X) is maximised at = 1/2. Then the capacity equals
1
h(1/4, 1/4, 1/2) 1 = .
2
(1.4.29)
(b) Here, the channel is not symmetric. If P(X = A) = then the conditional
entropy is decomposed as
h(Y |X) = h(Y |X = A) + (1 )h(Y |X = B)
= 0 + (1 ) 1 = (1 ).
Then
h(Y ) =
75
1+ 1
1
1+
log
log
2
2
2
2
and
1+
1+ 1
1
log
log
1+
2
2
2
2
which is maximised at = 3/5, with the capacity given by
log 5 2 = 0.321928.
I(X : Y ) =
Our next goal is to prove the direct part of Shannons SCT (Theorem 1.4.15). As
was demonstrated earlier, the proof is based on two consecutive Worked Examples
below.
Worked Example 1.4.24 Let F be a random coding, independent of the source
string U, such that the codewords F(u1 ), . . . , F(ur ) are IID, with a probability distribution pF :
pF (v) = P(F(u) = v),
v (= v(N) ) J N .
Here, u j , j = 1, . . . , r, are source strings, and r = 2N(R+o(1)) . Define random codewords V1 , . . . , Vr1 by
if U = u j then Vi := F(ui )
and Vi := F(ui+1 )
(1.4.30)
Then U (the message string), X = F(U) (the random codeword) and V1 , . . . , Vr1
are independent words, and each of X, V1 , . . . , Vr1 has distribution pF .
Solution This is straightforward and follows from the formula for the joint probability,
P(U = u j , X = x, V1 = v1 , . . . , Vr1 = vr1 )
= P(U = u j ) pF (x) pF (v1 ) . . . pF (vr1 ).
(1.4.31)
Worked Example 1.4.25 Check that for the random coding as in Worked Example 1.4.24, for any > 0,
E = E (F) P(N ) + r2N .
(1.4.32)
Here,
the random variable N is defined in (1.4.21), with EN =
1 (N) (N)
I X :Y
.
N
76
Solution For given words x(= x(N) ) and y(= y(N) ) J N , denote
*
+
Sy (x) := x J N : P(y | x ) P(y | x) .
(1.4.33)
That is, Sy (x) includes all words the ML decoder may produce in the situation
where x was sent and y received. Set, for a given non-random encoding rule f
and a source string u, ( f , u, y) = 1 if f (u ) Sy ( f (u)) for some u = u, and
( f , u, y) = 0 otherwise. Clearly, ( f , u, y) equals
1 1 f (u ) Sy ( f (u))
u : u =u
= 1 1 1 f (u ) Sy ( f (u)) .
u :u =u
It is plain that, for all non-random encoding f , ( f ) E ( f , U, Y), and for all
random encoding F, E = E (F) E (F, U, Y). Furthermore, for the random
encoding as in Worked Example 1.4.24, the expected value E (F, U, Y) does not
exceed
r1
= pX (x) P(y|x)
E 1 1 1 Vi SY (X)
x
y
i=1
r1
E 1 1 1 Vi SY (X) |X = x, Y = y ,
i=1
pX (x) P(y|x) 1 E 1 1{Vi Sy (x)} .
x
r1
i=1
Furthermore, due to the IID property (as explained in Worked Example 1.4.24),
r1
r1
E
1
{V S (x)} = (1 Qy (x)) ,
i
i=1
where
Qy (x) := 1 x Sy (x) pX (x ).
x
p(x, y)
1
log
>
N
pX (x)pY (y)
r2
(1 Qy (x)) j Qy (x).
j=0
(1.4.34)
77
when (x, y) T.
(1.4.35)
r1
1 (1 Qy (x))r = (1 Qy (x)) j Qy (x) (r 1)Qy (x),
j=1
this yields
E P (X,Y ) T + (r 1)
pX (x)P(y|x)Qy (x).
(1.4.36)
(x,y)T
P (X,Y ) T = P N .
(1.4.37)
Qy (x) 2N .
(1.4.38)
condition N c. Therefore, the random coding F gives the expected error probability that vanishes as N .
By Theorem 1.4.11(i), for any N 1 there exists a deterministic encoding f = fN
such that, for R = c 2 , lim ( f ) = 0. Hence, R is a reliable transmission rate.
N
78
Theorems 1.4.17 and 1.4.19 may be extended to the case of a memoryless channel with an arbitrary (finite) output alphabet, Jq = {0, . . . , q 1}. That is, at the
input of the channel we now have a word Y(N) = Y1 . . . YN where each Y j takes a
(random) value from Jq . The memoryless property means, as before, that
N
Pch y(N) |x(N) = P(yi | xi ),
(1.4.39)
i=1
(1.4.40)
where (p0 , . . . , pq1 ) is a row of the channel matrix. The equality is realised in the
case of a double-symmetric channel, and the maximising random coding has IID
symbols Vi taking values from Jq with probability 1/q.
Proof The proof is carried out as in the binary case, by using the fact that I(X1 :
Y1 ) = h(Y1 ) h(Y1 |X1 ) log q h(Y1 |X1 ). But in the symmetric case
h(Y1 | X1 ) = P(X1 = x)P(y | x) log P(y | x)
x,y
(1.4.41)
If, in addition, the columns of the channel matrix are permutations of each other,
then h(Y1 ) attains log q. Indeed, take a random coding as suggested. Then P(Y = y)
q1
1
= P(X1 = x)P(y|x) = P(y|x). The sum P(y|x) is along a column of the
q x
x
x=0
channel matrix, and it does not depend on y. Hence, P(Y = y) does not depend on
y Iq , which means equidistribution.
Remark 1.4.27 (a) In the random coding F used in Worked Examples 1.4.24 and
1.4.25 and Theorems 1.4.6, 1.4.15 and 1.4.17, the expected error-probability E 0
with N . This guarantees not only the existence of a good non-random coding
for which the error-probability E vanishes as N (see Theorem 1.4.11(i)), but
also that almost all codes are asymptotically good. In fact, by Theorem 1.4.11(ii),
79
C ( p)
p
1
Figure 1.8
1
1
;
the rows are permutations of each other, and hence have equal entropies. Therefore,
the conditional entropy h(Y |X) equals
h(1 , , ) = (1 ) log(1 ) log log ,
which does not depend on the distribution of the input symbol X.
Thus, I(X : Y ) is maximised when h(Y ) is. If pY (0) = p and pY (1) = q, then
h(Y ) = log p log p q log q,
80
(1 ) log
so the further processing of the second channel can only reduce the mutual
information.
The independence of the channels means that given Y , the random variables
X and Z are conditionally independent. Deduce that
h(X, Z|Y ) = h(X|Y ) + h(Z|Y )
and
h(X,Y, Z) + h(Z) = h(X, Z) + h(Y, Z).
Define I(X : Z|Y ) as h(X|Y ) + h(Z|Y ) h(X, Z|Y ) and show that
I(X : Z|Y ) = I(X : Y ) I(X : Z).
. . .
. . .
81
. . .
Figure 1.9
The equality holds iff X and Y are conditionally independent given Z, e.g. if the
second channel is error-free (Y, Z) Z is one-to-one, or the first channel is fully
noisy, i.e. X and Y are independent.
(b) The rows of the channel matrix are permutations of each other. Hence h(Y |X) =
h(p0 , . . . , pr1 ) does not depend on pX . The quantity h(Y ) is maximised when
pX (i) = 1/r, which gives
C = log r h(p0 , . . . , pr1 ).
82
then C is given by 2C = 2C1 + 2C2 . To what mode of operation does this correspond?
Solution (a)
X
channel 1
channel 2
Hence,
C = sup I(X : Z) sup I(X : Y ) = C1
pX
pX
and similarly
C sup I(Y : Z) = C2 ,
pY
i.e. C min[C1 ,C2 ]. A strict inequality may occur: take (0, 1/2) and the
matrices
1
1
, ch 2
,
ch 1
1
1
and
1
ch [1 + 2]
2
(1 )2 + 2
2 (1 )
2 (1 )
(1 )2 + 2
.
83
C = 1 h 2 (1 ), 1 2 (1 ) < Ci
channel 1
Y1
X2
channel 2
Y2
But
I (X1 , X2 ) : (Y1 ,Y2 ) = h(Y1 ,Y2 ) h Y1 ,Y2 |X1 , X2
h(Y1 ) + h(Y2 ) h(Y1 |X1 ) h(Y2 |X2 )
= I(X1 : Y1 ) + I(X2 : Y2 );
equality applies iff X1 and X2 are independent. Thus, C = C1 + C2 and the maximising p(X1 ,X2 ) is pX1 pX2 where pX1 and pX2 are maximisers for I(X1 : Y1 ) and
I(X2 : Y2 ).
(c)
channel 1
Y1
channel 2
Y2
X
Here,
C = sup I X : (Y1 : Y2 )
pX
and
I (Y1 : Y2 ) : X = h(X) h X|Y1 ,Y2
h(X) min h(X|Y j ) = min I(X : Y j ).
j=1,2
j=1,2
84
Thus, C max[C1 ,C2 ]. A strict inequality may occur: take an example from part
(a). Here, Ci = 1 h( , 1 ). Also,
I (Y1 ,Y2 ) : X = h(Y1 ,Y2 ) h Y1 ,Y2 |X
= h(Y1 ,Y2 ) h(Y1 |X) h(Y2 |X)
= h(Y1 ,Y2 ) 2h( , 1 ).
If we set pX (0) = pX (1) = 1/2 then
(Y1 ,Y2 ) = (0, 0) with probability (1 )2 + 2 2,
(Y1 ,Y2 ) = (1, 1) with probability (1 )2 + 2 2,
(Y1 ,Y2 ) = (1, 0) with probability (1 ),
(Y1 ,Y2 ) = (0, 1) with probability (1 ),
with
h(Y1 ,Y2 ) = 1 + h 2 (1 ), 1 2 (1 ) ,
and
I (Y1 ,Y2 ) : X = 1 + h 2 (1 ), 1 2 (1 ) 2h( , 1 )
> 1 h( , 1 ) = Ci .
Hence, C > Ci , i = 1, 2.
(d)
X1
channel 1
Y1
channel 2
Y2
X : X1 or X2
X2
The difference with part (c) is that every second only one symbol is sent, either
to channel 1 or 2. If we fix probabilities and 1 that a given symbol is sent
through a particular channel then
I(X : Y ) = h( , 1 ) + I(X1 : Y1 ) + (1 )I(X2 : Y2 ).
(1.4.42)
85
and
h(Y |X) = pX1 ,Y1 (x, y) log pY1 |X1 (y|x)
x,y
0 1
h( , 1 ) + C1 + (1 )C2 ;
pq
, 1 k N.
N
(1p)/p 1
1
1
1
1+
, 1.
q = min
p
Np 1 p
86
1
The condition q = 1 is equivalent to log N log(1 p), i.e.
p
1
N
.
(1 p)1/p
p(x)dx
A
Rn dxp(x) = 1.
The
(1.5.1)
under the assumption that the integral is absolutely convergent. As in the discrete
case, hdiff (X) may be considered as a functional of the density p : x Rn R+ =
[0, ). The difference is however that hdiff (X) may be negative, e.g. for a uniform
1
distribution on [0, a], hdiff (X) = 0a dx(1/a) log(1/a) = log a < 0 for a < 1. [We
write x instead of x when x R.] The relative, joint and conditional differential
entropy are defined similarly to the discrete case:
hdiff (X||Y ) = Ddiff (p||p ) =
hdiff (X,Y ) =
p(x) log
p (x)
dx,
p(x)
(1.5.2)
(1.5.3)
(1.5.4)
again under the assumption that the integrals are absolutely convergent. Here, pX,Y
is the joint probability density and pX|Y the conditional density (the PDF of the
conditional distribution). Henceforth we will omit the subscript diff when it is clear
what entropy is being addressed. The assertions of Theorems 1.2.3(b),(c), 1.2.12,
and 1.2.18 are carried through for the differential entropies: the proofs are completely similar and will not be repeated.
Remark 1.5.2
n (= n (x)) equals 0 or 1. For most of the numbers x the series is not reduced to a
finite sum (that is, there are infinitely many n such that n = 1; the formal statement
87
is that the (Lebesgue) measure of the set of numbers x (0, 1) with infinitely many
n (x) = 1 equals one). Thus, if we want to encode x by means of binary digits we
would need, typically, a codeword of an infinite length. In other words, a typical
value for a uniform random variable X with 0 X 1 requires infinitely many bits
for its exact description. It is easy to make a similar conclusion in a general case
when X has a PDF fX (x).
However, if we wish to represent the outcome of the random variable X with
an accuracy of first n binary digits then we need, on average, n + h(X) bits where
h(X) is the differential entropy of X. Differential entropies can be both positive
and negative, and can even be . Since h(X) can be of either sign, n + h(X) can
be greater or less than n. In the discrete case the entropy is both shift and scale
invariant since it depends only on probabilities p1 , . . . , pm , not on the values of the
random variable. However, the differential entropy is shift but not scale invariant
as is evident from the identity (cf. Theorem 1.5.7)
h(aX + b) = h(X) + log |a|.
However, the relative entropy, i.e. KullbackLeibler distance D(p||q), is scale
invariant.
Worked Example 1.5.3
Consider a PDF on 0 x e1 ,
fr (x) = Cr
1
, 0 < r < 1.
x( ln x)r+1
1
dx =
x( ln x)r+1
0
1
1
yr+1
1
dy = .
r
Hence,
h(X) =
x( ln x)
dx =
r+1
0
0
zerz dz =
1
.
r2
fr (x) ln fr (x)dx
= fr (x) ln r + ln x + (r + 1) ln( ln x) dx
0 e1
r
ln( ln x)
r(r + 1)
= ln r
dx,
x( ln x)r
x( ln x)r+1
0
0
so that for 0 < r < 1, the second term is infinite, and two others are finite.
88
1
log (2 e)d detC .
2
(1.5.5)
exp
,C
(x
)
, x Rd .
1/2
2
(2 )d detC
Then h(X) takes the form
log e
1
d
1
p(x) log (2 ) detC
x ,C (x ) dx
2
Rd
2
log e
1
=
E (xi i )(x j j ) C1 i j + log (2 )d detC
2
2
i, j
1
log e 1
=
C i j E(xi i )(x j j ) + log (2 )d detC
2 i, j
2
log e 1
1
=
C i j C ji + log (2 )d detC
2 i, j
2
1
d log e 1
+ log (2 )d detC = log (2 e)d detC .
=
2
2
2
0
1
log (2 e)d detC ,
2
(1.5.6)
89
p(x) log
p(x)
dx
p0 (x)
(a) Show that the exponential density maximises the differential entropy among
the PDFs on [0, ) with given mean, and the normal density maximises the
differential entropy among the PDFs on R with a given variance.
1
1
log 2 e(Var X + ) .
2
12
Solution (a) For the Gaussian case, see Theorem 1.5.5. In the exponential
case, by the Gibbs inequality, for any random variable Y with PDF f (y),
1
f (y) log f (y)e y / dy 0 or
h(Y ) ( EY log e log ) = h(Exp( )),
with equality iff Y Exp( ), = (EY )1 .
(b) Let X0 be a discrete random variable with P(X0 = i) = pi , i = 1, 2, . . ., and the
random variable U be independent of X0 and uniform on [0, 1]. Set X = X0 + U.
For a normal random variable Y with Var X = VarY ,
hdiff (X) hdiff (Y ) =
1
1
1
log 2 eVar Y = log 2 e(Var X + ) .
2
2
12
90
The value of EX is not essential for h(X) as the following theorem shows.
Theorem 1.5.7
(a) The differential entropy is not changed under the shift: for all y Rd ,
h(X + y) = h(X).
(b) The differential entropy changes additively under multiplication:
h(aX) = h(X) + log |a|,
for all a R.
(1.5.7)
p1 (x)(x, y)
p2 (x)(x, y)
x
p1 (x)
.
= p1 (x)(x, y) log
p2 (x)
x
p1 (x)(x, y) log
p1 (w)(w, y)
w
p2 (z)(z, y)
z
p1 (x)
= D(p1 ||p2 ).
p1 (x)(x, y) log
p
2 (x)
x y
In the continuous case a similar inequality holds if we replace summation by
integration.
91
The concept of differential entropy has proved to be useful in a great variety of situations, very often quite unexpectedly. We consider here inequalities for
determinants and ratios of determinants of positive definite matrices (cf. [39], [36]).
Recall that the covariance matrix C = (Ci j ) of a random vector X = (X1 , . . . , Xd )
is positive definite, i.e. for any complex vector y = (y1 , . . . , yd ), the scalar product
(y,Cy) = Ci j yi y j is written as
i, j
2
E(X
)(X
)y
y
=
E
(X
)y
i i j j i j i i i 0.
i, j
i
Conversely, for any positive definite matrix C there exists a PDF for which C is a
covariance matrix, e.g. a multivariate normal distribution (if C is not strictly positive definite, the distribution is degenerate).
Worked Example 1.5.9
Solution Take two positive definite matrices C(0) and C(1) and [0, 1]. Let X(0)
and X(1) be two multivariate normal vectors, X(i) N(0,C(i) ). Set, as in the proof
of Theorem 1.2.18, X = X() , where the random variable takes two values, 0 and
1, with probabilities and 1 , respectively, and is independent of X(0) and X(1) .
Then the random variable X has covariance C = C(0) + (1 )C(1) , although X
need not be normal. Thus,
1
1
log 2 e)d + log det C(0) + (1 )C(1)
2
2
1
= log (2 e)d detC h(X) (by Theorem 1.5.5)
2
h(X|) (by Theorem 1.2.11)
1
This property is often called the Ky Fan inequality and was proved initially in
1950 by using much more involved methods. Another famous inequality is due to
Hadamard:
Worked Example 1.5.10
(1.5.8)
92
(1.5.9)
where X and Y are independent normal random variables with h(X) = h(X ) and
h(Y ) = h(Y ).
In the d-dimensional case the entropypower inequality is as follows.
For two independent random variables X and Y with PDFs fX (x) and fY (x), x Rd ,
e2h(X+Y )/d e2h(X)/d + e2h(Y )/d .
(1.5.10)
It is easy to see that for d = 1 (1.5.9) and (1.5.10) are equivalent. In general,
inequality (1.5.9) implies (1.5.10) via (1.5.13) below which can be established
independently. Note that inequality (1.5.10) may be true or false for discrete random variables. Consider the following example: let X Y be independent with
PX (0) = 1/6, PX (1) = 2/3, PX (2) = 1/6. Then
16
18
2
h(X) = h(Y ) = ln 6 ln 4, h(X +Y ) = ln 36 ln 8 ln 18.
3
36
36
By inspection, e2h(X+Y ) = e2h(X) + e2h(Y ) . If X and Y are non-random constants
then h(X) = h(Y ) = h(X +Y ) = 0, and the EPI is obviously violated. We conclude
93
(BrunnMinkowski)
(1.5.11)
(b) The volume of the set sum of two sets A1 and A2 is greater than the volume
of the set sum of two balls B1 and B2 with the same volume as A1 and A2 ,
respectively:
V (A1 + A2 ) V (B1 + B2 ),
(1.5.12)
where B1 and B2 are spheres with V (A1 ) = V (B1 ) and V (A2 ) = V (B2 ).
Worked Example 1.5.13
(1.5.13)
(1.5.14)
94
|Cn |
= lim |Cn |1/n .
|Cn1 | n
(1.5.15)
h(X1 , X2 , . . . Xn )
1 n
= lim h(Xk |Xk1 , . . . , X1 )
n
n n
n
i=1
= lim h(Xn |Xn1 , . . . , X1 ).
lim
But, more interestingly, the following intermediate inequality holds true. Let
X1 , X2 , . . . , Xn+1 be IID square-integrable random variables. Then
Xi /d
1 n+1 2h i
2h X1 ++Xn /d
e =j
.
(1.5.16)
e
n j=1
As was established, the differential entropy is maximised by a Gaussian distribution, under the constraint that the variance of the random variable under consideration is bounded from above. We will state without proof the following important
result showing that the entropy increases on every summation step in the central
limit theorem.
95
h
h
.
(1.5.17)
n
n+1
Solution Introduce = n0 n , the set of all strings with digits from . We send
a message x1 x2 . . . xn 1 as the concatenation f (x1 ) f (x2 ) . . . f (xn ) 2 , i.e. f
extends to a function f : 1 2 . We say a code is decipherable if f is injective.
Krafts inequality states that a prefix-free code f : 1 2 with codewordlengths s1 , . . . , sm exists iff
m
qs
1.
(1.6.1)
i=1
i=1
i=1
(1.6.2)
96
In the example, the three extra codewords must be 00, 01, 10 (we cannot take
11, as then a sequence of ten 1s is not decodable). Reversing the order in every
codeword gives a prefix-free code. But prefix-free codes are decipherable. Hence,
the code is decipherable.
In conclusion, we present an alternative proof of necessity of Krafts inequality. Denote s = max si ; let us agree to extend any word in X to the length s,
say by adding some fixed symbol. If x = x1 x2 . . . xsi X , then any word of the
form x1 x2 . . . xsi ysi +1 . . . ys X because x is a prefix. But there are at most qssi of
such words. Summing up on i, we obtain that the total number of excluded words
ssi . But it cannot exceed the total number of words qs . Hence, (1.6.1)
is m
i=1 q
follows:
m
qs qsi qs .
i=1
Problem 1.2
Consider an alphabet with m letters each of which appears with
probability 1/m. A binary Huffman code is used to encode the letters, in order to
minimise the expected codeword-length (s1 + + sm )/m where si is the length of
a codeword assigned to letter i. Set s = max[si : 1 i m], and let n be the number
of codewords of length .
(a) Show that 2 ns m.
(b) For what values of m is ns = m?
(c) Determine s in terms of m.
(d) Prove that ns1 + ns = m, i.e. any two codeword-lengths differ by at most 1.
(e) Determine ns1 and ns .
(f) Describe the codeword-lengths for an idealised model of English (with m =
27) where all the symbols are equiprobable.
(g) Let now a binary Huffman code be used for encoding symbols 1, . . . , m occurring with probabilities p1 pm > 0 where p j = 1. Let s1 be the length
1 jm
c
b
i
2
97
Figure 1.10
Bound ns m is obvious. (From what is said below it will follow that ns is always
even.)
(b) ns = m means all codewords are of equal length. This, obviously, happens iff
m = 2k , in which case s = k (a perfect binary tree Tk with 2k leaves).
(c) In general,
'
s=
log m,
if m = 2k ,
log m, if m = 2k .
The case m = 2k was discussed in (b), so let us assume that m = 2k . Then 2k < m <
2k+1 where k = log m. This is clear from the observation that the binary tree for
probabilities 1/m (we will call it a binary m-tree Bm ) contains the perfect binary
tree Tk but is contained in Tk+1 . Hence, s is as above.
(d) Indeed, in the case of an equidistribution 1/m, . . ., 1/m it is impossible to have
a branch of the tree whose length differs from the maximal value s by two or more.
In fact, suppose there is such a branch, Bi , of the binary tree leading to some letter i
and choose a branch M j of maximal length s leading to a letter j. In a conventional
terminology, letter j was engaged in s merges and i in t s 2 merges. Ultimately,
the branches Bi and M j must merge, and this creates a contradiction. For example,
the least controversial picture is still illegal; see Figure 1.10. Here, vertex i
carrying probability 1/m should have been joined with vertex a or b carrying each
probability 2/m, instead of joining a and b (as in the figure), as it creates vertex c
carrying probability 4/m.
(e) We conclude that (i) for m = 2k , the m-tree Bm coincides with Tk , (ii) for m = 2k
we obtain Bm in the following way. First, take a binary tree Tk where k = [log m],
with 1 m 2k < 2k . Then m 2k leaves of Tk are allowed to branch one step
98
k+1 _
_ k
2 (m 2 )
Figure 1.11
further: this generates 2(m 2k ) = 2m 2k+1 leaves of tree Tk+1 . The remaining
2k (m 2k ) = 2k+1 m leaves of Tk are left intact. See Figure 1.11. So,
ns1 = 2k+1 m,
(f) In the example of English, with equidistribution among m = 27 = 16 + 11 symbols, we have 5 codewords of length 4 and 22 codewords of length 5. The average
codeword-length is
5 4 + 22 5 130
=
4.8.
27
27
3
4
(g) The minimal value for s1 is 1 (obviously). The maximal value is log2 m ,
i.e. the positive integer l with 2l < m 2l+1 . The maximal value for sm is m
1 (obviously). The minimal value is log2 m, i.e. the natural l such that 2l1 <
m 2l .
The tree that yields s1 = 1 and sm = m 1 is given in Figure 1.12.
It is characterised by
i
1
2
..
.
f (i)
0
10
..
.
si
1
2
..
.
m1
m
11. . . 10
11. . . 11
m1
m1
99
m1 m
Figure 1.12
m = 16
Figure 1.13
State Shannons second coding theorem (SCT) and use it to compute the capacity
of this channel.
100
m = 18
Figure 1.14
where the random variables X and Y represent an input and the corresponding
output.
The binary erasure channel keeps an input letter 0 or 1 intact with probability
1 p and turns it to a splodge with probability p. An input random variable X is
0 with probability and 1 with probability 1 . Then the output random variable
Y takes three values:
P(Y = 0) = (1 p) ,
P(Y = 1) = (1 p)(1 ),
P(Y = ) = p.
Thus, conditional on the value of Y , we have
h(X|Y = 0) = 0,
h(X|Y = ) = h( ),
Therefore,
capacity = max I(X : Y )
= max [h(X) h(X|Y )]
= max [h( ) ph( )]
= (1 p) max h( ) = 1 p,
101
pi
0. In fact, take i = (x, y) and
qi
x,y
102
(b) Define a random variable T equal to 0 with probability and 1 with probability
1 . Then the random variable Z has the distribution W ( ) where
'
Z=
X,
if T = 0,
Y,
if T = 1.
By part (a),
h(Z|T ) h(Z),
with the LHS = h(X) + (1 )h(Y ), and the RHS = h(W ( )).
(c) Observe that for independent random variables X and Y , h(X + Y |X) =
h(Y |X) = h(Y ). Hence, again by part (a),
h(X +Y ) h(X +Y |X) = h(Y ).
Using this fact, for all 1 < 2 , take X Po(1 ), Y Po(2 1 ), independently.
Then
h(X +Y ) h(X) implies hPo (2 ) hPo (1 ).
Problem 1.5
What does it mean to transmit reliably at rate R through a binary
symmetric channel (MBSC) with error-probability p? Assuming Shannons second coding theorem (SCT), compute the supremum of all possible reliable transmission rates of an MBSC. What happens if: (i) p is very small; (ii) p = 1/2; or
(iii) p > 1/2?
Solution An MBSC can
reliably at rate R if there is a sequence of codes
8 transmit
9
NR
codewords such that
XN , N = 1, 2, . . ., with 2
e(XN ) = max P error|x sent 0 as N .
xXN
By the SCT, the so-called operational channel capacity is sup R = max I(X : Y ),
the maximum information transmitted per input symbol. Here X is a Bernoulli
random variable taking values 0 and 1 with probabilities [0, 1] and 1 , and
Y is the output random variable for the given input X. Next, I(X : Y ) is the mutual
entropy (information):
I(X : Y ) = h(X) h(X|Y ) = h(Y ) h(Y |X).
103
Observe that the binary entropy function h(x) 1 with equality for x = 1/2.
Selecting = 1/2 conclude that the MBSC with error probability p has the
capacity
= max h( p + (1 )(1 p)) (p)
(1.6.3)
ai log bi ai log bi
i
i
i
(1.6.5)
and show that for any probability distributions p = (p(x)) and q = (q(x)),
2
log2 e
D(pq)
(1.6.6)
|p(x) q(x)| .
2
x
104
Solution (i) Denote x = ln , and taking the logarithm twice obtain the inequality
x 1 > ln x. This is true as x > 1, hence e > e .
(ii) Assume without loss of generality that ai > 0 and bi > 0. The function g(x) =
x log x is strictly convex. Hence, by the Jensen inequality for any coefficients ci =
1, ci 0,
ci g(xi ) g ci xi .
1
and xi = ai /bi , we obtain
Selecting ci = bi b j
j
ai
ai
ai
b j log bi b j
i
log
ai
i
bj
q(x)
c p(x) 1
= c 1 q(B) 0.
p(x)
xB
D(pq) =
g(A) = g(x),
xA
Then
f (x)
f (A)p(x)
xA
xA
= f (A)
p(x)
xA
;<
f (A) log
f (A)
.
g(A)
+ f (A) log
f (A)
g(A)
105
then
p(x)
p(x)
+ p(x) log
q(x) xAc
q(x)
xA
c)
p(A
p(A)
+ p(Ac ) log
p(A) log
q(A)
q(Ac )
2
2 log2 e
2 log2 e p(A) q(A) =
|p(x) q(x)| .
2
x
Problem 1.7 (a) Define the conditional entropy, and show that for random variables U and V the joint entropy satisfies
h(U,V ) = h(V |U) + h(U).
(1.6.7)
i=1
where h(XS ) = h(Xs1 , . . . , Xsk ) for S = {s1 , . . . , sk }. Assume that, for any i,
h(Xi |XS ) h(Xi |XT ) when T S, and i
/ S.
By considering terms of the form
h(X1 , . . . , Xn ) h(X1 , . . . Xi1 , Xi+1 , . . . , Xn )
(n)
(n)
(k)
(n)
(n)
(n)
(n)
t1 t2 tn .
106
(1.6.8)
and, in general,
h(X1 , . . . , Xn )
= h(X1 , . . . , Xi1 , Xi+1 , . . . , Xn ) + h(Xi |X1 , . . . , Xi1 , Xi+1 , . . . , Xn )
h(X1 , . . . , Xi1 , Xi+1 , . . . , Xn ) + h(Xi |X1 , . . . , Xi1 ),
(1.6.9)
because
h(Xi |X1 , . . . , Xi1 , Xi+1 , . . . , Xn ) h(Xi |X1 , . . . , Xi1 ).
Then adding equations (1.6.9) from i = 1 to n:
n
The second sum in the RHS equals h(X1 , . . . , Xn ) by the chain rule (1.6.7). So,
n
(n)
(n)
(1.6.10)
In general, fix a subset S of size k in {1, . . . , n}. Writing S(i) for S \{i}, we obtain
1
h(X[S(i)])
1
h[X(S)]
,
k
k iS k 1
107
.
h =
k k
k
k(k 1)
S{1,...,n}: |S|=k
S{1,...,n}: |S|=k iS
(1.6.11)
Finally, each subset of size k 1, S(i), appears [n (k 1)] times in the sum
(n)
(1.6.11). So, we can write hk as
n
h[X(T )] n (k 1)
k
k
T {1,...,n}: |T |=k1 k 1
n
h[X(T )]
=
= hnk1 .
1
k
1
T {1,...,n}: |T |=k1
(c) Starting from (1.6.11), exponentiate and then apply the arithmetic
mean/geometric mean inequality, to obtain for S0 = {1, 2, . . . , n}
e h(X(S0 ))/n e [h(S0 (1))++h(S0 (n))]/(n(n1))
(n)
1 n h(S0 (i))/(n1)
e
n i=1
(n)
which is equivalent to tn tn1 . Now we use the same argument as in (b), taking
(n)
(n)
the average over all subsets to prove that for all k n,tk tk1 .
Problem 1.8
Prove that
The random variables X and Y with values x and y from finite alphabets I and
J represent the input and output of a transmission channel, with the conditional
probability P(x | y) = P(X = x | Y = y). Let h(P( | y)) denote the entropy of the
conditional distribution P( | y), y J , and h(X | Y ) denote the conditional entropy
of X given Y . Define the ideal observer decoding rule as a map f : J I such
that P( f (y) | y) = maxxI P(x | y) for all y J . Show that under this rule the errorprobability
er (y) =
P(x | y)
xI: x= f (y)
1
satisfies er (y) h(P( | y)), and the expected error satisfies
2
1
Eer (Y ) h(X | Y ).
2
108
Solution Bound (i) follows from the pooling inequality. Bound (ii) holds as
pi log pi pi log
i
1
1
= log .
p
p
To check (iii), it is convenient to use (i) for p 1/2 and (ii) for p 1/2. Assume
first that p 1/2. Then, by (i),
h(p1 , . . . , pn ) h (p , 1 p ) .
The function x (0, 1) h(x, 1 x) is concave, and its graph on (1/2, 1) lies
strictly above the line x 2(1 x). Hence,
h(p1 , . . . , pn ) 2 (1 p ) .
On the other hand, if p 1/2, we use (ii):
h(p1 , . . . , pn ) log
1
.
p
1
1
2(1 x); equality iff x = .
x
2
109
where Psource stands for the source probability distribution, and one uses a code fN
and a decoding rule fN . A value R (0, 1) is said to be a reliable transmission rate
if, given that Psource is an equidistribution over a set UN of source strings u with
UN = 2N[R+o(1)] , there exist fN and fN such that
1
(N)
f
P
(Y
)
=
u
|
f
(u)
sent
= 0.
lim
N
N
ch
N UN
uUN
The channel capacity is the supremum of all reliable transmission rates.
For an erasure channel, the matrix is
1 p
0
p
0
1 p p
1 0
0
1
The conditional entropy h(Y |X) = h(p, 1 p) does not depend on pX . Thus,
C = sup I(X : Y ) = sup h(Y ) h(Y |X)
pX
pX
110
n1
1
n1
0 n2
2
n
n2
1
n1
1
1
n2
n2
n2
0
1
1
1
n2
n1
n2
1
...
...
...
n1
n2
1
n2
1
n1
n2
n2
n1
n1
The channel is double-symmetric (the rows and columns are permutations of each
other), hence the capacity-achieving input distribution is
1
pX (0) = pX (1) = ,
2
111
we find
1
1
1
dC(x) 3
=
+
log(x 1)
dx
x x 1 x(x 1) x2
1
2 1
= 2 log(x 1) = 2 [ 2x log(x 1)] > 0, x > 1.
x x
x
Thus, Cn increases with n for n 1. When i and i are treated as 0 or 1, the
capacity does not change.
Problem 1.12
Let Xi , i = 1, 2, . . . , be IID random variables, taking values 1, 0
with probabilities p and (1 p). Prove the local De MoivreLaplace theorem with
a remainder term:
1
exp[nh p (y) + n (k)], k = 1, . . . , n 1; (1.6.12)
P(Sn = k) = $
2 y(1 y)n
here
Sn =
Xi ,
y = k/n,
1in
1y
y
+ (1 y) ln
h p (y) = y ln
p
1 p
1
,
6ny(1 y)
y = k/n.
n! = 2 n, nn en+ (n) ,
where
1
1
< (n) <
.
12n + 1
12n
112
Solution Write
n!
n
(1 p)nk pk
(1 p)nk pk =
P(Sn = k) =
k
k!(n
k)!
?
nn
n
=
(1 p)nk pk
nk
2 k(n k) kk (n
k)
+k ln p + (n k) ln(1 p) exp (n) (k) (n k)
1
=$
exp nh p (y)
2 ny(1 y)
Now, as
| (n) (k) (n k)| <
1
1
2n2
1
+
+
<
,
12n 12k 12(n k) 12nk(n k)
(1.6.12) follows, with n (k) = (n) (k) (n k). By the Gibbs inequality,
h p (y) 0 and h p (y) = 0 iff y = p. Furthermore,
d2 h p (y) 1
dh p (y)
y
1y
1
= ln ln
, and
> 0,
= +
2
dy
p
1 p
dy
y 1y
which yields
dh p (y)
dh p (y)
dh p (y)
< 0, 0 < y < p, and
> 0, p < y < 1.
= 0,
dy y=p
dy
dy
Hence,
at y = p,
h p = min h p (y) = 0, attained
1
1
h p = max h p (y) = min ln , ln
, attained at y = 0 or y = 1.
p
1 p
Thus, the maximal probability for n 1 is for y = p, i.e. k+ = np:
1
exp n (np) ,
P Sn = np $
2 np(1 p)
where
|n (np)|
1
.
6np(1 p)
113
Problem 1.13
(a) Prove that the entropy h(X) = p(i) log p(i) of a discrete
i=1
(1.6.13)
Show also that inequality (1.6.13) remains true even when p < 1/2.
Solution (a) Concavity of h(p) means that
h(1 p1 + 2 p2 ) 1 h(p1 ) + 2 h(p2 )
(1.6.14)
(1.6.15)
If PY |X (.|.) are fixed, the second term is a linear function of pX , hence concave. The
first term, h(Y ), is a concave function of pY which in turn is a linear function of
pX . Thus, h(Y ) is concave in pX , and so is I(X : Y ).
(b) Consider two cases, (i) p 1/2 and (ii) p 1/2. In case (i), by pooling
inequality,
h(X) h(p , 1 p ) (1 p ) log
1
1
(1 p ) log = 2(1 p )
p (1 p )
4
114
115
i: i <i
pi
pi
i: i >i
= pi sign (i i )
i
= E sgn( ) E 2 1 ,
as sign x 2x 1 for integer x. Continue the argument with
E 2 1 = pi 2i i 1
i
= 2i 2i i 1 =
i
2 2
i
1 2i = 1 1 = 0
i
2i i 1, i.e. i = i .
Problem 1.15
Define the capacity C of a binary channel. Let CN =
(N)
(N)
(1/N) sup I(X : Y ), where I(X(N) : Y(N) ) denotes the mutual entropy between
X(N) , the random word of length N sent through the channel, and Y(N) , the received
word, and where the supremum is over the probability distribution of X(N) . Prove
that C lim supN CN .
Solution A binary channel is defined as a sequence of conditional probability distributions
(N)
116
e(N) = e(N) f (N) , f(N) .
The converse part of Shannons second coding theorem (SCT) states that
1
(1.6.20)
C lim sup sup I X(N) : Y(N) ,
N N P (N)
X
where I X(N) : Y(N) is the mutual entropy between the random input and output
strings X(N) and Y(N) and PX(N) is a distribution of X(N) .
For the proof, it suffices to check that if U (N) = 2N[R+O(1)] then, for all f (N)
and f(N) ,
CN + o(1)
e(N) 1
(1.6.21)
R + o(1)
where
CN =
1
sup I X(N) : Y(N) .
N P (N)
X
according
to
(1.6.21)
117
The last bound here follows from the generalised Fano inequality
h X(N) | f Y(N) e(N) log e(N) 1 e(N) log 1 e(N)
+e(N) log U (N) 1
1 + e(N) log U (N) 1 .
Now, from (1.6.22),
i.e.
(N)
N R + o(1) NCN 1
CN + o(1)
= 1
,
R + o(1)
log 2N[R+o(1)] 1
as required.
Problem 1.16
A memoryless channel has input 0 and 1, and output 0, 1 and
(illegible). The channel matrix is given by
P(0|0) = 1, P(0|1) = P(1|1) = P(|1) = 1/3.
Calculate the capacity of the channel and the input probabilities pX (0) and pX (1)
for which the capacity is achieved.
Someone suggests that, as the symbol may occur only from 1, it is to your
advantage to treat as 1: you gain more information from the output sequence, and
it improves the channel capacity. Do you agree? Justify your answer.
Solution Use the formula
pX
1 + 2p 2(1 p)
1 p
1 + 2p
log
log
.
3
3
3
3
118
Also,
h(Y |X) = pX (x) P(y|x) log P(y|x)
x=0,1
1 + 2p 2(1 p)
1 p
1 + 2p
log
log
(1 p) log 3.
3
3
3
3
Differentiating yields
d
I(X : Y ) = 2/3(log (1/3 + 2p/3) + 2/3 log (1/3 p/3) + log 3.
dp
Hence, the maximum max I(X : Y ) is found from relation
1 p
2
log
+ log 3 = 0.
3
1 + 2p
This yields
log
and
3
1 p
= log 3 := b,
1 + 2p
2
1 p
= 2b , i.e. 1 2b = p 1 + 2b+1 .
1 + 2p
The answer is
p=
1 2b
.
1 + 2b+1
where q(x, y) = P(X = x|Y = y). For which and for which X , Y does equality
hold here?
119
where
q(x, y) = P(X = x|Y = y).
The joint entropy is given by
h(X,Y ) = P(X = x,Y = y) log P(X = x,Y = y).
x,y
1
1
E log q(X,Y ) =
h(X|Y ).
log 1/
log 1/
Here equality holds iff
P q(X,Y ) = ) = 1.
This requires that (i) = 1/m where m is a positive integer and (ii) for all y
support Y , there exists a set Ay of cardinality m such that
P(X = x|Y = y) =
1
,
m
for x Ay .
and is impossible if
h(p1 , p2 , . . . , pm ) + h(p , 1 p ) > 1,
m
120
Solution The asymptotic equipartition property for a Bernoulli source states that
the number of distinct strings (words) of length n emitted by the source is typically 2nH+o(n) , and they have nearly equal probabilities 2nH+o(n) :
lim P 2n(H+ ) Pn (U(n) ) 2n(H ) = 1.
n
Here, H = h(p1 , . . . , pn ).
Denote
(
)
Tn (= Tn ( )) = u(n) : 2n(H+ ) Pn (u(n) ) 2n(H )
and observe that
1
1
log Tn = H, i.e. lim sup log Tn < H + .
n n
n n
lim
By the definition of the channel capacity, the words u(n) Tn ( ) may be encoded
1
by binary codewords of length R (H + ) and sent reliably through a memoryless
symmetric channel with matrix
p
1 p
p
1 p
for any R < C where
C = sup I(X : Y ) = sup[h(Y ) h(Y |X)].
pX
pX
121
independently of pX . Hence,
C = sup h(Y ) h(p , 1 p ) = 1 h(p , 1 p ),
pX
then R (H + ) can be made < 1, for > 0 small enough and R < C close to
C. This means that there exists a sequence of codes fn of length n such that the
error-probability, while using encoding fn and the ML decoder, is
P u(n) Tn
+P u(n) Tn ; an error while using fn (u(n) ) and the ML decoder
0, as n ,
since both probabilities go to 0.
On the other hand,
H >C
then R H > 1 for all R < C, and we cannot encode words u(n) Tn by codewords
of length n so that the error-probability tends to 0. Hence, no reliable transmission
is possible.
Problem 1.19 A Markov source with an alphabet of m characters has a transition
matrix Pm whose elements p jk are specified by
p11 = pmm = 2/3, p j j = 1/3 (1 < j < m),
p j j+1 = 1/3 (1 j < m), p j j1 = 1/3 (1 < j m).
All other elements are zero. Determine the information rate of the source.
. Consider
Denote the transition matrix thus specified by Pm
a source in an
Pm 0
, where the zeros
alphabet of m + n characters whose transition matrix is
0 Pn
indicate zero matrices of appropriate size. The initial character is supposed uniformly distributed over the alphabet. What is the information rate of the source?
122
2/3 1/3 0
0 ... 0
1/3 1/3 1/3 0 . . . 0
.
.
.
.
.
..
..
..
..
..
..
0
0
0
0 . . . 2/3
is Hermitian and so has the equilibrium distribution = (i ) with i = 1/m, 1
i m (equidistribution). The information rate equals
Hm = j p jk log p jk
j,k
2
2 1
1
1
1
1
2
log + log
+ 3(m 2) log
=
m
3
3 3
3
3
3
4
.
= log 3
3m
Pm 0
The source with transition matrix
is non-ergodic, and its information
0 Pn
rate is the maximum of the two rates
max Hm , Hn = Hmn .
Problem 1.20
Consider a source in a finite alphabet. Define Jn = n1 h(U(n) )
and Kn = h(Un+1 |U(n) ) for n = 1, 2, . . .. Here Un is the nth symbol in the sequence
and U(n) is the string constituted by the first n symbols, h(U(n) ) is the entropy and
h(Un+1 |U(n) ) the conditional entropy. Show that, if the source is stationary, then Jn
and Kn are non-increasing and have a common limit.
Suppose the source is Markov and not necessarily stationary. Show that the
mutual information between U1 and U2 is not smaller than that between U1 and U3 .
Solution For the second part, the Markov property implies that
P(U1 = u1 |U2 = u2 ,U3 = u3 ) = P(U1 = u1 |U2 = u2 ).
Hence,
P(U1 = u1 |U2 = u2 ,U3 = u3 )
= E log
= I(U1 : U2 ).
P(U1 = u1 )
Since
I(U1 : (U2 ,U3 )) I(U1 : U3 ),
the result follows.
123
Problem 1.21 Construct a Huffman code for a set of 5 messages with probabilities as indicated below
1
2
3
4
5
Message
Probability 0.1 0.15 0.2 0.26 0.29
Solution
Message
1
2
3
4
5
Probability 0.1 0.15 0.2 0.026 0.029
Codeword 101 100 11 01
00
The expected codeword-length equals 2.4.
Problem 1.22
State the first coding theorem (FCT), which evaluates the information rate for a source with suitable long-run properties. Give an interpretation of
the FCT as an asymptotic equipartition property. What is the information rate for a
Bernoulli source?
Consider a Bernoulli source that emits symbols 0, 1 with probabilities 1 p and
p respectively, where 0 < p < 1. Let (p) = p log p (1 p) log(1 p) and let
> 0 be fixed. Let U(n) be the string consisting of the first n symbols emitted by
the source. Prove that there is a set Sn of possible values of U(n) such that
(n)
P U Sn 1 log
2
p(1 p)
,
n 2
Sn the probability that P U(n) = u(n) lies between
p
1 p
1
Var [log P(U1 )] .
2n
(1.6.23)
124
Here
'
P(U j ) =
and
1 p,
if U j = 0,
p,
if U j = 1,
log P(U j )
1 jn
where
P(U j ),
1 jn
Var
Pn (U(n) ) =
Var log P(U j )
1 jn
2
Var log P(U j ) = E log P(U j ) E log P(U j )
2
= p(log p)2 + (1 p)(log(1 p))2 p log p + (1 p) log(1 p)
2
p
= p(1 p) log
.
1 p
S
value of q . Establish the lower bound E q
pi , and characterise
1im
pi .
q
1im
xi yi
1im
1im
xi2
1im
y2i
125
Solution By CauchySchwarz,
pi = pi qsi /2 qsi /2
1im
1/2
1/2
1/2
s
s
s
i
i
i
,
q
pi q
pi q
1/2
1/2
1im
1im
since, by Kraft,
1im
1im
qsi 1. Hence,
1im
2
EqS =
pi qsi
1im
1/2
pi
1im
qxi = 1 (so,
1im
1/2
pi
1im
1im
EqS =
pi qsi = q
1im
<q
1im
pi qsi 1
1im
pi qxi = qc
1im
2
1/2
pi
= q
1/2
pi
1im
Problem 1.24
A Bernoulli source of information of rate H is fed characterby-character into a transmission line which may be live or dead. If the line is live
when a character is transmitted then that character is received faithfully; if the line
is dead then the receiver learnt only that it is indeed dead. In shifting between
its two states the line follows a Markov chain (DTMC) with constant transition
probabilities, independent of the text being transmitted.
Show that the information rate of the source constituted by the received signal
is HL + L HS where HS is the signal, HL is the information rate of the DTMC
governing the functioning of the line and L is the equilibrium probability that the
line is alive.
Solution The rate of a Bernoulli source emitting letter j = 1, 2, . . . with probability
p j is H = p j log p j . The state of the line is a DTMC with a 2 2 transition
j
126
matrix
dead
live
1
1
, L (live) =
+
+
(assuming that + > 0). The received signal sequence follows a DTMC with
states 0 (dead), 1, 2, . . . and transition probabilities
q0 j = p j ,
q00 = 1 ,
j, k 1.
q jk = (1 )pk
q j0 = ,
This chain has a unique equilibrium distribution
RS (0) =
, RS ( j) =
p j , j 1.
+
+
(1 ) log(1 ) + p j log( p j )
=
+
j1
= HL +
HS .
+
Here HL is the entropy rate of the line state DTMC:
HL =
(1 ) log(1 ) + log
+
(1 ) log(1 ) + log ,
and = /( + ).
Problem 1.25
Consider a Bernoulli source in which the individual character
can take value i with probability pi (i = 1, . . . , m). Let ni be the number of times the
character value i appears in the sequence u(n) = u1 u2 . . . un of given length n. Let
An be the smallest set of sequences u(n) which has total probability at least 1 .
Show that each sequence in An satisfies the inequality
ni log pi nh + nk/ )1/2 ,
127
pni i .
1im
'
@
u(n) : ni log pi c ,
An =
(n)
< .
ni log pi c
1im
Now, for the random string U(n) = U1 . . . Un , let Ni is the number of appearances
of value i. Then
Ni log pi =
1im
1 jn
pi log pi := h
1im
and
2
Var j = E( j )2 E j =
pi (log pi )2
1im
Then
E
1 jn
2
j = nh and Var
pi log pi
1im
j = nv.
1 jn
:= v.
128
ni log pi nh +
1im
nk
:= c.
For an irreducible and aperiodic Markov source the assertion is similar, with
H =
i pi j log pi j ,
1i, jm
1
and v 0 a constant given by v = lim sup Var
n n
j .
1 jn
Problem 1.26
Demonstrate that an efficient and decipherable noiseless coding
procedure leads to an entropy as a measure of attainable performance.
Words of length si (i = 1, . . . , n) in an alphabet Fa = {0, 1, . . . , a 1} are to be
n
i=1
ity but also to the condition that qi si should not exceed a prescribed bound,
i=1
pi si .
i=1
Solution If we disregard the condition that s1 , . . . , sn are positive integers, the minimisation problem becomes
minimise
si pi
i
1in
si pi
1in
asi .
(1.6.24)
129
(1.6.25)
vrel =
pi loga pi := h,
1in
h si pi .
i
qi si b.
(1.6.26)
1in
The relaxed problem (1.6.24) complemented with (1.6.26) again can be solved by
the Lagrange method. Here, if
qi loga pi b
i
then adding the new constraint does not affect the minimiser (1.6.24), i.e. the
optimal positive s1 , . . . , sn are again given by (1.6.25), and the optimal value is h.
Otherwise, i.e. when qi loga pi > b, the new minimiser s1 , . . . , sn is still unique
i
(since the problem is still strong Lagrangian) and fulfils both constraints
as
= 1,
qi si = b.
i
In both cases, the optimal value vrel for the new relaxed problem satisfies h vrel .
Finally, the solution s1 , . . . , sn to the integer-valued word-length problem
minimise
si pi
i
subject to si 1 integer
and asi 1, qi si b
i
(1.6.27)
will satisfy
h vrel si pi ,
i
Problem 1.27
si qi b.
i
130
(s)
k s1
(s)
erated digits with splodges. The probability pn (x) = P X1n = x1n of such a string
is represented as
(1.6.28)
x1 or
depending on where the initial non-obliterated digit occurred in x1n (if at all). The
subsequent factors contributing to (1.6.28) have a similar structure:
(s)
pxt1 xt or pxts xt s1 or 1.
Consequently, the information log pn (x1n ) carried by string x1n is calculated as
log P(Xs1 = xs1 ) (s1 1) log log
s s )
log px2s1 xs21 (s2 s1 1) log log
(sN sN1 )
log pxsN1
xsN (sN sN1 1) log log
where 1 s1 < < sN n are the consecutive times of appearance of nonobliterated symbols in x1n .
1
Now take log pn X1n , the information rate provided by the random string
n
X1n . Ignoring the initial bit, we can write
M(i, j; s)
N( )
N( )
1
(s)
log
log
log pi j .
log pn X1n =
n
n
n
n
i
131
Here
N( ) = number of non-obliterated digits in X1n ,
N( ) = number of obliterated digits in X1n ,
M(i, j; s) = number of series of digits i j in X1n of length s + 1
As n , we have the convergence of the limiting frequencies (the law of large
numbers applies):
N( )
N( )
M(i, j; s)
(s)
,
,
s1 i pi j .
n
n
n
This yields
1
log pn X1n
n
(s)
(s)
log log 2 i s1 pi j log pi j ,
i, j
s1
00
q0 q1 0 0
01
0 0 q1 q0
10 q1 q0 0 0
11
q0 q1
132
, =0,1
=0,1
p , log p ,
and equals
1
h(q0 , q1 ) = q0 log q0 q1 log q1 .
4 ,
=0,1
Problem 1.29
An input to a discrete memoryless channel has three letters 1, 2
and 3. The letter j is received as ( j 1) with probability p, as ( j +1) with probability p and as j with probability 1 2p, the letters from the output alphabet ranging
from 0 to 4. Determine the form of the optimal input distribution, for general p,
as explicitly as possible. Compute the channel capacity in the three cases p = 0,
p = 1/3 and p = 1/2.
Solution The channel matrix is 3 5:
p (1 2p)
p
0
0
1
0
p
(1 2p)
p
0 .
2
3
0
0
p
(1 2p) p
The rows are permutations of each other, so the capacity equals
C = maxPX [h(Y ) h(Y
|X)]
y=0,1,2,3,4
where
PY (0)
PY (1)
PY (2)
PY (3)
PY (4)
=
=
=
=
=
PX (1)p,
PX (1)(1 2p) + PX (2)p,
PX (1)p + PX (2)(1 2p) + PX (3)p,
PX (3)(1 2p) + PX (2)p,
PX (3)p.
(1.6.29)
133
The symmetry in (1.6.29) suggests that h(Y ) is maximised when PX (0) = PX (2) = q
and PX (1) = 1 2q. So:
2q)p
d
h(Y ) = 2p log(qp) 2p 2(1 4p) log q(1 2p) + (1 2q)p
dq
134
1
1
1
, PY (1) = PY (3) = , PY (2) = .
12
3
6
1
0 0 ... 0
1 p p 0 . . . 0
1 p 0 p . . . 0 .
..
.. .. . . ..
. .
.
. .
135
i1
From i icd i = 1, cd (1 d) = 1 a0 and d = a0 , c = (1 a0 )2 a0 .
Next, we maximise, in a0 [0, 1], the function
f (a0 ) = (1 p + pa0 ) log(1 p + pa0 ) p(1 a0 ) log
log a0 + (1 a0 )[(1 p) log(1 p) + p log p].
(1 a0 )2
a0
Requiring that
f (a0 ) = 0
(1.6.30a)
and
f (a0 ) =
p2
2p
p
0,
q + pa0 1 a0 a0
(1.6.30b)
one can solve equation (1.6.30a) numerically. Denote its root where (1.6.30b) holds
by a
0 . Then we obtain the following answer for the optimal input distribution:
'
a
i = 0,
0,
ai =
2 i1 , i 1.
(1 a
0 ) (a0 )
with the capacity C = f (a
0 ).
Problem 1.31 The representation known as binary-coded decimal encodes 0 as
0000, 1 as 0001 and so on up to 9, coded as 1001, with other 4-digit binary strings
being discarded. Show that by encoding in blocks, one can get arbitrarily near the
lower bound on word-length per decimal digit.
Hint: Assume all integers to be equally probable.
136
i1 ,...,in
Denote by S(n) the random codeword-length while encoding in blocks. The mini1
mal expected word-length per source letter is en := min ES(n) . By Shannons NC
n
theorem,
h(n)
1
h(n)
en
+ ,
n log q
n log q n
where q is the size of the original alphabet A . We see that, for large n, en
h(n)
.
n log q
Relate this to the information rate of a two-state source with transition probabilities
p and 1 p.
Solution The information rate of an m-state stationary DTMC with transition
matrix P = (pi j ) and an equilibrium (invariant) distribution = (i ) equals
h = i pi j log pi j .
i, j
If matrix P is irreducible (i.e. has a unique communicating class) then this statement holds for the chain with any initial distribution (in this case the equilibrium
distribution is unique).
137
1 p
0
... 0
p
1 p . . . 0
..
..
. . ..
.
.
.
.
1 p
0
0... p
p
0
..
.
The rows are permutations of each other, and each of them has entropy
p log p (1 p) log(1 p).
The equilibrium distribution is = (1/m, . . . , 1/m):
1
1
1
pi j = (p + 1 p) = ,
m
m
1im m
and it is unique, as the chain has a unique communicating class. Therefore, the
information rate equals
h=
1
p log p (1 p) log(1 p) = p log p (1 p) log(1 p).
m 1im
p
1 p
For m = 2 we obtain precisely the matrix
, so with the equilib1 p
p
rium distribution = (1/2, 1/2) the information rate is again h = (p).
Problem 1.33 Define a symmetric channel and find its capacity.
A native American warrior sends smoke signals. The signal is coded in puffs of
smoke of different lengths: short, medium and long. One puff is sent per unit time.
Assume a puff is observed correctly with probability p, and with probability 1 p
(a) a short signal appears to be medium to the recipient, (b) a medium puff appears
to be long, and (c) a long puff appears to be short. What is the maximum rate at
which the warrior can transmit reliably, assuming the recipient knows the encoding
system he uses?
It would be more reasonable to assume that a short puff may disperse completely
rather than appear medium. In what way would this affect your derivation of a
formula for channel capacity?
Solution Suppose we use an input alphabet I , of m letters, to feed a memoryless
channel that produces symbols from an output alphabet J of size n (including
illegibles). The channel is described by its m n matrix where entry pi j gives the
138
p11 . . . p1 j . . . p1n
..
..
..
..
..
.
.
.
.
.
pi1 . . . pi j . . . pin .
..
..
..
..
..
.
.
.
.
.
pm1 . . . pm j . . . pmn
The channel is called symmetric if its rows are permutations of each other (or, more
generally, have the same entropy E = h(pi1 , . . . , pin ), for all i I ). The channel
is said to be double-symmetric if in addition its columns are permutations of each
other (or, more generally, have the same column sum = pi j , for all j J ).
1im
Here, the maximum is taken over PX = (PX (i), i I ), the input-letter probability distribution, and I(X : Y ) is the mutual entropy between the input and output
random letters X and Y tied through the channel matrix:
I(X : Y ) = h(Y ) h(Y |X) = h(X) h(X|Y ).
For the symmetric channel, the conditional entropy
h(Y |X) = PX (i)pi j log pi j h,
i, j
and the maximisation needs only to be performed for the output symbol entropy
h(Y ) = PY ( j) log PY ( j), where PY ( j) = PX (i)pi j .
j
139
1 short
p
1 p
0
2 medium 0
p
1 p ,
3 long
1 p
0
p
and double-symmetric. This yields
C = log 3 + p log p + (1 p) log(1 p).
In the modified example, the matrix becomes 3 4:
p
0
0
1 p
0
p 1 p
0 ;
1 p 0
p
0
column 4 corresponds to a no-signal output state (a splodge). The maximisation
problem loses its symmetry:
max
PX (i)pi j log
PX (i)pi j
j=1,2,3,4 i=1,2,3
i=1,2,3
PX (i) pi j log pi j
i=1,2,3
subject to
j=1,2,3,4
PX (i) = 1,
i=1,2,3
(1.6.31)
with equality iff X and Y are Gaussian with proportional covariance matrices.
Let X be a real-valued random variable with a PDF fX and finite differential
entropy h(X), and let function g : R R have strictly positive derivative g everywhere. Prove that the random variable g(X) has differential entropy satisfying
h(g(X)) = h(X) + E log2 g (X),
140
dFg(X) (y)
takes the form
dy
1 1
fX g1 (y)
fg(X) (y) = fX g (y) g (y) = 1 .
g g (y)
Then
h(g(X)) =
fX (x)
log
f
(x)
log
g
(x)
g (x)dx
=
X
2
2
g (x)
= h(X) + E log2 g (X) .
(1.6.32)
Problem 1.35
0 < a < b:
141
G(a, b) =
ab, L(a, b) =
1
ba
, I(a, b) = (bb /aa )1/(ba) .
log(b/a)
e
Check that
0 < a < G(a, b) < L(a, b) < I(a, b) < A(a, b) =
a+b
< b.
2
(1.6.33)
Let m = min[qi /pi ], M = max[qi /pi ], = min[pi ], = max[pi ]. Prove the following
bounds for the entropy h(X) and KullbackLeibler divergence D(p||q) (cf. PSE II,
p. 419):
0 log r h(X) log ( , ).
(1.6.34)
(1.6.35)
pi xi
r r
Next, we sketch the proof of (1.6.36); see details in [144], [50]. Let f be a convex
function, p, q 0, p + q = 1. Then for xi [a, b], we have
(1.6.37)
0 pi f (xi ) f pi xi max[p f (a) + q f (b) f (pa + qb)].
p
142
Applying (1.6.37) for a convex function f (x) = log x we obtain after some calculations that the maximum in (1.6.37) is achieved at p0 = (b L(a, b))/(b a),
with p0 a + (1 p0 )b = L(a, b), and
ba
A (p, x)
log(bb /aa )
0 log
log
log(ab) +
1
G (q, x))
log(b/a)
ba
which is equivalent to (1.6.36). Finally, we establish (1.6.37). Write xi = i a +
(1 i )b for some i [0, 1]. Then by convexity
0 pi f (xi ) f ( pi xi )
pi (i f (a) + (1 i ) f (b)) f (a pi i + b pi (1 i )) .
Denoting pi i = p and 1 pi i = q and maximising over p we obtain
(1.6.37).
Problem 1.36 Let f be a strictly positive probability density function (PDF)
on the line R, define the KullbackLeibler divergence D(g|| f ) and prove that
D(g|| f ) 0.
0
0
ex f (x)dx < and
xg(x)dx + D(g|| f )
(1.6.38)
g(x)
dx = .
g(x) ln
f (x)
g (x)
ex f (x)/Z
where Z =
ez f (z)dz. Set W =
1
ex
ex f (x) ln dx
Z0
Z
W
1
1
x
=
e f (x) x ln Z dx = W Z ln Z = ln Z
Z
Z
Z
D(g || f ) =
143
xg (x)dx + D(g || f ) = ln Z.
g(x)
x
dx
=
q(x)
ln
q(x)e
Z
f (x)dx
g (x)
0
0
= x f (x)q(x)dx + f (x)q(x) ln q(x)dx + ln Z
D(g||g ) =
g(x) ln
xg(x)dx + D(g|| f ) + ln Z,
implying that
xg(x)dx + D(g|| f ) =
2
Introduction to Coding Theory
(2.1.1a)
2.1 Hamming spaces. Geometry of codes. Basic bounds on the code size
N=1
N=2
N=3
145
N=4
Figure 2.1
An important part is played by the distance (x(N) , 0(N) ) between words x(N) =
x1. . . xN and 0(N) = 0 . . . 0; it is called the weight of word x(N) and denoted by
w x(N) :
(2.1.1b)
w x(N) = the number of digits i with xi = 0.
Lemma 2.1.1
(2.1.2)
(2.1.3a)
146
This makes the Hamming space HN,q a commutative group, with the zero codeword 0(N) = 0 . . . 0 playing the role of the zero of the group. (Words also may be
multiplied which generates a powerful apparatus; see below.)
For q = 2, we have a two-point code alphabet {0, 1} that is actually a twopoint field, F2 , with the following arithmetic: 0 + 0 = 1 + 1 = 0 1 = 1 0 = 0,
0 + 1 = 1 + 0 = 1 1 = 1. (Recall, a field is a set equipped with two commutative
operations: addition and multiplication, satisfying standard axioms of associativity
and distributivity.) Thus, each point in the binary Hamming space HN,2 is opposite
to itself: x(N) + x (N) = 0(N) iff x(N) = x (N) . In fact, HN,2 is a linear space over the
coefficient field F2 , with 1 x(N) = x(N) , 0 x(N) = 0(N) .
Henceforth, all additions of q-ary words are understood digit-wise and mod q.
Lemma 2.1.2
tions:
The Hamming distance on HN,q is invariant under group transla (x(N) + z(N) , y(N) + z(N) ) = (x(N) , y(N) ).
(2.1.3b)
2.1 Hamming spaces. Geometry of codes. Basic bounds on the code size
147
if it allows the receiver to correct the error string e(N) , at least when word e(N)
does not contain too many non-zero digits.
Going back to an MBSC with the row probability of the error p < 1/2: the ML
(N)
decoder selects a codeword x that leads to a word e(N) with a minimal number
of the unit digits. In geometric terms:
(N)
x
(2.1.4)
The same rule can be applied in the q-ary case: we look for the codeword closest
to the received string. A drawback of this rule is that if several codewords have the
same minimal distance from a received word we are stuck. In this case we either
choose one of these codewords arbitrarily (possibly randomly or in connection with
the messages content; this is related to the so-called list decoding), or, when a high
quality of transmission is required, refuse to decode the received word and demand
a re-transmission.
Definition 2.1.3 We call N the length of a binary code XN , M := XN the size
log2 M
and :=
the information rate. A code XN is said to be D-error detecting
N
if making up to D changes in any codeword does not produce another codeword,
and E-error correcting if making up to E changes in any codeword x(N) produces
a word which is still (strictly) closer to x(N) than to any other codeword (that is,
x(N) is correctly guessed from a distorted word under the rule (2.1.4)). A code has
minimal distance (or briefly distance) d if
(N)
(N)
(N)
.
(2.1.5)
d = min (x(N) , x ) : x(N) , x XN , x(N) = x
The minimal distance and the information rate of a code XN will be sometimes
denoted by d(XN ) and (XN ), respectively.
This definition can be repeated almost verbatim for the general case of a
logq M
. Namely, a code XN
q-ary code XN HN,q , with information rate =
N
is called E-error correcting if, for all r = 1, . . . , E, x(N) XN and y(N) HN,q
with (x(N) , y(N) ) = r, the distance (y(N) , x (N) ) > r for all x (N) XN such that
x (N) = x(N) . In words, it means that making up to E errors in a codeword produces a word that is still closer to it than to any other codeword. Geometrically,
this property means that the balls of radius E about the codewords do not intersect:
BN,q (x(N) , E) BN,q (x
(N)
(N)
XN .
Next, a code XN is called D-error detecting if the ball of radius D about a codeword
does not contain another codeword. Equivalently, the intersection BN,q (x(N) , D)
XN is reduced to a single point x(N) .
148
(2.1.6a)
2.1 Hamming spaces. Geometry of codes. Basic bounds on the code size
149
and x(N) XN for all Fq . For a linear code X , the size M is given by
M = qk where k may take values 1, . . . , N and gives the dimension of the code, i.e.
the maximal number of linearly independent codewords. Accordingly, one writes
k = dim X . As in the usual geometry, if k = dim X then in X there exists a basis
of size k, i.e. a linearly independent collection of codewords x(1) , . . . , x(k) such that
any codeword x X can be (uniquely) written as a linear combination a j x( j) ,
1 jk
i = 1, . . . , N.
(2.1.6b)
(N)
N
, R) =
(q 1)k ;
k
0kR
(2.1.7)
150
the detecting and correcting abilities). From this point of view, it is important to
understand basic bounds for codes.
Upper bounds are usually written for Mq (N, d), the largest size of a q-ary code
of length N and distance d. We begin with elementary facts: Mq (N, 1) = qN ,
Mq (N, N) = q, Mq (N, d) qMq (N 1, d) and in the binary case M2 (N, 2s) =
M2 (N 1, 2s 1) (easy exercises).
Indeed, the number of the codewords cannot be too high if we want to keep
good an error-detecting and error-correcting ability. There are various bounds for
parameters of codes; the simplest bound was discovered by Hamming in the late
1940s.
Theorem 2.1.6
(2.1.8a)
(2.1.8b)
Proof (i) The E-balls about the codewords x(N) XN must be disjoint. Hence,
the total number of points covered equals the product vN,q (E)M which should not
exceed qN , the cardinality of the Hamming space HN,q .
8(ii) Likewise,
9 if XN is an [N, M, d] code then, as was noted above, for E =
(d 1)/2 , the balls BN,q (x(N) , E), x(N) XN , do not intersect. The volume
BN,q (x(N) , E) is given by
N
(q 1)k ,
vN,q (E) =
k
0kE
and the union of balls
BN,q (x(N) , E)
x(N) XN
2.1 Hamming spaces. Geometry of codes. Basic bounds on the code size
151
partition has an additional advantage: the code not only corrects errors, but never
leads to a refusal of decoding. More precisely:
Definition 2.1.7 An E-error correcting code XN of size XN = M is called
perfect when the equality is achieved in the Hamming bound:
M = qN vN,q (E) .
If a code XN is perfect, every word y(N) HN,q belongs to a (unique) ball
BE (x(N) ). That is, we are always able to decode y(N) by a codeword: this leads
to the correct answer if the number of errors is E, and to a wrong answer if it is
> E. But we never get stuck in the case of decoding.
The problem of finding perfect binary codes was solved about 20 years ago.
These codes exist only for
(a) E = 1: here N = 2l 1, M = 22 1l , and these codes correspond to the so-called
Hamming codes;
(b) E = 3: here N = 23, M = 212 ; they correspond to the so called (binary) Golay
code.
l
Both the Hamming and Golay codes are discussed below. The Golay code is
used (together with some modifications) in the US space programme: already in
the 1970s the quality of photographs encoded by this code and transmitted from
Mars and Venus was so excellent that it did not require any improving procedure.
In the former Soviet Union space vessels (and early American ones) other codes
were also used (and we also discuss them later): they generally produced lowerquality photographs, and further manipulations were required, based on statistics
of the pictorial images.
If we consider non-binary codes then there exists one more perfect code, for
three symbols (also named after Golay).
We will now describe a number of straightforward constructions producing new
codes from existing ones.
Example 2.1.8
(i) Extension: You add a digit xN+1 to each codeword x(N) = x1 . . . xN from
code XN , following an agreed rule. Viz., the so-called parity-check extension requires that xN+1 + x j = 0 in the alphabet field Fq . Clearly, the
1 jN
152
(N)
) : x(N) , x
(N)
XN ].
(v) Shortening: Take all codewords x(N) XN with the ith digit 0, say, and
delete this digit (shortening on xi = 0). In this way the original binary linsh,0
ear [N, M, d] code XN is reduced to a binary linear code XN1
(i) of length
N 1, whose size can be M/2 or M and distance d or, in a trivial case, 0.
(vi) Repetition: Repeat each codeword x(= x(N) ) XN a fixed number of times,
say m, producing a concatenated (Nm)-word
xx . . . x. The result is a code
re
re
XNm , of length Nm and distance d XNm = md(XN ).
(vii) Direct sum: Given two codes XN and XN , form a code X + X =
{xx : x X , x X }. Both the repetition and direct-sum constructions
are not very effective and neither is particularly popular in coding (though
we will return to these constructions in examples and problems). A more
effective construction is
(viii) The bar-product (x|x + x ): For the [N, M, d] and [N, M , d ] codes XN and
XN define a code XN |XN of length 2N as the collection
)
(
(N)
x(x + x ) : x(= x(N) ) XN , x (= x ) XN .
That is, each codeword in X |X is a concatenation of the codeword from
XN and its sum with a codeword from XN (formally, neither of XN , XN in
this construction is supposed to be linear). The resulting code is denoted by
XN |XN ; it has size
XN |XN = XN XN .
A useful exercise is to check that the distance
d XN |XN = min 2d(XN ), d(X N ) .
2.1 Hamming spaces. Geometry of codes. Basic bounds on the code size
153
(ix) The dual code. The concept of duality is based on the inner dot-product in
space HN,q (with q = ps ): for x = x1 . . . xN and y = y1 . . . yN ,
D
C
x(N) y(N) = x1 y1 + + xN yN
which yields a value from field Fq . For a linear [N, k] code XN its dual, XN ,
is a linear [N, N k] code defined by
D
)
(
C
XN = y(N) HN,q : x(N) y(N) = 0 for all x XN .
(2.1.9)
Clearly, (XN ) = XN . Also dim XN + dim XN = N. A code is called selfdual if XN = XN .
Worked Example 2.1.9
(a) Prove that if the distance d of an [N, M, d] code XN is an odd number then the
code may be extended to an [N + 1, M] code X + with distance d + 1.
(b) Show an E -error correcting code XN can be extended to a code X + that detects 2E + 1 errors.
(c) Show that the distance of a perfect binary code is an odd number.
Solution (a) By adding the digit xN+1 to the codewords x = x1 . . . xN of an [N, M]
code XN so that xN+1 = x j , we obtain an [N + 1, M] code X + . If the distance
1 jN
90 89
= 4096 = 212
2
and
M v90,2 (2) = 278 212 = 290 = 2N .
154
However, such a code does not exist. Assume that it exists, and, the zero word
0 = 0 . . . 0 is a codeword. The code must have d = 5. Consider the 88 words with
three non-zero digits, with 1 in the first two places:
1110 . . . 00 ,
1101 . . . 00 ,
... ,
110 . . . 01 .
(2.1.10)
Each of these words should be at distance 2 from a unique codeword. Say, the
codeword for 1110 . . . 00 must contain 5 non-zero digits. Assume that it is
111110 . . . 00.
This codeword is at distance 2 from two other subsequent words,
11010 . . . 00
and
11001 . . . 00 .
Continuing with this construction, we see that any word from list (2.1.10) is attracted to a codeword with 5 non-zero digits, along with two other words from
(2.1.10). But 88 is not divisible by 3.
Let us continue with bounds on codes.
Theorem 2.1.11 (The GilbertVarshamov (GV) bound) For any q 2 and d 2,
there exists a q-ary [N, M, d] code XN such that
M = XN qN vN,q (d 1) .
(2.1.11)
Proof Consider a code of maximal size among the codes of minimal distance
d and length N. Then any word y(N) HN,q must be distant d 1 from some
codeword: otherwise we can add y(N) to the code without changing the minimal
distance. Hence, the balls of radius d 1 about the codewords cover the whole
Hamming space HN,q . That is, for the code of maximal size, XNmax ,
XNmax vN,q (d 1) qN .
As was listed before, there are ways of producing one code from another (or from
a collection of codes). Let us apply truncation and drop the last digit xN in each
codeword x(N) from an original code XN . If code XN had the minimal distance
(The Singleton bound) Any q-ary code XN with minimal disM = XN Mq (N, d) qNd+1 .
(2.1.12)
2.1 Hamming spaces. Geometry of codes. Basic bounds on the code size
155
(2.1.13)
We will see below that, similarly to perfect codes, the family of the MDS codes
is rather thin.
Corollary 2.1.14
distance d then
N
qN
q
, qNd+1 .
Mq (N, d) min
vN,q (d 1)
vN,q (d 1)/2
(2.1.14)
From now on we will omit indices N and (N) whenever it does not lead to
confusion. The upper bound in (2.1.14) becomes too rough when d N/2. Say, in
the case of binary [N, M, d]-code with N = 10 and d = 5, expression (2.1.14) gives
the upper bound M2 (10, 5) 18, whereas in fact there is no code with M 13, but
there exists a code with M = 12. The codewords of the latter are as follows:
0000000000, 1111100000, 1001011010, 0100110110,
1100001101, 0011010101, 0010011011, 1110010011,
1001100111, 1010111100, 0111001110, 0101111001.
The lower bound gives in this case the value 2 (as 210 /v10,2 (4) = 2.6585) and is
also far from being satisfactory. (Some better bounds will be obtained below.)
Theorem 2.1.15 (The Plotkin bound) For a binary code X of length N and
distance d with N < 2d , the size M obeys
A
B
d
M = X 2
.
(2.1.15)
2d N
156
Proof The minimal distance cannot exceed the average distance, i.e.
M(M 1)d
xX
x X
(x, x ).
xX
x X
(x, x ) 2
si (M si ).
(2.1.16)
1iN
2d
N
=
1.
2d N 2d N
B
A
B
2d
d
M
1 2
,
2d N
2d N
(2.1.17)
(2.1.18)
and
Proof To prove (2.1.17) let X be a code of length N, distance 2d 1 and size
M2 (N, 2d 1). Take its parity-check extension X + . That is, add digit xN+1 to
N+1
i=1
2.1 Hamming spaces. Geometry of codes. Basic bounds on the code size
157
Turning to the proof of (2.1.18), given an [N, d] code, divide the codewords into
two classes: those ending with 0 and those ending with 1. One class must contain
at least half of the codewords. Hence the result.
Corollary 2.1.17
(2.1.19)
and
M2 (2d, d) 4d.
(2.1.20)
d +1
2
2d + 1 N
B
(2.1.21)
and
M2 (2d + 1, d) 4d + 4.
(2.1.22)
Proof Inequality (2.1.19) follows from (2.1.17), and (2.1.20) follows from
(2.1.18) and (2.1.19): if d = 2d then
M2 (4d , 2d ) = 2M2 (4d 1, 2d ) 8d = 4d.
Furthermore, (2.1.21) follows from (2.1.17):
M2 (N, d)
M2 (N + 1, d + 1)
B
d +1
2
.
2d + 1 N
, if d > N
.
Mq (N, d) d
d N
q
q
(2.1.23)
Solution Given a q-ary [N, M, d] code XN , observe that the minimal distance d is
bounded by the average distance
d
1
S, where S =
M(M 1)
xX
x X
(x, x ).
As before, let ki j denote the number of letters j {0, . . . , q 1} in the ith position
in all codewords from X , i = 1, . . . , N. Then, clearly,
ki j = M and the
0 jq1
0 jq1
ki j (M ki j ) = M 2
0 jq1
ki2j M 2
M2
q
158
as the quadratic function (u1 , . . . , uq ) u2j achieves its minimum on the set
1 jq
*
+
u = u1 . . . uq : u j 0, u j = M at u1 = = uq = M/q. Summing over all N
digits, we obtain with = (q 1)/q
M(M 1)d M 2 N,
which yields the bound M d(d N)1 . The proof is completed as in the binary
case.
There exists a substantial theory related to the equality in the Plotkin bound
(Hadamard codes) but it will not be discussed in this book. We would also like
to point out the fact that all bounds established so far (Hamming, Singleton, GV
and Plotkin) hold for codes that are not necessarily linear. As far as the GV bound
is concerned, one can prove that it can be achieved by linear codes: see Theorem
2.3.26.
Worked Example 2.1.19 Prove that a 2-error correcting binary code of length
10 can have at most 12 codewords.
Solution The distance of the code must be 5. Suppose that it contains M codewords and extend it to an [11, M] code of distance 6. The Plotkin bound works as
follows. List all codewords of the extended code as rows of an M 11 matrix. If
column i in this matrix contains si zeros and M si ones then
6(M 1)M
xX + x X +
11
(x, x ) 2 si (M si ).
i=1
(2.1.24)
2.1 Hamming spaces. Geometry of codes. Basic bounds on the code size
159
(2.1.25)
R
R
R O (log N)
R
log 1
log +
1
N
N
N
N
N
log vN,2 (R)
1
log(R + 1) + the LHS.
N
N
The limit R/N yields the result. The case where R = N is considered in a
similar manner.
Worked Example 2.1.20 is useful in the study of the asymptotics of
(N, ) =
1
log M2 (N, N),
N
(2.1.26)
the information rate for the maximum size of a code correcting N and detecting
2 N errors (i.e. a linear portion of the total number of digits N). Set
a( ) := lim inf (N, ) lim sup (N, ) =: a( )
N
(2.1.27)
(2.1.28)
a( ) 1 , 0 1/2 (Singleton),
(2.1.29)
a( ) 1 ( ), 0 1/2 (GV),
a( ) = 0, 1/2 1 (Plotkin).
(2.1.30)
(2.1.31)
By using more elaborate bounds (also due to Plotkin), well show in Problem
2.10 that
a( ) 1 2 ,
0 1/2.
(2.1.32)
The proof of Theorem 2.1.21 is based on a direct inspection of the abovementioned bounds; for Hamming and GV bounds it is carried in Worked Example
2.1.22 later.
160
Plotkin
1
Singleton
Hamming
GilbertVarshamov
1
Figure 2.2
Figure 2.2 shows the behaviour of the bounds established. Good sequences of
codes are those for which the pair ( , (N, N)) is asymptotically confined to
the domain between the curves indicating the asymptotic bounds. In particular, a
good code should lie above the curve emerging from the GV bound. Constructing such sequences is a difficult problem: the first examples achieving the asymptotic GV bound appeared in 1973 (the Goppa codes, based on ideas from algebraic
geometry). All families of codes discussed in this book produce values below the
GV curve (in fact, they yield ( ) = 0), although these codes demonstrate quite
impressive properties for particular values of N, M and d.
As to the upper bounds, the Hamming and Plotkin compete against each other,
while the Singleton bound turns out to be asymptotically insignificant (although
it is quite important for specific values of N, M and d). There are about a dozen
various other upper bounds, some of which will be discussed in this and subsequent
sections of the book.
The GilbertVarshamov bound itself is not necessarily optimal. Until 1982 there
was no better lower bound known (and in the case of binary coding there is still no
better lower bound known). However, if the alphabet used contains q 49 symbols
where q = p2m and p 7 is a prime number, there exists a construction, again based
on algebraic geometry, which produces a different lower bound and gives examples of (linear) codes that asymptotically exceed, as N , the GV curve [159].
Moreover, the TVZ construction carries a polynomial complexity. Subsequently,
two more lower bounds were proposed: (a) Elkies bound, for q = p2m + 1; and
(b) Xings bound, for q = pm [43, 175]. See N. Elkies, Excellent codes from mod-
2.1 Hamming spaces. Geometry of codes. Basic bounds on the code size
161
ular curves, Manipulating with different coding constructions, the GV bound can
also be improved for other alphabets.
Worked Example 2.1.22 Prove bounds (2.1.28) and (2.1.30) (that is, those parts
of Theorem 2.1.21 related to the asymptotical Hamming and GV bounds).
Solution Picking up the Hamming and the GV parts in (2.1.14), we have
(2.1.33)
2N /vN,2 (d 1) M2 (N, d) 2N vN,2 (d 1)/2 .
The lower bound for the Hamming volume is trivial:
N
.
vN,2 (d 1)/2
(d 1)/2
For the upper bound, observe that with d/N < 1/2,
d1i
d 1
N
i
N
d
+
1
0id1
d1i
N
1
N
.
d 1
1 2 d 1
0id1 1
vN,2 (d 1)
Then, for the information rate (log M2 (N, d)) N,
N
1
1
1 log
N
1 2 d 1
N
1
1
(2.1.34)
162
where
(2.1.35)
(q) (N, ) =
1
log Mq (N, N)
N
(2.1.36)
Theorem 2.1.24
(2.1.37)
(2.1.38)
a ( ) 1
(2.1.39)
(q)
(Singleton),
a(q) ( ) 1 (q) ( )
(GV),
a ( ) max[1 / , 0] (Plotkin).
(q)
(2.1.40)
(2.1.41)
Of course, the minimum of the right-hand sides of (2.1.38), (2.1.39) and (2.1.41)
provides the better of the three upper bounds. We omit the proof of Theorem 2.1.24,
leaving it as an exercise that is a repetition of the argument from Worked Example
2.1.22.
Example 2.1.25 Prove bounds (2.1.38) and (2.1.40), by modifying the solution
to Worked Example 2.1.22.
163
1 p
p
.
=
p
1 p
(2.2.1)
(2.2.2)
That is, we assume that the channel transmits a letter correctly with probability
1 p and reverses with probability p, independently for different letters.
In Theorem 2.2.1, it is asserted that there exists a sequence of one-to-one coding maps fn , for which the task of decoding is reduced to guessing the codewords
fn (u) HN . In other words, the theorem guarantees that for all R < C there exists a
sequence of subsets XN HN with XN 2NR for which the probability of incorrect guessing tends to 0, and the exact nature of the coding map fn is not important.
Nevertheless, it is convenient to keep the map fn firmly in sight, as the existence
will follow from a probabilistic construction (random coding) where sample coding maps are not necessarily one-to-one. Also, the decoding rule is geometric: upon
receiving a word a(N) HN , we look for the nearest codeword fn (u) XN . Consequently, an error is declared every time such a codeword is not unique or is a result
of multiple encodings or simply yields a wrong message. As we saw earlier, the
geometric decoding rule corresponds with the ML decoder when the probability
p (0, 1/2). Such a decoder enables us to use geometric arguments constituting
the core of the proof.
Again as in Section 1.4, the new proof of the direct part of the SCT/NCT only
guarantees the existence of good codes (and even their proliferation) but gives
no clue on how to construct such codes [apart from running again a random coding
scheme and picking its typical realisation].
In the statement of the SCT/NCT given below, we deal with the maximum errorprobability (2.2.4) rather than the averaged one over possible messages. However,
a large part of the proof is still based on a direct analysis of the error-probabilities
averaged over the codewords.
Theorem 2.2.1 (The SCT/NCT, the direct part) Consider an MBSC with channel
matrix as in (2.2.2), with 0 p < 1/2, and let C be as in (2.2.1). Then for any
164
(2.2.3)
Example 2.2.2
Then we can embed the blocks from A L in the alphabet A n and so encode the
blocks. The transmission rate is log |A L | n/R 0.577291. As was mentioned,
the SCT tells us that there are such codes but gives no idea of how to find (or
construct) them, which is difficult to do.
Before we embark on the proof of Theorem 2.2.1, we would like to explore
connections between the geometry of Hamming space HN and the
randomness
generated by the channel. As in Section 1.4, we use the symbol P | fn (u) as a
shorthand for Pch | fn (u) sent . The
expectation and
variance under this distribution will be denoted by E( | fn (u) and Var( | fn (u) .
165
Observe that, under distribution P | fn (u) , the number of distorted digits in
the (random) received word Y(N) can be written as
N
j=1
This is a random variable which has a binomial distribution Bin(N, p), with the
mean value
N
(N)
E 1(digit j in Y = digit j in fn (u)) fn (u)
j=1
N
= E 1(digit j in Y(N) = digit j in fn (u))| fn (u) = N p,
j=1
Var
1(digit j in
j=1
N
Y(N)
digit j in fn (u)) fn (u)
=
= Var 1(digit j in Y(N) = digit j in fn (u))| fn (u) = N p(1 p).
j=1
Then, by Chebyshevs inequality, for all given (0, 1 p) and positive integer
N > 1/ , the probability that at least N(p + ) digits have been distorted given
that the codeword fn (u) has been sent, is
p(1 p)
.
(2.2.5)
P N(p + ) 1 distorted | fn (u)
N( 1/N)2
Proofof Theorem 2.2.1. Throughout the proof, we follow the set-up from (2.2.3).
Subscripts n and N will be often omitted; viz., we set
2n = M.
We will assume the ML/geometric decoder without any further mention. Similarly
to Section 1.4, we identify the set of source messages Un with Hamming space Hn .
As proposed by Shannon, we use again a random encoding. More precisely, a message u Hn is mapped to a random codeword Fn (u) HN , with IID digits taking
values 0 and 1 with probability 1/2 and independently of each other. In addition,
we make codewords Fn (u) independent for different messages u Hn ; labelling
the strings from Hn by u(1), . . . , u(M) (in no particular order) we obtain a family of IID random strings Fn (u(1)), . . . , Fn (u(M)) from HN . Finally, we make the
codewords independent of the channel. Again, in analogy with Section 1.4, we can
think of the random code under consideration as a random megastring/codebook
from HNM = {0, 1}NM with IID digits 0, 1 of equal probability. Every given sample
f (= fn ) of this random codebook (i.e. any given megastring from HNM ) specifies
166
f
f(u(1))
f( u(2))
f (u(r))
Figure 2.3
1
2NM
(2.2.6)
eave (Fn ),
M 1iM
Then the expected average error-probability is given by
1
eave f .
E n eave (Fn ) = NM
2
f (u(1)),..., f (u(M))H
(2.2.9)
Relation (2.2.9) implies (again in a manner similar to Section 1.4) that there
exists a sequence of deterministic codes fn such that the average error-probability
eave ( fn ) = eave ( fn (u(1)), . . . , fn (u(2n ))) obeys
lim eave ( fn ) = 0.
(2.2.10)
167
fn (u(i))
decoding
fn (u(i)) sent
N
...
..
a
a received
(a, m)
Figure 2.4
4
p(1 p)
M 1 3
+
v
)
,
N(p
+
N
N( 1/N)2
2N
(2.2.12)
where vN (b) stands for the number of points in the ball of radius b in the binary
Hamming space HN .
Proof Set m(= mN (p, )) := N(p + ). The ML decoder definitely returns the
codeword fn (u(i)) sent through the channel when fn (u(i)) is the only codeword in
168
the Hamming ball BN (y, m) around the received word y = y(N) HN (see
Figure 2.4). In any other situation (when fn (u(i)) BN (y, m) or fn (u(k))
BN (y, m) for some k = i) there is a possibility of error.
Hence,
P error while using codebook f | fn (u(i))
P y| fn (u(i)) 1 fn (u(i)) BN (y, m)
(2.2.13)
yHN
+ P(z| fn (u(i))) 1 fn (u(k)) BN (z, m) .
zHN
k=i
(2.2.14)
by virtue of (2.2.5). Observe that since the RHS in (2.2.14) does not depend on the
choice of the sample code f , the bound (2.2.14) will hold when we take first the
1
average
and then expectation E n .
M 1iM
The second sum in the RHS of (2.2.13) is more tricky: it requires averaging and
taking expectation. Here we have
En
over HN , the expectations
En P(z|Fn (u(i))) and En 1 Fn (u(k)) BN (z, m) can be calculated as
1
En P(z|Fn (u(i))) = N
2
and
P(z|x)
(2.2.16a)
xHN
vN (m)
.
En 1 Fn (u(k)) BN (z, m) =
2N
(2.2.16b)
169
zHN xHN
P(z|x) =
P(z|x) = 2N .
(2.2.17)
xHN zHN
M 1iM k=i 2N
(2.2.18)
vN (m)M(M 1) (M 1)vN (m)
=
=
.
2N M
2N
Collecting (2.2.12)(2.2.18) we have that EN [eave (Fn )] does not exceed the RHS
of (2.2.12).
the RHS of (2.2.15) =
At the next stage we estimate the volume vN (m) in terms of entropy h(p + )
where, recall, m = N(p + ). The argument here is close to that from Section 1.4
and based on the following result.
Lemma 2.2.5 Suppose that 0 < p < 1/2, > 0 and positive integer N satisfy
p + + 1/N < 1/2. Then the following bound holds true:
vN (N(p + )) 2N (p+ ) .
(2.2.19)
The proof of Lemma 2.2.5 will be given later, after Worked Example 2.2.7. For
the moment we proceed with the proof of Theorem 2.2.1. Recall, we want to establish (2.2.7). In fact, if p < 1/2 and R < C = 1 (p) then we set = C R > 0
and take > 0 so small that (i) p + < 1/2 and (ii) R + /2 < 1 (p + ). Then
we take N so large that (iii) N > 2/ . With this choice of and N, we have
1
> and R 1 + (p + ) < .
N
2
2
Then, starting with (2.2.12), we can write
(2.2.20)
4p(1 p) 2NR N (p+ )
EN e(Fn )
+ N 2
N 2
2
(2.2.21)
4
N /2
<
p(1
p)
+
2
.
N 2
This implies (2.2.7) and hence the existence of a sequence of codes fn : Hn HN
obeying (2.2.10).
To finish the proof of Theorem 2.2.1, we deduce (2.2.4) from (2.2.7), in the form
of Lemma 2.2.6:
170
(i) For all R (0,C), there exist codes fn with lim emax ( fn ) = 0.
n
(ii) For all R (0,C), there exist codes fn such that lim eave ( fn ) = 0.
n
Proof of Lemma 2.2.6. It is clear that assertion (i) implies (ii). To deduce (i) from
(ii), take R < C and set for N big enough
R = R +
1
< C, n = NR , M = 2n .
N
(2.2.22)
Here and below, M = 2NR and fn (u(1)), . . . , fn (u(M )) are the codewords for
source messages u(1), . . . , u(M ) Hn .
Instead of P error while using fn | fn (u(i)) , we write P fn -error| fn (u(i)) ,
for brevity. Now, at least half of summands P fn -error| fn (u(i)) in the RHS of
(2.2.23) must be < 2eave ( fn ). Observe that, in view of (2.2.22),
M /2 2NR1 .
(2.2.24)
Solution Write
*
+
vN (m) = points at distance m from 0 in HN =
171
N
k .
0km
<
, for 0 k < m.
1
1
Then, for 0 k < m, the product
k
=
(1 )N
1
m
>
(1 )N = m (1 )Nm .
1
k (1 )Nk
Hence,
N
N
k (1 )Nk >
k (1 )Nk
k
k
0kN
0km
N
m
Nm
= vN (m) m (1 )Nm
> (1 )
0km k
1 =
mmk
(N m + 1) (N k)
=
172
(m + 1) k
(N m)km
which is again 1 as the product of 2(k m) factors 1. Thus, the ratio pk /pm
1, and the desired bound follows.
We are now in position to prove Lemma 2.2.5.
Proof of Lemma 2.2.5 First, p + < 1/2 implies that m = N(p + ) < N/2 and
:=
m N(p + )
=
< p + ,
N
N
In other words, the noise acts on each symbol xi of the input string x independently,
and P(y|x) is the probability of having an output symbol y given that the input
symbol is x.
173
Symbol x runs over Ain , an input alphabet of a given size q, and y belongs to
Aout , an output alphabet of size r. Then probabilities P(y|x) form a q r stochastic
matrix (the channel matrix). A memoryless channel is called symmetric if the rows
of this matrix are permutations of each other, i.e. contain the same collection of
probabilities, say p1 , . . . , pr . A memoryless symmetric channel is said to be doublesymmetric if the columns of the channel matrix are also permutations of each other.
If m = n = 2 (typically, Ain = Aout = {0, 1}) a memoryless channel is called binary.
For a memoryless binary symmetric channel, the channel matrix entries P(y|x)
are P(0|0) = P(1|1) = 1 p, P(1|0) = P(0|1) = p, p (0, 1) being the flipping
probability and 1 p the probability of flawless transmission of a single binary
symbol.
A channel is characterised by its capacity: the value C 0 such that:
(a) for all R < C, R is a reliable transmission rate; and
(b) for all R > C, R is an unreliable transmission rate.
Here R is called a reliable transmission rate if there exists a sequence of codes
fn : Hn HN and decoding rules fN : HN Hn such that n NR and the (suitably defined) probability of error
e( fn , fN ) 0, as N .
In other words,
C = lim
1
log MN
N
where I(X : Y ) is the mutual information between a (random) input symbol X and
the corresponding output symbol Y , and the maximum is over all possible probability distributions pX of X.
Now in the case of a memoryless symmetric channel (MSC), the above maximisation procedure applies to the output symbols only:
C = max h(Y ) + pi log pi ;
pX
1ir
the sum pi log pi being the entropy of the row of channel matrix (P(y|x)). For
i
174
n(1,t 1)
n(1,t)
,
=A
n(1,t 2)
n(1,t 1)
1 1
A=
.
1 0
1
5+1
.
= lim log n(1,t) + n(0,t) = log
t t
2
C = lim
175
Remark 2.2.9
obeys
lim sup max ( fn , fN ) = 1.
(2.2.26b)
Proof As in Section 1.4, we can assume that codes fn are one-to-one and obey
fN ( fn (u)) = u, for all u Hn (otherwise, the chances of erroneous decoding will
be even larger). Assume the opposite of (2.2.26b):
(2.2.27)
Our aim is to deduce from (2.2.27) that R C. As before, set n = NR and let
fn (u(i)) be the codeword for string u(i) Hn , i = 1, . . . , 2n . Let Di HN be the set
of binary strings where fN returns fn (u(i)): fN (a) = fn (u(i)) if and only if a Di .
Then Di ! f (u(i)), sets Di are pairwise disjoint, and if the union i Di = HN then
on the complement HN \ i Di the channel declares an error. Set si = Di , the size
of set Di .
Our first step is to improve the decoding rule, by making it closer to the ML
rule. In other words, we want to replace each Di with a new set, Ci HN , of the
same cardinality Ci = si , but of a more rounded shape (i.e. closer to a Hamming
176
ball B( f (u(i)), bi )). That is, we look for pairwise disjoint sets Ci , of cardinalities
Ci = si , satisfying
BN ( f (u(i)), bi ) Ci BN ( f (u(i)), bi + 1), 1 i 2n ,
(2.2.28)
(2.2.30)
Then, clearly,
(2.2.31)
Next, suppose that there exists p < p such that, for any N large enough,
bi + 1 N p for some 1 i 2n .
(2.2.32)
Then, by virtue of (2.2.28) and (2.2.31), with Cic standing for the complement
HN \ Ci ,
P(at least N p digits distorted| fn (u(i)))
P(at least bi + 1 digits distorted| fn (u(i)))
P(Cic | fn (u(i))) max ( fn , gN ) c.
This would lead to a contradiction, since, by the law of large numbers, as N ,
the probability
P(at least N p digits distorted | x sent) 1
uniformly in the choice of the input word x HN . (In fact, this probability does
not depend on x HN .)
Thus, we cannot have p (0, p) such that, for N large enough, (2.2.32) holds
true. That is, the opposite is true: for any given p (0, p), we can find an arbitrarily
large N such that
bi > N p ,
for all i = 1, . . . , 2n .
(2.2.33)
177
(As we claim (2.2.33) for all p (0, p), it does not matter if in the LHS of (2.2.33)
we put bi or bi + 1.)
At this stage we again use the explicit expression for the volume of the Hamming
ball:
N
N
si = Di = Ci vN (bi ) =
j
bi
0 jbi
N
, provided that bi > N p .
(2.2.34)
N p
A useful bound has been provided in Worked Example 2.2.7 (see (2.2.25)):
R
1
2N N .
(2.2.35)
vN (R)
N +1
We are now in a position to finish the proof of Theorem 2.2.10. In view of
(2.2.35), we have that, for all p (0, p), we can find an arbitrarily large N such
that
si 2N( (p )N ) ,
for all 1 i 2n ,
with lim N = 0. As the original sets D1 , . . . , D2n are disjoint, we have that
N
NR
1
1, implying that R 1 (p ) + N + .
N
N
As N , the RHS tends to 1 (p). So, given any p (0, p), R 1 (p ).
This is true for all p < p, hence R 1 (p) = C. This completes the proof of
Theorem 2.2.10.
(p ) N +
yHN
Ky = MvN,q (s)
(2.2.36)
178
as each word x falls in vN,q (s) of the balls BN,q (y, s).
Lemma 2.2.11 If X is a q-ary [N, M] code then for all s = 1, . . . , N there
exists a ball BN,q (y, s) about an N -word y HN,q with the number Ky = X
BN,q (y, s) of codewords in BN,q (y, s) obeying
Ky MvN,q (s)/qN .
(2.2.37)
1
Ky gives the average numqN y
ber of codewords in ball BN,q (y, s). But there must be a ball containing at least as
many as the average number of codewords.
Proof Divide both sides of (2.2.36) by qN . Then
A ball BN,q (y, s) with property (2.2.37) is called critical (for code X ).
Theorem 2.2.12 (The Elias bound) Set = (q 1)/q. Then for all integers s 1
such that s < N and s2 2 Ns + Nd > 0, the maximum size Mq (N, d) of a q-ary
code of length N and distance d satisfies
Mq (N, d)
qN
Nd
.
s2 2 Ns + Nd vN,q (s)
(2.2.38)
Proof Fix a critical ball BN,q (y, s) and consider code X obtained by subtracting
word y from the codewords of X : X = {x y : x X }. Then X is again an
[N, M, d] code. So, we can assume that y = 0 and BN,q (0, s) is a critical ball.
Then take X1 = X BN,q (0, s) = {x X : w(x) s}. The code X1 is [N, K, e]
where e d and K (= K0 ) MvN,q (s)/qN . As in the proof of the Plotkin bound,
consider the sum of the distances between the codewords in X1 :
S1 =
xX1 x X1
(x, x ).
Again, we have that S1 K(K 1)e. On the other hand, if ki j is the number of
letters j Jq = {0, . . . , q 1} in the ith position in all codewords x X1 then
S1 =
ki j (K ki j ).
1iN 0 jq1
0 jq1
S = NK
2
1iN
2
+
ki0
1 jq1
ki2j
179
S
NK 2
1iN
2
ki j
1 jq1
1
(K ki0 )2 .
q1
1
2
(K ki0 )
q1
2 + K 2 2Kk + k2
(q 1)ki0
i0
i0
2 +
ki0
1
q 1 1iN
1
= NK 2
(qk2 + K 2 2Kki0 )
q 1 1iN i0
N
q
2 + 2 K
K2
= NK 2
ki0
ki0
q1
q 1 1iN
q 1 1iN
q2
q
2 + 2 KL,
NK 2
=
ki0
q1
q 1 1iN
q1
= NK 2
1iN
2
ki0
2
ki0
1iN
1 2
L .
N
Then
q2
q 1 2
2
NK 2
L +
KL
q1
q1 N
q1
1
q
=
(q 2)NK 2 L2 + 2KL .
q1
N
qs
1
K 2 s 2(q 1)
which can be
q1
N
Ne
s2 2 Ns + Ne
180
provided that s < N and s2 2 Ns + Ne > 0. Finally, recall that X (1) arose
from an [N, M, d] code X , with K Mv(s)/qN and e d. As a result, we obtain
that
MvN,q (s)
Nd
.
2
qN
s 2 Ns + Nd
This leads to the Elias bound (2.2.38).
The ideas used in the proof of the Elias bound (and earlier in the proof of the
Plotkin bounds) are also helpful in obtaining bounds for W2 (N, d, ), the maximal
size of a binary (non-linear) code X HN,2 of length N, distance d(X ) d and
with the property that the weight w(x) , x X . First, three obvious statements:
A B
N
W2 (N, 2k, )
B
kN
.
2 N + kN
N
and
2
(2.2.39)
Solution Take an [N, M, 2k] code X such that w(x) , x X . As before, let
ki1 be the number of 1s in
position
i in all codewords. Consider the sum of the
dot-products D = 1 x = x "x x #. We have
x,x X
1
"x x # = w(x x ) = w(x) + w(x ) x, x )
2
1
(2 2k) = k
2
and hence
D ( k)M(M 1).
On the other hand, the contribution to D from position i equals ki1 (ki1 1), i.e.
D=
1iN
ki1 (ki1 1) =
1iN
2
(ki1
ki1 ) =
1iN
2
ki1
M.
181
W2 (N, 2k, )
B
N
W (N 1, 2k, 1) .
2
N
and
2
(2.2.40)
Solution Again take an [N, M, 2k] code X such that w(x) for all x X .
Consider the shortening code X on x1 = 1 (cf. Example 2.1.8(v)): it gives a code
of length (N 1), distance 2k and constant weight ( 1). Hence, the size of
the cross-section is W2 (N 1, 2k, 1). Therefore, the number of 1s at position
1 in the codewords of X does not exceed W2 (N 1, 2k, 1). Repeating this
argument, we obtain that the total number of 1s in all positions is NW2 (N
1, 2k, 1). But this number equals M, i.e. M NW2 (N 1, 2k, 1). The
bound (2.2.40) then follows.
Corollary 2.2.15
N
and 2k 4k 2,
2
1
k
(2.2.41)
The remaining part of Section 2.2 focuses on the Johnson bound. This bound
aims at improving the binary Hamming bound (cf. (2.1.8b) with q 2):
(2.2.42)
M2 (N, 2E + 1) 2N vN (E) or vN (E) M2 (N, 2E + 1) 2N .
Namely, the Johnson bound asserts that
M2 (N, 2E + 1) 2N /vN (E) or vN (E) M2 (N, 2E + 1) 2N ,
where
vN (E)
N
E +1
2E + 1
.
W2 (N, 2E + 1, 2E + 1)
E
1
= vN (E) +
N/(E + 1)
(2.2.43)
(2.2.44)
182
N
Recall that vN (E) =
stands for the volume of the binary Hamming ball
0sE s
of radius E. We begin our derivation of bound (2.2.43) with the following result.
Lemma
2.2.16 If x, y are binary words, with (x, y) = 2 + 1, then there exists
2 + 1
binary words z such that (x, z) = + 1 and (y, z) = .
Proof Left as an exercise.
Consider the set T (= TN,E+1 ) of all binary N-words at distance exactly E + 1
from the codewords from X :
(
T = z HN : (z, x) = E + 1 for some x X
)
and (z, y) E + 1 for all y X .
(2.2.45)
Then we can write that
MvN (E) + T 2N ,
(2.2.46)
as none of the words z T falls in any of the balls of radius E about the codewords
y X . The bound (2.2.43) will follow when we solve the next worked example.
Worked Example 2.2.17 Prove that the cardinality T is greater than or equal
to the second term from the RHS of (2.2.44):
2E + 1
N
M
.
(2.2.47)
W2 (N, 2E + 1, 2E + 1)
E
N/(E + 1) E + 1
Solution We want to find a lower bound on T . Consider the set W (= WN,E+1 ))
of matched pairs of N-words defined by
)
(
W = (x, z) : x X , z TE+1 , (x, z) = E + 1
(
(2.2.48)
= (x, z) : x X , z HN : (x, z) = E + 1,
)
and (y, z) E + 1 for all y X .
Given x X , the x-section W x is defined as
W x = {z HN : (x, z) W }
= {z : (x, z) = E + 1, (y, z) E + 1 for all y X }.
(2.2.49)
for all y X }.
(2.2.50)
183
y X : (x, y) = 2E + 1 .
W =
E +1
E
Moreover, if we subtract x from every y X with (x, y) = 2E + 1, the result
is a code of length N whose codewords z have weight w(z) 2E + 1. Hence,
there are at most W (N, 2E + 1, 2E + 1) codewords y X with (x, y) = 2E + 1.
Consequently,
N
2E + 1
x
W (N, 2E + 1, 2E + 1)
(2.2.51)
W
E +1
E
and
W M the RHS of (2.2.51).
(2.2.52)
(2.2.53)
F
N
So, y v and z v have no digit 1 in common. Hence, there exist at most
E +1
E
F
N
words of the form y v where y W v , i.e. at most
words in W v . ThereE +1
fore,
E
F
N
W
T .
(2.2.54)
E +1
Collecting (2.2.51), (2.2.52) and (2.2.54) yields inequality (2.2.47).
184
N
M (N, 2E + 1) 2 vN (E)
A
B 1
(2.2.55)
N
N E
N E
1
N/(E + 1) E
E +1
E +1
Corollary 2.2.18
M (13, 5)
= 77.
1 + 13 + 78 + (286 10 23)/4
This bound is much better than Hammings which gives M (13, 5) 89. In fact, it
is known that M (13, 5) = 64. Compare Section 3.4.
Any binary linear code of rank k contains 2k vectors, i.e. has size
185
Proof A basis of the code contains k linearly independent vectors. The code is
generated by the basis; hence it consists of the sums of basic vectors. There are
precisely 2k sums (the number of subsets of {1, . . . , k} indicating the summands),
and they all give different vectors.
Consequently, a binary linear code X of rank k may be used for encoding all
possible source strings of length k; the information rate of a binary linear [N, k]
code is k/N. Thus, indicating k N linearly independent words x HN identifies
a (unique) linear code X HN of rank k. In other words, a linear binary code of
rank k is characterised by a k N matrix of 0s and 1s with linearly independent
rows:
g11 . . . . . . . . . g1N
g21 . . . . . . . . . g2N
G=
..
..
.
.
gk1
...
...
...
gkN
Namely, we take the rows g(i) = gi1 . . . giN , 1 i k, as the basic vectors of a
linear code.
Definition 2.3.3 A matrix G is called a generating matrix of a linear code. It is
clear that the generating matrix is not unique.
Equivalently, a linear [N, k] code X may be described as the kernel of a certain
(N k) N matrix H, again with the entries 0 and 1: X = ker H where
h12
...
...
h1N
h11
h21
h22
...
...
h2N
H = .
..
..
..
..
..
.
.
.
.
h(Nk)1 h(Nk)2
and
...
...
h(Nk)N
)
(
ker H = x = x1 . . . xN : xH T = 0(Nk) .
(2.3.1)
Here, for x, y HN ,
N
(2.3.2)
186
The inner product (2.3.2) possesses all properties of the Euclidean scalar product
in RN , but one: it is not positive definite (and therefore does not define a norm). That
is, there are non-zero vectors x HN with "x x# = 0. Luckily, we do not need the
positive definiteness.
However, the key ranknullity property holds true for the dot-product: if L is
a linear subspace in HN of rank k then its orthogonal complement L (i.e. the
collection of vectors z HN such that "x z# = 0 for all x L ) is a linear subspace
of rank N k. Thus, the (N k) rows of H can be considered as a basis in X ,
the orthogonal complement to X .
The matrix H (or sometimes its transpose H T ) with the property X = ker H or
"x h( j)# 0 (cf. (2.3.1)) is called a parity-check (or, simply, check) matrix of code
X . In many cases, the description of a code by a check matrix is more convenient
than by a generating one.
The parity-check matrix is again not unique as the basis in X can be chosen
non-uniquely. In addition, in some situations where a family of codes is considered, of varying length N, it is more natural to identify a check matrix where the
number of rows can be greater than N k (but some of these rows will be linearly
dependent); such examples appear in Chapter 3. However, for the time being we
will think of H as an (N k) N matrix with linearly independent rows.
Worked Example 2.3.4 Let X be a binary linear [N, k, d] code of information
rate = k/N . Let G and H be, respectively, the generating and parity-check matrices of X . In this example we refer to constructions introduced in Example 2.1.8.
(a) The parity-check extension of X is a binary code X + of length N +1 obtained
by adding, to each codeword x X , the symbol xN+1 = xi so that the
1iN
sum
1iN+1
xi is zero. Prove that X + is a linear code and find its rank and
minimal distance. How are the information rates and generating and paritycheck matrices of X and X + related?
(b) The truncation X of X is defined as a linear code of length N 1 obtained
by omitting the last symbol of each codeword x X . Suppose that code X
has distance d 2. Prove that X is linear and find its rank and generating
and parity-check matrices. Show that the minimal distance of X is at least
d 1.
(c) The m-repetition of X is a code X re (m) of length Nm obtained by repeating each codeword x X a total of m times. Prove that X re (m) is a linear
code and find its rank and minimal distance. How are the information rates and
generating and parity-check matrices of X re (m) related to , G and H ?
|
g1i
1iN
..
H
+
|
.
G =
G
,H =
..
|
.
|
gki
1iN
0 ... 0
187
|
|
|
|
|
|
|
1
1
The rank of X + equals the rank of X = k. If the minimal distance of X was even
it is not changed; if odd it increases by 1. The information rate + = (N 1) N.
(b) The generating matrix
g11 . . . g1N1
..
..
G =
.
..
.
gk1 . . . gkN1
The parity-check matrix H of X , after suitable column operations, may be written
as
H =
|
.
|
0 ... 0
|
The parity-check matrix of X is then identified with H . The rank is unchanged;
the distance may decrease maximum by 1. The information rate = N (N 1).
(c) The generating and parity-check matrices are
Gre (m) = (G . . . G) (m times),
and
H (m) =
re
0
0
I
..
.
...
...
...
..
.
H
I
I
..
.
0
I
0
..
.
0
0
0
..
.
0 0 ... I
188
Here, I is a unit N N matrix and the zeros mean the zero matrices (of size (N
k) N and N N, accordingly). The number of the unit matrices in the first column
equals m 1. (This is not a unique form of H re (m).) The size of H re (m) is (Nm
k) Nm.
The rank is unchanged, the minimal distance in X re (m) is md and the information rate /m.
Worked Example 2.3.5 A dual code of a linear binary [N, k] code X is defined
as the set X of the words y = y1 . . . yN such that the dot-product
"y x# =
yi xi = 0
1iN
189
Show that the subset of a linear binary code consisting of all words of even
weight is a linear code. Prove that, for d even, if there exists a linear [N, k, d] code
then there exists a linear [N, k, d] code with codewords of even weight.
1 k1 k
2 2i . Indeed,
k! i=0
if the l first basis vectors are selected, all their 2l linear combinations should be
excluded on the next step. This gives 840 for k = 4, and 28 for k = 3.
Solution The size is 2k and the number of different bases
Finally, for d even, we can truncate the original code and then use the paritycheck extension.
Example 2.3.7 The binary Hamming [7, 4] code is determined by a 3 7 paritycheck matrix. The columns of the check matrix are all non-zero words of length 3.
Using lexicographical order of these words we obtain
1 0 1 0 1 0 1
Ham
= 0 1 1 0 0 1 1 .
Hlex
0 0 0 1 1 1 1
The corresponding generating matrix may be written as
0 0 1 1 0 0 1
0 1 0 0 1 0 1
.
GHam
lex =
0 0 1 0 1 1 0
1 1 1 0 0 0 0
(2.3.3)
In many cases it is convenient to write the check matrix of a linear [N, k] code in
a canonical (or standard) form:
Hcan = INk H .
(2.3.4a)
In the case of the Hamming [7, 4] code it gives
1 0 0 1 1 0 1
Ham
= 0 1 0 1 0 1 1 ,
Hcan
0 0 1 0 1 1 1
with a generating matrix also in a canonical form:
Gcan = G Ik ;
namely,
1
GHam
can =
1
1
1
0
1
1
0
1
1
1
1
0
0
0
0
1
0
0
0
0
1
0
(2.3.4b)
0
0
.
0
1
190
Formally, Glex and Gcan determine different codes. However, these codes are
equivalent:
Definition 2.3.8 Two codes are called equivalent if they differ only in permutation of digits. For linear codes, equivalence means that their generating matrices
can be transformed into each other by permutation of columns and by rowoperations including addition of columns multiplied by scalars. It is plain that
equivalent codes have the same parameters (length, rank, distance).
In what follows, unless otherwise stated, we do not distinguish between equivalent linear codes.
Remark 2.3.9 An advantage of writing G in a canonical form is that a source
string u(k) Hk is encoded as an N-vector u(k) Gcan ; according to (2.3.4b), the last k
digits in u(k) Gcan form word u(k) (they are called information digits), whereas the
first N k are used for the parity-check (and called parity-check digits). Pictorially,
the parity-check digits carry the redundancy that allows the decoder to detect and
correct errors.
(2.3.5)
Theorem 2.3.11
(i) The distance of a linear binary code equals the minimal weight of its non-zero
codewords.
(ii) The distance of a linear binary code equals the minimal number of linearly
dependent columns in the check matrix.
Proof (i) As the code X is linear, the sum x + y X for each pair of codewords
x, y X . Owing to the shift invariance of the Hamming distance (see Lemma
2.1.1), (x, y) = (0, x + y) = w(x + y) for any pair of codewords. Hence, the
minimal distance of X equals the minimal distance between 0 and the rest of the
code, i.e. the minimal weight of a non-zero codeword from X .
(ii) Let X be a linear code with a parity-check matrix H and minimal distance d.
Then there exists a codeword x X with exactly d non-zero digits. Since xH T = 0,
191
we conclude that there are d columns of H which are linearly dependent (they
correspond to non-zero digits in x). On the other hand, if there exist (d 1) columns
of H which are linearly dependent then their sum is zero. But that means that there
exists a word y, of weight w(y) = d 1, such that yH T = 0. Then y must belong to
X which is impossible, since min[w(x) : x X , x = 0] = d.
Theorem 2.3.12 The Hamming [7, 4] code has minimal distance 3, i.e. it detects
2 errors and corrects 1. Moreover, it is a perfect code correcting a single error.
Proof For any pair of columns the parity-check matrix H lex contains their sum to
obtain a linearly dependent triple (viz. look at columns 1, 6, 7). No two columns
are linearly dependent because they are distinct (x + y = 0 means that x = y). Also,
the volume v7 (1) equals 1 + 7 = 23 , and the code is perfect as its size is 24 and
24 23 = 27 .
The construction of the Hamming [7, 4] code admits a straightforward generalisation to any length N = 2l 1; namely, consider a (2l 1) l matrix H Ham with
columns representing all possible non-zero binary vectors of length l:
1 0 ...
0 1 . . .
H Ham = 0 0 . . .
. . .
.. .. . .
0 0 ...
0 1
0 1
0 0
.. ..
. .
1 0
...
...
...
..
.
1
1
1
.
...
(2.3.6)
The rows of H Ham are linearly independent, and hence H Ham may be considered
as a check matrix of a linear code of length N = 2l 1 and rank N l = 2l
1 l. Any two columns of H Ham are linearly independent but there exist linearly
dependent triples of columns, e.g. x, y and x + y. Hence, the code X Ham with the
check matrix H Ham has a minimal distance 3, i.e. it detects 2 errors and corrects 1.
This code is called the Hamming [2l 1, 2l 1 l] code. It is a perfect one-error
correcting code: the volume of the 1-ball v2l 1 (1) equals 1 + 2l 1 = 2l , and size
2l l 1
l
l
1 as
volume = 22 1l 2l = 22 1 = 2N . The information rate is
2l 1
l . This proves
Theorem 2.3.13 The above construction defines a family of [2l 1, 2l 1
l, 3] linear binary codes X2Ham
l 1 , l = 1, 2, . . ., which are perfect one-error correcting
codes.
192
(2.3.7)
193
Worked Example 2.3.17 Show that word x is always a codeword that minimises the distance between y and the words from X .
Solution As y and w are in the same coset, y + w X (see Example 2.3.16(3)).
All other words from X are obtained as the sums y + v where v runs over coset
y + X . Hence, for any x X ,
194
and contains at most one non-zero digit. If the syndrome yH T = s occupies position
i among the columns of the parity-check matrix then, for word ei = 0 . . . 1 0 . . . 0
with the only non-zero digit i,
(y + ei )H T = s + s = 0.
That is, (y + ei ) X and ei y + X . Obviously, ei is the leader.
The duals X Ham of binary Hamming codes form a particular class, called simplex codes. If X Ham is [2 1, 2 1 ], its dual (X Ham ) is [2 1, ], and the
code X Ham has weight 21 and the distance between any two codewords equals
21 . Hence justify the term simplex.
Solution If X = X Ham is the binary Hamming [2l 1, 2l l 1] code then the
dual X is [2l 1, l], and its l (2l 1) generating matrix is H Ham . The weight of
any row of H Ham equals 2l1 (and so d(X ) = 2l1 ). Indeed, the weight of row
j of H Ham equals the number of non-zero vectors of length l with 1 at position j.
This gives 2l1 as the weight, as half of all 2l vectors from Hl have 1 at any given
position.
195
(2.3.8)
xl h jl = 0.
1lN
anequivalent code whose generLemma 2.3.22 For any [N, k] code, there exists
ating matrix G has a canonical form: G = G Ik where Ik is the identity k k
matrix and G is an k (N k) matrix. Similarly,
the parity-check matrix H may
have a standard form which is INk H .
196
We now discuss the decoding procedure for a general linear code X of rank
k. As was noted before, it may be used for encoding source messages (strings)
u = u1 . . . uk of length k. The source encoding u Fkq X becomes particularly
simple when the generating and parity-check matrices are used in the canonical (or
standard) form.
Theorem 2.3.23 For any linear code X there exists an equivalent code X with
the generating matrix Gcan and the check matrix H can in standard form (2.3.4a),
(2.3.4b) and G = (H )T .
Proof Assume that code X is non-trivial (i.e. not reduced to the zero word 0).
Write a basis for X and take the corresponding generating matrix G. By performing row-operations (where a pair of rows i and j are exchanged or row i is replaced
by row i plus row j) we can change the basis, but do not change the code. Our
matrix G contains a non-zero column, say l1 : perform row operations to make g1l1
the only non-zero entry in this column. By permuting digits (columns), place column l1 at position N k. Drop row 1 and column N k (i.e. the old column l1 )
and perform a similar procedure with the rest, ending up with the only non-zero
entry g2l2 in a column l2 . Place column l2 at position N k + 1. Continue until an
upper triangular k k submatrix emerges. Further operations may be reduced to
this matrix only. If this matrix is a unit matrix, stop. If not, pick the first column
with more than one non-zero entry. Add the corresponding rows from the bottom
to kill redundant non-zero entries. Repeat until a unit submatrix emerges. Now a
generating matrix is in a standard form, and new code is equivalent to the original
one.
To complete the proof, observe that matrices Gcan and H can figuring in (2.3.4a),
(2.3.4b) with G = (H )T , have k independent rows and N k independent
columns, correspondingly. Besides, the k (N k) matrix Gcan (H can )T vanishes.
In fact,
(Gcan (H can )T )i j = " row i of G column j of (H )T # = g i j g i j = 0.
Hence, H can is a check matrix for Gcan .
Returning to source encoding, select the generating matrix in the canonical form
k
Gcan . Then, given a string u = u1 . . . uk , we set x = ui gcan (i), where gcan (i) repi=1
resents row i of Gcan . The last k digits in x give string u; they are called the information digits. The first N k digits are used to ensure that x X ; they are called
the parity-check digits.
The standard form is convenient because in the above representation X = Fk G,
the initial (N k) string of each codeword is used for encoding (enabling the
197
detection and correction of errors), and the final k string yields the message from
Fk
q . As in the binary case, the parity-check matrix H satisfies Theorem 2.3.11. In
particular, the minimal distance of a code equals the minimal number of linearly
dependent columns in its parity-check matrix H .
Definition 2.3.24 Given an [N, k] linear q-ary code X with parity-check matrix
T
k
H, the syndrome of an N vector y FN
q is the k vector yH Fq , and the synN
T
drome subspace is the image Fq H . A coset of X by vector w FN
q is denoted
by w + X and formed by words of the form w + x where x X . All cosets have
the same number of elements equal to qk and partition the whole Hamming space
into qNk disjoint subsets; code X is one of them. The cosets are in oneFN
q
to-one correspondence with syndromes yH T . The syndrome decoding procedure is
carried as in the binary case: a received vector y is decoded by x = y + w where
w is the leader of coset y + X (i.e. the word from y + X with minimum weight).
All drawbacks we had in the case of binary syndrome decoding persist in the
general q-ary case, too (and in fact are more pronounced): the coset tables are
bulky, the leader of a coset may be not unique. However, for q-ary Hamming codes
the syndrome decoding procedure works well, as we will see in Section 2.4.
In the case of linear codes, some of the bounds can be improved (or rather new
bounds can be produced).
Worked Example 2.3.25
(a) Fix a codeword x X with exactly d non-zero digits. Prove that truncating
X on the non-zero digits of x produces a code XNd
of length N d , rank
k 1 and distance d for some d d/2.
(b) Deduce the Griesmer bound improving the Singleton bound (2.1.12):
E F
d
N d+
.
(2.3.9)
1k1 2
Solution (a) Without loss of generality, assume that the non-zero digits in x are
with
x1 = = xd = 1. Truncating on digits 1, . . ., d will produce the code XNd
the rank reduced by 1. Indeed, suppose that a linear combination of k 1 vectors
vanishes on positions d + 1, . . . , N. Then on the positions 1, . . . , d all the values
equal either 0s or 1s because d is the minimal distance. But the first case is impossible, unless the vectors are linearly dependent. The second case also leads to
contradiction by adding the string x and obtaining k linearly
E F dependent vectors in
d
and take y X with
the code X . Next, suppose that X has distance d <
2
N
w(y ) = y j = d .
j=d+1
198
2
2
4
E F
d
.
N d+
1k1 2
(2.3.10)
199
Proof Let X be a linear code of maximal rank with distance at least d of maximal
size. If inequality (2.3.10) is violated the union of all Hamming spheres of radius
d 1 centred on codewords cannot cover the whole Hamming space. So, there
must be a point y that is not in any Hamming sphere around a codeword. Then for
any codeword x and any scalar b Fq the vectors y and y + b x are in the same
coset by X . Also y + b x cannot be in any Hamming sphere of radius d 1. The
same is true for x + b y because if it were, then y would be in a Hamming sphere
around another codeword. Here we use the fact that Fq is a field. Then the vector
subspace spanned by X and y is a linear code larger than X and with a minimal
distance at least d. That is a contradiction, which completes the proof.
For example, let q = 2 and N = 10. Then 25 < v10,2 (2) = 56 < 26 . Upon taking
d = 3, the Gilbert bound guarantees the existence of a binary [10, 5] code with
d 3.
For brevity, we will now write X H and H H (or even simply H when possible)
Ham and H Ham . In the binary case (with q = 2), matrix H H is cominstead of XN,q
posed by all non-zero binary column-vectors of length . For general q we have
200
to exclude columns that are multiples of each other. To this end, we can choose
as columns all non-zero -words that have 1 in their top-most non-0 component.
q 1
. Next, as in
Such columns are linearly independent, and their total equals
q1
the binary case, one can arrange words with digits from Fq in the lexicographic
order. By construction, any two columns of H H are linearly
independent, but there
exist triples of linearly dependent columns. Hence, d X H = 3, and X H detects
two errors and corrects one. Furthermore, X H is a perfect code correcting a single
error, as
q 1
k
M(1 + (q 1)N) = q 1 + (q 1)
= qk+ = qN .
q1
As in the binary case, the general Hamming codes admit an efficient
(and elH
has been
egant) decoding procedure. Suppose a parity-check matrix H = H
we
calculate
the synconstructed as above. Upon receiving a word y FN
q
T
T
drome yH Fq . If yH = 0 then y is a codeword. Otherwise, the columnvector HyT is a scalar multiple of a column h( j) of H: HyT = a h( j) , for some
j = 1, . . . , N and a Fq \ {0}. In other words, yH T = a e( j)H T where word
e( j) = 0 . . . 1 . . . 0 HN,q (with the jth digit 1, all others 0). Then we decode y
by x = y a e( j), i.e. simply change digit y j in y to y j a.
Summarising, we have the following
Theorem 2.4.2 The q-ary Hamming codes form a family of
q 1 q 1
q 1
,
, 3 perfect codes XNH , for N =
, = 1, 2, . . . ,
q1 q1
q1
correcting one error, with a decoding rule that changes the digit y j to y j a in a
received word y = y1 . . . yN FN
q , where 1 j N , and a Fq \ {0} are determined from the condition that HyT = a h( j), the a-multiple of column j of the
parity-check matrix H .
Hamming codes were discovered by R. Hamming and M. Golay in the late
1940s. At that time Hamming, an electrical engineer turned computer scientist
during the Jurassic computers era, was working at Los Alamos (as an intellectual janitor to local nuclear physicists, in his own words). This discovery shaped
the theory of codes for more than two decades: people worked hard to extend properties of Hamming codes to wider classes of codes (with variable success). Most
of the topics on codes discussed in this book are related, in one way or another, to
Hamming codes. Richard Hamming was not only an outstanding scientist but also
an illustrious personality; his writings (and accounts of his life) are entertaining
and thought-provoking.
201
Until the late 1950s, the Hamming codes were a unique family of codes existing in dimensions N , with regular properties. It was then discovered that
these codes have a deep algebraic background. The development of the algebraic
methods based on these observations is still a dominant theme in modern coding
theory.
Another important example is the four Golay codes (two binary and two ternary).
Marcel Golay (19021989) was a Swiss electrical engineer who lived and worked
in the USA for a long time. He had an extraordinary ability to see the discrete
geometry of the Hamming spaces and guess the construction of various codes
without bothering about proofs.
matrix G =
The binary Golay code X24Gol is a [24, 12] code withthe generating
0 1 1 1 1 1 1 1 1
1 1 1 0 1 1 1 0 0
1 1 0 1 1 1 0 0 0
1 0 1 1 1 0 0 0 1
1 1 1 1 0 0 0 1 0
1 1 1 0 0 0 1 0 1
G =
1 1 0 0 0 1 0 1 1
1 0 0 0 1 0 1 1 0
1 0 0 1 0 1 1 0 1
1 0 1 0 1 1 0 1 1
1 1 0 1 1 0 1 1 1
1 0 1 1 0 1 1 1 0
1
0
1
0
1
1
0
1
1
1
0
0
1
1
0
1
1
0
1
1
1
0
0
0
1
0
1
1
0
1
1
1
0
0
0
1
(2.4.1)
The rule of forming matrix G is ad hoc (and this is how it was determined by M.
Golay in 1949). There will be further ad hoc arguments in the analysis of Golay
codes.
Remark 2.4.3 Interestingly, there is a systematic way of constructing all codewords of X24Gol (or its equivalent) by fitting together two versions of Hamming [7, 4]
code X7H . First, observe that reversing the order of all the digits of a Hamming
code X7H yields an equivalent code which we denote by X7K . Then add a paritycheck to both X7H and X7K , producing codes X8H,+ and X8K,+ . Finally, select
two different words a, b X8H,+ and a word x X8K,+ . Then all 212 codewords
of X24Gol of length 24 could be written as concatenation (a + x)(b + x)(a + b + x).
This can be checked by inspection of generating matrices.
Lemma 2.4.4 The binary Golay code X24Gol is self-dual, with X24Gol = X24Gol .
The code X24Gol is also generated by the matrix G = (G |I12 ).
202
Proof A direct calculation shows that any two rows of matrix G are dot
orthogonal. Thus X24Gol X24Gol . But the dimensions of X24Gol and X24Gol
coincide. Hence, X24Gol = X24Gol . The last assertion of the lemma now follows
from the property (G )T = G .
Worked Example 2.4.5
Solution First, we check that for all x X24Gol the weight w(x) is divisible by 4 .
This is true for every row of G = (I12 |G ): the number of 1s is either 12 or 8. Next,
for all binary N-words x, x ,
w(x + x ) = w(x) + w(x ) 2w(x x )
where (x y) is the wedge-product, with digits (x y) j = min(x j , y j ), 1 j N
(cf. (2.1.6b)). But for any pair g( j), g( j ) of the rows of G, w(g( j) g( j )) = 0
mod 2. So, 4 divides w(x) for all x X24Gol .
On the other hand, X24Gol does not have codewords of weight 4. To prove this,
compare two generating matrices, (I12 |G ) and ((G )T |I12 ). If x X24Gol has w(x) =
4, write x as a concatenation xL xR . Any non-trivial sum of rows of (I12 |G ) has
the weight of the L-half of at least 1, so w(xL ) 1. Similarly, w(xR ) 1. But if
w(xL ) = 1 then x must be one of the rows of (I12 |G), none of which has weight
w(xR ) = 3. Hence, w(xL ) 2. Similarly, w(xR ) 2. But then the only possibility
is that w(xL ) = w(xR ) = 2 which is impossible by a direct check. Thus, w(x) 8.
But (I12 |G ) has rows of weight 8. So, d(X24Gol ) = 8.
When we truncate X24Gol at any digit, we get X23Gol , a [23, 12, 7] code. This code
is perfect 3 error correcting. We recover X24Gol from X23Gol by adding a paritycheck.
The Hamming [2 1, 2 1, 3] and the Golay [23, 12, 7] are the only possible
perfect binary linear codes.
Gol of length 12 has the generating matrix I |G
The ternary Golay code X12,3
6 (3)
where
0 1 1 1 1 1
1 0 1 2 2 1
1
1
0
1
2
2
(2.4.2)
G (3) =
, with (G (3) )T = G (3) .
1 2 1 0 1 2
1 2 2 1 0 1
1 1 2 2 1 0
Gol is a truncation of X Gol at the last digit.
The ternary Golay code X11,3
12,3
Theorem 2.4.6
(Gol)
203
(Gol)
(Gol)
11 10
22 =
2
(2.4.3)
So,
0 1 2
0
0
..
.
1
0
..
.
0
1
..
.
M=
0 0 0
0 0 0
. . . 2m 1
... 1
... 1
..
..
. .
... 1
... 1
v(1)
v(2)
..
.
(2.4.4)
v(m1)
v(m)
The columns of M list all vectors from Hm,2 and the rows are vectors from HN,2
denoted by v(1) , . . . , v(m) . In particular, v(m) has the first 2m1 entries 0, the last
2m1 entries 1. To pass from Mm to Mm1 , one drops the last row and takes one of
the two identical halves of the remaining (m 1) N matrix. Conversely, to pass
from Mm1 to Mm , one concatenates two copies of Mm1 and adds row v(m) :
Mm1 Mm1
.
(2.4.5)
Mm =
0...0 1...1
204
ji 2i1 is
1im
ji w(i) .
1im
(2.4.6)
In terms of the wedge-product (cf. (2.1.6b)) v(i1 ) v(i2 ) v(ik ) is the indicator
function of the intersection Ai(1) Ai(k) . If all i1 , . . . , ik are distinct, the cardinality (1 jk Ai( j) ) = 2mk . In other words, we have the following.
Lemma 2.4.7
An important fact is
Theorem 2.4.8 The vectors v(0) = 11 . . . 1 and 1 jk v(i j ) , 1 i1 < < ik m,
k = 1, . . . , m, form a basis in HN,2 .
Proof It suffices to check that the standard basis N-words e( j) = 0 . . . 1 . . . 0 (1 in
position j, 0 elsewhere) can be written as linear combinations of the above vectors.
But
(i)
0 j N 1.
(2.4.7)
[All factors in position j are equal to 1 and at least one factor in any position l = j
is equal to 0.]
Example 2.4.9
205
For m = 4, N = 16,
v(0) = 1111111111111111
v(1) = 0101010101010101
v(2) = 0011001100110011
v(3) = 0000111100001111
v(4) = 0000000011111111
v(1) v(2) = 0001000100010001
v(1) v(3) = 0000010100000101
v(1) v(4) = 0000000001010101
v(2) v(3) = 0000001100000011
v(2) v(4) = 0000000000110011
v(3) v(4) = 0000000000001111
v(1) v(2) v(3) = 0000000100000001
v(1) v(2) v(4) = 0000000000010001
v(1) v(3) v(4) = 0000000000000101
v(2) v(3) v(4) = 0000000000000011
v(1) v(2) v(3) v(4) = 0000000000000001
206
(2.4.8)
(2.4.9)
(2.4.10)
This means that the ReedMuller codes are related via the bar-product construction
(cf. Example 2.1.8(viii)):
X RM (r, m) = X RM (r, m 1)|X RM (r 1, m 1).
(2.4.11)
Therefore, inductively,
d(X RM (r, m)) = 2mr .
(2.4.12)
In fact, for m = r = 0, d(X RM (0, 0)) = 2m and for all m, d(X RM (m, m)) =
1 = 20 . Assume d(X RM (r 1, m)) = 2mr+1 for all m r 1, and
d(X RM (r 1, m 1)) = 2mr . Then (cf. (2.4.14) below)
d(X RM (r, m)) = min[2d(X RM (r, m 1)), d(X RM (r 1, m 1))]
= min[2 2m1r , 2m1r+1 ] = 2mr .
(2.4.13)
207
Summarising,
Theorem 2.4.11
d X1 |X2 = min 2d(X1 ), d(X2 ) .
(2.4.14)
(2.4.15)
d X1 |X2 min 2d(X1 ), d(X2 ) .
On the other hand, if x X1 has w(x) = d(X1 ) then d X
1 |X2 w(x|x) =
2d(X1 ). Finally, if y X2 has w(y) = d(X2 ) then d X1 |X2 w(0|y) = d(X2 ).
We conclude that
(2.4.16)
d X1 |X2 min 2d(X1 ), d(X2 ) ,
proving (2.4.14).
208
X2 |X1 (X1 |X2 ) .
Indeed, let (u|u + v) X2 |X1 and (x|x + y) (X1 |X2 ). The dot-product
I
J
(u|u + v) (x|x + y) = u x + (u + v) (x + y)
= u y + v (x + y) = 0,
since u X2 , y X2 , v X1 and (x + y) X1 . In addition, we know that
rank X2 |X1 = N rank(X2 ) + N rank(X1 )
= 2N rank (X1 |X2 ) = rank (X1 |X2 ) .
This implies that in fact
X2 |X1 = (X1 |X2 ) .
(2.4.17)
X RM (r, m) = X RM (m r 1, m).
It is done by induction in m 3. By the above, we can assume that
= X RM (r 1, m 1) |X RM (r, m 1)
= X RM (m r 1, m 1)|X RM (m r 2, m 1)
= X RM (m r 1, m).
209
Definition 2.4.13
ji 2i1
1im
with ji = 0 for i
/ {i1 , . . . , ik }.
(2.4.18)
(2.4.19)
y j v(i1 ) v(ik )
(2.4.20)
jC(i1 ,...,ik )
(2.4.21)
We see that the information space Hk,2 is embedded into HN,2 , by identifying
entries a j ai1 ,...,il where j = j0 20 + j1 21 + + jm1 2m1 and i1 , . . . , il are the
successive positions of the 1s among j1 , . . . , jm , 1 l r. With such an identification we obtain:
Lemma 2.4.14
jC(i1 ,...,il )
x j = ai1 ,...,il , if l r,
(2.4.22)
= 0, if l > r.
Proof
Lemma 2.4.15
t
/ {i1 , . . . , ir },
x j.
(2.4.23)
210
Proof The proof follows from the fact that C(i1 , . . . , ir ,t) is the disjoint union
C(i1 , . . . , ir )(C(i1 , . . . , ir )+2t1 ) and the equation
x j = 0 (cf. (2.4.19)).
Moreover:
Theorem 2.4.16 For any information symbol ai1 ,...,ir corresponding to v(i1 ,...,ir ) ,
we can split the set {0, . . . , N 1} into 2mr disjoint subsets S, each containing 2r
elements, such that, for all such S, ai1 ,...,ir = x j .
jS
Proof The list of sets S begins with C(i1 , . . . , ir ) and continues with (m r) disjoint sets C(i1 , . . . , ir ) + 2t1 where 1 t m, t {i1 , . . . , ir }. Next, we take any
/ Then C(i1 , . . . , ir ,t1 ,t2 ) conpair 1 t1 < t2 m such that {t1 ,t2 } {i1 , . . . , ir } = 0.
t
1
1
tains disjoint sets C(i1 , . . . , ir ), C(i1 , . . . , ir ) + 2
and C(i1 , . . . , ir ) + 2t2 1 , and for
x j , k = 1, 2. Then the same is true for the
each of them, ai1 ,...,ir =
remaining sets
C(i1 , . . . , ir ) + 2t1 1 + 2t2 1 = C(i1 , . . . , ir ,t1 ,t2 )\
C(i1 , . . . , ir ) C(i1 , . . . , ir ) + 2t1 1 C(i1 , . . . , ir ) + 2t2 1 ;
(2.4.24)
(2.4.25)
Here each such set is labelled by a collection {t1 , . . . ,ts } where 0 s m r, t1 <
/ [The union {t1 ,...,t }{t1 ,...ts } in (2.4.25)
< ts and {t1 , . . . ,ts } {i1 , . . . , ir } = 0.
s
is over all (strict) subsets {t1 , . . . ,ts } of {t1 , . . . ,ts }, with t1 < < ts and s =
0, . . . , s 1 (s = 0 gives the empty subset).] The total number of sets C(i1 , . . . , ir )
equals 2mr and each of them has 2r elements by construction.
Theorem 2.4.16 provides a rationale for the so-called majority decoding for the
ReedMuller codes. Namely, upon receiving a word y = (y0 , . . . , yN1 ), produced
RM , we take any 1 i < < i m and consider
from a codeword x Xr,m
1
r
mr
RM , all these sums coincide
above sets S. If y Xr,m
the sums y j along the 2
jC
211
(1)
(1)
mr =
word y x(1)
> will have (x x , y x ) = (x , y) errors, which is < 2
RM ) 2. We can repeat the above procedure and obtain the correct a
d(Xr1,m
i1 ,...,ir1
for any 1 i1 < < ir1 m, etc. At the end, we recover the whole sequence of
information symbols ai1 ,...,ir .
RM < d X RM
2 is
Therefore, any word y HN,2 with distance y, Xr,m
r,m
uniquely decoded.
212
213
Worked Example 2.4.18 The MDS codes [N, N, 1], [N, 1, N] and [N, N 1, 2]
always exist and are called trivial. Any [N, k] MDS code with 2 k N 2 is called
non-trivial. Show that there is no non-trivial MDS code over Fq with q k N q.
In particular, there is no non-trivial binary MDS code (which causes a discernible
lack of enthusiasm about binary MDS codes).
Solution Indeed, the [N, N, 1], [N, N 1, 2] and [N, 1, N] codes are MDS. Take
q k N q and assume X is a q-ary MDS. Take its generating matrix G in the
standard form (Ik |G ) where G is k (N k), N k q.
If some entries in a column of G are zero then this column is a linear combination of k 1 columns of Ik1 . This is impossible by (b) in the previous example;
hence G has no 0 entry. Next, assume that the first row of G is 1 . . . 1: otherwise
we can perform scalar multiplication of columns maintaining codes equivalence.
Now take the second row of G : it is of length N k q and has no 0 entry. Then
these must be repeated entries. That is,
214
together with some other sharp observations made at the end of the 1950s, particularly the invention of BCH codes, opened a connection from the theory of linear
codes (which was then at its initial stage) to algebra, particularly to the theory of finite fields. This created algebraic coding theory, a thriving direction in the modern
theory of linear codes.
We begin with binary cyclic codes. The coding and decoding procedures for
binary cyclic codes of length N are based on the related algebra of polynomials
with binary coefficients:
a(X) = a0 + a1 X + + aN1 X N1 , where ak F2 for k = 0, . . . , N 1. (2.5.1)
Such polynomials can be added and multiplied in the usual fashion, except that
X k + X k = 0. This defines a binary polynomial algebra F2 [X]; the operations over
binary polynomials refer to this algebra. The degree deg a(X) of polynomial a(X)
equals the maximal label of its non-zero coefficient. The degree of the zero polynomial is set to be 0. Thus, the representation (2.5.1) covers polynomials of degree
< N.
Theorem 2.5.1
(b) (The division algorithm) Let f (X) and h(X) be two binary polynomials with
h(X) 0. Then there exist unique polynomials g(X) and r(X) such that
f (X) = g(X)h(X) + r(X) with
(2.5.2)
The polynomial g(X) is called the ratio, or quotient, and r(X) the remainder.
Proof
(a) The statement follows from the binomial decomposition where all
intermediate terms vanish.
(b) If deg h(X) > deg f (X) we simply set
f (X) = 0 h(X) + f (X).
If deg h(X) deg f (X), we can perform the standard procedure of long division, with the rules of the binary addition and multiplication.
Example 2.5.2
(a) 1 + X + X 3 + X 4 X + X 2 + X 3 = X + X 7 .
(b) 1 + X N = (1 + X) 1 + X + + X N1 .
1 + X + X 2 + X 4 = X 3 + X 4 ; the
(c) The quotient X + X 2 + X 6 + X 7 + X 8
remainder equals X + X 2 + X 3 .
215
Definition 2.5.3
Two polynomials, f1 (X) and f2 (X), are called equivalent
mod h(X), or f1 (X) = f2 (X) mod h(X), if their remainders, after division by h(X),
coincide. That is,
fi (X) = gi (X)h(X) + r(X),
i = 1, 2,
then
'
Proof
(2.5.3)
(2.5.4)
We have, for i = 1, 2,
fi (X) = gi (X)h(X) + r(X),
with
deg r(X), deg s(X) < deg h(X).
Hence
fi (X) + pi (X) = (gi (X) + qi (X))h(X) + (r(X) + s(X))
with
deg(r(X) + s(X)) max[r(X), s(X)] < deg h(X).
Thus
f1 (X) + p1 (X) = f2 (X) + p2 (X) mod h(X).
Furthermore, for i = 1, 2, the product fi (X)pi (X) is represented as
gi (X)qi (X)h(X) + r(X)qi (X) + s(X)gi (X) h(X) + r(X)s(X).
Hence, the remainder for both polynomials f1 (X)p1 (X) and f2 (X)p2 (X) may come
only from r(X)s(X). Thus it is the same for both of them.
Note that every linear binary code XN corresponds to a set of polynomials, with
coefficients 0, 1, of degree N 1 which is closed under addition mod 2:
a(X) = a0 + a1 X + + aN1 X N1 a(N) = a0 . . . aN1 ,
b(X) = b0 + b1 X + + bN1 X N1 b(N) = b0 . . . bN1 ,
a(X) + b(X) a(N) + b(N) = (a0 + b0 ) . . . (aN1 + bN1 ).
(2.5.5)
216
1 a(X) =
1
a(X) + a0 + a0 X N1 .
X
217
Theorem 2.5.9 A binary cyclic code contains, with each pair of polynomials
a(X) and b(X), the sum a(X) + b(X) and any polynomial v(X)a(X) mod (1 + X N ).
Proof By linearity the sum a(X) + b(X) X . If v(X) = v0 + v1 X + vN1 X N1
then each polynomial X k a(X) mod (1 + X N ) corresponds to k a and hence belongs
to X . As
v(X)a(X) mod (1 + X N ) =
N1
vi X i a(X) mod (1 + X N ),
i=0
(2.5.6)
and the usual F2 [X]-addition, form a commutative ring, denoted by F2 [X] (1 +
X N ). The binary cyclic codes are precisely the ideals of this ring.
Nk
Theorem 2.5.10
Proof
g(X)v(X) = vi X i g(X),
i=1
r < k.
218
Hence, each polynomial a(X) X is a linear combination of polynomials g(X), Xg(X), . . . , X k1 g(X) (all of which belong to X ). On the other
hand, polynomials g(X), Xg(X), . . . , X k1 g(X) have distinct degrees and hence
are linearly independent. Therefore words g, g, . . . , k1 g, corresponding to
g(X), Xg(X), . . . , X k1 g(X), form a basis in X .
(iv) We know that each polynomial a(X) X has degree > deg g(X). By the
division algorithm,
a(X) = v(X)g(X) + r(X).
Here, we must have
deg v(X) < k
and
But then v(X)g(X) belongs to X owing to Theorem 2.5.9 (as v(X)g(X) has degree
N 1, it coincides with v(X)g(X) mod (1 + X N )). Hence,
r(X) = a(X) + v(X)g(X) X
by linearity. As g(X) is a unique polynomial from X of minimum degree, r(X) =
0.
Corollary 2.5.11 Every binary cyclic code is obtained from the codeword corresponding to a polynomial of minimum degree, by cyclic shifts and linear combinations.
Definition 2.5.12 A polynomial g(X) of minimal degree in X is called a minimal degree generator of a (cyclic) binary code X , or briefly a generator of X .
Remark 2.5.13 There may be other polynomials that generate X in the sense
of Corollary 2.5.11. But the minimum degree polynomial is unique.
Theorem 2.5.14 A polynomial g(X) of degree N 1 is the generator of a
binary cyclic code of length N iff g(X) divides 1 + X N . That is,
1 + X N = h(X)g(X)
That is,
r(X) = h(X)g(X) + 1 + X N , i.e. r(X) = h(X)g(X) mod (1 + X N ).
(2.5.7)
219
By Theorem 2.5.10, r(X) belongs to the cyclic code X generated by g(X). But
g(X) must be the unique polynomial of minimum degree in X . Hence, r(X) = 0
and 1 + X N = h(X)g(X).
(The if part.) Suppose that 1 + X N = h(X)g(X), deg h(X) = N deg g(X).
Consider the set {a(X) : a(X) = u(X)g(X) mod (1 + X N )}, i.e. the principal
ideal in the -multiplication polynomial ring corresponding to g(X). This set
forms a linear code; it contains g(X), Xg(X), . . . , X k1 g(X) where k = deg h(X).
It suffices to prove that X k g(X) also belongs to the set. But X k g(X) = 1 +
k1
1101000
0110100
H
=
Gcycl
0011010
0001101
220
and the cyclic shift of the last row is again in the code:
(0 0 0 1 1 0 1) = (1 0 0 0 1 1 0)
= (1 1 0 1 0 0 0) + (0 1 1 0 1 0 0) + (0 0 1 1 0 1 0).
By Lemma 2.5.6, the code is cyclic. By Theorem 2.5.10(iii), the generating polyH :
nomial g(X) corresponds to the framed part in matrix Gcycl
1101 g(X) = 1 + X + X 3 = the generator.
But a similar argument can be used to show that an equivalent cyclic code is obtained from the word 1011 1 + X 2 + X 3 . There is no contradiction: it was not
claimed that the polynomial ideal of a cyclic code is the principal ideal of a unique
element.
If we choose a different order of the columns in the parity-check matrix, the
code will be equivalent to the original code; that is, the code with the generator
polynomial 1 + X 2 + X 3 is again a Hamming [7, 4] code.
In Problem 2.3 we will check that the Golay [23, 7] code is generated by the
polynomial g(X) = 1 + X + X 5 + X 6 + X 7 + X 9 + X 11 .
Worked Example 2.5.18
X 7 + 1 = (X + 1)(X 3 + X + 1)(X 3 + X 2 + 1)
(2.5.8)
in F2 [X], find all cyclic binary codes of length 7. Identify those which are Hamming
codes and their duals.
Solution See the table below.
code X
generator for X
generator for X
{0, 1}7
parity-check
1
1+X
1 + X7
Xi
0i6
Hamming
Hamming
dual Hamming
dual Hamming
repetition
1+X
+ X3
1 + X2 + X3 + X4
1 + X2 + X3
1 + X2 + X3 + X4
1 + X + X2 + X4
Xi
1 + X + X2 + X4
1 + X + X3
1 + X2 + X3
1+X
1 + X7
0i6
zero
221
It is easy to check that all factors in (2.5.8) are irreducible. Any irreducible factor
could be included or not included in decomposition of the generator polynomial.
This argument proves that there exist exactly 8 binary codes in H7,2 as demonstrated in the table.
Example 2.5.19 (a) Polynomials of the first degree, 1 + X and X, are irreducible
(but X does not appear in the decomposition for 1 + X N ). There is one irreducible
binary polynomial of degree 2: 1 + X + X 2 , two of degree 3: 1 + X + X 3 and 1 +
X 2 + X 3 , and three of degree 4:
1 + X + X 4,
1 + X3 + X4
and 1 + X + X 2 + X 3 + X 4 ,
(2.5.9)
1 + X4 + X6 + X7 + X8
and 1 + X 2 + X 6 + X 8
(2.5.10)
1 + X 3,
1 + X 5,
1 + X 11 ,
1 + X 13 .
222
and
1 + X 25 : (1 + X + X 2 + X 3 + X 4 )(1 + X 5 + X 10 + X 15 + X 20 ).
For N even, 1 + X N can have multiple roots (see Example 2.5.35(c)).
Example 2.5.20 Irreducible polynomials of degree 2 and 3 over the field F3
(that is, from F3 [X]) are as follows. There exist three irreducible polynomials of
degree 2 over F3 : X 2 + 1, X 2 + X + 2 and X 2 + 2X + 2. There exist eight irreducible
polynomials of degree 3 over F3 : X 3 + 2X + 2, X 3 + X 2 + 2, X 3 + X 2 + X + 2, X 3 +
2X 2 + 2X + 2, X 3 + 2X + 1, X 3 + X 2 + 2X + 1, X 3 + 2X 2 + 1 and X 3 + 2X 2 + X + 1.
Cyclic codes admit encoding and decoding procedures in terms of the polynomials. It is convenient to have a generating matrix of a cyclic code X in a form
similar to Gcycl for the Hamming [7, 4] code (see above). That is, we want to find
the basis in X which gives the following picture in the corresponding generating
matrix:
Gcycl =
..
(2.5.11)
Gcycl =
g(X)
Xg(X)
..
.
(2.5.12)
X k1 g(X)
The code has rank k and may be used for encoding words of length k as follows.
Given a word a = a0 . . . ak1 , we form the polynomial a(X) = ai X i and take
0i<k
223
We see that the cyclic codes are naturally labelled by their generator polynomials.
Definition 2.5.23 Let X be the cyclic binary code of length N generated by
g(X). The check polynomial h(X) of X is defined as the ratio (1 + X N )/g(X).
That is, h(X) is a unique polynomial for which h(X)g(X) = 1 + X N .
We will use the standard notation gcd( f (X), g(X)) for the greatest common divisor of polynomials f (X) and g(X) and lcm( f (X), g(X)) for their least common
multiple. Denote by X1 + X2 the direct sum of two linear codes X1 , X2 HN,2 .
That is, X1 + X2 consists of the linear combinations 1 a(1) + 2 a(2) where
1 , 2 = 0, 1 and a(i) Xi , i = 1, 2. Compare Example 2.1.8(vii).
224
Worked Example 2.5.24 Let X1 and X2 be two binary cyclic codes of length
N , with generators g1 (X) and g2 (X). Prove that:
(a) X1 X2 iff g2 (X) divides g1 (X);
(b)
X1 X2 yields a cyclic code generated by
the intersection
225
lcm(g1 (X), g2 (X)), the code generated by lcm(g1 (X), g2 (X)) must be strictly larger
than X1 X2 . This contradicts the definition of X1 X2 .
(c) Similarly, X1 + X2 is the minimal linear code containing both X1 and X2 .
Hence, its generator divides both g1 (X) and g2 (X), i.e. is their common divisor.
And if it is not equal to the gcd(g1 (X), g2 (X)) then it contradicts the above minimality property.
Worked Example 2.5.25 Let X be a binary cyclic code of length N with the
generator g(X) and the check polynomial h(X). Prove that a(X) X iff the polynomial (1 + X N ) divides a(X)h(X), i.e. a h(X) = 0 in F2 [X]/(1 + X N ).
Solution If a(X) X then a(X) = f (X)g(X) for some polynomial f (X)
F2 [X]/(1 + X N ). Then
a(X)h(X) = f (X)g(X)h(X) = f (X)(1 + X N )
which equals 0 in F2 [X]/(1 + X N ). Conversely, let a(X) F2 [X]/(1 + X N ) and
assume that a(X)h(X) = 0 mod (1 + X N ). Write a(X) = f (X)g(X) + r(X) where
deg r(X) <deg g(X). Then
a(X)h(X) = f (X)(1 + X N ) + r(X)h(X) = r(X)h(X) mod (1 + X N ).
Hence, r(X)h(X) = 0 mod (1 + X N ) which is only possible when r(X) = 0 (since
deg r(X)h(X) < N). Thus, a(X) = f (X)g(X) and a(X) X .
Worked Example 2.5.26
find its generating matrix.
Solution If y X , the dual code, then the dot-product " x y# = 0 for all x X .
But " x y# = "x y#, i.e. y X , which means that X is cyclic.
Let g(X) = g0 + g1 X + + gNk X Nk be the generating polynomial for X ,
where N k = d is the degree of g(X) and k gives the rank of X . We know that
the generating matrix G of X may be written as
g(X)
Xg(X)
(2.5.13)
G
.
.
.
.
0
X k1 g(X)
226
'
g j hi j
j=0
= 1,
i = 0, N,
= 0,
1 i < N.
" j g j h # = 0 for j = 0, 1, . . . , N k 1, j = 0, . . . , k 1,
where h = hk hk1 . . . h0 . It is then easy to see that h gives rise to the generator
h (X) of X .
An alternative solution is based on Worked Example 2.5.25. We know that
a(X) X iff a h(X) = 0. Let k be the degree of g(X) then the degree of h(X)
equals N k. The degree deg[a(X)h(X)] is < 2N k, so the coefficients of X Nk ,
X Nk+1 , . . . , X N1 in a(X)h(X) all vanish. That is:
a0 hNk + a1 hNk1 + + aNk h0
a1 hNk + a2 hNk1 + + aNk+1 h0
..
.
= 0,
= 0,
..
.
h (X)
0
(X)
Xh
H
(2.5.14)
..
..
.
.
0
X Nk1 h (X)
and h (X) = X Nk h(X 1 ), with the coefficient string h = hk hk1 . . . h0 .
We conclude that matrix H generates a code X X . But since hNk = 1, the
rank of X equals N k. Hence, X = X .
It remains to check that polynomial h (X) divides 1 + X N . To this end,
we deduce from g(X)h(X) = 1 + X N that h(X 1 )g(X 1 ) = X N + 1. Hence
h (X)X k g(X 1 ) = 1 + X N , and as X k g(X 1 ) equals the polynomial gk + gk1 X +
+ g0 X k , the required fact follows.
2
Worked Example 2.5.27
erator g(X).
227
(a) Show that the set of codewords a X of even weight is a cyclic code and find
its generator.
(b) Show that X contains a codeword of odd weight iff g(1) = 0 or, equivalently,
word 1 X .
Solution (a) If code X is even (i.e. contains only words of even weight) then
every polynomial a(X) X has a(1) = ai = 0. Hence, a(X) contains a
0i<N1
factor (X + 1). Therefore, the generator g(X) has a factor (X + 1). The converse is
also true: if (X + 1) divides g(X), or, equivalently, g(1) = 0, then every codeword
a X is of even weight.
Now assume that X contains a word with an odd weight, i.e. g(1) = 1; that
is, (1 + X) does not divide g(X). Let X ev be the subcode in X formed by the
even codewords. A cyclic shift does not change the weight, so X ev is a cyclic
code. For the corresponding polynomials a(X) we have, as before, that (1 + X)
divides a(X). Thus, the generator gev (X) of X ev is divisible by (1 + X), hence
gev (X) = g(X)(X + 1).
(b) It remains to show that g(1) = 1 iff the word 1 X . The corresponding polynomial is 1+ + X N1 , the complementary factor to (1+ X) in the decomposition
1 + X N = (1 + X)(1 + + X N1 ). So, if g(1) = 1, i.e. g(X) does not contain the
factor (1 + X), then g(X) must be a divisor of 1 + + X N1 . This implies that
1 X . The inverse statement is established in a similar manner.
Worked Example 2.5.28 Let X be a binary cyclic code of length N with generator g(X) and check polynomial h(X).
(a) Prove that X is self-orthogonal iff h (X) divides g(X) and self-dual iff
h (X) = g(X) where h (X) = hk + hk1 X + + h0 X k1 and h(X) = h0 + +
hk1 X k1 + hk X k is the check polynomial, with g(X)h(X) = 1 + X N .
(b) Let r be a divisor of N : r|N . A binary code X is called r-degenerate if every
codeword a X is a concatenation c . . . c where c is a string of length r. Prove that
X is r-degenerate iff h(X) divides (1 + X r ).
Solution (a) Self-orthogonality means that X X , i.e. "a b# = 0 for all
a, b X . From Worked Example 2.5.26 we know that h (X) gives the generator polynomial of X . Then, by virtue of Worked Example 2.5.26, X X iff
h (X) divides g(X).
Self-duality means that X = X , that is h (X) = g(X).
(b) For N = rs, we have the decomposition
1 + X N = (1 + X r )(1 + X r + + X r(s1) ).
228
Now assume cyclic code X of length N with generator g(X) is r-degenerate. Then
the word g is of the form 1c1c . . . 1c for some string c of length r 1 (with c = 1c).
Let c(X) be the polynomial corresponding to c (of degree r 2). Then g(X) is
given by
1 + X c(X) + X r + X r+1 c(X) + + X r(s1) + X r(s1)+1 c(X)
= (1 + X r + + X r(s1) )[1 + X c(X)].
For the check polynomial h(X) we obtain
>
h(X) = 1 + X N
1 + X r + + X r(s1) 1 + X c(X)
= 1 + Xr
1 + X c(X) ,
i.e. h(X) is a divisor of (1 + X r ).
Conversely, let h(X)|(1 + X r ), with h(X)g(X) = 1 + X r where g(X) =
c j X j , with c0 = 1. Take c = c0 . . . cr1 ; repeating the above argument in
0 jr1
the reverse order, we conclude that the word g is the concatenation c . . . c. Then the
cyclic shift g is the concatenation c(1) . . . c(1) where c(1) = cr1 c0 . . . cr2 (= c,
the cyclic shift of c in {0, 1}r ). Similarly, for subsequent cyclic shift iterations
2 g, . . .. Hence, the basis vectors in X are r-degenerate, and so is the whole of X .
In the standard arithmetic, a (real or complex) polynomial p(X) of a given degree d is conveniently identified through its roots (or zeros) 1 , . . . , d (in general,
complex), by means of the monomial decomposition: p(X) = pd (X i ). In
1id
the binary arithmetic (and, more generally, the q-ary arithmetic), the roots of polynomials are still an extremely useful concept. In our situation, the roots help to
construct the generator polynomial g(X) = gi X i of a binary cyclic code with
0id
important predicted properties. Assume for the moment that the roots 1 , . . . , d of
g(X) are a well-defined object, and the representation
g(X) =
(X i )
1id
has a consistent meaning (which is provided within the framework of finite fields).
Even without knowing the formal theory, we are able to make a couple of helpful
observations.
The first observation is that the i are Nth roots of unity, as they should be among
the zeros of polynomial 1 + X N . Hence, they could be multiplied and inverted, i.e.
would form an Abelian multiplicative group of size N, perhaps cyclic. Second, in
the binary arithmetic, if is a zero of g(X) then so is 2 , as g(X)2 = g(X 2 ). Then
2 is also a zero, as well as 4 , and so on. We conclude that the sequence , 2 , . . .
d
d
begins cycling: 2 = (or 2 1 = 1) where d is the degree of g(X). That is, all
229
)
c1
Nth roots of unity split into disjoint classes, of the form C = , 2 , . . . , 2
,
c
of size c where c = c(C ) is a positive integer (with 2 1 dividing N). The notation
C ( ) is instructive, with c = c( ). The members of the same class are said to be
conjugate to each other. If we want a generating polynomial with root then all
conjugate roots of unity C ( ) will also be among the roots of g(X).
Thus, to form a generator g(X) we have to borrow roots from classes C and
enlist, with each borrowed root of unity, all members of their classes. Then, since
any polynomial a(X) from the cyclic code generated by g(X) is a multiple of g(X)
(see Theorem 2.5.10(iv)), the roots of g(X) will be among the roots of a(X). Conversely, if a(X) has roots i of g(X) among its roots then a(X) is in the code. We
see that cyclic codes are conveniently described in terms of roots of unity.
Example 2.5.29 (The Hamming [7, 4] code) Recall that the parity-check matrix
H for the binary Hamming [7, 4] code X H is 3 7; it enlists as its columns all
non-zero binary words of length 3: different orderings of these rows define equivalent codes. Later in this section we explain that the sequence of non-zero binary
words of any given length 2 1 written in some particular order (or orders) can be
interpreted as a sequence of powers of a single element : 0 , , 2 , . . . , 2 2 .
The multiplication rule generating these powers is of a special type (multiplication
of polynomials modulo a particular irreducible polynomial of degree ). To stress
this fact, we use in this section the notation for this multiplication rule, writing
i in place of i . Anyway, for l = 3, one appropriate order of the binary non-zero
3-words (out of the two possible orders) is
0 0 1 0 1 1 1
H = 0 1 0 1 1 1 0 ( 0 2 3 4 5 6 ).
1 0 0 1 0 1 1
Then, with this interpretation, the equation aH T = 0, determining that the word
a = a0 . . . a6 (or its polynomial a(X) = ai X i ) lies in X H , can be rewritten as
0i<7
ai i = 0, or a( ) = 0.
0i<7
In other words, a(X) X H iff is a root of a(X) under the multiplication rule
(which in this case is multiplication of binary polynomials of degree 2 modulo
the polynomial 1 + X + X 3 ).
The last statement can be rephrased in this way: the Hamming [7, 4] code is
equivalent to the cyclic code with the generator g(X) that has among its roots;
in this case the generator g(X) = 1 + X + X 3 , with g( ) = 0 + + 3 = 0.
The alternative ordering of the rows of H H is related in the same fashion to the
polynomial 1 + X 2 + X 3 .
230
We see that the Hamming [7, 4] code is defined by a single root , provided
that we establish proper terms of operation with its powers. For that reason we can
call the defining root (or defining zero) for this code. There are reasons to call
element primitive; cf. Sections 3.13.3.
Worked Example 2.5.30 A code X is called reversible if a = a0 a1 . . . aN1 X
implies that a = aN1 . . . a1 a0 X . Prove that a cyclic code with generator g(X)
is reversible iff g( ) = 0 implies g( 1 ) = 0.
Solution For the generator polynomial g(X) = gi X i , with deg g(X) = d < N
0id
231
(2.5.15)
This implies that either g(X)| f (X) or g(X)|h1 (X) h2 (X). We conclude that if
polynomial g(X) is irreducible, (2.5.15) is impossible, unless h1 (X) = h2 (X) and
v(X) = 0. For one and only one polynomial h(X), we have
f (X)h(X) = 1 mod g(X);
h(X) represents the inverse for f (X) in multiplication mod g(X). We write h(X) =
f (X)1 .
On the other hand, if g(X) is reducible, then g(X) = b(X)b (X) where both b(X)
and b (X) are non-zero and have degree < d. That is, b(X)b (X) = 0 mod g(X). If
the multiplication mod q led to a field both b(X) and b (X) would have inverses,
b(X)1 and b (X)1 . But then
b(X)1 b(X) b (X) = b (X) = 0,
and similarly b(X) = 0.
A field obtained via the above construction is called a polynomial field and is
often denoted by F2 [X]/"g(X)#. It contains 2d elements where d = deg g(X) (representing polynomials of degree < d). We will call g(X) the core polynomial of the
field. For the rest of this section we denote the multiplication in a given polynomial
field by . The zero polynomial and the unit polynomial are denoted correspondingly, by 0 and 1: they are obviously the zero and the unity of the polynomial field.
A key role is played by the following result.
Theorem 2.5.33 (a) The multiplicative group of non-zero elements in polynomial field F2 [X]/"g(X)# is isomorphic to the cyclic group Z2d 1 of size 2d 1.
(b) The polynomial fields obtained by picking different irreducible polynomials of
degree d are all isomorphic.
Proof We will only prove here assertion (a); assertion (b) will be established in
Section 3.1. Take any element from the field, a(X) F2 [X]/"g(X)#, and observe
that
. . a=(X)
ai (X) := a
: .;<
i times
(the multiplication in the field) takes at most 2d 1 values (the number of elements
in the field less one, as the zero 0 is excluded). Hence there exists a positive integer
r such that ar (X) = 1; the smallest value of r is called the order of a(X).
232
s = pc l , and r = pc l,
with integers c , c 0 and l, l 1, where l and l are not divisible by p. We want to
c
show that c c . Indeed, element ap (X) has order l, bl (X) has order pc and the
b
product ap bl (X) has order l pc . Hence, c c or else r would not be maximal.
This is true for any prime p, hence s divides r.
Thus, with r being the maximal order, every element b(X) in the field obeys
r
b (X) = 1. By using the pigeon-hole principle, we conclude that r = 2d 1, the
number of non-zero elements of the field. Hence, with a(X) being an element of
d
order r, the powers 1, a(X), . . . , a(2 1) (X) exhaust the multiplicative groups of
the field.
In the wake of Theorem 2.5.33, we can use the notation F2d for any polynomial
field F2 [X]/"g(X)# where g(X) is an irreducible binary polynomial of degree d.
Further, the multiplicative group of non-zero elements in F2d is denoted by F2d ;
it is cyclic ( Z2d 1 , according to Theorem 2.5.33). Any generator of group F2d
(whose -powers exhaust F2d ) is called a primitive element of field F2d .
Example 2.5.34 We can see the importance of writing down the full list of irreducible polynomials. There are six irreducible binary polynomials of degree 5
(each of which is primitive):
1 + X 2 + X 5, 1 + X 3 + X 5, 1 + X + X 2 + X 3 + X 5,
1 + X + X 2 + X 4 + X 5, 1 + X + X 3 + X 4 + X 5,
1 + X2 + X3 + X4 + X5
(2.5.16)
(2.5.17)
The number of irreducible polynomials grows significantly with the degree: there
are 18 of degree 7, 30 of degree 8, and so on. However, there exist and are available
quite extensive tables of irreducible polynomials over various finite fields.
X i
1 + X + X3
polynomial
X 0
X
X 2
X 3
X 4
X 5
X 6
0
1
X
X2
1+X
X + X2
1 + X + X2
1 + X2
word
X i
1 + X2 + X3
polynomial
word
000
100
010
001
110
011
111
101
X 0
X
X 2
X 3
X 4
X 5
X 6
0
1
X
X2
1 + X2
1 + X + X2
1+X
X + X2
000
100
010
001
101
111
110
011
233
Figure 2.6
234
polynomials
X 0
X
X 2
X 3
X 4
X 5
X 6
X 7
X 8
X 9
X 10
X 11
X 12
X 13
X 14
0
1
X
X2
X3
1+X
X + X2
X2 + X3
1 + X + X3
1 + X2
X + X3
1 + X + X2
X + X2 + X3
1 + X + X2 + X3
1 + X2 + X3
1 + X3
coefficient
strings
0000
1000
0100
0010
0001
1100
0110
0011
1101
1010
0101
1110
0111
1111
1011
1001
Figure 2.7
(c) The field F2 [X]/"1 + X + X 4 # contains 16 elements. The field table is given in
Figure 2.7. In this case, the multiplicative group is Z15 , and the field can be denoted
by F16 . As above, element X F2 [X]/"1 + X + X 4 # yields a root of polynomial
1 + X + X 4 ; other roots are X 2 , X 4 and X 8 .
This example can be used to identify the Hamming [15, 11] code as (an equivalent to) the cyclic code with generator g(X) = 1 + X + X 4 . We can now say that the
Hamming [15, 11] code is (modulo equivalence) the cyclic code of length 15 with
the defining root (= X) in field F2 [X]/"1 + X + X 4 #. As X is a generator of the
multiplicative group of the field, we again could say that the defining root is a
primitive element in F16 .
2
In general, take the field F2 [X]/"g(X)#, where g(X) = gi X i is an irreducible
0id
d1
2
X, X , X 4 , . . . , X 2
will sat-
0id
235
in general: it only happens when g(X) is a primitive binary polynomial; for the
detailed discussion of this property see Sections 3.13.3. For a primitive core polynomial g(X) we have, in addition, that the powers X i for i < d = deg g(X) coincide
with X i , while further powers X i , m i 2d 1, are relatively easy to calculate.
With this in mind, we can pass to a general binary Hamming code.
Example 2.5.36 Let X H be the binary Hamming [2 1, 2 1 ] code. We
know that its parity-check matrix H features all non-zero column-vectors of length
. These vectors, written in a particular order, list the consecutive powers i , i =
0, 1, . . . , 2 2, in the field F2 [X]/"g(X)# where = X and g(X) = g0 + g1 X +
+ g X 1 + X is a primitive polynomial of degree . Thus,
1 0 0
g0
0 1 0
g1
H =
(2.5.18)
,
0 0 0 g1
0 0 1
0
or H (1 (1) (2 2) .
Hence, as before, the equation aH T = 0 for the codeword is equivalent to the
equation a( ) = 0 for the corresponding polynomial. So, we can say that a(X)
X H iff is among the roots of a(X).
On the other hand, by construction, is a root of g(X): g( ) = 0. Thus, we
identify the Hamming [2 1, 2 1 ] code as equivalent to the cyclic code of
length 2 1 with the generator polynomial g(X), with(the defining root ). The
1
role of can be played by any conjugate element, from , 2 , . . . , 2
.
The above idea leads to an immediate (and far-reaching) generalisation. Take
N = 2 1 and let be a primitive element of field F2 F2 [X]/"g(X)# where
g(X) is a primitive polynomial. (In all the examples and problems from this chapter,
this requirement is fulfilled.) Consider a defining set of roots, to start with, of the
form , 2 , 3 , but more generally, , 2 , . . . , ( 1) . (Using parameter which
is an integer > 3 is a tradition here.) Consider the cyclic code with these roots:
what can we say about it? With the length N = 2 1, we can guess that it will
yield a subcode of the Hamming [2 1, 2 1 ] code, and it may correct more
than a single error. This is the gist of the so-called (binary) BCH code construction
(BoseChoudhury, Hocquenguem, 1959).
In this section we restrict ourselves to a brief introduction to the BCH codes;
in greater detail and generality these codes are discussed in Section 3.2. For
N = 2 1 field F2 F2 [X]/"g(X)# has the property that its non-zero elements
are the Nth roots of unity (i.e. the zeros of the polynomial 1 + X N ). In other words,
236
(X j )
1 jN
where all j list the whole of F2 . (In the terminology of Section 3.1, F2 is the
splitting field for 1 + X N over F2 .) As before, we use the notation := X for the
generator of the multiplicative cyclic group F2 . (In fact, it could be any generator
of this group.)
Because N = 1 and the power N is minimal with this property, the element
is often called a primitive Nth root of unity. Consequently, the powers k for
0 k < N yield distinct elements of the field. This fact is used below when we
conclude that the product
kj
ki
= 0
1i< j 1
(2.5.20)
Here lcm stands for the least common multiple and M (X) denotes the minimal
binary polynomial with root . For brevity, we will use in this chapter the term
binary BCH codes. (A more general class of BCH codes will be introduced in
Section 3.2.)
Example 2.5.38 For N = 7, the appropriate polynomial field is F2 [X]/"1 + X +
X 3 # or F2 [X]/"1 + X 2 + X 3 #, i.e. one of two realisations of field F8 . Since 7 is a
prime number, any non-zero polynomial from the field has the multiplicative order
7, i.e. is a generator of the multiplicative group in F2 [X]/"1 + X 2 + X 3 #. In fact, we
have the decomposition of polynomial 1 + X 7 into irreducible factors:
1 + X 7 = (1 + X)(1 + X + X 3 )(1 + X 2 + X 3 ).
Further, if we choose polynomial field F2 [X]/"1 + X + X 3 # then = X satisfies
3
3
3 = 1 + , 2 = 1 + 2, 4 = 1 + 4,
237
i.e. the conjugates , 2 and 4 are the roots of the core polynomial 1 + X + X 3 :
1 + X + X 3 = (X ) X 2 X 4 .
Next, 3 , 6 and 12 = 5 are the roots of 1 + X 2 + X 3 :
1 + X2 + X3 = X 3 X 5 X 6 .
Hence, the binary BCH code of length 7 with designed distance 3 is formed by
binary polynomials a(X) of degree 6 such that
a( ) = a( 2 ) = 0, that is, a(X) is a multiple of 1 + X + X 3 .
This code is equivalent to the Hamming [4, 7] code; in particular its true distance
equals 3.
Next, the binary BCH code of length 7 with designed distance 4 is formed by
binary polynomials a(X) of degree 6 such that
a( ) = a( 2 ) = a( 3 ) = 0, that is, a(X) is a multiple of
(1 + X + X 3 )(1 + X 2 + X 3 ) = 1 + X + X 2 + X 3 + X 4 + X 5 + X 6 .
This is simply the repetition code R7 .
The staple of the theory of the BCH codes is
Theorem 2.5.39 (The BCH bound) The minimal distance of a binary BCH code
with designed distance is .
The proof of Theorem 2.5.39 (sometimes referred to as the BCH theorem) is
based on the following result.
Lemma 2.5.40 Consider the m m Vandermonde determinant with entries from
a commutative ring:
1 2 . . . m
1 12 . . . 1m
2 2 . . . 2
2 2 . . . m
m
2
2
2
1
det .
(2.5.21)
. ..
. = det ..
. ..
. .
.
.
.
.
.
.
.
. .
. .
.
.
1m 2m . . . mm
m m2 . . . mm
1lm
(i j ).
(2.5.22)
1i< jm
238
the product
1i< jm
2
...
(N1)
1
a0
1
2
4
...
2(N1)
a1
..
. = 0.
..
..
..
.
..
.
.
.
1 ( 1) 2( 1) . . . (N1)( 1)
aN1
Due to Lemma 2.5.40, any ( 1) columns of this (( 1)N) matrix are linearly
independent. Hence, there must be at least non-zero coefficients in a(X). Thus,
the distance of X is .
Example 2.5.41 (Here a mistake in [18], p. 106, is corrected.) Consider a BCH
code with N = 15 and = 5. Use the following decomposition into irreducible
polynomials:
X 15 1 = (X + 1)(X 2 + X + 1)(X 4 + X + 1)(X 4 + X 3 + 1)
(X 4 + X 3 + X 2 + X + 1).
The generator of the code is
g(x) = (X 4 + X + 1)(X 4 + X 3 + X 2 + X + 1) = X 8 + X 7 + X 6 + X 4 + 1.
Indeed, g( 3 ) = g( 9 ) = 0. The set of zeros of X 4 + X 3 + X 2 + X + 1 is
( 3 , 9 , 12 , 9 ). The set of zeros of X 4 + X + 1 is ( , 2 , 4 , 8 ). The set of
zeros of X 4 + X 3 + 1 is ( 7 , 14 , 13 , 11 ). The set of zeros of X 2 + X + 1 is
( 5 , 10 ).
(b) Let N = 31 and be a primitive element of F32 . The minimal polynomial with
root is
M (X) = (X )(X 2 )(X 4 )(X 8 )(X 16 ).
We find also the minimal polynomial for 5 :
M 5 (X) = (X 5 )(X 10 )(X 20 )(X 9 )(X 18 ).
By definition, the generator of the BCH code of length 31 with a designed distance = 8 is g(X) = lcm(M (X), M 3 (X), M 5 (X), M 7 (X)). In fact, the minimal distance of the BCH code (which is, obviously, at least 9) is in fact at least 11.
This follows from Theorem 2.5.39 because all the powers , 2 , . . . , 10 are listed
among the roots of g(X).
239
There exists a decoding procedure for a BCH code which is simple to implement: it generalises the Hamming code decoding procedure. In view Aof Theorem
B
1
er2.5.39, the BCH code with designed distance corrects at least t =
2
rors. Suppose a codeword c = c0 . . . cN1 has been sent and corrupted to r = c + e
where e = e0 . . . eN1 . Assume that e has at most t non-zero entries. Introduce the
corresponding polynomials c(X), r(X) and e(X), all of degrees < N. For c(X) we
have that c( ) = c( 2 ) = = c( ( 1) ) = 0. Then, clearly,
r( ) = e( ), r 2 = e 2 , . . . , r ( 1) = e ( 1) .
(2.5.23)
So, we calculate r( i ) for i = 1, . . . , 1. If these are all 0, r(X) X (no error
or at least t + 1 errors). Otherwise, let E = {i : ei = 1} indicate the erroneous digits
and assume that 0 < E t. Introduce the error locator polynomial
(X) = (1 i X),
(2.5.24)
iE
with binary coefficients, of degree E and with the lowest coefficient 1. If we know
(X), we can find which powers i are its roots and hence find the erroneous
digits i E. We then simply change these digits and correct the errors.
In order to calculate (X), consider the formal power series
(X) = e j X j .
j1
(Observe that, as N = 1, the coefficients of this power series recur.) For the initial
( 1) coefficients, we have equalities, by virtue of (2.5.23):
e j = r j , j = 1, . . . , 1;
these are the only ones needed for our purpose, and they are calculated in terms of
the received word r.
Now set
(X) = i X
iE
(1 j X).
(2.5.25)
jE: j=i
(X) =
iX
(X)
i j X j = i j X j = 1 i X = (X) .
j1 iE
iE j1
(2.5.26)
iE
240
namely,
0 + 1 X + + t X t
r( )X + + + r 2t X 2t + e (2t+1) X 2t+1 +
(2.5.27)
= 0 + 1 X + + t X t .
We are interested in the coefficients of X k for t < k 2t: these satisfy
j r (k j) = 0,
(2.5.28)
0 jt
which does not involve any of the terms e l . We obtain the following equations:
(t+1)
t)
r
(
.
.
.
r(
)
r
0
(t+1)
2
r (t+2)
r
. . . r( ) 1
..
..
.. .. = 0.
..
.
.
.
.
(2t1)
.(2t)
t
t
r
. . . r( )
r
The above matrix is t (t + 1), so it always has a non-zero vector in the kernel;
this vector identifies the error locator polynomial (X). We see that the above
routine (called the BerlekampMassey decoding algorithm) enables us to specify
the set E and hence correct t errors.
Unfortunately, the BCH codes are asymptotically bad: for any sequence of
BCH codes of length N , either k/N or d/N 0. In other words, they lie
at the bottom of Figure 2.2. To obtain codes that meet the GilbertVarshamov
(GV) bound, one needs more powerful methods, based on algebraic geometry. Such
codes were constructed in the early 1970s (the Goppa and Justesen codes). It remains a problem to construct codes that lie above the GilbertVarshamov curve.
As was mentioned just before on page 160, a new class of codes was invented in
1982 by Tsfasman, Vladut and Zink; these codes lie above the GV curve when the
number of symbols in the code alphabet is large. However, for binary codes, the
problem is still waiting for solution.
Worked Example 2.5.42 Compute the rank and minimum distance of the cyclic
code with generator polynomial g(X) = X 3 + X + 1 and parity-check polynomial
h(X) = X 4 + X 2 + X + 1. Now let be a root of g(X) in the field F8 . We receive
the word r(X) = X 5 + X 3 + X (mod X 7 1). Verify that r( ) = 4 , and hence
decode r(X) using minimum-distance decoding.
Solution A cyclic code X of length N has generator polynomial g(X) F2 [X]
and parity-check polynomial h(X) F2 [X] with g(X)h(X) = 1 + X N . Recall that
if g(X) has degree k, i.e. g(X) = a0 + a1 X + + ak X k where ak = 0, then g(X),
241
Xg(X), . . . , X Nk1 g(X) form a basis for X . In particular, the rank of X equals
N k. In this example, N = 7, k = 3 and rank(X ) = 4.
If h(X) = b0 + b1 X + + bNk X Nk then the parity-check matrix H for X has
the form
...
b1
b0
0
... 0 0
bNk bNk1
0
b0
... 0 0
bNk bNk1 . . . b1
.
.
.
.
.
.
..
..
..
..
..
..
...
bNk bNk1 . . . b1 b0
;<
=
N
1 0 1 1 1 0 0
H = 0 1 0 1 1 1 0
0 0 1 0 1 1 1
and we have the following implications:
no zero column
no codewords of weight 1,
no repeated column no codewords of weight 2.
The minimum distance d(X ) of a linear code X is the minimum non-zero weight
of a codeword. In the example, d(X ) = 3. [In fact, X is equivalent to the Hamming [7, 4] code.]
>
Since g(X) F2 [X] is irreducible, the code X F2 [X] "X 7 1# is the cyclic
code defined by . The multiplicative cyclic group Z
7 of non-zero elements of
field F8 is
0 = 1, , 2 , 3 = + 1, 4 = 2 + ,
5 = 3 + 2 = 2 + + 1, 6 = 3 + 2 + = 2 + 1,
7 = 3 + = 1.
Next, the value r( ) is
r( ) = + 3 + 5
= + ( + 1) + ( 2 + + 1)
= 2 + = 4,
242
.
..
(k) divide rk1 (X) by rk1 (X):
rk2 (X) = qk (X)rk1 (X) + rk (X) where deg rk (X) < deg Rk1 (X),
...
The algorithm continues until the remainder is 0:
(s) divide rs2 (X) by rs1 (X):
rs2 (X) = qs (X)rs1 (X).
Then
gcd f (X), g(X) = rs1 (X).
(2.5.29)
At each stage, the equation for the current remainder rk (X) involves two previous
remainders. Hence, all remainders, including gcd( f (X), g(X)), can be written in
terms of f (X) and g(X). In fact,
Lemma 2.5.44
243
where
a1 (X) = b1 (X) = 0,
a0 (X) = 0, b0 (X) = 1,
ak (X) = qk (X)ak1 (X) + ak2 (X), k 1,
bk (X) = qk (X)bk1 (X) + bk2 (X), k 1.
Furthermore:
(1) deg ak (X) = deg qi (X), deg bk (X) = deg qk (X).
2ik
1ik
deg qk (X).
1ik+1
Proof
244
g j hi j =
j=0
'
1,
i = 0, N,
0,
1 i < N.
(2.6.1)
(b) Fix R (0, 1) and suppose we want to send one of a collection UN of messages
of length N , where the size UN = 2NR . The message is transmitted through an
245
MBSC with error-probability p < 1/2, so that we expect about pN errors. According to the asymptotic bound of part (a), for which values of p can we correct pN
errors, for large N ?
Solution (a) A code X FN2 is said to be E-error correcting if B(x, E) B(y, E) =
0/ for all x, y X with x = y. The Hamming bound for a code of size M, distance
d 1
d, correcting E =
errors is as follows. The balls of radius E about the
2
codewords are disjoint: their total volume equals M vN (E). But their union lies
inside FN2 , so M 2N /vN (E).
On the other hand, take an E-correcting code X of maximum size X . Then
there will be no word
y FN2 \ xX B(x, 2E + 1)
or we could add such a word to X , increasing the size but preserving the errorcorrecting property. Since every word y FN2 is less than d 1 from a codeword,
we can add y to the code. Hence, balls of radius d 1 cover the whole of FN2 , i.e.
M vN (d 1) 2N , or
M 2N /vN (d 1) (the VarshamovGilbert bound).
Combining these bounds yields, for (N, E) = log X N:
1
log vN (E)
log vN (2E + 1)
(N, E) 1
.
N
N
N
N
N
.
<
=
s1
N s+1 s
1 s
Consequently,
E
j
N
N
vN (E)
1 .
E
E j=0
1
1
log M lim sup log M 1 ( /2).
N
N
N
246
B
d 1
(b) We can correct pN errors if the minimum distance d satisfies
pN,
2
i.e. /2 p. Using the asymptotic Hamming bound we obtain R 1 ( /2)
1 (p). So, the reliable transmission is possible if p 1 (1 R),
The Shannon SCT states:
capacity C of a memoryless channel = sup I(X : Y ).
pX
Here I(X : Y ) = h(Y ) h(Y |X) is the mutual entropy between the single-letter
random input and output of the channel, maximised over all distributions of the
input letter X. For an MBSC with the error-probability p, the conditional entropy
h(Y |X) equals (p). Then
C = sup h(Y ) (p).
pX
But h(Y ) attains its maximum 1, by using the equidistributed input X (then Y is also
equidistributed). Hence, for the MBSC, C = 1 (p). So, a reliable transmission is
possible via MBSC with R 1 (p), i.e. p 1 (1 R). These two arguments
lead to the same answer.
Problem 2.3 Prove that the binary code of length 23 generated by the polynomial
g(X) = 1 + X + X 5 + X 6 + X 7 + X 9 + X 11 has minimum distance 7, and is perfect.
Hint: Observe that by the BCH bound (see Theorem 2.5.39) if a generator polynomial of a cyclic code has roots { , 2 , . . . , 1 } then the code has distance ,
and check that X 23 + 1 (X + 1)g(X)grev (X) mod 2, where grev (X) = X 11 g(1/X)
is the reversal of g(X).
Solution First, show that the code is BCH, of designed distance 5. Recall that if
is a root of a polynomial p(X) F2 [X] then so is 2 . Thus, if is a root of
g(X) = 1 + X + X 5 + X 6 + X 7 + X 9 + X 11 then so are 2 , 4 , 8 , 16 , 9 , 18 ,
13 , 3 , 6 , 12 . This yields the design sequence { , 2 , 3 , 4 }. By the BCH
theorem, the code X = "g(X)# has distance 5.
Next, the parity-check extension, X + , is self-orthogonal. To check this, we need
only to show that any two rows of the generating matrix of X + are orthogonal.
These are represented by
(X i g(X)|1) and (X j g(X)|1)
247
(2.6.2)
This implies that all words in X + have weight divisible by 4. Indeed, by inspection, all rows (X i g(X)|1) of the generating matrix of X + have weight 8.
Then, by induction on the number of rows involved in the sum, if c X + and
g(i) (X i g(X)|1) is a row of the generating matrix of X + then
w g(i) + c = w g(i) + w(c) 2w g(i) c ,
where g(i) c l = min g(i) l , cl , l = 1, . . . , 24. We know that 8|w g(i) and by
(i)
(i)
c
is
even,
so
2w
g c is divisthe induction hypothesis, 4|w(c).
Next,
w
g
(i)
ible by 4. Then the LHS, w g + c , is divisible by 4. Therefore, the distance of
X + is 8, as it is 5 and is divisible by 4. (Clearly, it cannot be bigger than 8 as
then it would be 12.) Then the distance of the original code, X , equals 7.
Finally, the code X is perfect 3-error correcting, since the volume of the 3-ball
in F23
2 equals
23
23
23
23
= 1 + 23 + 253 + 1771 = 2048 = 211 ,
+
+
+
3
2
1
0
and 212 211 = 223 . Here, obviously, 12 represents the rank and 23 the length.
Problem 2.4
Show that the Hamming code is cyclic with check polynomial
4
2
X + X + X + 1. What is its generator polynomial? Does Hammings original code
contain a subcode equivalent to its dual? Let the decomposition into irreducible
monic polynomials M j (X) be
l
X N + 1 = M j (X)k j .
j=1
(2.6.3)
248
The cyclic code with generator g(X) = X 3 + X + 1 has check polynomial h(X) =
X 4 + X 2 + X + 1. The parity-check matrix of the code is
1 0 1 1 1 0 0
0 1 0 1 1 1 0 .
(2.6.4)
0 0 1 0 1 1 1
The columns of this matrix are the non-zero elements of F32 . So, this is equivalent
to Hammings original [7, 4] code.
The dual of Hammings [7, 4] code has the generator polynomial X 4 + X 3 + X 2 + 1
(the reverse of h(X)). Since X 4 + X 3 + X 2 + 1 = (X + 1)g(X), it is a subcode of
Hammings [7, 4] code.
Finally, any irreducible polynomial M j (X) could be included in a generator of a
cyclic code in any power 0, . . . , k j . So, the number of possibilities to construct this
generator equals lj=1 (k j + 1).
Problem 2.5
Describe the construction of a ReedMuller code. Establish its
information rate and its distance.
m
m
Solution The space Fm
2 has N = 2 points. If A F2 , let 1A be the indicator
function of A. Consider the collection of hyperplanes
j = {p Fm
2 : p j = 0}.
Set h j = 1 j , j = 1, . . . , m, and h0 = 1Fm2 1. Define sets of functions Fm
2 F2 :
A0 = {h0 },
A1 = {h j ; j = 1, 2, . . . , m},
A2 = {hi h j ; i, j = 1, 2, . . . , m, i < j},
..
.
Ak+1 = {a h j ; a Ak , j = 1, 2, . . . , m, h j |a},
..
.
Am = {h1 hm }.
The union of these sets has cardinality N = 2m (there are 2m functions altogether).
N
Therefore, functions from m
i=0 Ai can be taken as a basis in F2 .
RM of length N = 2m is defined as
Then the ReedMuller code RM(r, m) = Xr,m
r
m
the span of ri=0 Ai and has rank
. Its information rate is
i=0 i
1 r m
i .
2m i=0
249
d RM(m, k) min 2d RM(m 1, k) , d RM(m 1, k 1) ,
which, by induction, yields
d RM(r, m) 2mr .
On the other hand, the vector h1 h2 hm is at distance 2mr from RM(m, r).
Hence,
d RM(r, m) = 2mr .
Problem 2.6 (a) Define a parity-check code of length N over the field F2 . Show
that a code is linear iff it is a parity-check code. Define the original Hamming code
in terms of parity-checks and then find a generating matrix for it.
(b) Let X be a cyclic code. Define the dual code
N
X = {y = y1 . . . yN : xi yi = 0 for all x = x1 . . . xN X }.
i=1
Prove that X is cyclic and establish how the generators of X and X are related to each other. Show that the repetition and parity-check codes are cyclic, and
determine their generators.
Solution (a) The parity-check code X PC of a (not necessarily linear) code X is
the collection of vectors y = y1 . . . yN FN2 such that the dot-product
N
y x = xi yi = 0 (in F2 ),
for all x = x1 . . . xN X .
i=1
From the definition it is clear that X PC is also the parity-check code for X , the
PC
linear code spanned by X : X PC = X . Indeed, if y x = 0 and y x = 0 then
y (x + x ) = 0. Hence, the parity-check code X PC is always linear, and it forms
a subspace dot-orthogonal to X . Thus, a given code X is linear iff it is a paritycheck code. A pair of linear codes X and X PC form a dual pair: X PC is the dual
of X and vice versa. The generating matrix H for X PC serves as a parity-check
matrix for X and vice versa.
250
1 1
0 1
0 0
0 0
0
1
1
0
1
0
1
1
0
1
0
1
0
0
1
0
0
0
.
0
1
(b) The generator of dual code g (X) = X N1 g(X 1 ). The repetition code has
g(X) = 1 + X + + X N1 and the rank 1. The parity-check code has g(X) = 1 + X
and the rank N 1.
Problem 2.7 (a) How does coding theory apply when the error rate p > 1/2?
(b) Give an example of a code which is not a linear code.
(c) Give an example of a linear code which is not a cyclic code.
(d) Define the binary Hamming code and its dual. Prove that the Hamming code is
perfect. Explain why the Hamming code cannot always correct two errors.
(e) Prove that in the dual code:
(i) The weight of any non-zero codeword equals 21 .
(ii) The distance between any pair of words equals 21 .
Solution (a) If p > 1/2, we reverse the output to get p = 1 p.
(b) The code X F22 with X = {11} is not linear as 00 X .
(c) The code X F22 with X = {00, 10} is linear, but not cyclic, as 01 X .
(d) The original Hamming [7, 4] code has distance 3 and is perfect one-error
correcting. Thus, making two errors in a codeword will always lead outside the
ball of radius 1 about the codeword, i.e. to a ball of radius 1 about a different
codeword (at distance 1 of the nearest, at distance 2 from the initial word). Thus,
one detects two errors but never corrects them.
(e) The dual of a Hamming [2 1, 2 1, 3] code is linear, of length N = 2 1
and rank , and its generating matrix is (2 1), with columns listing all nonzero vectors of length (the parity-check matrix of the original code). The rows of
this matrix are linearly independent; moreover, any row i = 1, . . . , has 21 digits
1. This is because each such digit comes from a column, i.e. a non-zero vector of
length , with 1 in position i; there are exactly 21 such vectors. Also any pair of
251
columns of this matrix are linearly independent, but there are triples of columns
that are linearly dependent (a pair of columns complemented by their sum).
Every non-zero dual codeword x is a sum of rows of the above generating matrix.
Suppose these summands are rows i1 , . . . , is where 1 i1 < < is . Then, as
above, the number of digits 1 in the sum equals the number of columns of this
matrix for which the sum of digits i1 , . . . , is is 1. We have no restriction on the
remaining s digits, so for them there are 2s possibilities. For digits i1 , . . . , is
we have 2s1 possibilities (a half of the total of 2s ). Thus, again 2s 2s1 = 21 .
We proved that the weight of every non-zero dual codeword equals 21 . That is,
the distance from the zero vector to any dual codeword is 21 . Because the dual
code is linear, the distance between any pair of distinct dual codewords x, x equals
21 :
of ways to place
0s and 1s outside J
with xi = 0 or 1
lJ
which yields 2|J| 2|J|1 = 21 . In other words, to get a contribution from a digit
(i)
x j = g j = 1, we must fix (i) a configuration of 0s and 1s over {1, . . . , } \ J (as it
iJ
is a part of the description of a non-zero vector of length N), and (ii) a configuration
of 0s and 1s over J,
an odd number of 1s.
with
To check that d X H
252
Solution (a) The necessary and sufficient condition for g(X) being the generator of
a cyclic code of length N is g(X)|(X N 1). The generator g(X) may be irreducible
or not; in the latter case it is represented as a product g(X) = M1 (X) Mk (X)
of its irreducible factors, with k d = deg g. Let s be the minimal number such
that N|2s 1. Then g(X) is factorised into the product of first-degree monomials
d
refers to the minimal field the splitting field for g, but this is not necessary.] Each
element i is a root of g(X) and also a root of at least one of its irreducible factors
M1 (X), . . . , Mk (X). [More precisely, each Mi (X) is a sub-product of the above firstdegree monomials.]
We want to select a defining set D of roots among 1 , . . . , d K: it is a collection comprising at least one root ji for each factor Mi (X). One is naturally tempted
to take a minimal defining set where each irreducible factor is represented by one
root, but this set may not be easy to describe exactly. Obviously, the cardinality |D|
of defining set D is between k and d. The roots forming D are all from field K but
in fact there may be some from its subfield, K K containing all ji . [Of course,
F2 K .] We then can identify the cyclic code X generated by g(X) with the set
of polynomials
+
*
f (X) F2 [X]/"X N 1# : f ( ) = 0 for all D .
It is said that X is a cyclic code with defining set of roots (or zeros) D.
(b) A binary BCH code of length N (for N odd) and designed distance is a cyclic
code with defining set { , 2 , . . . , 1 } where N and is a primitive Nth
root of unity, with N = 1. It is helpful to note that if is a root of a polynos1
mial p(X) then so are 2 , 4 , . . . , 2 . By considering a defining set of the form
{ , 2 , . . . , 1 } we fill the gaps in the above diadic sequence and produce an
ideal of polynomials whose properties can be analytically studied.
The simplest example is where N = 7 and D = { , 2 } where is a root of
3
X + X + 1. Here, 7 = ( 3 )2 = ( + 1)2 = 3 + = 1, so is a 7th root of
unity. [We used the fact that the characteristic is 2.] In fact, it is a primitive root.
Also, as was said, 2 is a root of X 3 + X + 1: ( 2 )3 + 2 + 1 = ( 3 + + 1)2 =
0, and so is 4 . Then the cyclic code with defining set { , 2 } has generator
X 3 +X +1 since all roots of this polynomial are engaged. We know that it coincides
with the Hamming [7, 4] code.
The Vandermonde determinant is
1
1
1
...
1
x1
x2
x3 . . . xn
.
= det
...
...
... ... ...
x1n1 x2n1 x3n1 . . . xnn1
253
Observe that if xi = x j (i = j) the determinant vanishes (two rows are the same).
Thus xi x j is a factor of ,
= P(x) (xi x j ),
i< j
(2.6.5)
i< j
where is a primitive Nth root of unity is called a BCH code of design distance
< N. Next, X BCH is a vector space over F2 and c X BCH iff
cH T = 0
where
1
1
H = 1
.
..
2
3
..
.
2
4
6
..
.
...
...
...
..
.
(2.6.6)
N1
2N2
3N3
..
.
(2.6.7)
1 1 2 2 . . . (N1)( 1)
Now rank H = . Indeed, by (2.6.5) for any minor H
det H = ( i j ) = 0.
i< j
254
(x, y) < d . Conclude that if d or more changes are made in a codeword then the
new word is closer to some other codeword than to the original one.
Suppose that a maximal [N, M, d] code is used for transmitting information via a
binary memoryless channel with the error-probability p, and the receiver uses the
maximum likelihood decoder. Prove that the probability of erroneous decoding,
ML , obeys the bounds
err
ML
1 b(N, d 1) err
1 b(N, (d 1)/2),
Problem 2.10
The Plotkin bound for an [N, M, d] binary code states that M
d
if d > N/2. Let M2 (N, d) be the maximum size of a code of length N and
d N/2
distance d , and let
1
( ) = lim log2 M2 (N, N).
N N
d
= 2d 2N(2d1) .
d (2d 1)/2
255
Solution If d > N/2 apply the Plotkin bound and conclude that ( ) = 0. If
d N/2 consider the partition of a code X of length N and distance d N/2
according to the last N (2d 1) digits, i.e. divide X into disjoint subsets, with
fixed N (2d 1) last digits. One of these subsets, X , must have size M such
that M 2N(2d1) M.
Hence, X is a code of length N = 2d 1 and distance d = d, with d > N /2.
Applying Plotkins bound to X gives
M
d
d
=
= 2d.
d N/2 d (2d 1)/2
Therefore,
M 2N(2d1) 2d.
Taking d = N with N yields ( ) 1 2 , 0 1/2.
Problem 2.11 State and prove the Hamming, Singleton and GilbertVarshamov
bounds. Give (a) examples of codes for which the Hamming bound is attained, (b)
examples of codes for which the Singleton bound is attained.
Solution The Hamming bound states that the size M of an E-error correcting code
X of length N,
2N
M
,
vN (E)
N
is the volume of an E-ball in the Hamming space
where vN (E) =
i
0iE
{0, 1}N . It follows from the fact that the E-balls about the codewords x X must
be disjoint:
M vN (E) = of points covered by M E-balls
2N = of points in {0, 1}N .
The Singleton bound is that the size M of a code X of length N and distance d
obeys
M 2Nd+1 .
It follows by observing that truncating X (i.e. omitting a digit from the codewords
x X ) d 1 times still does not merge codewords (i.e. preserves M) while the
resulting code fits in {0, 1}Nd+1 .
The GilbertVarshamov bound is that the maximal size M = M2 (N, d) of a
binary [N, d] code satisfies
2N
M
.
vN (d 1)
256
This bound follows from the observation that any word y {0, 1}N must be within
distance d 1 from a maximum-size code X . So,
M vN (d 1) of points within distance d 1 = 2N .
Codes attaining the Hamming bound are called perfect codes, e.g. the Hamming
[2 1, 2 1 , 3] codes. Here, E = 1, vN (1) = 1 + 2 1 = 2 and M = 22 1 .
Apart from these codes, there is only one example of a (binary) perfect code: the
Golay [23, 12, 7] code.
Codes attaining the Singleton bound are called maximum distance separable
(MDS): their check matrices have any N M rows linearly independent. Examples
of such codes are (i) the whole {0, 1}N , (ii) the repetition code {0 . . . 0, 1 . . . 1}
and the collection of all words x {0, 1}N of even weight. In fact, these are all
examples of binary MDS codes. More interesting examples are provided by Reed
Solomon codes that are non-binary; see Section 3.2. Binary codes attaining the
GilbertVarshamov bound for general N and d have not been constructed so far
(though they have been constructed for non-binary alphabets).
Problem 2.12 (a) Explain the existence and importance of error correcting codes
to a computer engineer using Hammings original code as your example.
(b) How many codewords in a Hamming code are of weight 1? 2? 3? 4? 5?
Solution (a) Consider the linear map F72 F32 given by the matrix H of the form
(2.6.4). The Hamming code X is the kernel ker H, i.e. the collection of words
x = x1 x2 x3 x4 x5 x6 x7 {0, 1}7 such that xH T = 0. Here, we can choose four digits,
say x4 , x5 , x6 , x7 , arbitrarily from {0, 1}; then x1 , x2 , x3 will be determined:
x1 = x4 + x5 + x7 ,
x2 = x4 + x6 + x7 ,
x3 = x5 + x6 + x7 .
It means that code X can be used for encoding 16 binary messages of length 4.
If y = y1 y2 y3 y4 y5 y6 y7 differs from a codeword x X in one place, say y = x + ek
then the equation yH T = ek H T gives the binary decomposition of number k, which
leads to decoding x. Consequently, code X allows a single error to be corrected.
Suppose that the probability of error in any digit is p << 1, independently of
what occurred to other digits. Then the probability of an error in transmitting a
non-encoded (4N)-digit message is
1 (1 p)4N 4N p.
257
But using the Hamming code we need to transmit 7N digits. An erroneous transmission requires at least two wrong digits, which occurs with probability
N
7 2
1 1
p
21N p2 << 4N p.
2
So, the extra effort of using 3 check digits in the Hamming code is justified.
(b) A Hamming code X H, of length N = 2 1 ( 3) consists of binary words
x = x1 . . . xN such that xH T = 0 where H is an N matrix whose columns
h(1) , . . . , h(N) are all non-zero binary vectors of length l. Hence, the number of
N
Problem 2.13 (a) The dot-product of vectors x, y from a binary Hamming space
HN is defined as x y = Ni=1 xi yi (mod 2), and x and y are said to be orthogonal
if x y = 0. What does it mean to say that X HN is a linear [N, k] code with
generating matrix G and parity-check matrix H ? Show that
X = {x HN : x y = 0 for all y X }
is a linear [N, N k] code and find its generator and parity-check matrices.
(b) A linear code X is called self-orthogonal if X X . Prove that X is selforthogonal if the rows of G are self and pairwise orthogonal. A linear code is called
self-dual if X = X . Prove that a self-dual code has to be an [N, N/2] code (and
hence N must be even). Conversely, prove that a self-orthogonal [N, N/2] code, for
N even, is self-dual. Give an example of such a code for any even N and prove that
a self-dual code always contains the word 1 . . . 1.
258
(c) Consider now a Hamming [2 1, 2 1] code XH, . Describe the generating
. Prove that the distance between any two codewords in X equals
matrix of XH,
H,
1
2 .
Solution By definition, X is preserved under the linear operations; hence X
is a linear code. From algebraic considerations, dim X = N k. The generating
matrix G of X coincides with H, and the parity-check matrix H with G.
If X X then the rows g(1) , . . . , g(k) of G are self- and pairwise orthogonal.
The converse is also true. From the previous observation, if X is self-dual then
k = N k, i.e. k = N/2, and N should be even. Similarly, if X is self-orthogonal
and k = N/2 then X is self-dual.
Let 1 = 1 . . . 1. If X = X then 1 g(i) = g(i) g(i) = 0. So, 1 X and hence
1 X . An example is a code with the generating matrix
1 1 1 ... 1
1 1 0 . . . 0
G = 1 0 1 . . . 0
. . . .
.. .. .. . . ...
1 0 0 ... 1
N/2
1
1
1
..
.
...
...
...
..
.
1
0
..
.
1 ... 1
N/2
N/2.
Clearly, w(g(i) ) = 21 as exactly half of all 2 vectors have 1 on any given position.
The proof is finished by induction on J.
A simple and elegant way is to use the MacWilliams identity (cf. Lemma 3.4.4)
which immediately gives
WX (s) = 1 + (2 1)s2
1
(2.6.8)
259
0011001
0100101
0010110
1110000
E
F
G
H
0111100
0001111
1101001
0110001
I
J
K
L
1010101
1100110
0101010
1001100
M
N
O
1111111
1000011
0000000
1011010
1 0 ... 0 1 ... 1
0 1 ... 0 0 ... 1
H =
. . . . . . . . . . . . . . . . . . . . . .
0 0 ... 1 1 ... 1
Here the columns are meant to be lexicographically ordered. Different matrices
obtained from the above by permuting the rows define different, but equivalent,
codes: they are all named Hamming codes.
To perform decoding, we have to fix a matrix H (the check matrix) and let it
be known to both the sender and the receiver. Upon receiving a word (string) y =
y1 . . . yN we form a syndrome vector yH T . If yH T = 0, we decode y by itself. (We
260
have no means to determine if the original codeword was corrupted by the channel
or not.)
If yH T = 0 then yH T coincides with a column of H. Suppose yH T gives column
j of H; then we decode y by
x = y + e j where e j = 0 . . . 1 . . . 0 (1 in digit j).
In other words, we change digit j in y and decide that it was the word sent through
the channel. This works well when errors in the channel are rare.
If = 3 a Hamming [7, 4] code contains 24 = 16 codewords. These codewords
are fixed when H is fixed: in the example they are used for encoding 15 letters from
A to O and the space character . Upon receiving a message we divide it into words
of length 7: in the example there are 15 words altogether. Performing the decoding
procedure leads to
JOHNNIEBEGOOD
0 1 0 . . . 0T ,
and
1 1 0 . . . 0T .
Hence, the minimal distance equals 3. Therefore, if a single error occurs, i.e. the
received word is at distance 1 from a codeword, then this codeword is uniquely
determined. Hence, the Hamming code is single-error correcting.
261
and
(1 + N)2N = 2 2N = 2N .
The information rate of the code equals
2 1
.
rank length =
2 1
The code with = 3 has the 3 7 parity-check matrix of the form (2.6.4); any
permutation of rows leads to an equivalent code. The generating matrix is 4 7:
1 0 0 0 1 1 1
0 1 0 0 0 1 1
0 0 1 0 1 0 1
0 0 0 1 0 1 1
and the information rate 4/7. The Hamming code with = 2 is trivial: it contains
a single non-zero codeword 1 1 1.
Problem 2.16
Define a BCH code of length N over the field Fq with designed
distance . Show that the minimum weight of such a code is at least .
Consider a BCH code of length 31 over the field F2 with designed distance 8.
Show that the minimum distance is at least 11.
Solution A BCH code of length N over the field Fq is defined as a cyclic code X
whose minimum degree generator polynomial g(X) Fq [X], with g(X)|(X N 1)
(and hence deg g(X) N), contains among its roots the subsequent powers ,
2 , . . . , 1 where Fqs is a primitive Nth root of unity. (This root lies
in an extension field Fqs the splitting field for X N 1 over Fq , i.e. N|qs 1.) Then
is called the designed distance for X ; the actual distance (which may be difficult
to calculate in a general situation) is .
If we consider the binary BCH code X of length 31, should be a primitive
root of unity of degree 31, with 31 = 1 (the root lies in an extension field F32 ).
262
8 = ( 4 )2 , 9 = ( 5 )8 , and 10 = ( 5 )2 .
That is, the defining set can be extended to
, 2 , 3 , 4 , 5 , 6 , 7 , 8 , 9 , 10
(all these elements are distinct, as is a primitive 31st root of unity). In fact, code
X has designed distance 11. Hence, the minimum distance in X is 11.
Problem 2.17
Let X be a linear [N, k, d] code over the binary field F2 , and
G be a generating matrix of X , with k rows and N columns, such that exactly
d of the first rows entries are 1. Let G1 be the matrix, of k 1 rows and N d
columns, formed by deleting the first row of G and those columns of G with a
non-zero entry in the first row. Show that X1 , the linear code generated by G1 ,
has minimum distance d d/2. Here, for a real number x, x is the integer
satisfying x x < x + 1.
Show also that X1 has rank k 1. Deduce that
3 i4
N
d 2 .
0ik1
Solution Let x be the codeword in X represented by the first row of G and pick
a pair of other rows, say y and z. After the first deleting they become y and z ,
correspondingly. Both weights w(y ) and w(z ) must be d/2: otherwise at least
one of the original words y and z, say y, would have had minimum d/2 digits 1
among deleted d digits (as w(y) d by condition). But then
w(x + y) = w(y ) + d d/2 < d
which contradicts the condition that the distance of X is d.
We want to check that the weight w(y + z ) d/2. Assume the opposite:
w(y + z ) = m < d/2 .
Then m = w(y0 + z0 ) must be d m d/2 where y0 is the deleted part of y,
of length d, and z0 is the deleted part of z, also of length d. In fact, as before, if
m < d m then w(y + z) < d which is impossible. But if m d m then
w(x + y + z) = d m + m < d,
again impossible. Hence, the sum of any two rows of G1 has weight d/2.
263
This argument can be repeated for the sum of any number of rows of G1 (not
exceeding k 1). In fact, in the case of such a sum x + y + + z, we can pass to
new matrices, G and G1 , with this sum among the rows. We conclude that X1 has
minimum distance d d/2. The rank of X1 is k 1, for any k 1 rows of G1
are linearly independent. (The above sum cannot be 0.)
Now, the process of deletion can be applied to X1 (you delete d columns in
G1 yielding digits 1 in a row of G1 with exactly d digits 1). And so on, until you
exhaust the initial rank k by diminishing it by 1. This leads to the required bound
4
3
N d + d/2 + d/22 + + d 2k1 .
Problem 2.18 Define a cyclic linear code X and show that it has a codeword of
minimal length which is unique, under normalisation to be stated. The polynomial
g(X) whose coefficients are the symbols of this codeword is the (minimum degree)
generator polynomial of this code: prove that all words of the code are related to
g(X) in a particular way.
Show further that g(X) can be the generator polynomial of a cyclic code with
words of length N iff it satisfies a certain condition, to be stated.
There are at least three ways of determining the parity-check matrix of the code
from a knowledge of the generator polynomial. Explain one of them.
Solution Let X be the cyclic code of length N with generator polynomial g(X) =
gi X i of degree d. Without loss of generality, assume the code is non-trivial,
0id
of degree k = N d;
(b) a string a = a0 . . . aN1 X iff the polynomial a(X) =
ai X i has the
0iN1
264
One way to specify the parity-check matrix is to take the ratio (X N 1) g(X) =
h(X) = h0 + h1 X + + hk X k . Then form the N (N k) matrix
0 ... 0 0
hk hk1 . . .
0
hk hk1 . . . h1 . . . 0
.
(2.6.9)
H =
. . . . . .
. . . . . . . . . . . . . . .
0
0
. . . hk . . . h1 h0
j
The rows of H are the cyclic shifts h , 0 j d 1 = N k 1, of the string
h = hk . . . h0 0 . . . 0.
We claim that for all a X , aH T = 0. In fact, it suffices to check that for the
basis words j g, j gH T = 0, j = 0, . . . , k 1. That is, the dot-product
j1 g j2 h = 0, 0 j1 < k, 0 j2 < N k 1.
(2.6.10)
k1 g h = g0 hk + g1 hk1 = 0
since it gives the first coefficient (at monomial X) of the product g(X)h(X) =
X N 1. Similarly, for j1 = k 2 and j2 = 0, k2 g h gives the second coefficient
of g(X)h(X) (at monomial X 2 ) and is again equal to 0. And so on: for j1 = j2 = 0,
g h = 0 as the kth-degree coefficient in g(X)h(X).
Continuing, g h equals the (k + 1)st-degree coefficient in g(X)h(X), g 2 h
the (k + 2)nd, and so on; g Nk1 h = gd1 hk + gd hk1 the (N 1)st. As before,
they all vanish.
The same holds true when we simultaneously shift both words cyclically (when
possible) which leads to (2.6.10).
Conversely, suppose that aH T = 0 for some word a = a0 . . . aN1 . Write the corresponding polynomial a(X) as a(X) = f (X)g(X) + r(X) where the ratio f (X) =
fi X i and r(X) is the remainder. Then either r(X) = 0 or 1 deg r(X) =
0ik1
0ik
hi X ki .
265
Alternatively, let h(X) be the check polynomial for the cyclic code X length N
with a generator polynomial g(X) so that g(X)h(X) = X N 1. Then:
(a) X = { f (X): f (X)h(X) = 0 mod (X N e)};
(b) if h(X) = h0 + h1 X + + hNr X Nr then the parity-check matrix H of X
has the form (2.6.9);
(c) the dual code X is a cyclic code of dim X = r, and X = "h (X)#,
Nr h(X 1 ) = h1 (h X Nr + h X Nr1 + + h
where h (X) = h1
0
1
Nr ).
0 X
0
Problem 2.19
Consider the parity-check matrix H of a Hamming [2 1, 2
1] binary code. Form the parity-check matrix H of a [2 , 2 1] code by
augmenting H with a column of zeros and then with a row of ones. The dual of
the resulting code is called a first-order ReedMuller code. Show that a first-order
ReedMuller code can correct errors of up to 22 1 bits per codeword.
For the photographs of Mars taken by the Mariner spacecraft such code with
= 5 was used in 1972. What was the code rate? Why is this likely to have been
much less than the capacity of the channel?
Solution The code in question is [2 , + 1, 21 ]; with = 5, the information rate
equals 6/32 1/5. Let us check that all codewords except 0 and 1 have weight
21 . For 1 the code R() is defined by recursion
R( + 1) = {xx|x R()} {x, x + 1|x R()}.
So, the length of codewords in R( + 1) is obviously 2+1 . As {xx|x R()}
and {x, x + 1|x R()} are disjoint, the number of codewords is doubled, i.e.
R( + 1) = 2+2 . Finally, assuming that all codewords in R() except 0 and 1
have weight 21 , consider a codeword y R(l + 1). If y = xx is different from 0
or 1, then x = 0 or 1, and so w(y) = 2w(x) = 2 21 = 2 .
If y = x, x + 1 we must consider some cases. If x = 0 then y = 01, which has
weight 2l . If x = 1 then y = 10, which also has weight 2l . Finally, if x = 0 or 1
then w(x + 1) = 2 21 = 21 and w(y) = 2 21 = 2 . It is clear now that
codewords xx and x, x + 1 with w(x) = 21 are orthogonal to rows of parity-check
matrix H .
Up to 7 bits may be in error, thus the probability of a transmission error pe (for
a binary symmetric memoryless channel with the error-probability p) obeys
32 i
pe 1
p (1 p)32i ,
i
0i7
which is small when p is small. (As an estimate of an acceptable p, we can take the
solution to 1 + p log p + (1 p) log(1 p) = 26/32.) If the block length is fixed
(and rather small), with a low value of p we cant get near the capacity.
266
Indeed, for = 5, the code is [32, 6, 16], detecting 15 and correcting 7 errors. That
is, the code can correct a fraction > 1/5 of the total of 32 digits. Its information
rate is 6/32 and if the capacity of the (memoryless) channel is C = 1 (p) (where
p stands for the symbol-probability of error), we need the bound C > 6/32; that
is, (p) + 6/32 < 1, for a reliable transmission. This yields |p 1/2| > |p 1/2|
where p (0, 1) solves 26/32 = (p ). Definitely 0 p < 1/5 and 4/5 < p 1
would do. In reality the error-probability was much less.
Problem 2.20 Prove that any binary [5, M, 3] code must have M 4. Verify that
there exists, up to equivalence, exactly one [5, 4, 3] code.
Solution By the Plotkin bound, if d is odd and d > 12 (N 1) then
M2 (N, d) 2
d +1
.
2d + 1 N
In fact,
M2 (5, 3) 2
4
= 2 2 = 4.
6+15
Show that if d2 is the distance of the code with generating matrix G2 then d2 d/2.
Solution Let X be [N, k, d]. We can always form a generating matrix G of X where
the first row is a codeword x with w(x) = d; by permuting columns of G we can
have the first row in the form 1 . . . 1d 0 . . . 0Nd . So, up to equivalence,
: ;< = : ;< =
1...1 0...0
G=
.
G1
G2
Suppose d(G2 ) < d/2 then, without loss of generality, we may assume that there
exists a row of (G1 G2 ) where the number of ones among digits d + 1, . . . , N is
< d/2. Then the number of ones among digits 1, . . . d in this row is > d/2, as its
total weight is d. Then adding this row and 1 . . . 1 0 . . . 0 gives a codeword with
weight < d. So, d(G2 ) d/2.
267
Problem 2.22 (GilbertVarshamov bound) Prove that there exists a p-ary linear
[N, k, d] code if pk < 2N /vN1 (d 2). Thus, if pk is the largest power of p satisfying
this inequality, we have Mp (N, d) pk .
Solution We construct a parity-check matrix by selecting N columns of length
N k with the requirement that no d 1 columns are linearly dependent. The first
column may be any non-zero string in ZNk
p . On the step i 2 we must choose
a column which is not a linear combination of any d 2 (or fewer) of previously
selected columns. The number of such linear combinations (with non-zero coefficients) is
d2
i1
Si =
(p 1) j .
j
j=1
So, the parity-check matrix may be constructed iff SN + 1 < pNk . Finally, observe
that SN + 1 = vN1 (d 2). Say, there exists [5, 2k , 3] code if 2k < 32/5, so k = 2
and M2 (5, 3) 4, which is, in fact, sharp.
Problem 2.23 An element b Fq is called primitive if its order (i.e. the minimal
k such that bk = 1 mod q) is q 1. It is not difficult to find a primitive element of
the multiplicative group Fq explicitly. Consider the prime factorisation
s
q 1 = pj j.
j=1
(q1)/p j
(q1)/p j
= e. Set b j = a j
and
0. Because p j are distinct primes, it follows that n = 0 mod p j j for any j. Hence,
n = sj=1 p j j .
p
Problem 2.24 The minimal polynomial with a primitive root is called a primitive
polynomial. Check that among irreducible binary polynomials of degree 4 (see
(2.5.9)), 1 + X + X 4 and 1 + X 3 + X 4 are primitive and 1 + X + X 2 + X 3 + X 4 is
not. Check that all six irreducible binary polynomials of degree 5 (see (2.5.15))
are primitive; in practice, one prefers to work with 1 + X 2 + X 5 as the calculations
modulo this polynomial are slightly shorter. Check that among the nine irreducible
polynomials of degree 6 in (2.5.16), there are six primitive: they are listed in the
upper three lines. Prove that a primitive polynomial exists for every given degree.
268
Solution For the solution to the last part, see Section 3.1.
Problem 2.25 A cyclic code X of length N with the generator polynomial g(X)
of degree d = N k can be described in terms of the roots of g(X), i.e. the elements
1 , . . . Nk such that g( j ) = 0. These elements are called zeros of code X and
belong to a Galois field F2d . As g(X)|(1+X N ), they are also among roots of 1+X N .
That is, Nj = 1, 1 j N k, i.e. the j are N th roots of unity. The remaining k
roots of unity 1 , . . . , k are called non-zeros of X . A polynomial a(X) X iff,
in Galois field F2d , a( j ) = 0, 1 j N k.
(a) Show that if X is the dual code then the zeros of X are 1 1 , . . . , k 1 , i.e.
the inverses of the non-zeros of X .
(b) A cyclic code X with generator g(X) is called reversible if, for all x =
x0 . . . xN1 X , the word xN1 . . . x0 X . Show that X is reversible iff g( ) = 0
implies that g( 1 ) = 0.
(c) Prove that a q-ary cyclic code X of length N with (q, N) = 1 is invariant under
the permutation of digits such that q (i) = qi mod N (i.e. x xq ). If s = ordN (q)
then the two permutations i i + 1 and q (i) generate a subgroup of order Ns in
the group Aut(X ) of the code automorphisms.
Solution Indeed, since a(xq ) = a(x)q is proportional to the same generator polynomial it belongs to the same cyclic code as a(x).
Problem 2.26 Prove that there are 129 non-equivalent cyclic binary codes of
length 128 (including the trivial codes, {0 . . . 0} and {0, 1}128 ). Find all cyclic binary codes of length 7.
Solution The equivalence classes of the cyclic codes of length 2k are in a one-tok
one correspondence with the divisors of 1+X 2 ; the number of those equals 2k +1.
Furthermore, there are eight codes listed by their generators which are divisors of
X 7 1 as
X 7 1 = (1 + X)(1 + X + X 3 )(1 + X 2 + X 3 ).
3
Further Topics from Coding Theory
270
(a b) p = a p + (b) p .
(3.1.1)
p
k ak (b) pk .
0kp
p
For 1 k p 1, the value
is a multiple of p and the corresponding term
k
vanishes. Therefore, (a b) p = a p + (b) p . The inductive step is completed by the
n1
n1
same argument, with a and b replaced by a p and (b) p .
Lemma 3.1.6 The multiplicative group F of non-zero elements of a field F of
size q is isomorphic to the cyclic group Zq1 .
Proof Observe that for any divisor d|(q 1), group F contains exactly (d)
elements of multiplicative order d where is Eulers totient phi-function. (Recall
that (d) = {k : k < d, gcd(k, d) = 1}.) Well see that all elements of order d have
q1
the form a d r where a is a primitive element, r d and r, d are co-prime. In fact,
q 1 = (d), and F will have at least one element of order q 1 which
d:d|(q1)
271
{e, a, . . . , ad1 }. Observe that the cyclic group Zd has exactly (d) elements of
order d. So, the whole F has exactly (d) elements of order d; in other words, if
(d) is the number of elements in F of order d then either (d) = 0 or (d) = (d)
and
q1 =
(d)
d:d(n)
(d) = q 1,
d:d|n
(d) = (d).
Definition 3.1.7 A (multiplicative) generator of F (i.e. an element of multiplicative order q 1) is called a primitive element of field F. Although such an element
is non-unique, we will usually single out one such element and denote it by ;
of course a power r where r is coprime with (q 1) will also give a primitive
element.
If a F with F = q 1 then aq1 = e (the order of every element divides the
order of the group). Hence, aq = a, i.e. a is a root of the polynomial X q X in F.
But X q X can have only q roots (including zero 0), so F gives the set of all
roots of X q X.
Definition 3.1.8 Given fields K and F, with F K, field K is called the splitting
field for a polynomial g(X) with coefficients from F if (a) K contains all roots of
g(X), (b) there is no field K with F K K satisfying (a). We will write Spl(g(X))
for the splitting field for g(X).
Thus, if F = q then F contains all roots of polynomial X q X and is the splitting
field for this polynomial.
Lemma 3.1.9 Any two splitting fields K, K for the same polynomial g(X) with
coefficients from F coincide.
Proof In fact, take the intersection K K : it contains F and is a subfield of both
K and K . It must then coincide with each of K, K .
Corollary 3.1.10 For any prime p and natural s 1, there exists at most one
field with ps elements.
Proof Each such field is splitting for polynomial X q X with coefficients from
Z p and q = ps . So any two such fields coincide.
On the other hand, we will prove later the following.
Theorem 3.1.11 For any non-constant polynomial with coefficients from F, there
exists a splitting field.
272
Corollary 3.1.12 For any prime p and natural s 1, there exists precisely one
field with ps elements.
Proof of Corollary 3.1.12 Take again the polynomial X q X with coefficients from
Z p and q = ps . By Theorem 3.1.11, there exists the splitting field Spl(X q X)
where X q X = X(X q1 e) is factorised into linear polynomials. So, Spl(X q X)
contains the roots of X q X and has characteristic p (as it contains Z p ).
However, the roots of (X q X) form a subfield: if aq = a and bq = b then (a
q
b) = aq + (bq ) (Lemma 3.1.5) which coincides with a b. Also, (ab1 )q =
aq (bq )1 = ab1 . This field cannot be strictly contained in Spl(X q X) thus it
coincides with Spl(X q X).
It remains to check that all roots of (X q X) are distinct: then the cardinality
Spl(X q X) will be equal to q. In fact, if X q X had a multiple root then it would
have had a common factor with its derivative X (X q X) = qX q1 e. However,
qX q1 = 0 in Spl(X q X) and thus cannot have such factors.
Summarising, we have the two characterisation theorems for finite fields.
Theorem 3.1.13 All finite fields have size ps where p is prime and s 1 integer.
For all such p, s, there exists a unique field of this size.
The field of size q = ps will be denoted by Fq (a popular alternative notation is
GF(q) (a Galois field)). In the case of the simplest fields F p = {0, 1, . . . , p 1} (for
p is prime) we use symbol 1 instead of e for the unit.
Theorem 3.1.14 All finite fields can be arranged into sequences (towers). For
a prime p and positive integers s1 , s2 , . . .,
...
...
...
F ps1 s2 ...si
...
F ps1 s2
F ps1
Fp Zp
273
X0
X
X2
X3
X4
X5
X6
X7
X8
X9
X 10
X 11
X 12
X 13
X 14
polynomial
vector
(string)
0
1
X
X2
X3
1 + X3
1 + X + X3
1 + X + X2 + X3
1 + X + X2
X + X2 + X3
1 + X2
X + X3
1 + X2 + X3
1+X
X + X2
X2 + X3
0000
1000
0100
0010
0001
1001
1101
1111
1110
0111
1010
0101
1011
1100
0110
0011
(3.1.2)
274
From now on we will focus on polynomial representations of finite fields. Generalising concepts introduced in Section 2.5, consider
Definition 3.1.17 The set of all polynomials with coefficients from Fq is a commutative ring denoted by Fq [X]. A quotient ring Fq [X]/"g(X)# is where the operation is modulo a fixed polynomial g(X) Fq [X].
Definition 3.1.18 A polynomial g(X) Fq [X] is called irreducible (over Fq ) if it
admits no representation
g(X) = g1 (X)g2 (X)
with g1 (X), g2 (X) Fq [X].
A generalisation of Theorem 2.5.32 is presented in Theorem 3.1.19 below.
Theorem 3.1.19
Let g(X) Fq [X] have degree deg g(X) = d . Then
Fq [X]/"g(X)# is a field Fqd iff g(X) is irreducible.
Proof Let g(X) be an irreducible polynomial over Fq . To show that Fq [X]/"g(X)#
is a field we should check that each non-zero element f (X) Fq [X]/"g(X)# has an
inverse. Consider the set F( f ) of polynomials of the form f (X)h(X) mod g(X)
where h(X) Fq [X]/"g(X)# (the principal ideal generated by f (X)). If F( f ) contains the unity e Fq (the constant polynomial equal to e) then the corresponding
h(X) = f (X)1 . If not, the map h(X) f (X)h(X) mod g(X), from Fq [X]/"g(X)#
to itself, is not a surjection. That is, f (X)h1 (X) = f (X)h2 (X) mod g(X) for some
distinct h1 (X), h2 (X), i.e.
f (X) h1 (X) h2 (X) = r(X)g(X).
275
Then either g(X)| f (X) or g(X)| h1 (X) h2 (X) as g(X) is irreducible. So, either f (X) = 0 mod g(X) (a contradiction) or h1 (X) = h2 (X) mod g(X). Hence,
Fq [X]/"g(X)# is a field.
The inverse assertion is proved similarly: if g(X) is reducible then Fq [X]/"g(X)#
contains non-zero g1 (X), g2 (X) with g1 (X)g2 (X) = 0. Then Fq [X]/"g(X)# cannot
be a field.
The dimension Fq [X]/"g(X)# : Fq is equal to d, the degree of g(X), so
Fq [X]/"g(X)# = Fqd .
Worked Example 3.1.20
Prove that
g(X) has an inverse in the polynomial ring
Fq [X]/"X N e# iff gcd g(X), X N e = e.
Solution Consider the map Fq [X]/"X N e# Fq [X]/"X N e# given by h(X)
h(X)g(X) mod (X N e). If it is a surjection then there exists h(X) with
h(X)g(X) = e and h(X) = g(X)1 . Suppose it is not. Then there exist h(1) (X) =
h(2) (X) mod (X N e) such that h(1) (X)g(X) = h(2) (X)g(X) mod (X N e), i.e.
(h(1) (X) h(2) (X))g(X) = s(X)(X N e).
As (X N e) | (h(1) (X) h(2) (X)), this means that gcd(g(X), X N e) = e.
Conversely, if gcd(g(X), X N e) = d(X) = e then the equation h(X)g(X) = e
mod (X N e) gives
h(X)g(X) = e + q(X)(X N e)
where d(X)|LHS and d(X)|q(X)(X N e)). Therefore, d(X)|e: a contradiction.
Hence, g(X)1 does not exist.
Example 3.1.21 (Continuing Example 2.5.19) There are six irreducible binary
polynomials of degree 5:
1 + X 2 + X 5, 1 + X 3 + X 5, 1 + X + X 2 + X 3 + X 5,
1 + X + X 2 + X 4 + X 5, 1 + X + X 3 + X 4 + X 5,
1 + X 2 + X 3 + X 4 + X 5.
(3.1.3)
Then there are nine irreducible polynomials of degree 6, and so on. Calculating
irreducible polynomials of a large degree is a demanding task, although extensive
tables of such polynomials are now available on the web.
We are now going to prove Theorem 3.1.11.
Proofof Theorem 3.1.11 The key fact is that any non-constant polynomial g(X)
Fq [X] has a root in some extension of Fq . Without loss of generality, assume that
g(X) is irreducible, with deg g(X) = d. Take Fq [X]/"g(X)# = Fqd as an extension
field. In this field, g( ) = 0 where is polynomial X Fq [X]/"g(X)#, so g(X)
276
has a root. We can divide g(X) by X in Fqd and use the same construction
to prove that g1 (X) = g(X)/(X ) has a root in some extension of Fqt ,t < d.
Finally, we obtain a field containing all d roots of g(X), i.e. construct the splitting
field Spl(g(X)).
Definition 3.1.22 Given a field F K and an element K, we denote by
F( ) the smallest field containing F and (obviously, F F( ) K). Similarly,
F(1 , . . . , r ) is the smallest field containing F and elements 1 , . . . , r K. For
F = Fq and K, set
d1
M ,F (X) = (X )(X q ) . . . X q
,
(3.1.4)
where d is the smallest positive integer such that q = (such a d exists as will
be proved in Lemma 3.1.24).
A monic polynomial is the one with the highest coefficient. The minimal polynomial for K over F is a unique monic polynomial M (X) (= M ,F (X))
F[X] such that M ( ) = 0 and M (X)|g(X) for each g(X) F[X] with g( ) = 0.
When is a primitive element of K (generating K ), M (X) is called a primitive
polynomial (over F). The order of a polynomial p(X) F[X] is the smallest n such
that p(X)|(X n e).
Example 3.1.23 (Continuing Example 3.1.21.) In this example we deal with
polynomials over F2 . The irreducible polynomial X 2 + X + 1 is primitive and has
order 3. The irreducible polynomials X 3 + X + 1 and X 3 + X 2 + 1 are primitive and
of order 7. The polynomials X 4 + X 3 + 1 and X 4 + X + 1 are primitive and have
order 15 whereas X 4 + X 3 + X 2 + X + 1 is not primitive and of order 5. (It is helpful
to note that with d = 4, the order of X 4 +X 3 +1 and X 4 +X +1 equals 2d 1; on the
other hand, the order of element X in the field F2 [X]/"1+X +X 2 +X 3 +X 4 # equals
5, but its order, say, in the field F2 [X]/"1 + X + X 4 # equals 15.) All six polynomials
listed in (3.1.3) are primitive and have order 31 (i.e. appear in the decomposition
of X 31 + 1).
Lemma 3.1.24 Let Fq Fqd and Fqd . Let M (X) F[X] be the minimal
polynomial for , of degree deg M (X) = d . Then:
(a) M (X) is the only irreducible polynomial in Fq [X] with a root at .
(b) M (X) is the only monic polynomial in Fq [X] of degree d with a root at .
(c) M (X) has the form (3.1.4).
277
Proof Assertions (a), (b) follow from the definition. To prove (c), assume K is
a root of a polynomial f (X) = a0 + a1 X + + ad X d from F[X], i.e. ai i = 0.
0id
0id
0id
q
2
so q is a root. Similarly, q = q is a root, and so on.
2
s
For M (X) it yields that , q , q , . . . are roots. This will end when q =
for the first time (which proves the existence of such an s). Finally, s = d as all
d1
i
j
, q , . . . , q are distinct: if not then q = q where, say, i < j. Taking qd j
d+i
j
d
power of both sides, we get q
= q = . So, is a root of polynomial
d+i
j
X, and Spl(P(X)) = Fqd+i j . On the other hand, is a root of an
P(X) = X q
irreducible polynomial of degree d, and Spl(M (X)) = Fqd . Hence, d|(d + i j)
i
or d|(i j), which is impossible. This means that all the roots q , i < d, are
distinct.
Theorem 3.1.25 For any field Fq and integer d 1, there exists an irreducible
polynomial f (X) Fq [X] of degree d .
Proof Take a primitive element Fqd . Then Fq ( ), the minimal extension of
Fq containing , coincides with Fqd . The dimension [Fq ( ) : Fq ] of vector space
Fq ( ) over Fq equals [Fqd : Fq ] = d. The minimal polynomial M (X) for over
d1
Fq has distinct roots , q , . . . , q and therefore is of degree d.
Although proving irreducibility of a given polynomial is a problem with no general solution, the number of irreducible polynomials of a given degree can be evaluated by using an elegant (and not very complicated) method invoking the so-called
Mobius function.
Definition 3.1.26
1
(d)qn/d .
n d:
d|n
(3.1.5)
278
(3.1.6)
d|n
and
(n) = (d)
d|n
n
d
(3.1.7)
This equivalence follows when we observe that (a) the sum (d) is equal to 0 if
d|n
d: d|n
c: c|n/d
d: d|n/c
279
X
= Fqn . By
d and Spl X
q
n
q
Theorem 3.1.29, Spl(g(X)) Spl X X iff d|n.
n
n
Now if g(X)| X q X , each root of g(X) is a root of X q X . Then
n
Spl(g(X)) Spl X q X and hence d|n.
280
n
Conversely,
Spl X q X , then each root of
lies
qn
if d|n, i.e.
Spl(g(X))
g(X)
in
n
qn X , so
Spl X X . But Spl X q X is precisely
the
set
of
the
roots
of
X
n
n
each root of g(X) is that of X q X . Then g(X)| X q X .
Theorem 3.1.31 If g(X) Fq [X] is an irreducible polynomial of degree d and
d1
Spl(g(X)) = Fqd [X] is its root then all the roots of g(X) are , q , . . . , q .
d
Furthermore, d is the smallest positive integer such that q = .
d1
281
For a general irreducible polynomial, the notion of conjugacy is helpful: see Definition 3.1.34 below. This concept was introduced (and used) informally in Section
2.5 for fields F2s .
Definition 3.1.34
Elements , Fqn are called conjugate over Fq if
M ,Fq (X) = M ,Fq (X).
Summarising what was said above, we deduce the following assertion.
d1
power
of
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
1 + X + X4 1 + X3 + X4
vector
vector
(word)
(word)
0000
0000
1000
1000
0100
0100
0010
0010
0001
0001
1100
1001
0110
1101
0011
1111
1101
1110
1010
0111
0101
1010
1110
0101
0111
1011
1111
1100
1011
0110
1001
0011
(3.1.8)
282
Under the left table addition rule, the minimal polynomial M i (X) for the power
i is 1 + X + X 4 for i = 1, 2, 4, 8 and 1 + X 3 + X 4 for i = 7, 14, 13, 11, while for
i = 3, 6, 12, 9 it is 1 + X + X 2 + X 3 + X 4 and for i = 5, 10 it is 1 + X + X 2 . Under the
right table addition rule, we have to swap polynomials 1 + X + X 4 and 1 + X 3 + X 4 .
Polynomials 1 + X + X 4 and 1 + X 3 + X 4 are of order 15, polynomial 1 + X + X 2 +
X 3 + X 4 is of order 5 and 1 + X + X 2 of order 3.
4
A short way to produce these answers is to find the expression for i as a
2
3
linear combination of 1, i , i and i . For example, from the left table we
have for 7 :
7 4
= 28 = 3 + 2 + 1,
7 3
= 21 = 3 + 2 ,
3
4
and readily see that 7 = 1 + 7 , which yields 1 + X 3 + X 4 . For complete 2
ness, write down the unused expression for 7 :
7 2
= 14 = 12 2 = (1 + )3 2 = (1 + + 2 + 3 ) 2
= 2 + 3 + 4 + 5 = 2 + 3 + 1 + + (1 + ) = 1 + 3 .
283
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
vector
(word)
00000
10000
01000
00100
00010
00001
10100
01010
00101
10110
01011
10001
11100
01110
00111
10111
power
of
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
vector
(word)
11111
11011
11001
11000
01100
00110
00011
10101
11110
01111
10011
11101
11010
01101
10010
01001
(3.1.9)
Definition 3.1.38
An automorphism of Fqn over Fq (in short, an (Fqn , Fq )automorphism) is a bijection : Fqn Fqn with: (a) (a + b) = (a) + (b);
(b) (ab) = (a) (b); (c) (c) = c, for all a, b Fqn , c Fq .
Theorem 3.1.39 The set of (Fqn , Fq )-automorphisms is isomorphic to the cyclic
group Zn and generated by the Frobenius map q (a) = aq , a Fqn .
Proof Let Fqn be a primitive element. Then q 1 = e and M (X) Fq [X]
2
n1
has roots , q , q , . . . , q . An (Fqn ; Fq )-automorphism fixes the coefficients
j
of M (X), thus it permutes the roots, and ( ) = q for some j, 0 j n1. But
j
as is primitive, is completely determined by ( ). Then as q j ( ) = q =
( ), we have that = q j .
n
The rest of this section is devoted to a study of roots of unity, i.e. the roots of the
polynomial X n e over field Fq where q = ps and p =char (Fq ). Without loss of
generality, we suppose from now on that
gcd(n, q) = 1, i.e. n and q are co-prime.
(3.1.10)
Indeed, if n and q are not co-prime, we can write n = mpk . Then, by Lemma 3.1.5
k
X n e = X mp e = (X m e) p ,
and our analysis is reduced to the polynomial X m e.
284
285
Definition 3.1.46 The set of exponents i, iq, . . . , iqd1 where d(= d(i)) is the
minimal positive integer such that iqd = i mod n is called a cyclotomic coset (for i)
and denoted by Ci (= Ci (n, q)) (alternatively, C i is defined as the set of non-zero
d1
field elements i , iq , . . . , iq ).
Worked Example 3.1.47 Check that polynomials X 2 + X + 2 and X 3 + 2X 2 + 1
are primitive over F3 and compute the field tables for F9 and F27 generated by these
polynomials.
Solution The field F9 is isomorphic to F3 [X]/"X 2 + X + 2#. The multiplicative
powers of X are
2 2X + 1, 3 2X + 2, 4 2,
5 2X, 6 X + 2, 7 X + 1, 8 1.
The cyclotomic coset of is { , 3 } (as 9 = ). Then the minimal polynomial
M (X) = (X )(X 3 ) = X 2 ( + 3 )X + 4
= X 2 2X + 2 = X 2 + X + 2.
Hence, X 2 + X + 2 is primitive.
286
2 X 2 , 3 X 2 + 2, 4 X 2 + 2X + 2, 5 2X + 2,
6 2X 2 + 2X, 7 X 2 + 1, 8 X 2 + X + 2,
9 2X 2 + 2X + 2, 10 X 2 + 2X + 1, 11 X + 2,
12
X 2 + 2X, 13 2, 14 2X, 15 2X 2 , 16 2X 2 + 1,
17 2X 2 + X + 1, 18 X + 1, 19 X 2 + X,
20
2X 2 + 2, 21 2X 2 + 2X + 1, 22 X 2 + X + 1,
23 2X 2 + X + 2, 24 2X + 1, 25 2X 2 + X, 26 1.
The cyclotomic coset of in F27 is { , 3 , 9 }. Consequently, the primitive polynomial
M (X) = (X )(X 3 )(X 9 )
= X 3 ( + 3 + 9 )X 2 + ( 4 + 10 + 12 )X 13
= X 3 + 2X 2 + 1
as required.
Worked Example 3.1.48 (a) Consider the polynomial X 15 1 over F2 (with
n = 15, q = 2). Then = 2, s = ord15 (2) = 4 and Spl(X 15 1) = F24 = F16 .
The polynomial g(X) = 1 + X + X 4 is primitive: any of its roots are primitive
in F16 . So, the primitive (15, F2 )-root of unity is
= (2
4 1)/15
= .
This yields
X 9 1 = (1 + X)(1 + X + X 2 )(1 + X 3 + X 6 ).
287
|(qd 1),
f (X)| X e ,
|n iff f (X)| X n e ,
is the least positive integer such that f (X)| X e .
288
Proof (a) Spl( f (X)) = Fqd , hence every root of f (X) is a root of X q
it has ord( )|(qd 1).
d 1
e. So,
(b) Each root
of f (X) in Spl( f (X)) has ord( ) = and hence is a root of (X e).
So, f (X)| X e .
(c) If f (X)| X n e then each root of f (X) is a root of X n e, i.e. ord( )|n. So,
|n. Conversely, if n = k then (X e)|(X k e) and f (X)|(X n e) by (b).
289
degree d = ord (q), equals 2 if = d = 1, equals 0 in all other cases. In particular, the degree of an order irreducible polynomial always equals ord (q), i.e. the
minimal s such that qs = 1 mod . Here () is the Euler totient function.
The proofs of Theorems 3.1.52 and 3.1.53 are omitted (see [92]). We only
make a short comment about Theorem 3.1.53. If p(0) = 0, the order of irreducible
polynomial p(X) of degree d coincides with the order of any of its roots in the
multiplicative group Fqd . So, the order is iff d = ord (q) and p(X) divides the
so-called circular polynomial
Q (X) =
(X s ).
s:gcd(s,)=1
290
The smallest field Fq with this property (i.e. field Fq (1 , . . . , u )) is called a splitting field for p(X); we also say that p(X) splits over Fq (1 , . . . , u ). The splitting
field for p(X) is denoted by Spl(p(X)); an element Spl(p(X)) takes part in decomposition (3.1.13) iff p( ) = 0. Field Spl(p(X)) is described as the set {g( j )}
where j = 1, . . . , u, and g(X) Fq [X] are polynomials of degree < deg(p(X)). (ii)
Field Fq is splitting for the polynomial X q X. (iii) If polynomial p(X) of degree d isirreducible over Fq and is a root of p(X) in field Spl(p(X)) then Fqd
Fq [X] "p(X)# is isomorphic to Fq ( ) and all the roots of p(X) in Spl(p(X))
2
d1
are given by the conjugate elements , q , q , . . . , q . Thus, d is the smalld
est positive integer for which q = . (iv) Suppose that, for a given field Fq ,
a monic polynomial p(X) Fq [X] and an element from a larger field we have
p( ) = 0. Then there exists a unique minimal polynomial M (X) with the property
that M ( ) = 0 (i.e. such that any other polynomial p(X) with p( ) = 0 is divided
by M (X)). Polynomial M (X) is the unique irreducible polynomial over Fq vanishing at . It is also the unique polynomial of the minimum degree vanishing at .
We call M (X) the minimal polynomial of over Fq . If is a primitive element
of Fqd then M (X) is called a primitive polynomial for Fqd over Fq . We say that
elements , Fqd are conjugate over Fq if they have the same minimal polyd1
nomial over Fq . Then (v) the conjugates of Fqd over Fq are , q , . . . , q ,
d
where d is the smallest positive integer with q = . When = i where
is a primitive element, the congugacy class is associated with a cyclotomic coset
d1
C i = { i , iq , . . . , iq }.
Summary 1.58. Now assume that n and q = ps are co-prime and take polynomial
X n e. The roots of X n e in the splitting field Spl(X n e) are called nth roots
of unity over Fq . The set of all nth roots of unity is denoted by En . (i) Set En is
a cyclic subgroup of order n in the multiplicative group of field Spl(X n e). An
nth root of unity generating En is called a primitive nth root of unity. (ii) If Fqs is
Spl(X n e) then s is the smallest positive integer with n|(qs 1). (iii) Let n be the
set of primitive nth roots of unity over field Fq and n the set of primitive elements
of the splitting field Fqs = Spl(X s e). Then either n n = 0/ or n = n , the
latter happening iff n = qs 1.
291
(3.2.1)
X N e = X q1 e =
(X )
Fq
(as the splitting field Spl (X q X) is Fq ). Furthermore, owing to the fact that is a
primitive (q 1, Fq ) root of unity (or, equivalently, a primitive element of Fq ), the
minimal polynomial Mi (X) is just X i , for all i = 0, . . . , N 1.
An important property is that the RS codes are MDS. Indeed, the generator g(X)
of Xq,RS
, ,b has deg g(X) = 1. Hence, the rank k is given by
k = dim(Xq,RS
, ,b ) = N deg g(X) = N + 1.
(3.2.2)
By the generalised BCH bound (see Theorem 3.2.9 below), the minimal distance
d Xq,RS
, ,b = N k + 1.
But the Singleton bound states that d(X RS ) N k + 1. Hence,
RS
d(Xq,d,
,b ) = N k + 1 = .
(3.2.3)
292
Thus the RS codes have the largest possible minimal distance among all q-ary
codes of length q 1 and dimension k = q . Summarising, we obtain
Theorem 3.2.2
RS
Proof The proof is straightforward, as (Xq,RS
, ,b ) = Xq,q , ,b+ 1 .
0ik1
Lemma 3.2.5 Let a(X) = a0 + a1 X + + aN1 X N1 Fq [X] and be a primitive (N, Fq ) root of unity over Fq , N = q 1. Then
ai =
1
a( j ) i j .
N 0 jN1
1
1
1
a( j ) i j = N c( i ) = N c( Ni ),
N 0 jN1
(3.2.4)
293
Theorem 3.2.6
294
not known. Such an algorithm solution for the latter was found in 1969 by Elwyn Berlekamp and James Massey, and is known since as the BerlekampMassey
decoding algorithm (cf. [20]); see Section 3.3. Later on, other algorithms were
proposed: continued fraction algorithm and Euclidean algorithm (see [112]).
ReedSolomon codes played an important role in transmitting digital pictures
from American spacecraft throughout the 1970s and 1980s, often in combination
with other code constructions. These codes still figure prominently in modern space
missions although the advent of turbo-codes provides a much wider choice of coding and decoding procedures.
ReedSolomon codes are also a key component in compact disc and digital game
production. The encoding and decoding schemes employed here are capable of correcting bursts of up to 4000 errors (which makes about 2.5mm on the disc surface).
BCH
Definition 3.2.7 A BCH code Xq,N,
, ,b with parameters q, N, , and b is the
q-ary cyclic code XN = "g(X)# with length N, designed distance , such that its
generating polynomial is
(3.2.5)
g(X) = lcm M b (X), M b+1 (X), . . . , M b+ 2 (X) ,
i.e.
*
BCH
=
f (X) Fq [X] mod (X N 1) :
Xq,N,
, ,b
+
f ( b+i ) = 0, 0 i 2 .
If b = 1, this is a narrow sense BCH code. If is a primitive Nth root of unity, i.e. a
primitive root of the polynomial X N 1, the BCH code is called primitive. (Recall
that under condition gcd(q, N) = 1 these roots form a commutative multiplicative
group which is cyclic, of order N, and is a generator of this group.)
Lemma 3.2.8
BCH
The BCH code Xq,N,
, ,b has minimum distance .
Proof Without loss of generality consider a narrow sense code. Set the paritycheck ( 1) N matrix
2
...
N1
1
1 2
4
...
2(N1)
H = .
.
..
..
.
.
.
.
.
.
.
1
2(
1)
(
1)(N1)
...
1
The codewords of X are linear dependence relations between the columns of H.
Then Lemma 2.5.40 implies that any 1 columns of H are linearly independent.
In fact, select columns with top (row 1) entries k1 , . . . , k 1 where 0 k1 < <
k 1 N 1. They form a square ( 1) ( 1) matrix
295
k1 1
k2 1
...
k 1 1
k
k
k
k1 k1
2 2
...
1 k 1
D=
..
..
..
..
.
.
.
.
k1 k1 ( 2) k2 k2 ( 2) . . . k 1 k 1 ( 2)
that differs from the Vandermonde matrix by factors ks in front of the sth column.
Then the determinant of D is the product
1
1
...
1
k
k
k
1
2
...
1
1
det D = ks
..
..
..
..
.
s=1
.
.
.
k1 ( 2) k2 ( 2) . . . k 1 ( 2)
1
k
k
k
s
i
j
= 0,
=
i> j
s=1
and any 1 columns of H are indeed linearly independent. In turn, this means
that any non-zero codeword in X has weight at least . Thus, X has minimum
distance .
Theorem 3.2.9 (A generalisation of the BCH bound) Let be a primitive N th
root of unity and b 1, r 1 and > 2 integers, with gcd(r, N) = 1. Consider a
cyclic code X = "g(X)# of length N where g(X) is a monic polynomial of smallest degree with g( b ) = g( b+r ) = = g( b+( 2)r ) = 0. Prove that X has
d(X ) .
Proof As gcd(r, N) = 1, r is a primitive root of unity. So, we can repeat the
proof given above, with b replaced by bru where ru is found from ru + Nv = 1. An
alternative solution: the matrix N ( 1)
1
1
... 1
b
b+r
. . . b+( 2)r
2b
2(b+r)
. . . 2(b+( 2)r)
.
.
.
.
.
.
.
.
.
. .
.
(N1)b
(N1)b+r
(N1)(b+(
2)r)
...
checks the code X = "g(X)#. Take any of its ( 1) ( 1) submatrices, say,
with rows i1 < i2 < < i 1 . Denote it by D = (D jk ). Then
det D =
1l 1
1l 1
296
a( j )X n j .
aMS (X) =
(3.2.6)
j=1
Let q = 2 and a(X) F2 [X]/"X n 1#. Prove that the MattsonSolomon polynomial
aMS (X) is idempotent, i.e. aMS (X)2 = aMS (X) in F2 [X]/"X n 1#.
Solution Let a(X) =
0in1
3.2.5. In F2 , (nai )2 = nai , so aMS ( i )2 = aMS ( i ). For polynomials, write b(2) (X)
for the square in F2 [X] and b(X)2 for the square in F2 [X]/"X n 1#:
b(2) (X) = c(X)(X n 1) + b(X)2 .
Then
(2)
e e
... e
e
. . . n1
(2)
(aMS aMS ) .. ..
= 0.
..
.
.
. .
. .
e n1 . . . (n1)
( j i ) = 0,
0i< jn1
(2)
Vj =
i=0
i j vi , j = 0, . . . , N 1.
(3.2.7)
297
Lemma 3.2.12 (The inversion formula) The vector v is recovered from its Fourier
transform V by the formula
vi =
1 N1 i j
Vj.
N j=0
(3.2.8)
r j = 0 mod N.
j=0
r j = N mod p
j=0
N j=0
N k=0
j=0
N 1
a( j ) i j = N 1
0 jN1
=N
0 jN1
ak
0kN1
0 jN1
j(ki)
=N
ak jk i j
0kN1
ak N ki = ai .
0kN1
0 jN1
j =
( ) j = (e ( )N )(e )1 = 0.
0 jN1
Hence
ai =
1
a( j ) i j .
N 0 jN1
(3.2.9)
298
Worked Example 3.2.13 Give an alternative proof of the BCH bound: Let be a
primitive (N, Fq ) root of unity and b 1 and 2 integers. Let XN = "g(X)# be a
cyclic code where g(X) Fq [X]/"X N e# is a monic polynomial of smallest degree
having b , b+1 , . . . , b+ 2 among its roots. Then XN has minimum distance at
least .
Solution
Let a(X) =
0 jN1
a( i )X i =
0iN1
a( Ni )X i
0iN1
a( )X
i
Ni
1iN
a( j )X N j + 0 + + 0 (from b , . . . , b+ 2 )
1 jb1
+ a( b+ 1 )X Nb +1 + + a( N ).
(3.2.10)
+ a( b+ 1 )X N + + a( N )X b1
= X N p1 (X) + q(X)
= (X N e)p1 (X) + p1 (X) + q(X).
We see that cMS ( i ) = 0 iff p1 ( i ) + q( i ) = 0. But p1 (X) + q(X) is a polynomial
of degree N so it has at most N roots. Thus, cMS (X) has at most N
roots of the form i .
Therefore, the inversion formula (3.2.8) implies that the weight w(a(X)) (i.e. the
weight of the coefficient string a0 . . . aN1 ) obeys
w(a(X)) N the number of roots of cMS (X) of the form i .
(3.2.11)
That is,
w(a(X)) N (N ) = .
299
done it in their joint paper). For brevity, we take the value b = 1 (but will be able
to extend the definition to values of N > q 1).
Given N q, let S = {x1 , . . . , xN } Fq be a set of N distinct points in Fq (a
supporting set). Let Ev denote the evaluation map
Ev : f Fq [X] Ev( f ) = ( f (x1 ), . . . , f (xN )) FNq
(3.2.12)
and take
L = { f Fq [X] : deg f < k}.
(3.2.13)
Then the q-ary ReedSolomon code of length N and dimension k can be defined as
X = Ev(L);
(3.2.14)
B
d 1
erit has the minimum distance d = d(X ) = N k + 1 and corrects up to
2
rors. The encoding of a source message u = u0 . . . uk1 Fkq consists in calculating
the values of the polynomial f (X) = u0 + u1 X + + uk X k1 at points xi S.
Definition 3.2.1 (where X was defined as the set of polynomials c(X) =
cl X l Fq [X] with c( ) = c( 2 ) = = c( 1 ) = 0) emerges when
A
0l<q1
0l<N
N fl = c( Nl ), or N fNl1 = c( l+1 ), l = 0, . . . , N 1,
guaranteeing, in particular, that fk = = fN1 = 0.
Given f Fq [X] and y = y1 . . . yN FNq , set
dist ( f , y) =
1( f (xi ) = yi ).
1iN
B
d 1
. The above2
mentioned conventional decoding algorithms (the BerlekampMassey algorithm,
the continued fractions algorithm and the Euclidean algorithm) follow the same
principle: the algorithm either finds a unique f such that dist ( f , y) t or reports
that such f does not exist. On the other hand, given s > t, list decoding attempts to
find all f with dist ( f , y) s; the hope is that if we are lucky, the codeword with
this property will be unique, and we will be able to correct s errors, exceeding the
conventional limit of t errors.
Now assume y = y1 . . . yN is a received word and set t =
300
This idea goes back to Shannons bounded distance decoding: upon receiving
a word y, you inspect the Hamming balls around y until you encounter a closest
codeword (or a collection of closest codewords) to y. Of course, we want two
things: that (i) when we take s moderately larger than t, the chance of finding
two or more codewords within distance s is small, and (ii) the algorithm has a
reasonable computational complexity.
Example 3.2.14 The [32, 8] RS code over F32 has d = 25 and t = 12. If we take
s = 13, the Hamming ball about the received word y may contain two codewords.
However, assuming that all error vectors e of weight 13 are equally likely, the
probability of this event is 2.08437 1012 .
The GuruswamiSudan list decoding algorithm (see [59]) performs the task of
finding the codewords within distance s for t s tGS in a polynomial time. Here
L$
M
tGS = n 1
(k 1)n ,
and tGS can considerably exceed t.
In the above example, tGS = 17. Asymptotically, for RS codes of rate R, the
conventional decoding algorithms will correct
a fraction (1 R)/2 of errors, while
301
It is fruitful to think of X as the ideal in Fq [X]/"X N e# and consider all polynomials mod (X N e). Moreover, Fq [X]/"X N e# is a principal ideal ring: each
its ideal is of the form
"g(X)# = { f (X) : f (X) = g(X)h(X), h(X) Fq [X]/"X N e#}
(3.3.1)
g(X)|(X N e),
if deg g(X) = d then dim X = N d ,
X = { f (X) : f (X) = g(X)h(X), h(X) Fq [X], deg h(X) < N d},
if g(X) = g0 + g1 X + g2 X 2 + + gd X d , with gd = e, then g0 = 0 and
0 0 ... 0
g0 g1 g2 . . . gd
0 g0 g1 . . . gd1 gd 0 . . . 0
G=
...
...
0
...
g0
g1
. . . gd
is a generating matrix for X , with row i being the cyclic shift of row
i 1, i = 2, . . . , N d .
Conversely, for any polynomial g(X)|(X N e), the set "g(X)# =
{ f (X) : f (X) = g(X)h(X), h(X) Fq [X]/"X N e#} is an ideal in
Fq [X]/"X N e#, i.e. a cyclic code X , and the above properties (b)(d) hold.
Proof Take g(X) F2 [X] a non-zero polynomial of the least degree in X . Take
p(X) X and write
p(X) = q(X)g(X) + r(X), with deg r(X) < deg g(X).
Then r(X) mod (X N 1) belongs to X . This contradicts the choice of g(X) unless
r(X) = 0. Therefore, g(X)|p(x) which proves (i). Taking p(X) = X N 1 proves (ii).
Finally, if g(X) and g(X) both satisfy (i) and (ii) then g(X)|g(X) and g(X)|g(X),
implying g(X) = g(X).
Corollary 3.3.3 The cyclic codes of length N are in a one-to-one correspondence
with factors of X N e. In other words, the map
+
*
+
*
cyclic codes of length N
divisors of X N 1 ,
X g(X),
is a bijection.
302
303
1 0 1 1 1 0 0
0 1 0 1 1 1 0 .
0 0 1 0 1 1 1
The columns of this matrix are the non-zero elements of F32 . So, it is equivalent to
Hammings [7, 4] code.
The dual of Hammings [7, 4] code has generator polynomial X 4 + X 3 + X 2 + 1
(the reverse of h(X)). Since X 4 + X 3 + X 2 + 1 = (X + 1)g(X), it is a subcode of
Hammings [7, 4] code.
Worked Example 3.3.8 Let be a primitive N th root of unity. Let X = "g(X)#
be a cyclic code of length N . Show that the dimension dim (X ) equals the number
of powers j such that g( j ) = 0.
Solution Denote E(N) = { , 2 , . . . , N = e}, dim"g(X)# = N d, d = deg g(X).
But g(X) = (X i j ) where i1 , . . . , id are the zeros of "g(X)#. Hence, the
1 jd
304
hNr hNr1 . . . h1
h0
0 0 ... 0
0
h1
h0 0 . . . 0
hNr . . . . . .
,
H =
...
...
... ...
...
... ... ... ...
0
0
. . . hNr hNr1 . . . . . . . . . h0
(c) the dual code X is a cyclic code of dim X = r, and X = "g (X)#, where
Nr h(X 1 ) = h1 (h X Nr + h X Nr1 + + h
g (X) = h1
0
1
Nr ).
0 X
0
The generator g(X) of a cyclic code is specified, in terms of factorisation of
e, as a sub-product,
X N e = lcm M (X) : E(N) ,
(3.3.2)
XN
(3.3.3)
(3.3.4)
305
ij of length
space over Fq of dimension l, we associate ij with a (column) vector
01
0
2
T
H = .
..
0u
11 . . .
12 . . .
..
..
.
.
...
u
N1
1
N1
2
..
N1
u
(3.3.5)
can be considered as a parity-check matrix for the code with zeros 1 , . . . , u (with
the proviso that its rows may not be linearly independent).
Theorem 3.3.13 For q = 2, the Hamming [2l 1, 2l l 1, 3] code is equivalent
i
to a cyclic code "M (X)# = 0il1 (X 2 ) where is a primitive element in
F 2l .
Proof Let be a primitive (N, F2 ) root of unity where N = 2l 1. The splitting
field Spl (X N e) is F2l (as ordN (2) = l). So, is a primitive element in F2l .
l1
Take M (X) = (X )(X 2 ) X 2 , of degree l. The powers 0 =
e, , . . . , N1 form F2l , the list of the non-zero elements and the columns of the
l N matrix
0
H=
,
,...,
N1
(3.3.6)
consist of all non-zero binary vectors of length l. Hence, the Hamming [2l 1, 2l
l 1, 3] code is (equivalent to) the cyclic code "M (X)# whose zeros consist of a
primitive (2l 1; F2 ) root of unity and (necessarily) all the other roots of the
minimal polynomial for .
Theorem
3.3.14 If gcd(l, q 1) = 1 then the
l
l
q 1 q 1
,
l, 3 code is equivalent to the cyclic code.
q1 q1
q-ary
Hamming
306
1
Proof Write Spl(X N e) = Fql where l = ordN (q), N = qq1
. To justify the
l
q 1
= q 1 and l is the least positive integer with
selection of l observe that
N
l
1
> ql1 1.
this property as qq1
l
N=
ql 1
= 1 + + ql1 .
q1
(3.3.7)
g(X) = lcm M1 ,Fq (X), . . . , Mu ,Fq (X)(X)
(3.3.8)
307
"M (X)# and is equivalent to the Hamming code. We could try other possibilities
for zeros of X to see if it leads to interesting examples. This is the way to discover
the BCH codes [25], [70].
Recall the factorisation into minimal polynomials Mi (X)(= M i ,Fq (X)),
(3.3.9)
X N 1 = lcm Mi (X) : i = 0, . . . ,t ,
where is a primitive (N, Fq ) root of unity. The roots of Mi (X) are conjugate,
d1
i.e. have the form i , iq , . . . , iq where d(= d(i)) is the least integer 1 such
that iqd = i mod N. The set Ci = {i, iq, . . . , iqd1 } is the ith cyclotomic coset of q
mod N. So,
Mi (X) =
(X j ).
(3.3.10)
jCi
(3.3.11)
N ( 1) ordN (q).
e
...
b
b+1
...
b+ 2
T
2b
2(b+1)
2(b+
2)
.
H =
(3.3.12)
...
...
(N1)b
(N1)(b+1) . . .
(N1)(b+ 2)
The proper parity-check matrix H is obtained by removing redundant rows.
The binary BCH codes are simplest to deal with. Let Ci = {i, 2i, . . . , i2d1 } be
the ith cyclotomic coset (with d(= d(i)) being the smallest non-zero integer such
that i 2d = i mod N). Then u Ci iff 2u mod N Ci . So, Mi (X) = M2i (X), and for
all s 1 the polynomials
g2s1 (X) = g2s (X) = lcm{M1 (X), M2 (X), . . . , M2s (X)}.
308
BCH
BCH . So we can focus on the narrow
We immediately deduce that X2,N,2s+1
= X2,N,2s
sense BCH codes with odd designed distance = 2E + 1, and obtain an improvement of Theorem 3.3.16:
Theorem 3.3.17
BCH
The rank of a binary BCH code X2,N,2E+1
is N E ordN (2).
The problem of determining exactly the minimum distance of a BCH code has
been solved only partially (although a number of results exist in the literature). We
present the following theorem without proof.
Theorem 3.3.18 The minimum distance of a binary primitive narrow sense BCH
code is an odd number.
The previous results can be sharpened in a number of particular cases.
Worked Example 3.3.19
N(N 1) . . . (N i + 1) = S(E).
(3.3.14)
0iE+1
309
BCH
Corollary 3.3.21 If N = 2s 1 and s > 1 + log2 (E + 1)! then d(X2,2
s 1,2E+1 ) =
2E + 1. In particular, let N = 31 and s = 5. Then we easily verify that
5E
<
0iE+1
31
i
Set N = m, then
X N 1 = X m 1 = (X m 1)(1 + X m + + X ( 1)m ).
310
Theorem 3.3.25
There exists no infinite sequence of q-ary primitive BCH
codes XNBCH of length N such that d(XN )/N and rank(XN )/N are bounded away
from 0.
Decoding BCH codes can be done by using the so-called BerlekampMassey
algorithm. To begin with, consider a binary primitive narrow sense BCH code
BCH ) of length N = 2s 1 and designed distance 5. With E = 2 and
X BCH (= X2,N,5
N
sE
holds, and by Theorem 3.3.20, the distance
s 4, inequality 2 <
i
0iE+1
BCH
d X
equals 5. Thus, the code is two-error correcting. Also, by Theorem
3.3.17, the rank of X BCH is N 2s. [For s = 4, the rank is actually equal to
N 2s = 15 8 = 7.] So, X BCH is [2s 1, 2s 1 2s, 5].
The defining zeros are , 2 , 3 , 4 where is a primitive Nth root of unity
over F2 (which is also a primitive element of F2s ). We know that and 3
suffice as defining zeros: X BCH = {c(X) F2 [X]/"X N 1# : c( ) = c( 3 ) = 0}.
So, the parity-check matrix H in (3.3.12) can be taken in the form
e
2
N1
T
.
(3.3.16)
H =
e
3
6
3(N1)
It is instructive to compare the situation with the binary Hamming [2l 1, 2l
1l] code X ( H) . In the case of code X BCH , suppose again that a codeword c(X)
X was sent and the received word r(X) has 2 errors. Write r(X) = c(X) + e(X)
where the error polynomial e(X) now has weight 2. There are three cases to
consider: e(X) = 0, e(X) = X i or e(X) = X i + X j , 0 i = j N 1. If r( ) = r1
and r( 3 ) = r3 then e( ) = r1 and e( 3 ) = r3 . In the case of no error (e(X) = 0),
r1 = r3 = 0, and vice versa. In the single-error case (e(X) = X i ),
r3 = e( 3 ) = 3i = ( i )3 = (e( ))3 = r13 = 0.
Conversely, if r3 = r13 = 0 then e( 3 ) = e( )3 . If e(X) = X i + X j with i = j then
3i + 3 j = ( i + j )3 = 3i + 2i j + i 2 j + 3 j ,
i.e. 2i j + i 2 j = 0 or i + j = 0 which implies i = j, a contradiction. So, the
single error occurs iff r3 = r13 = 0, and the wrong digit is i such that r1 = i . So,
in the single-error case we identify a column of H, i.e. a pair ( i , 3i ) = (r1 , r3 )
and change digit i in r(X). This is completely similar to the decoding procedure for
Hamming codes.
In the two-error case (e(X) = X i + X j , i = j), in the spirit of the Hamming codes,
we try to find a pair of columns ( i , 3i ) and ( j , 3 j ) such that the sum ( i +
j , 3i + 3 j ) = (r1 , r3 ), i.e. solve the equation
r1 = i + j , r3 = 3i + 3 j .
311
Then find i, j such that y1 = i , y2 = j (y1 , y2 are called error locators). If such i,
j (or equivalently, error locators y1 , y2 ) are found, we know that errors occurred at
positions i and j.
It is convenient to introduce an error-locator polynomial (X) whose roots are
1
y1
1 , y2 :
(3.3.17)
where is the primitive element of F2s . [The rank of the code is N 2l and
for l 4 the distance equals 5, i.e. X is [2l 1, 2l 1 2l, 5] and corrects two
errors.] Assume that at most two errors occurred in a received word r(X) and let
r( ) = r1 , r( 3 ) = r3 . Then:
(a) if r1 = 0 then r3 = 0 and no error occurred;
(b) if r3 = r13 = 0 then a single error occurred at position i where r1 = i ;
(c) if r1 =
0 and r3 = r13 then two errors occurred: the error locator polynomial
(X) = 1 r1 X + (r3 r11 r12 )X 2 has two distinct roots N1i , N1 j and
the errors occurred at positions i and j.
For a binary BCH code with a general designed distance ( = 2t +1 is assumed
odd), we follow the same idea: compute
r1 = e( ), r3 = e( 3 ), . . . , r 2 = e( 2 )
for the received word r(X) = c(X) + e(X). Suppose that errors occurred at places
i1 , . . . , it . Then
e(X) =
1 jt
Xij.
312
i j = r1 ,
1 jt
3i j = r3 , . . . ,
1 jt
( 2)i j = r 2 ,
yj 2 = r 2 .
1 jt
y j = r1 ,
1 jt
y3j = r3 , . . . ,
1 jt
1 jt
(X) =
(1 y j X)
1 jt
i
has the roots y1
j . The coefficients i in (X) = i X can be determined from
0it
1
r2
r4
.
..
r2t4
r2t2
0
r1
r3
..
.
r2t5
r2t3
0
1
r2
..
.
..
.
...
0
0
r1
..
.
..
.
...
0
0
1
..
.
..
.
...
...
...
...
..
.
..
.
...
0
0
0
..
.
rt3
rt1
1
2
3
..
.
2t3
2t1
r1
r3
r5
..
.
r2t3
r2t1
313
(X) = 1 + 6 X + ( 13 + 12 )X 2 .
The roots of l(X) are 3 and 11 by the direct check. Hence we discover the errors
at the 4th and 12th positions.
Ak zk and WX (z) =
0kN
k
A
kz
(3.4.1)
0kN
N
1z
1
1 + (q 1)z WX
, z C,
(3.4.2)
WX (z) =
X
1 + (q 1)z
and takes a particularly elegant form in the binary case (q = 2):
1z
1
n
(1 + z) WX
WX (z) =
.
X
1+z
(3.4.3)
(3.4.4)
314
More generally, a linear representation D of a group G over a field F (not necessarily finite) is defined as a homomorphism
D : G GL(V ) : g D(g)
(3.4.5)
( j) : Fq S : u ju .
The character ( j) is non-trivial for j = 0. In fact, all characters of Fq can be described in this way, but we omit the proof of this assertion.
Next, we define a character of the group G = FNq . Fix a non-trivial onedimensional ordinary character : Fq S and a non-zero element v FNq and
define a character of the additive group G = FNq as follows:
(3.4.7)
(g) = 0.
(3.4.8)
gG
(h) (g) =
gG
(hg) = (g),
gG
gG
gG
315
Definition 3.4.3 The discrete Fourier transform (in short, DFT) of a function f
on FNq is defined by
f =
f (v)(v) .
(3.4.9)
vFN
q
Sometimes, the weight enumerator polynomial of code X is defined as a function of two formal variables x, y:
WX (x, y) =
xw(v) yNw(v)
(3.4.10)
vX
(if one sets x = z, y = 1, (3.4.10) coincides with (3.4.1)). So, we want to apply the
DFT to the function (no harm to say that x, y S )
g : FNq C [x, y] : v xw(v) yNw(v) .
Lemma 3.4.4
(3.4.11)
(3.4.12)
(3.4.13)
Then
Proof Let denote a non-trivial ordinary character of the additive group G = Fq .
Given Fq , set | | = 0 if = 0 and | | = 1 otherwise. Then for all u FNq we
compute
g(u) = "v, u# g(v)
vFN
q
"v, u# xw(v) yNw(v)
...
...
vFN
q
v0 Fq
vN1 Fq
N1
vi ui
i=0
N1
v0 Fq
vN1 Fq i=0
N1
gG
316
If ui = 0 then
gG
(gui )x = y (0)x = y x.
gG\0
f(x) = qk
f (y).
(3.4.14)
yX
xX
f(x) =
"v, x# f (v)
vFN
q xX
xX vFN
q
xX
"v, x# f (v)
vX xX
"v, x# f (v).
vFN
q \X xX
In the first sum we have "v, x# = (0) = 1 for all v X and all x X . In
the second sum we study the linear form
X Fq : x "v, x#.
Since v FNq \ X , this linear form is surjective, whence its kernel has dimension
k 1, i.e. for any g Fq there exist qk1 vectors x X such that "v, x# = g. This
implies
f(x) = qk
f (y) + qk1
f (y)
yX
xX
= qk
vFnq \X
f (v) (g)
gG
yX
(3.4.15)
Proof
317
g(v) = qk
vX
k
g(v)
vX
= q WX (y x, y + (q 1)x).
Substituting x = z, y = 1 we obtain (3.4.3).
Example 3.4.7 (i) For all codes X , WX (0) = A0 = 1 and WX (1) = X . When
N
X = FN
q , WX (z) = [1 + z(q 1)] .
(ii) For a binary repetition code X = {0000, 1111}, WX (x, y) = x4 + y4 . Hence,
WX (x, y) =
1
(y x)4 + (y + x)4 = y4 + 6x2 y2 + x4 .
2
(iii) Let X be the Hamming [7, 4] code. The dual code X has 8 codewords; all
except 0 are of weight 4. Hence, WX (x, y) = x7 + 7x4 y3 , and, by the MacWilliams
identity,
1
1
WX (x y, x + y) = 3 (x y)7 + 7(x y)4 (x + y)3
3
2
2
= x7 + 7x4 y3 + 7x3 y4 + y4 .
WX =
Hence, X has 7 words of weight 3 and 4 each. Together with the 0 and 1 words,
this accounts for all 16 words of the Hamming [7, 4] code.
Another way to derive the identity (3.4.1) is to use an abstract result related
to group algebras and character transforms for Hamming spaces FN
(which are
q
linear spaces over field Fq of dimension N). For brevity, the subscript q and superscript (N) will be often omitted.
Definition 3.4.8 The (complex) group algebra CFN for space FN is defined as
the linear space of complex functions G : x FN G(x) C equipped by a complex involution (conjugation) and multiplication. Thus, we have four operations for
functions G(x); addition and scalar (complex) multiplication are standard (pointwise), with (G + G )(x) = G(x) + G (x) and (aG)(x) = aG(x), G, G CFN ,
a C, x FN . The involution is just the (point-wise) complex conjugation:
G (x) = G(x) ; it is an idempotent operation, with G = G. However, the multiplication (denoted by ) is a convolution:
(G G )(x) =
G(y)G (x y), x FN .
(3.4.16)
yFN
This makes CFN a commutative ring and at the same time a (complex) linear
space, of dimension dim CFN = qN , with involution. (A set that is a commutative
318
ring and a linear space is called an algebra.) The natural basis in CFN is formed
by Diracs (or Kroneckers) delta-functions y , with y (x) = 1(x = y), x, y H .
If X FN is a linear code, we set GX (x) = 1(x X ).
The multiplication rule (3.4.16) requires an explanation. If we rewrite the
G(y)G(y ) (which makes the commuRHS in a symmetric form
y,y FN :y+y =x
tativity of the -multiplication obvious) then there will be an analogy with the
multiplication of polynomials. In fact, if A(t) = a0 + a1t + + al1t l1 and
A (t) = a 0 + a 1t + + a l 1t l 1 are two polynomials, with coefficient strings
(a0 , . . . , al1 ) and (a 0 , . . . , a l 1 ), then the product B(t) = A(t)A (t) has a string
am a m .
of coefficients (b0 , . . . , bl1+l 1 ) where bk =
m,m 0:m+m =k
From this point of view, rule (3.4.16) is behind some polynomial-type multiplication. Polynomials of degree n 1 form of course a (complex) linear space of
dimension n. However, they do not form a group (or even a semi-group). To make
a group, we should affiliate inverse monomials 1/t, 1/t 2 , and so on, and either consider infinite series or make an agreement that t n = 1 (i.e. treat t as an element of
a cyclic group, not a free variable). Similar constructions can be done for polynomials of several variables, but there we have a variety of possible agreements on
relations between variables.
Returning to our group algebra CH , we make the following steps:
(i) Produce a multiplicative version of the Hamming group H . That is, take a
collection of formal variables t (x) labelled by elements x H and postulate the
rule t (x)t (x ) = t (x+x ) for all x, x CH .
(ii) Then consider the set TH of all (complex) linear combinations G =
xH xt (x) and introduce (ii1) the addition G + G = xH (x + x )t (x) and (ii2)
the scalar multiplication aG = xH (ax )t (x) , G, G TH , a C. We again obtain a linear space of dimension qN , with the basis formed by basic combinations t (x) , x H . Obviously, TH and CH are isomorphic as linear spaces, with
G g.
(iii) Now remove brackets in t (x) (but keep the rule t xt x = t x+x ) and write
xH xt x as g(t) thinking that this is a function (in fact, a polynomial) of some
variable t obeying the above rule. Finally, consider the polynomial multiplication
g(t)g (t) in TH . Then TH and CH become isomorphic not only as linear spaces
but also as rings, i.e. as algebras.
The above construction is very powerful and can be used for any group, not just
for HN . Its power will be manifested in the derivation of the MacWilliams identity.
So, we will think of CH as a set of functions
g(t) =
xHn
xt x
(3.4.17)
319
t x;
(3.4.18)
xX
Xx (g)t x ,
(3.4.19a)
y (x y)
(3.4.19b)
xHn
where g (x , x Hn ) and
Xx (g) =
yHn
k=0
x:w(x)=k
0kn
Here
Ak =
x .
(3.4.21)
xH :w(x)=k
For a linear code X , with generating function gX (t) (see 3.4.18)), Ak gives the
number of codewords of weight k:
Ak = #{x X : w(x) = k}.
(3.4.22)
0kn
x: w(x)=k
320
where
A k =
Xx (g).
(3.4.24)
xH : w(x)=k
We have
Wg (s) = (1 + (q 1)s) Wg
n
1s
.
1 + (q 1)s
(3.4.25)
k=0
k=0
A k sk = Ak (1 s)k (1 + (q 1)s)nk
(3.4.26)
and expand:
n
(3.4.27)
i=0
j
i
j
(3.4.28)
j=0(i+kn)
0 (i + k n) = max [0, i + k n], i k = min [i, k].
Then
A k sk =
0kn
0kn
Ak
Ki (k)si =
0in
Ak Ki (k)si
0in 0kn
k
Ai Kk (i)s ,
0kn 0in
i.e.
A k =
Ai Kk (i).
(3.4.29)
0in
(3.4.30)
gX = #X gX .
(3.4.31)
Proof
By Lemma 3.4.2
321
Xu (gX ) = Xu
xX
(y u) = #X 1(u X ).
yX
Xx (gX )t x =
xH
= #X
#X 1(x X )t x
xH
xX
t x = #X gX (t).
Hence,
WgX (s) = #X WgX (s),
(3.4.32)
WX (s) =
Ak sk ,
WX (s) =
k=0
Ak sk
(3.4.33)
k=0
1
#X
Ai Kk (i),
(3.4.35)
0in
(3.4.36)
322
1
{(x, y) : x, y X , (x, y) = k},
M
k = 0, 1, . . . , N
(each pair x, y is counted two times). The numbers B0 , B1 , . . . , BN form the distance
distribution of code X . The expression
BX (s) =
Bk sk
(3.4.38)
0kN
X (s) =
sx .
(3.4.39)
(3.4.40)
xX
1
1
sx sy =
sxy
M xX yX
M x,yX
and hence
WhX (s) =
1
1 w(x y) = k sk = Bk sk
M 0kN x,yX :
0kN
= BX (s).
Now by the MacWilliams identity, for a given non-trivial character and the
corresponding transform , we obtain
Theorem 3.4.14 For hX (s) as above, if
hX (s) is the character transform and
Wh (s) its w-enumerator, with
X
k
Wh (s) = Bk s =
x (hX ) sk ,
X
0kN
0kN
w(x)=k
323
then
Bk =
Bi Kk (i),
0iN
x (hX (t)) =
and so,
Bk =
x (hX ) =
x:w(x)=k
1
|x (X )|2 0.
M w(x)=k
Thus:
Theorem 3.4.16
Bi Kk (i) 0.
(3.4.41)
0iN
Bi = M 2
0iN
or
Ei = M, with Ei =
0iN
1
Bi
M
(3.4.42)
Ei Kk (i) 0.
0iN
yFN
q :w(y)=k
"x,y# = Kk (i).
(3.4.43)
324
"x,y# =
...
yh1 Fq
yD
yhk Fq
j
= (q 1)k j
hi y
i=1 yFq
= (1) j (q 1)k j .
Hence,
N
M Bi Kk (i) =
"xy,z#
i=0
"x,z# |2 0.
zFN
q :w(z)=k xX
This leads us to the so-called linear programming (LP) bound stated in Theorem
3.4.17 below.
(The LP bound) The following inequality holds:
Theorem 3.4.17
0iN
and
0iN
325
Hence, for d even, as we can assume that E2i+1 = 0, the constraint in (3.4.44)
need only be considered for k = 0, . . . , [N/2].
Ei K0 (i) 0 follows from Ei 0.
0iN
M2 (N, d)
max
0iN
for k = 1, . . . ,
N
+ Ei Kk (i) 0
k
diN
(3.4.45)
A B
N
.
2
f (x) = 1 + f j K j (x)
j=1
(3.4.46)
326
j=d
f (0) = 1 + f j K j (0)
j=1
N
k=1
N
i=d
1 fk Bi (X )Kk (i)
N
= 1 Bi (X ) fk Kk (i)
i=d
N
k=1
= 1 Bi (X )( f (i) 1)
i=d
N
1 + Bi (X )
i=d
= M = Mq (N, d).
To obtain the Singleton bound select
Nd+1
f (x) = q
j=d
x
1
.
j
i=0
N i
N j
Ki (k) = q
N k
j
1
qN
f (i)Ki (k)
i=0
d1
N
N i
(k)/
K
N d +1 i
d 1
qd1 i=0
N k
N
=
/
0.
d 1
d 1
1
k=0
N k
N j
Kk (x) = q
N x
j
.
(3.4.47)
327
Ei
0iN
328
65/2
2
s 13s + 65/2
213
13
1 + 13 + +
5
for all s < 13 such that s2 13s + 65/2 > 0.
65 13
2
1 + 13 + 13 6 = 2.33277 106 :
21
not good enough. Next, s = 3 yields s2 13s + 65/2 = 9 39 + 65/2 = 5/2 > 0
and
M2 (13, 5)
13 11
65 13
212
2
2 :
1 + 13 + 13 6 + 13 2 5 = 13
5
111 66
not as good as Hammings. Finally, observe that 42 13 4 + 65/2 < 0, and the
procedure stops.
(3.5.1)
Then X is a [2m, m] linear code and has information rate 1/2. We can recover
from any non-zero codeword (a, b) X , as = ba1 (division in F2m ). Hence,
if = then X X = {0}.
Now, given = m (0, 1/2], we want to find = m such that code X has
minimum weight 2m . Since a non-zero binary (2m)-word can enter at most
one of the X s, we can find such if the number of the non-zero (2m)-words
329
of weight < 2m is < 2m 1, the number of distinct codes X . That is, we can
manage if
2m
< 2m 1
i
1i2m 1
2m
or even better,
< 2m 1. Now use the following:
i
1i2m
Lemma 3.5.2
For 0 1/2,
N
2N ( ) ,
k
0k N
(3.5.2)
(3.5.3)
Minimise the RHS of (3.5.3) in x = et for t > 0, i.e. for 0 < x < 1. This yields the
minimiser et = /(1 ) and the minimal value
N N
N
1
1+
1
2 1
1 N
1 N
= N N
= 2N ( )
,
2
2
with = 1 . Hence, (3.5.2) implies
N
N
1
N
N ( ) 1
.
k 2 2
2
0k N
330
= m =
1
1/2
log m
(3.5.4)
(with 0 < < 12 ), bound (3.5.4) becomes 2m2m/ log m < 2m 1 which is true for
m large enough. And m h1 (1/2) > 0, as m . Here and below, 1 is
the inverse function to (0, 1/2] ( ). In the code (3.5.1) with a fixed
the information rate is 1/2 but one cannot guarantee that d/2m is bounded away
from 0. Moreover, there is no effective way of finding a proper = m . However,
in 1972, Justensen [81] showed how to obtain a good sequence of codes cleverly
using the concatenation of words from an RS code.
More precisely, consider a binary (k1 k2 )-word a organised as k1 separate k2 words: a = a(0) a(1) . . . a(k1 1) . Pictorially,
k2
a =
k2
...
a(k1 1)
a(0)
a(i) F2k2 , 0 i k1 1.
We fix an [N1 , k1 , d1 ] code X1 over F2k2 called an outer code: X1 FN2k12 . Then
string a is encoded into a codeword c = c0 c1 . . . cN1 1 X1 . Next, each ci F2k2
is encoded by a codeword bi from an [N2 , k2 , d2 ] code X2 over F2 , called an inner
code. The result is a string b = b(0) . . . b(N1 1) FNq 1 N2 of length N1 N2 :
N2
N2
b =
...
b(0)
b(N1 1)
b(i) F2N2 , 0 i N1 1.
331
(3.5.6)
Ju
Ju
Ju
, x = 0 = d Xm,k
= min w(x) : x Xm,k
.
(3.5.7)
w Xm,k
For any fixed m, if the outer RS code XNRS , N = 2m 1, has minimum weight
Ju has
d then any super-codeword b = (c0 , c0 )(c1 , c1 ) . . . (cN1 , N1 cN1 ) Xm,k
d non-zero first components c0 , . . . , cN1 . Furthermore, any two inner codes
among X (0) , X (1) , . . . , X (N1) have only 0 in common. So, the corresponding d
ordered pairs, being from different codes, must be distinct. That is, super-codeword
b has d distinct non-zero binary (2m)-strings.
Ju is at least the sum of the weights
Next, the weight of super-codeword b Xm,k
of the above d distinct non-zero binary (2m)-strings. So, we need to establish a
lower bound on such a sum. Note that
k1
N(1 2R0 ).
d = N k+1 = N 1
N
Ju has at least N(1 2R ) distinct non-zero binary
Hence, a super-codeword b Xm,k
0
(2m)-strings.
Lemma 3.5.4 The sum of the weights of any N(1 2R0 ) distinct non-zero binary
(2m)-strings is
1 1
o(1) .
(3.5.8)
2mN(1 2R0 )
2
Proof By Lemma 3.5.2, for any [0, 1/2], the number of non-zero binary (2m)strings of weight 2m is
2m
22m ( ) .
i
1i2m
332
2m ( )
N(1 2R0 )
.
1
1
Select m =
(0, 1/2). Then m 1 (1/2), as h1 is
2 log(2m)
continuous on [0, 1/2]. So,
1
1
1
= 1
o(1).
m = 1
2 log(2m)
2
1
2m2m/ log(2m)
2m 1
m
1
2
0.
m
2m/
log(2m)
2 1 2
So the total weight of the N(1 2R0 ) distinct (2m)-strings is bounded below by
2mN(1 2R0 ) 1 (1/2) o(1) (1 o(1)) = 2mN(1 2R0 ) 1 (1/2) o(1) .
Thus the result follows.
Ju has
Lemma 3.5.4 demonstrates that Xm,k
Ju
1 1
o(1) .
w Xm,k 2mN(1 2R0 )
2
Then
(3.5.9)
Ju
w Xm,k
(1 2R0 )( 1 (1/2) o(1)) (1 2R0 ) 1 (1/2)
Ju
length Xm,k
c(1 2R0 ) > 0.
In the construction, R0 (0, 1/2). However, by truncating one can achieve any
given rate R0 (0, 1); see [110].
The next family of codes to be discussed in this section is formed by alternant
codes. Alternant codes are a generalisation of BCH (though in general not cyclic).
Like Justesen codes, alternant codes also form an asymptotically good family.
333
c11 . . .
..
..
M= .
.
cr1
c1n
.. .
.
. . . crn
m
As before, each ci j can be written as
c
i j (Fq ) , a column vector of length m over
Fq . That is, we can think of M as an (mr n) matrix over Fq (denoted again by M).
Given elements a1 , . . . , an Fqm , we have
a j ci j
c11 . . . c1n
a1
a1
1 jn
..
.. ..
.
..
.
.
.
M . = .
.
. . . =
.
an
cr1 . . . crn
an
a j cr j
1 jn
c =
c +
Furthermore, if b Fq and c, d Fqm then b
bc and
d = (c + d). Thus,
if a1 , . . . , an Fq , then
ai ci j
c11 . . . c1n
a1
a1
1 jn
.. ..
.
.
..
.
. . .. .. =
M . = .
.
.
an
an
cr1 . . . crn
ai cr j
1 jn
So, if the columns of M are linearly independent as r-vectors over Fqm , they are
also linearly independent as (rm)-vectors over Fq . That is, the columns of M are
linearly independent over Fq .
Recall that if is a primitive (n, Fqm ) root of unity and 2 then the n (m )
Vandermonde matrix over Fq
e
...
2
...
1
T
2
4
2(
1)
H =
...
...
n1 2(n1) . . . ( 1)(n1)
BCH (a proper parity-check matrix emerges
checks a narrow-sense BCH code Xq,n,
,
after column purging). Generalise it by taking an n r matrix over Fqm
h1 h1 1 . . . h1 1r2 h1 1r1
h2 h2 2 . . . h2 r2 h2 r1
2
2
(3.5.11)
A= .
,
..
..
.
.
.
.
.
.
.
hn hn n . . . hn nr2 hn nr1
334
h2 h2 2
A = .
..
.
..
hn hn n
. . . h1 1 r2
. . . h2 2 r2
..
..
.
.
r2
. . . hn n
r1
h1 1
r1
h2 2
.
r1
hn n
(3.5.12)
are linearly independent over Fqm . However, columns of A in (3.5.12) can be linearly dependent and purging such columns may be required to produce a genuine
parity-check matrix H.
Definition 3.5.5 Let = (1 , . . . , n ) and h = (h1 , . . . , hn ) where 1 , . . . , n are
distinct and h1 , . . . , hn non-zero elements of Fqm . Given r < n, an alternant code
XAlt
,h is the kernel of the n (rm) matrix A in (3.5.12).
(3.5.13)
335
(3.5.14a)
(3.5.14b)
bi (X i )1 Fqm [X]/"G(X)#.
(3.5.15)
1in
(3.5.16)
Clearly, XGo
,G is a linear code. The polynomial G(X) is called the Goppa polynomial; if G(X) is irreducible, we say that X Go is irreducible.
So, b = b1 . . . bn X Go iff in Fqm [X]
(3.5.17)
1in
Write G(X) = gi X i where deg G(X) = r, gr = 1 and r < n. Then in Fqm [X]
0ir
= 0 jr g j 0u j1 X u ij1u
336
and so
bi (G(X) G(i ))(X i )1 G(i )1
1in
= bi g j
1in
0 jr
0ur1
0u j1
ij1u X u G( j )1
X u bi G(i )1
1in
u+1 jr
g j ij1u .
1in
bi G(i )1
g j ij1u = 0
(3.5.18)
u+1 jr
for all u = 0, . . . , r 1.
Equation (3.5.18) leads to the parity-check matrix for X Go . First, we see that
the matrix
G(2 )1
...
G(n )1
G(1 )1
1 G(1 )1
2 G(2 )1 . . . n G(n )1
2 G(1 )1
22 G(2 )1 . . . n2 G(n )1
(3.5.19)
1
.
.
.
.
.
.
.
.
.
.
.
.
r1
r1
1
1
r1
1
1 G(1 )
2 G(2 )
. . . n G(n )
which is (n r) over Fm
q , provides a parity-check. As before, any r rows of matrix (3.5.19) are linearly independent over Fqm and so are its columns. Then again
we write (3.5.19) as an n (mr) matrix over Fq ; after purging linearly dependent
columns it will give the parity-check matrix H.
We see that X Go is an alternant code XAlt
,h where = (1 , . . . , n ) and h =
1
1
(G(1 ) , . . . , G(n ) ). So, Theorem 3.5.6 implies
Theorem 3.5.9 The q-ary Goppa code X = XGo
,G , where = {1 , . . . , n } and
deg G(X) = r < n, has length n, rank k satisfying n mr k n r and minimum
distance d(X ) r + 1.
As before, the above bound on minimum distance can be improved in the binary
case. Suppose that a binary word b = b1 . . . bn X where X is a Goppa code
XGo
,G , where F2m and G(X) F2 [X]. Suppose w(b) = w and bi1 = = biw = 1.
Take fb (X) = (X i j ) and write the derivative X fb (X) as
1 jw
where Rb (X) =
1 jw
have no common roots in any extension F2K , they are co-prime. Then b X Go
337
iff G(X) divides Rb (X) which is the same as G(X) divides X fb (X). For q = 2,
X fb (X) has only even powers of X (as its monomials are of the form X 1 times
a product of some i j s: this vanishes when is even). In other words, X fb =
h(X 2 ) = (h(X))2 for some polynomial h(X). Hence if g(X) is the polynomial of
lowest degree which is a square and divisible by G(X) then G(X) divides X fb (X)
iff g(X) divides X fb (X). So,
b X Go g(X)|X fb (X) Rb (X) = 0 mod g(X).
(3.5.21)
i1
over Fq obtained from the
in (3.5.12), we take the n (mr) matrix A = h j j
n r matrix A = h j i1
over Fqm by replacing the entries with rows of length m.
j
Then purge linearly dependent columns from A . Recall that h1 , . . . , hn are non-zero
and 1 , . . . , n are distinct elements of Fqm . Suppose a word u = c + e is received,
where c is the right codeword and e an error vector. We assume that r is even and
that t r/2 errors have occurred, at digits 1 i1 < < it n. Let the i j th entry
of e be ei j = 0. It is convenient to identify the error locators with elements i j : as
i = i for i = i (the i are distinct), we will know the erroneous positions if we
determine i1 , . . . , it . Moreover, if we introduce the error locator polynomial
t
(X) = (1 i j X) =
j=1
i X i ,
(3.5.22)
0it
, then it will be enough to find (X) (i.e. coeffiwith 0 = 1 and the roots i1
j
cients i ).
We have to calculate the syndrome (we will call it an A-syndrome) emerging by
acting on matrix A:
uA = eA = 0 . . . 0ei1 . . . eit 0 . . . 0 A.
338
(X) =
(1 i j X).
(3.5.23)
1 jt: j=k
1kt
Lemma 3.5.12
hik eik
si X i . It is convenient
0ir1
For all u = 1, . . . ,t ,
ei j =
(i1
)
j
hi j
1 jt, j= j
(1 i j i1
)
j
(3.5.24)
Proof Straightforward.
The crucial fact is that (X), (X) and s(X) are related by
Lemma 3.5.13
(3.5.25)
(X) (X)s(X) =
hik eik
1 jt: j=k
1kt
= hik eik
k
1 jt: j=k
(1 i j X) (X)
sl X l
0lr1
(1 i j X) (X)
l 1kt
= hik eik
k
(1 i X) (X) il X l
j
j=k
j=k
1 irk X r
= hik eik (1 i j X) 1 (1 ik X)
1 ik X
k
j=k
= hik ik (1 i j X) irk X r = 0 mod X r .
k
j=k
Lemma 3.5.13 shows the way of decoding alternant codes. We know that there
exists a polynomial q(X) such that
(3.5.26)
We also have deg (X) t 1 < r/2, deg (X) = t r/2 and that (X) and (X)
are co-prime as they have no common roots in any extension. Suppose we apply the
339
Euclid algorithm to the known polynomials f (X) = X r and g(X) = s(X) with the
aim to find (X) and (X). By Lemma 2.5.44, a typical step produces a remainder
rk (X) = ak (X)X r + bk (X)s(X).
(3.5.27)
If we want rk (X) and bk (X) to give (X) and (X), their degrees must match:
at least we must have deg rk (X) < r/2 and deg bk (X) r/2. So, the algorithm is
repeated until deg rk1 (X) r/2 and deg rk (X) < r/2. Then, according to Lemma
2.5.44, statement (3), deg bk (X) = deg X r deg rk1 (X) r r/2 = r/2. This is
possible as the algorithm can be iterated until rk (X) = gcd(X r , s(X)). But then
rk (X)| (X) and hence deg rk (X) deg (X) < r/2. So we can assume deg rk (X)
r/2, deg bk (X) r/2.
The relevant equations are
340
Then (X) is the error locator polynomial whose roots are the inverses of
i1 , . . . , yt = it , and i1 , . . . , it are the error digits. The values ei j are given by
ei j =
(i1
)
j
hi j l= j (1 il i1
)
j
(3.5.28)
341
q1 j = ( j )1 , j = b, . . . , b + 2.
That is, they are
qb , qb+1 , . . . , qb+q 1
and the generator polynomial g (X) for (X RS ) is
g (X) = (X b )(X b
+1
) . . . (X b
+q 1
342
(i) X = X ev or
(ii) X ev is an [N, k 1] linear subcode of X .
Prove that if the generating matrix G of X has no zero column then the total weight
w(x) equals N2k1 .
xX
and
A5 = N(N 1)(N 3)(N 7) 5!.
j
j=0s+2 2 +1
343
XH,
Ai Ks (i)
i=1
(3.6.1)
344
where
i
N i
(1) j ,
Ks (i) =
j
s
j
j=0s+iN
si
(3.6.2)
1
1)K (21 )
1
+
(2
s
2
21
2 1 21
s21
1
j
(1) .
= 1 + (2 1)
j
2 1 j
2
j=0s+21 2 +1
As =
345
346
So, and 3 suffice as zeros, and the generating polynomial g(X) equals
(X 5 + X 2 + 1)(X 5 + X 4 + X 3 + X 2 + 1)
= X 10 + X 9 + X 8 + X 6 + X 5 + X 3 + 1,
as required. In other words:
X = {c(X) F2 [X]/(X 31 + 1) : c( ) = c( 3 ) = 0}
= {c(X) F2 [X]/(X 31 + 1) : g(X)|c(X)}.
The rank of X equals 21. The minimum distance of X equals 5, its designed
distance. This follows from Theorem
3.3.20:
N
E
Let N = 2 1. If 2 <
then the binary narrow-sense primitive BCH
0iE+1 i
code of designed distance 2E + 1 has minimum distance 2E + 1.
In fact, N = 31 = 25 1 with = 5 and E = 2, i.e. 2E + 1 = 5, and
1024 = 210 < 1 + 31 +
31 30 31 30 29
+
= 4992.
2
23
Thus, X corrects two errors. The BerlekampMassey decoding procedure requires calculating the values of the received polynomial at the defining zeros. From
the F32 field table we have
u( ) = 12 + 11 + 9 + 7 + 6 + 2 + 1 = 3 ,
u( 3 ) = 36 + 33 + 27 + 18 + 6 + 1 = 9 .
So, u( 3 ) = u( )3 . As u( ) = 3 , we conclude that a single error occurred, at
digit three, i.e. u(X) is decoded by
c(X) = X 12 + X 11 + X 9 + X 7 + X 6 + X 3 + X 2 + 1
which is (X 2 + 1)g(X) as required.
Problem 3.4 Define the dual X of a linear [N, k] code of length N and dimension k with alphabet F. Prove or disprove that if X is a binary [N, (N 1)/2] code
with N odd then X is generated by a basis of X plus the word 1 . . . 1. Prove or
disprove that if a binary code X is self-dual, X = X , then N is even and the
word 1 . . . 1 belongs to X .
Prove that a binary self-dual linear [N, N/2] code X exists for each even N .
Conversely, prove that if a binary linear [N, k] code X is self-dual then k = N/2.
Give an example of a non-binary linear self-dual code.
Solution The dual X of the [N, k] linear code X is given by
X = {x = x1 . . . xN FN : x y = 0
for all y X }
347
1 0 0 0 0
0 1 0 0 0
X =
1 1 0 0 0 .
0 0 0 0 0
Then X is generated by
0 0 1 0 0
0 0 0 1 0 .
0 0 0 0 1
1 0 0 0 0 0 0 1 1 1 1 1
0 1 0 0 0 0 1 0 1 2 2 1
0 0 1 0 0 0 1 1 0 1 2 2
G=
0 0 0 1 0 0 1 2 1 0 1 2
0 0 0 0 1 0 1 2 2 1 0 1
0 0 0 0 0 1 1 1 2 2 1 0
Here rows of G are orthogonal (including self-orthogonal). Hence, X X .
But dim(X ) = dim(X ) = 6, so X = X .
Problem 3.5 Define a finite field Fq with q elements and prove that q must have
the form q = ps where p is a prime integer and s 1 a positive integer. Check that
p is the characteristic of Fq .
348
Prove that for any p and s as above there exists a finite field Fsp with ps elements,
and this field is unique up to isomorphism.
Prove that the set Fps of the non-zero elements of F ps is a cyclic group Z ps 1 .
Write the field table for F9 , identifying the powers i of a primitive element
F9 as vectors over F3 . Indicate all vectors in this table such that 4 = e.
Solution A field Fq with q elements is a set of cardinality q with two commutative
group operations, + and , with standard distributivity rules. It is easy to check that
char(Fq ) = p is a prime number. Then F p Fq and q = Fq = ps where s = [Fq : F p ]
is the dimension of Fq as a vector space over F p , a field of p elements.
Now, let Fq , the multiplicative group of non-zero elements from Fq , contain an
element of order q 1 = Fq . In fact, every b Fq has a finite order ord(b) = r(b);
set r0 = max[r(b) : b Fq ]. and fix a Fq with r(a) = r0 . Then r(b)|r0 for all
b Fq . Next, pick , a prime factor of r(b), and write r(b) = s , r0 = s . Let us
s
s
check that s s . Indeed, a has order , b order s and a b order s . Thus,
if s > s, we obtain an element of order > r0 . Hence, s s which holds for any
prime factor of r(b), and r(b)|r(a).
Then br(a) = e, for all b Fq , i.e. the polynomial X r0 e is divisible by (X b).
It must then be the product bFq (X b). Then r0 = Fq = q 1. Then Fq is a
cyclic group with generator a.
For each prime p and positive integer s there exists at most one field Fq with
q = ps , up to isomorphism. Indeed, if Fq and F q are two such fields then they both
are isomorphic to Spl(X q X), the splitting field of X q X (over F p , the basic
field).
The elements of F9 = F3 F3 with 4 = e are e = 01, 2 = 1 + 2 = 21,
4 = 02, 6 = 2 + = 12 where = 10.
Problem 3.6 Give the definition of a cyclic code of length N with alphabet Fq .
(N, Fq )-roots
What are the defining zeros of a cyclic code
why are they always
and
3s 1 3s 1
,
s, 3 code is equivalent
of unity? Prove that the ternary Hamming
2
2
to a cyclic code and identify the defining zeros of this cyclic code.
A sender uses the ternary [13, 10, 3] Hamming code, with field alphabet F3 =
{0, 1, 2} and the parity-check matrix H of the form
1 0 1 2 0 1 2 0 1 2 0 1 2
0 1 1 1 0 0 0 1 1 1 2 2 2 .
0 0 0 0 1 1 1 1 1 1 1 1 1
349
0mE
N
m
1
(q 1)
A
, with E =
B
d 1
,
2
(3.6.3)
shows that d = 3 and k = N l. So, the cyclic code with the parity-check matrix H
is equivalent to Hammings.
To decode the code in question, calculate the syndrome xH T = 2 0 2 = 2 (1 0 1)
indicating the error is in the 6th position. Hence, x 2e(6) = y + e(6) and the correct
word is c = 2 1 2 0 1 2 0 0 2 1 1 2 0.
Problem 3.7 Compute the rank and minimum distance of the cyclic code with
generator polynomial g(X) = X 3 +X +1 and parity-check polynomial h(X) = X 4 +
X 2 + X + 1. Now let be a root of g(X) in the field F8 . We receive the word
r(X) = X 5 + X 3 + X(mod X 7 1). Verify that r( ) = 4 , and hence decode r(X)
using minimum-distance decoding.
Solution A cyclic code X of length N has generator polynomial g(X) F2 [X]
and parity-check polynomial h(X) F2 [X] with g(X)h(X) = X N 1. Recall that
if g(X) has degree k, i.e. g(X) = a0 + a1 X + + ak X k where ak = 0, then
g(X), Xg(X), . . . , X nk1 g(X) form a basis for X . In particular, the rank of X
equals N k. In this question, k = 3 and rank(X ) = 4.
350
If h(X) = b0 + b1 X + + bNk X Nk
X is
...
b1
bNk bNk1
0
b
.
..
b
Nk
Nk1
..
..
..
0
.
.
.
...
0
b0
...
...
..
..
..
0
0
0
0
bNk bNk1 . . . b1 b0
;<
=
N
1 0 1 1 1 0 0
H = 0 1 0 1 1 1 0 .
0 0 1 0 1 1 1
no zero column
no codewords of weight 1
no repeated column no codewords of weight 2
Hence, d(X ) = 3. In fact, X is equivalent to Hammings original [7, 4] code.
Since g(X) F2 [X] is irreducible, the code X F72 = F2 [X] (X 7 1) is the
cyclic code defined by . The multiplicative cyclic group F8 Z
7 of non-zero
elements of field F8 is
0 = 1,
,
2,
3 = + 1,
4 = 2 + ,
5 = 3 + 2 = 2 + + 1,
6 = 3 + 2 + = 2 + 1,
7 = 3 + = 1.
Next, the value r( ) is
r( ) = + 3 + 5
= + ( + 1) + ( 2 + + 1)
= 2 +
= 4,
351
as required. Let c(X) = r(X)+X 4 mod(X 7 1). Then c( ) = 0, i.e. c(X) is a codeword. Since d(X ) = 3 the code is 1-error correcting. We just found a codeword
c(X) at distance 1 from r(X). Then r(X) is written as
c(X) = X + X 3 + X 4 + X 5 mod (X 7 1),
and should be decoded by c(X) under minimum-distance decoding.
Problem 3.8 If X is a linear [N, k] code, define its weight enumeration polynomial WX (s,t). Show that:
(a)
(b)
(c)
(d)
WX (1, 1) = 2k ,
WX (0, 1) = 1,
WX (1, 0) has value 0 or 1,
WX (s,t) = WX (t, s) if and only if WX (1, 0) = 1.
(3.6.4)
352
(3.6.7)
(3.6.8)
Problem 3.10 Let X be a linear code over F2 of length N and rank k and let Ai be
the number of words in X of weight i, i = 0, . . . , N . Define the weight enumerator
polynomial of X as
W (X , z) =
Ai zi .
0iN
(3.6.9)
[Hint: Consider g(u) = (1)uv zw(v) where w(v) denotes the weight of the
vFN
2
l1
(3.6.10)
l1
= 1 + (2l 1)z2 .
353
Solution The dual code X , of a linear code X with the generating matrix G and
the parity-check matrix H, is defined as a linear code with the generating matrix
H. If X is an [N, k] code, X is an [N, N k] code, and the parity-check matrix
for X is G.
Equivalently, X is the code which is formed by the linear subspace in FN2
orthogonal to X in the dot-product
"x, y# =
xi yi , x = x1 . . . xN , y = y1 . . . yN .
1iN
By definition,
W (X , z) =
zw(u) , W X , z =
zw(v) .
vX
uX
uX
(3.6.11)
zw(v) (1)"u,v#.
v
(3.6.12)
uX
Note that when v X , the sum (1)"u,v# = X . On the other hand, when
uX
v X then there exists u0 X such that "u0 , v# = 0 (i.e. "u0 , v# = 1). Hence, if
v X , then, with the change of variables u u + u0 , we obtain
uX
uX
(1)"u,v# = (1)"u,v#,
= (1)"u0 ,v#
uX
uX
(1)"u,v#
uX
(3.6.11) equals
1
X
vX
zw(v) X = W X , z .
(3.6.13)
zw(vi ) (1)ui vi
v1 ,...,vN 1iN
zw(a) (1)aui
1iN a=0,1
1iN
1 + z(1)ui .
(3.6.14)
354
Here w(a) = 0 for a = 0 and w(a) = 1 for a = 1. The RHS of (3.6.14) equals
(1 z)w(u) (1 + z)Nw(u) .
Hence, an alternative expression for (3.6.11) is
1
(1 + z)N
X
uX
1z
1+z
w(u)
=
1
1z
(1 + z)N W X ,
.
X
1+z
(3.6.15)
(3.6.16)
z)
(1
+
z)
i
dz X 0iN
z=1
1
N2N1 A1 (X )2N1 (only terms i = 0, 1 contribute)
=
X
2N N
(A1 (X ) = 0 as the code is at least 1-error correcting,
=
X 2
with distance 3).
Now take into account that
( X ) ( X ) = 2k 2Nk = 2N .
The equality
the average weight in X =
N
2
follows. The enumeration polynomial of the simplex code is obtained by substitution. In this case the average length is (2l 1)/2.
Problem 3.11 Describe the binary narrow-sense BCH code X of length 15 and
the designed distance 5 and find the generator polynomial. Decode the message
100000111000100.
355
Solution Take the binary narrow-sense BCH code X of length 15 and the designed
distance 5. We have Spl(X 15 1) = F24 = F16 . We know that X 4 + X + 1 is a
primitive polynomial over F16 . Let be a root of X 4 + X + 1. Then
M1 (X) = X 4 + X + 1, M3 (X) = X 4 + X 3 + X 2 + X + 1,
and the generator g(X) for X is
g(X) = M1 (X)M3 (X) = X 8 + X 7 + X 6 + X 4 + 1.
Take g(X) as example of a codeword. Introduce 2 errors at positions 4 and 12
by taking
u(X) = X 12 + X 8 + X 7 + X 6 + 1.
Using the field table for F16 , obtain
u1 = u( ) = 12 + 8 + 7 + 6 + 1 = 6
and
u3 = u( 3 ) = 36 + 24 + 18 + 1 = 9 + 3 + 1 = 4 .
As u1 = 0 and u31 = 18 = 3 = u3 , deduce that 2 errors occurred. Calculate the
locator polynomial
l(X) = 1 + 6 X + ( 13 + 12 )X 2 .
Substituting 1, , . . . , 14 into l(X), check that 3 and 11 are roots. This confirms
that, if exactly 2 errors occurred their positions are 4 and 12 then the codeword sent
was 100010111000000.
Problem 3.12 For a word x = x1 . . . xN FN2 the weight w(x) is the number
of non-zero digits: w(x) = {i : xi = 0}. For a linear [N, k] code X let Ai be the
number of words in X of weight i (0 i N). Define the weight enumerator
N
356
Problem 3.13 Let X be a binary linear [N, k, d] code, with the weight enumerator WX (s). Find expressions, in terms of WX (s), for the weight enumerators of:
(i) the subcode X ev X consisting of all codewords x X of even weight,
(ii) the parity-check extension X pc of X .
Prove that if d is even then there exists an [N, k, d] code where each codeword has
even weight.
Solution (i) All words with even weights from X belong to subcode X ev . Hence
ev
(s) =
WX
1
[WX (s) +WX (s)] .
2
357
Problem 3.16 Prove that the binary code of length 23 generated by the polynomial g(X) = 1 + X + X 5 + X 6 + X 7 + X 9 + X 11 has minimal distance 7, and is
perfect.
[Hint: If grev (X) = X 11 g(1/X) is the reversal of g(X) then
X 23 + 1 (X + 1)g(X)grev (X) mod 2.]
Solution First, show that the code is BCH, of designed distance 5. By the freshers
dream Lemma 3.1.5, if is a root of a polynomial f (X) F2 [X] then so is 2 .
Thus, if is a root of g(X) = 1 + X + X 5 + X 6 + X 7 + X 9 + X 11 then so are ,
2 , 4 , 8 , 16 , 9 , 18 , 13 , 3 , 6 , 12 . This yields the design sequence
358
359
Problem 3.17 Use the MacWilliams identity to prove that the weight distribution
of a q-ary MDS code of distance d is
N
j i
id+1 j
q
Ai =
(1)
1
i 0
j
jid
N
j i1
qid j , d i N.
=
(q 1) (1)
j
i
0 jid
[Hint: To begin the solution,
(a)
(b)
(c)
(d)
(e)
(3.6.18)
Use the fact that d(X ) = N k +1 and d(X ) = k +1 and obtain simplified equations involving ANk+1 , . . . , ANr only. Subsequently, determine ANk+1 , . . . , ANr .
Varying r, continue up to AN .]
Solution The MacWilliams identity is
1iN
i
A
i s =
Ni
1
i
A
(1
s)
.
1
+
(q
1)s
i
qk 1iN
r
N r
qk 0iNr
q 0ir
(the Leibniz rule (3.6.18) is used here). Formula (3.6.19) is the starting point. For
an MDS code, A0 = A
0 = 1, and
Ai = 0, 1 i N k (= d 1), A
i = 0, 1 i k (= d 1).
Then
1 N
1 Nr
1
N i
N
N 1
Ai = r
= r
,
+
r
r qk qk i=Nk+1
q N r
q r
360
i.e.
N
N i
(qkr 1).
Ai =
r
r
i=Nk+1
Nr
(3.6.20)
N
k1
+ ANk+2 =
(q2 1),
A
k2
k 2 Nk+1
361
Solution Write
i
N i
Kk (i) =
(1) j (q 1)k j .
j
k
j
0(i+kN) jki
Next:
(a) The following straightforward equation holds true:
i N
k N
Kk (i) = (q 1)
Ki (k)
(q 1)
i
k
(as all summands become insensitive to swapping i k).
N
N
For q = 2 this yields
Ki (k); in particular,
Kk (i) =
k
i
N
N
K0 (i) =
Ki (0) = Ki (0).
i
0
(b) Also, for q = 2: Kk (i) = (1)k Kk (N i) (again straightforward, after swapping
i i j).
N
N
K2i (k) which equals
(c) Thus, still for q = 2:
K (2i) =
k
2i k
N
N
2i
(1)
K (2i). That is,
K2i (N k) =
2i Nk
N k
Kk (2i) = KNk (2i).
Problem 3.19 What is an (n, Fq )-root of unity? Show that the set E(n,q) of the
(n, Fq )-roots of unity form a cyclic group. Check that the order of E(n,q) equals n if
n and q are co-prime. Find the minimal s such that E(n,q) Fqs .
Define a primitive (n, Fq )-root of unity. Determine the number of primitive
(n, Fq )-roots of unity when n and q are co-prime. If is a primitive (n, Fq )-root of
unity, find the minimal such that Fq .
Find representation of all elements of F9 as vectors over F3 . Find all (4, F9 )-roots
of unity as vectors over F3 .
Solution We know that any root of an irreducible polynomial of degree 2 over field
F3 = {0, 1, 2} belongs to F9 . Take the polynomial f (X) = X 2 + 1 and denote its
root by (any of the two). Then all elements of F9 may be represented as a0 + a1
where a0 , a1 F3 . In fact,
F9 = {0, 1, , 1 + , 2 + , 2 , 1 + 2 , 2 + 2 }.
362
= 1 + , 2 = 2 , 3 = 1 + 2 , 4 = 2,
5 = 2 + 2 , 6 = , 7 = 2 + , 8 = 1.
Hence, the roots of degree 4 are 2 , 4 , 6 , 8 .
Problem 3.20 Define a cyclic code of length N over the field Fq . Show that there
is a bijection between the cyclic codes of length N , and the factors of X N e in the
polynomial ring Fq [X].
Now consider binary cyclic codes. If N is an odd integer then we can find a finite
extension K of F2 that contains a primitive N th root of unity . Show that a cyclic
code of length N with defining set { , 2 , . . . , 1 } has minimum distance at
least . Show that if N = 2 1 and = 3 then we obtain the Hamming [2 1,
2 1 , 3] code.
Solution A linear code X FN
is a cyclic code if x1 . . . xN X implies that
q
x2 , . . . xN x1 X . Bijection of cyclic codes and factors of X N 1 can be established
as in Corollary 3.3.3.
Passing to binary codes, consider, for brevity, N = 7. Factorising in F72 renders
the decomposition
X 7 1 = (X 1)(X 3 + X + 1)(X 3 + X 2 + 1) := (X 1) f1 (X) f2 (X).
Suppose is a root of f1 (X). Since f1 (X)2 = f1 (X 2 ) in F2 [X] we have
f1 ( ) = f1 ( 2 ) = 0.
It follows that the cyclic code X with defining root has the generator polynomial
f1 (X) and the check polynomial (X 1) f2 (X) = X 4 + X 2 + X + 1. This property
363
si + s j )
si + s j = z
(3.6.22)
for any pair (z, z ) that may eventually occur as a syndrome under two errors.
A natural guess is to try a permutation that has some algebraic significance,
e.g. si = si si = (si )2 (a bad choice) or si = si si si = (si )3 (a good choice)
364
(1 . . . 00) (1 . . . 00)k
..
HT =
.
.
(1 . . . 11) (1 . . . 11)k
Then we have to deal with equations of the type
si + s j = z, ski + skj = z .
(3.6.23)
For solving (3.6.23), we need the field structure of the Hamming space, i.e. not
only multiplication but also division. Any field structure on the Hamming space
of length
N is isomorphic to F2N , and a concrete realisation of such a structure is
F2 [X] "c(X)#, a polynomial field modulo an irreducible polynomial c(X) of degree
N. Such a polynomial always exists: it is one of the primitive polynomials of degree
N. In fact, the simplest consistent system of the form (3.6.23) is
s + s = z, s3 + s = z ;
3
(3.6.24)
where s and s are words of length 4 (or equivalently their polynomials), and the
multiplication is mod 1 + X + X 4 . In the case of two errors it is guaranteed that
there is exactly one pair of solutions to (3.6.24), one vector occupying position i
and another position j, among the columns of the upper (Hamming) half of matrix
H. Moreover, (3.6.24) cannot have more than one pair of solutions because
z = s3 + s = (s + s )(s2 + ss + s ) = z(z2 + ss )
3
implies that
ss = z z1 + z2 .
(3.6.25)
365
Now (3.6.25) and the first equation in (3.6.24) give that s, s are precisely the roots
of a quadratic equation
X 2 + zX + z z1 + z2 = 0
(3.6.26)
(with z z1 + z2 = 0). But the polynomial in the LHS of (3.6.26) cannot have more
than two distinct roots (it could have no root or two coinciding roots, but it is
excluded by the assumption that there are precisely two errors). In the case of a
single error, we have z = z3 ; in this case s = z is the only root and we just find the
word z among the columns of the upper half of matrix H.
Summarising, the decoding scheme, in the case of the above [15, 7] code, is as
follows: Upon receiving word y, form a syndrome yH T = (z, z )T . Then
(i) If both z and z are zero words, conclude that no error occurred and decode y
by y itself.
(ii) If z = 0 and z3 = z , conclude that a single error occurred and find the location
of the error digit by identifying word z among the columns of the Hamming
check matrix.
(iii) If z = 0 and z3 = z , form the quadric (3.6.24), and if it has two distinct roots
s and s , conclude that two errors occurred and locate the error digits by identifying words s and s among the columns of the Hamming check matrix.
(iv) If z = 0 and z3 = z and quadric (3.6.26) has no roots, or if z is zero but z is
not, conclude that there are at least three errors.
Note that the case where z = 0, z3 = z and quadric (3.6.26) has a single root is
impossible: if (3.6.26) has a root, s say, then either another root s = s or z = 0 and
a single error occurs.
The decoding procedure allows us to detect, in some cases, that more than three
errors occurred. However, this procedure may lead to a wrong codeword when
three or more errors occur.
4
Further Topics from Information Theory
367
MAGC is particularly attractive because it allows one to do some handy and farreaching calculations with elegant answers.
However, Gaussian (and other continuously distributed) channels present a challenge that was absent in the case of finite alphabets considered in Chapter 1.
Namely, because codewords (or, using a slightly more appropriate term, codevectors) can a priori take values from a Euclidean space (as well as noise vectors),
the definition of the channel capacity has to be modified, by introducing a power
constraint. More generally, the value of capacity for a channel will depend upon
the so-called regional constraints which can generate analytic difficulties. In the
case of MAGC, the way was shown by Shannon, but it took some years to make
his analysis rigorous.
An input word of length N (designed to use the channel over N slots in succession) is identified with an input N-vector
x1
x(= x(N) ) = ... .
xN
We assume that xi R and hence x(N) RN (to make the notation shorter, the upper
index (N) will be often omitted).
In anadditive
channels an input vector x is transformed to a random vector
Y1
Y(N) = ... where Y = x + Z, or, component-wise,
YN
Y j = x j + Z j , 1 j N.
(4.1.1)
Z1
Z = ...
ZN
is a noise vector composed of random variables Z1 , . . . , ZN . Thus, the noise can be
characterised by a joint PDF f no (z) 0 where
z1
..
z= .
zN
368
no
Example 4.1.1
An additivechannel
is called Gaussian (an AGC, in short) if,
Z1
for each N, the noise vector ... is a multivariate normal; cf. PSE I, p. 114.
ZN
We assume from now on that the mean value EZ j = 0. Recall that the multivariate
normal distribution with the zero mean is completely determined by its covariance
matrix. More precisely, the joint PDF fZno(N) (z(N) ) for an AGC has the form
z1
1
1 T 1
..
N
(4.1.2)
1/2 exp 2 z z , z = . R .
N/2
(2 )
det
zN
Here is an N N matrix assumed
to be real, symmetric and strictly positive definite, with entries j j = E Z j Z j representing the covariance of noise random
variables Z j and Z j , 1 j, j N. (Real strict positive definiteness means that is
of the form BBT where B is an N N real invertible matrix; if is strictly positive
definite then has N mutually orthogonal eigenvectors, and all N eigenvalues of
are greater than 0.) In particular, each random variable Z j is normal: Z j N(0, 2j )
where 2j = EZ 2j coincides with the diagonal entry j j . (Due to strict positive definiteness, j j > 0 for all j = 1, . . . , N.)
If in addition the random variables Z1 , Z2 , . . . are IID, the channel is called memoryless Gaussian (MGC) or a channel with (additive) Gaussian white noise. In this
case matrix is diagonal: i j = 0 when i = j and ii > 0 when i = j. This is an
important model example (both educationally and practically) since it admits some
nice final formulas and serves as a basis for further generalisations.
Thus, an MGC has IID noise random variables Zi N(0, 2 ) where 2 =
independence is equivalent to decorreVarZi = EZi2 . For normal random
variables,
lation. That is, the equality E Z j Z j = 0 for all j, j = 1, . . . , N with j = j implies
that the components Z1 , . . . , ZN of the noise vector Z(N) are mutually independent.
This can be deduced from (4.1.2): if matrix has j j = 0 for j = j then is
diagonal, with det = j j , and the joint PDF in (4.1.2) decomposes into a
1 jN
369
Moreover, under the IID assumption, with j j 2 > 0, all random variables Z j
N(0, 2 ), and the noise distribution for an MGC is completely specified by a single
parameter > 0. More precisely, the joint PDF from (4.1.3) is rewritten as
N
1
1
exp 2 z2j .
2 1 jN
2
It is often convenient to think that an infinite random sequence Z
1 = {Z1 , Z2 , . . .}
(N)
is given, and the above noise vector Z is formed by the first N members of this
sequence. In the Gaussian case, Z
1 is called a random Gaussian process; with
EZ j
0, this
process
is determined, like before, by its covariance , with i j =
Cov Zi , Z j = E Zi Z j . The term white Gaussian noise distinguishes this model
from a more general model of a channel with coloured noise; see below.
Channels with continuously distributed noise are analysed by using a scheme
similar to the one adopted in the discrete case: in particular, if the channel is used
for transmitting one of M 2RN , R < 1, encoded messages, we need a codebook
that consists of M codewords of length N: xT (i) = (x1 (i), . . . , xN (i)), 1 i M:
x1 (M)
x1 (1)
(
)
xN (1)
xN (M)
The codebook is, of course, presumed to be known to both the sender and the
receiver. The transmission rate R is given by
R=
log2 M
.
N
(4.1.5)
Now suppose thata codevectorx(i) had been sent. Then the received random
x1 (i) + Z1
..
vector Y(= Y(i)) =
is decoded by using a chosen decoder d : y
.
xN (i) + ZN
d(y) XM,N . Geometrically, the decoder looks for the nearest codeword x(k),
relative to a certain distance (adapted to the decoder); for instance, if we choose to
use the Euclidean distance then vector Y is decoded by the codeword minimising
the sum of squares:
d(Y) = arg min
(4.1.6)
1 jN
when d(y) = x(i) we have an error. Luckily, the choice of a decoder is conveniently
resolved on the basis of the maximum-likelihood principle; see below.
370
There is an additional subtlety here: one assumes that, for an input word x to get a
chance of successful decoding, it should belong to a certain transmittable domain
in RN . For example, working with an MAGC, one imposes the power constraint
1
x2j
N 1
jN
(4.1.7)
where > 0 is a given constant. In the context of wireless transmission this means
that the amplitude square power per signal in an N-long input vector should be
bounded by , otherwise the result of transmission is treated as undecodable.
Geometrically, in order to perform decoding, the inputcodeword x(i) constituting
the codebook must lie inside the Euclidean ball BN2 ( N) of radius r = N
centred at 0 RN :
1/2
x1
..
(N)
2
x
r
B2 (r) = x = . :
.
j
1 jN
xN
The subscript 2 stresses that RN with the standard Euclidean distance is viewed as
a Hilbert 2 -space.
In fact, it is not required that the whole codebook XM,N lies in a decodable
domain; the agreement is only that if a codeword x(i) falls outside then it is decoded
wrongly with probability 1. Pictorially, the requirement is that most of codewords
lie within BN2 ((N )1/2 ) but not necessarily all of them. See Figure 4.1.
A reason for the regional constraint (4.1.7) is that otherwise the codewords
can be positioned in space at an arbitrarily large distance from each other, and,
eventually, every transmission rate would become reliable. (This would mean that
the capacity of the channel is infinite; although such channels should not be dismissed outright, in the context of an AGC the case of an infinite capacity seems
impractical.)
Typically, the decodable region D(N) RN is represented by a ball in RN , centred
at the origin, and specified relative to a particular distance in RN . Say, in the case
of exponentially distributed noise it is natural to select
x1
..
(N)
(N)
D = B1 (N ) = x = . : |x j | N
1 jN
xN
the ball in the 1 -metric. When an output-signal vector falling within distance r
from a codeword is decoded by this codeword, we have a correct decoding if (i)
the output signal falls in exactly one sphere around a codeword, (ii) the codeword
in question lies within D(N) , and (iii) this specific codeword was sent. We have
possibly an error when more than one codeword falls into the sphere.
371
Figure 4.1
(N)
(4.1.8)
As before, N = 1, 2, . . . indicates how many slots of the channel were used for transmission, and we will consider the limit N . Now assume that the distribution
(N)
(N)
Pch ( | x(N) ) is determined by a PMF fch (y(N) | x(N) ) relative to a fixed measure
(N) on RN :
(N)
Pch (Y(N)
A| x
(N)
)=
(N)
(4.1.9a)
(N) = (N times);
(4.1.9b)
for instance, (N) can be the Lebesgue measure on RN which is the product of
Lebesgue measures on R: dx(N) = dx1 dxN . In the discrete case where
digits xi represent letters from an input channel alphabet A (say, binary, with
A = {0, 1}), is the counting measure on A , assigning weight 1 to each symbol of the alphabet. Then (N) is the counting measure on A N , the set of all input
words of length N, assigning weight 1 to each such word.
372
fch (y j |x j ).
(4.1.10)
1 jN
Here fch (y|x) is the symbol-to-symbol channel PMF describing the impact of a
single use of the channel. For an MGC, fch (y|x) is a normal N(x, 2 ). In other
words, fch (y|x) gives the PDF of a random variable Y = x + Z where Z N(0, 2 )
represents the white noise affecting an individual input value x.
Next, we turn to a codebook XM,N , the image of a one-to-one map M RN
where M is a finite collection of messages (originally written in a message alphabet); cf. (4.1.4). As in the discrete case, the ML decoder dML decodes the received
(N)
word Y = y(N) by maximising fch (y| x) in the argument x = x(N) XM,N :
(N)
(4.1.11)
dML (y) = arg max fch (y| x) : x XM,N .
The case when maximiser is not unique will be treated as an error.
(N),
Another useful example is the joint typicality (JT) decoder dJT = dJT (see
below); it looks for the codeword x such that x and y lie in the -typical set TN :
dJT (y) = x if x XM,N and (x, y) TN .
(4.1.12)
The JT decoder is designed via a specific form of set TN for codes generated as
samples of a random code X M,N . Consequently, for given output vector yN and a
code XM,N , the decoded word dJT (y) XM,N may be not uniquely defined (or not
defined at all), again leading to an error. A general decoder should be understood
as a one-to-one map defined on a set K(N) RN taking points yN KN to points
x XM,N ; outside set K(N) it may be not defined correctly. The decodable region
K(N) is a part of the specification of decoder d (N) . In any case, we want to achieve
(N)
(N)
Pch d (N) (Y) = x|x sent = Pch Y K(N) |x sent
(N)
+ Pch Y K(N) , d(Y) = x|x sent 0
as N . In the case of an MGC, for any code XM,N , the ML decoder from
(4.1.6) is defined uniquely almost everywhere in RN (but does not necessarily give
the right answer).
We also require that the input vector x(N) D(N) RN and when x(N) D(N) ,
the result of transmission is rendered undecodable (regardless of the qualities of the
decoder used). Then the average probability of error, while using codebook XM,N
and decoder d (N) , is defined by
eav (XM,N , d (N) , D(N) ) =
1
e(x, d (N), D(N) ),
M xX
M,N
(4.1.13a)
373
(4.1.13b)
Here e(x, d (N) , D(N) ) is the probability of error when codeword x had been transmitted:
1,
x D(N) ,
(4.1.14)
e(x, d (N) , D(N) ) =
(N) (Y) = x|x , x D(N) .
P(N)
ch d
In (4.1.14) the order of the codewords in the codebook XM,N does not matter;
thus XM,N may be regarded simply as a set of M points in the Euclidean space RN .
Geometrically, we want the points of XM,N to be positioned so as to maximise the
chance of correct ML-decoding and lying, as a rule, within domain D(N) (which
again leads us to a sphere-packing problem).
To 3this end,
4 suppose that a number R > 0 is fixed, the size of the codebook XM,N :
M = 2NR . We want to define a reliable transmission rate as N in a fashion
similar to how it was done in Section 1.4.
Definition 4.1.2 Value R >30 is 4called a reliable transmission rate with regional
constraint D(N) if, with M = 2NR , there exist a sequence {XM,N } of codebooks
XM,N RN and a sequence {d (N) } of decoders d (N) : RN RN such that
lim eav (XM,N , d (N) , D(N) ) = 0.
(4.1.15)
Remark 4.1.3 It is easy to verify that a transmission rate R reliable in the sense of
average error-probability eav (XM,N , d (N) , D(N) ) is reliable for the maximum errorprobability emax (XM,N , d (N) , D(N) ). In fact, assume that R is reliable in the sense of
Definition 4.1.2, i.e. in the sense of the average error-probability. Take a sequence
{XM,N } of the corresponding codebooks with M = 2RN and a sequence {dN } of
(0)
the corresponding decoding rules. Divide each code XN into two halves, XN and
(1)
XN , by ordering the codewords in the non-decreasing order of their probabilities
(0)
of erroneous decoding and listing the first M (0) = M/2 codewords in XN and
(1)
(0)
the rest, M (1) = M M (0) , in XN . Then, for the sequence of codes {XM,N }:
(i) the information rate approaches the value R as N as
1
log M (0) R + O(N 1 );
N
(ii) the maximum error-probability, while using the decoding rule dN ,
1
M
(0)
Pemax XN , dN (1) Pe (x(N) , dN ) (1) Peav (XN , dN ) .
M
M
(1)
(N)
x
XN
374
C = sup R > 0 : R is reliable ;
(4.1.16)
it varies from channel to channel and with the shape of constraining domains.
It turns out (cf. Theorem 4.1.9 below) that for the MGC, under the average power
constraint threshold (see (4.1.7)), the channel capacity C( , 2 ) is given by the
following elegant expression:
C( , 2 ) =
1
log2 1 + 2 .
2
(4.1.17)
Furthermore, like in Section 1.4, the capacity C( , 2 ) is achieved by a sequence of random codings where codeword x(i) = (X1 (i), . . . , XN (i)) has IID
components X j (i) N(0, N ), j = 1, . . . , N, i = 1, . . . , M, with N 0 as
N . Although such random codings do not formally obey the constraint
(4.1.7) for finite N, it is violated with a vanishing
probability as N (since
1
2
X j (i) : 1 i M = 1 with a proper choice
lim supN P max
N 1
jN
of N ). Consequently, the average error-probability (4.1.13a) goes to 0 (of course,
for a random coding the error-probability becomes itself random).
Example 4.1.4 Next, we discuss an AGC with coloured Gaussian noise. Let a
codevector x = (x1 , . . . , xN ) have multi-dimensional entries
x j1
..
x j = . Rk ,
1 j N,
x jk
and the components Z j of the noise vector
Z1
Z = ...
ZN
375
Z j1
Z j = ... .
Z jk
N 1 jN
(4.1.18)
The formula for the capacity of an AGC with coloured noise is, not surprisingly,
more complicated. As Q = Q, matrices and Q may be simultaneously diagonalised. Let i and i , i = 1, . . . , k, be the eigenvalues of and Q, respectively
(corresponding to the same eigenvectors). Then
(l1 l )+
1
C( , Q, ) =
,
(4.1.19)
log2 1 +
2 1lk
l
1
where (l1 l )+ = max
,
0
. In other words, (l1 l )+ are the
l
l
1
eigenvalues of the matrix Q + representing the positive-definite part of the
Hermitian matrix Q1 . Next, = ( ) > 0 is determined from the condition
(4.1.20)
tr I Q + = .
The positive-definite part I Q + is in turn defined by
I Q + = + I Q +
where + is the orthoprojection (in Rk ) onto the subspace
spanned by
the eigenvectors of Q with eigenvalues l l < . In (4.1.20) tr I Q + 0 (since
tr AB 0 for all pair of positive-definite matrices), equals 0 for = 0 (as
376
I(X : Y ) = E log
(N)
I(X
(N )
:Y
) = E log
(4.1.21a)
Here fX(N) (x(N) ) and fY(N ) (y(N ) ) are the marginal PMFs for X(N) and Y(N ) (i.e.
joint PMFs for components of these vectors).
377
1,
x(N) D(N) ,
E(x(N) , D(N) ) =
(N) ) = x(N) |x(N) , x(N) D(N) ;
P(N)
ch dML (Y
cf. (4.1.14). Furthermore, we are interested in the expected value
(4.1.21b)
Next, given > 0, we can define the supremum of the mutual information per
signal (i.e. per a single use of the channel), over all input probability distributions
PX(N) with E (PX(N) , D(N) ) :
1
sup I(X(N) : Y(N) ) : E (PX(N) , D(N) ) ,
N
C = lim sup C ,N C = lim inf C .
C ,N =
(4.1.22)
(4.1.23)
We want to stress that the supremum in (4.1.22) should be taken over all probability distributions PX(N) of the input word X(N) with the property that the expected
error-probability is , regardless of whether these distributions are discrete or
continuous or mixed (contain both parts). This makes the correct evaluation of
CN, quite difficult. However, the limiting value C is more amenable, at least in
some important examples.
We are now in a position to prove the converse part of the Shannon second
coding theorem:
Theorem 4.1.5 (cf. Theorems 1.4.14 and
a channel given by
2.2.10.) Consider
(N)
a sequence of probability distributions Pch | x sent for the random output
words Y(N) and decodable domains D(N) . Then quantity C from (4.1.22), (4.1.23)
gives an upper bound for the capacity:
C C.
(4.1.24)
Proof Let R be a reliable transmission rate and {XM,N } be a sequence of codebooks with M = XM,N 2NR for which lim eav (XM,N , D(N) ) = 0. Consider the
N
(N)
pair (x, dML (y)) where (i) x = xeq is the random input word equidistributed over
XM,N , (ii) Y = Y(N) is the received word and (iii) dML (y) is the codeword guessed
while using the ML decoding rule dML after transmission. Words x and dML (Y) run
378
jointly over XM,N , i.e. have a discrete-type joint distribution. Then, by the generalised Fano inequality (1.2.23),
hdiscr (X|d(Y)) 1 + log(M 1)
XXM,N
1+
NR
Pch (dML (Y) = x|x sent)
M xX
M,N
(N)
1 + h(Xeq )
N
1
1
(N)
(N)
= I(Xeq : d(Y(N) )) + h(Xeq |d(Y(N) ))
N
N
1 (N) (N)
1
+ I(xeq : Y ) + N .
N
N
For any given > 0, for N sufficiently large, the average error-probability will
satisfy eav (XM,N , D(N) ) < . Consequently, R C ,N , for N large enough. (Because
eav (XM,N , D(N) ) gives a specific
the equidistribution over a codebook XM,N with
example of an input distribution PX(N) with E PX(N) , D(N) ) .) Thus, for all > 0,
R C , implying that the transition rate R C. Therefore, C C, as claimed.
R
The bound C C in (4.1.24) becomes exact (with C = C) in many interesting situations. Moreover, the expression for C simplifies in some cases of interest. For example, for an MAGC instead of maximising the mutual information I(X(N) : Y(N) )
for varying N it becomes possible to maximise I(X : Y ), the mutual information between single input and output signals subject to an appropriate constraint. Namely,
for an MAGC,
C = C = sup I(X : Y ) : EX 2 < .
(4.1.25a)
379
1 jN
h(Y j ) h(Z(N) )
h(Y j ) h(Z j ) .
(4.1.26)
1 jN
0
0
log2 e
1
log2 2 ( 2j + 2 ) +
EY 2
2
2( 2j + 2 ) j
1
log2 2 e( 2j + 2 ) ,
2
=
and consequently,
I(X j : Y j ) = h(Y j ) h(Z j )
log2 [2 e( 2j + 2 )] log2 (2 e 2 )
2j
= log2 1 + 2 ,
380
numbers, that lim PX(N) B(N) ( N ) = 1. Moreover, for any input probability
The bound
1 jN
N
jN
The Jensen inequality, applied to the concave function x log2 (1 + x), implies
2j
2j
1
1
1
log2 1 + 2 log2 1 +
2
2N 1
2
N 1
jN
jN
1
log2 1 + 2 .
2
After establishing Theorem 4.1.8, we will be able to deduce that the capacity
C( , 2 ) equals the RHS, confirming the answer in (4.1.17).
Example 4.1.7
repeated:
For the coloured Gaussian noise the bound from (4.1.26) can be
I(X(N) : Y(N) )
1 jN
Here we work with the mixed second-order moments for the random vectors of
input and output signals X j and Y j = X j + Z j :
1
2j .
N 1
jN
In this calculation we again made use of the fact that X j and Z j are independent
and the expected value EZ j = 0.
1
Next, as in the scalar case, I(X(N) : Y(N) ) does not exceed the difference
N
h(Y ) h(Z) where Z N(0, ) is the coloured noise vector and Y = X + Z is
a multivariate normal distribution maximising the differential entropy under the
trace restriction. Formally:
1
I(X(N) : Y(N) ) h( , Q, ) h(Z)
N
381
1
max log (2 )k e det(K + ) :
2
+
K positive-definite k k matrix with tr (QK) .
h( , Q, ) =
Write in the diagonal form = CCT where C is an orthogonal and the diagonal k k matrix formed by the eigenvalues of :
1 0 . . . 0
0 2 . . . 0
=
.
.
.
0 0
. 0
0
. . . k
1ik
eigenbasis), and B11 = 1 , . . . , Bkk = k are the eigenvalues of K. As before, assume Q = Q, then tr (B) = i i . So, we want to maximise the product
1ik
1ik
1ik
i i .
1ik
log(i + i ) +
1ik
i i
1ik
is maximised at
1
1
= i , i.e. i =
i , i = 1, . . . , k.
i + i
i
To satisfy the regional constraint, we take
1
i =
i
, i = 1, . . . , k,
i
+
and adjust > 0 so that
1ik
1
i i
= .
+
(4.1.28)
382
(4.1.29)
where the RHS comes from (4.1.28) with = 1/ . Again, we will show that the
capacity C( , Q, ) equals the last expression, confirming the answer in (4.1.19).
We now pass to the direct part of the second Shannon coding theorem for general
channels with regional restrictions. Although the statement of this theorem differs
from that of Theorems 1.4.15 and 2.2.1 only in the assumption of constraints upon
the codewords (and the proof below is a mere repetition of that of Theorem 1.4.15),
it is useful to put it in the formal context.
Theorem 4.1.8 Let a channel be specified by a sequence of conditional probabili
(N)
ties Pch | x(N) sent for the received word Y(N) and a sequence of decoding con
(N)
straints x(N) D(N) for the input vector. Suppose that probability Pch | x(N) sent
is given by a PMF fch (y(N) |x(N) sent) relative to a reference measure (N) . Given
c > 0, suppose that there exists a sequence of input probability distributions PX(N)
such that
(i) lim PX(N) (D(N) ) = 1,
N
(ii) the distribution PX(N) is given by a PMF fX(N) (x(N) ) relative to a reference
measure (N) ,
(iii) the following convergence in probability holds true: for all > 0,
N
limN PX(N) ,Y(N) T = 1,
(N) , Y(N) )
1
f
(N) ,Y(N) (x
x
c ,
TN = log+
N
fX(N) (x(N) ) fY(N) (Y(N) )
(4.1.30a)
where
(N)
(N)
(N) (N)
fX(N) ,Y(N) (x(N)
0 , y ) = fX(N) (x ) fch (y |x sent),
fY(N) (y(N) ) = fX(N) (xN ) fch (y(N) |x(N) sent) N dx(N) .
(4.1.30b)
383
(random) word YN = YN ( j) received, with the joint PMF fX(N) ,Y(N) as in (4.1.30b).
We take > 0 and decode YN by using joint typicality:
dJT (YN ) = xN (i) when xN (i) is the only vector among
xN (1), . . . , xN (M) such that (xN (i), YN ) TN .
Here set TN is specified in (4.1.30a).
Suppose a random vector xN ( j) has been sent. It is assumed that an error occurs
every time when
(i) xN ( j) D(N) , or
(ii) the pair (xN ( j), YN ) TN , or
(iii) (xN (i), YN ) TN for some i = j.
These possibilities do not exclude each other but if none of them occurs then
(a) xN ( j) D(N) and
(b) x( j) is the only word among xN (1), . . . , xN (M) with (xN ( j), YN ) TN .
Therefore, the JT decoder will return the correct result. Consider the average errorprobability
1
E( j, PN )
EM (PN ) =
M 1
jM
where E( j, PN ) is the probability that any of the above possibilities (i)(iii) occurs:
*
+ *
+
E( j, PN ) = P xN ( j) D(N) (xN ( j), YN ) TN
+
*
(xN (i), YN ) TN for some i = j
= E1 xN ( j) D(N)
+ E1 xN ( j) D(N) , dJT (YN ) = xN ( j) . (4.1.31)
The symbols P and E in (4.1.31) refer to (1) a collection of IID input vectors
xN (1), . . . , xN (M), and (2) the output vector YN related to xN ( j) by the action of
the channel. Consequently, YN is independent of vectors xN (i) with i = j. It is instructive to represent the corresponding probability distribution P as the Cartesian
product; e.g. for j = 1 we refer in (4.1.31) to
P = PxN (1),YN (1) PxN (2) PxN (M)
where PxN (1),YN (1) stands for the joint distribution of the input vector xN (1) and the
output vector YN (1), determined by the joint PMF
fxN (1),YN (1) (xN , yN ) = fxN (1) (xN ) fch (yN |xN sent).
384
Thanks to the condition that lim Px(N) (D(N) ) = 1, the first summand vanishes as
N
N
(x (i), YN ) TN = 2NR 1 P (xN (2), YN ) TN .
i=2
N
(x (i), YN ) TN 2N(Rc+3 )
i=2
Em (P ) = EPxN (1) PxN (M)
N
1
E( j)
M 1
jM
385
(N),
where e(x) = e(x, XM,N , D(N) , dJT ) is the error-probability for the input word x
in code XM,N , under the JT decoder and with regional constraint specified by D(N) :
1,
xN D(N) ,
e(x) =
(N),
Pch dJT
(YN ) = x|x sent , xN D(N) .
Hence, R is a reliable transmission rate in the sense of Definition 4.1.2. This completes the proof of Theorem 4.1.8.
We also have proved in passing the following result.
Theorem 4.1.9 Assume that the conditions of Theorem 4.1.5 hold true. Then,
for all R < C, there exists a sequence of codes XM,N of length N and size M 2RN
such that the maximum probability of error tends to 0 as N .
Example 4.1.10 Theorem 4.1.8 enables us to specify the expressions in (4.1.17)
and (4.1.19) as the true values of the corresponding capacities (under the ML rule):
for a scalar white noise of variance 2 , under an average input power constraint
x2j N ,
1 jN
C( , 2 ) =
1
log 1 + 2 ,
2
for a vector white noise with variances 2 = (12 , . . . , k2 ), under the constraint
xTj x j N ,
1 jN
C( , 2 ) =
1
( i2 )+
log
1
+
, where
2 1ik
i2
( i2 )+ = 2 ,
1ik
and for the coloured vector noise with a covariance matrix , under the constraint
xTj Qx j N ,
1 jN
(i1 i )+
1
,
C( , Q, ) =
log 1 +
2 1ik
i
where ( i i )+ = .
1ik
Explicitly, for a scalar white noise we take the random coding where the signals
X j (i), 1 j N, 1 i M = 2NR , are IID N(0, ). We have to check the
conditions of Theorem 4.1.5 in this case: as N ,
386
N =
1
P(X,Y )
.
log
N 1
P
X (X)PY (Y )
jM
Next, (ii): since pairs (X j ,Y j ) are IID, we apply the law of large numbers and
obtain that
P(X,Y )
= I(X1 : Y1 ).
N E log
PX (X)PY (Y )
But
I(X1 : Y1 ) = h(Y1 ) h(Y1 |X1 )
1
1
= log 2 e( + 2 ) log 2 e 2
2
2
1
C( , 2 ) as 0.
= log 1 +
2
2
Hence, the capacity equals C( , 2 ), as claimed. The case of coloured noise is
studied similarly.
Remark 4.1.11 Introducing a regional constraint described by a domain D does
not mean one has to secure that the whole code X should lie in D. To guarantee
that the error-probability Peav (X ) 0 we only have to secure that the majority
of codewords x(i) X belong to D when the codeword-length N .
Example 4.1.12
noise vector
Z1
Z = ...
ZN
387
has two-side exponential IID components Z j (2) Exp( ), with the PDF
1
fZ j (z) = e |z| , < z < ,
2
where Exp denotes the exponential distribution, > 0 and E|Z j | = 1/ (see PSE I,
Appendix). Again we will calculate the capacity under the ML rule and with a
regional constraint x(N) (N ) where
(
)
(N ) = x(N) RN : |x j | N .
1 jN
First, observe that if the random variable X has E|X| and the random variable
Z has E|Z| then E|X + Z| + . Next, we use the fact that a random variable
Y with PDF fY and E|Y | has the differential entropy
h(Y ) 2 + log2 ; with equality iff Y (2) Exp(1/ ).
In fact, as before, by Gibbs
h(Y ) =
0
0
1
fY (y)|y|dy + log
= 1 + log 2 + log
1
+ E|Y |
= 1+
1
2 + log2 ( j + 1 ) 2 + log2 ( )
N
1
= log2 1 + j
N
log2 1 + .
The same arguments as before establish that the RHS gives the capacity of the
channel.
388
M = 13
m=6
inf
= ln 7
2b
_
Figure 4.2
(4.1.32)
389
(4.1.33)
where the supremum is taken over all finite partitions and of intervals [A, A]
and [A b, A + b], and X and Y stand for the quantised versions of random
variables X and Y , respectively.
In other words, the input-signal distribution PX with
PX (A) = PX (A + 2b) = = PX (A 2b) = PX (A) =
1
m+1
(4.1.34)
maximises I(X : Y ) under assumptions (i) and (ii). We denote this distribution by
(A,A/m)
(bm,b)
, or, equivalently, PX
.
PX
However, if M = 2m, i.e. the number A of allowed signals is even, the calculation becomes more involved. Here, clearly, the uniform distribution U(A
b, A + b) for the output signal Y cannot be achieved. We have to maximise h(Y ) =
h(X + Z) within the class of piece-wise constant PDFs fY on [A b, A + b]; see
below.
Equal spacing in [A, A] is generated by points A/(2m 1), 3A/(2m
1), . . . , A; they are described by the formula (2k 1)A/(2m 1) for k =
1, . . . , m. These points divide the interval [A, A] into (2m 1) intervals of length
2A/(2m 1). With Z U(b, b) and A = b(m 1/2), we again have the outputsignal PDF fY (y) supported in [A b, A + b]:
pm /(2b),
for k = 1, . . . , m 1,
fY (y) =
p1 + p1 (2b), if b/2 y b/2,
for k = 1, . . . , m + 1,
p /(2b),
if b(m + 1/2) y b(m 1/2),
m
390
where
1
(2k 1)A
pk = pX b k
=P X =
, k = 1, . . . , m,
2
2m 1
stand for the input-signal probabilities. The entropy h(Y ) = h(X + Z) is written as
pk + pk+1 pk + pk+1 p1 + p1 p1 + p1
pm pm
ln
ln
ln
2
2b 1k<m
2
2b
2
2b
pk + pk+1 pk + pk+1 pm pm
ln
ln
.
2
2b
2
2b
m<k1
1km
L (PX ; ) reads
L (PX ; ) = 0, k = 1, . . . , m.
pk
Thus, we have m equations, with the same RHS:
pm (pm1 + pm )
2 + 2 = 0, (implies) pm (pm1 + pm ) = 4b2 e2 2 ,
4b2
(pk1 + pk )(pk + pk+1 )
2 + 2 = 0,
ln
4b2
(implies) (pk1 + pk )(pk + pk+1 ) = 4b2 e2 2 , 1 < k < m,
2p1 (p1 + p2 )
2 + 2 = 0 (implies) 2p1 (p1 + p2 ) = 4b2 e2 2 .
ln
4b2
ln
This yields
pm = pm1 + pm2 = = p3 + p2 = 2p1 ,
pm + pm1 = pm2 + pm3 = = p2 + p1 ,
K
for m even,
391
and
pm = pm1 + pm2 = = p2 + p1 ,
pm + pm1 = pm2 + pm3 = = p3 + p2 = 2p1 ,
K
for m odd.
1/(4b), A y 3A,
fY (y) = 1/(2b), A y A,
yielding Cinf = (ln 2)/2.
1/(4b), 3A y A,
For M = 4 (four input signals at A, A/3, A/3, A, with b = 2A/3): p1 = 1/6,
p2 = 1/3, and the maximising output-signal PDF is
392
whence
m+2
p1 ,
m
m2
p3 =
p1 ,
m
m+4
p4 =
p1 ,
m
m4
p1 ,
p5 =
m
..
.
2m 2
,
pm2 =
m
2
pm1 = ,
m
p2 =
(4.1.36)
pm = 2p1 ,
with p1 =
1
.
2(m + 1)
1
1
inf
ln 2.
and CA
= ln
2 4m(m + 1)
(4.1.37)
On the other hand, for a general odd m, the maximising input-signal distribution
PX has
m+1
,
2m(m + 1)
m1
,
p2 =
2m(m + 1)
m+3
p3 =
,
2m(m + 1)
m3
p4 =
,
2m(m + 1)
..
.
1
,
pm1 =
2m(m + 1)
m
.
pm =
m(m + 1)
p1 =
(4.1.38)
393
This yields the same answer for the maximum entropy and the restricted capacity:
1
1
h(Y ) = ln
2 4m(m + 1)b2
1
1
inf
ln 2.
and CA
= ln
2 4m(m + 1)
(4.1.39)
(4.1.40)
Then
1 p
1 2p
1
p
+ (A b)(1 2p) ln
hy(Y ) = Ap ln + (2b A)(1 p) ln
,
b
2b
2b
2b
(4.1.41)
and the equation dh(Y ) dp = 0 is equivalent to
pA = (1 p)2bA (1 2p)2(Ab) .
(4.1.42)
(4.1.43a)
394
We are interested in the solution lying in (0, 1/2) (in fact, in (1/3, 1/2)). For b =
3A/4, the equation becomes
pA = (1 p)A/2 (1 2p)A/2 ,
i.e.
whence p = (3 5) 2.
p2 = (1 p)(1 2p),
(4.1.43b)
Example 4.1.16 It is useful to look at the example where the noise random
variable Z has two components: discrete and continuous. To start with, one could
try the case where
fZ (z) = q0 + (1 q) (z; 2 ),
i.e. Z = 0 with probability q and Z N(0, 2 ) with probability 1 q (0, 1). (So,
1 q gives the total probability of error.) Here, we consider the case
fZ = q0 + (1 q)
1
1(|z| b),
2b
(4.1.44a)
p1 , p0 , p1 0, p1 + p0 + p1 = 1,
(4.1.44b)
where
with b = A and M = 3 (three signal levels in (A, A)). The input-signal entropy is
h(X) = h(p1 , p0 , p1 ) = p1 ln p1 p0 ln p0 p1 ln p1 .
The output-signal PMF has the form
1
fY (y) = q p1 A + p0 0 + p1 A + (1 q)
2b
p1 1(2A y 0) + p0 1(A y A) + p1 1(0 y 2A)
and its entropy h(Y ) (calculated relative to the reference measure on R, whose
absolutely continuous component coincides with the Lebesgue and discrete component assigns value 1 to points A, 0 and A) is given by
h(Y ) = q ln q (1 q) ln(1 q) qh(p1 , p0 , p1 )
p1 + p0
p1
+ p1 + p0 ln
(1 q)A p1 ln
2A
2A
p0 + p1
p1
+ p1 ln
+ p0 + p1 ln
.
2A
2A
395
p
1 p
(1q)A/q
.
If (1 q)A/q > 1 this equation yields a unique solution which defines an optimal
input-signal distribution PX of the form (4.1.44a)(4.1.44b).
If we wish to see what value of q yields the maximum of h(Y ) (and hence, the
maximum information capacity), we differentiate in q as well:
q
d
h(Y ) = 0 log
= (A 1)h(p, p, 1 2p) 2A ln 2A.
dq
1q
If we wish to consider a continuously distributed input signal on [A, A], with a
PDF fX (x), then the output random variable Y = X + Z has the PDF given by the
convolution:
0
1 (y+b)A
fX (x)dx.
fY (y) =
2b (yb)(A)
1
The differential entropy h(Y ) = fY (y) ln fY (y)dy, in terms of fX , takes the form
0 (x+z+b)A
0
0 b
1 A
1
h(X + Z) =
fX (x)
ln
fX (x )dx dzdx.
2b A
2b (x+zb)(A)
b
The PDF fX minimising the differential entropy h(X + Z) yields a solution to
0 b 0 (x+z+b)A
1
fX (x )dx + fX (x)
ln
0=
2b (x+zb)(A)
b
1
0 (x+z+b)A
fX (x )dx
fX (x + z + b) fX (x + z b) dz.
(x+zb)(A)
396
C = ln 10
2b
2b
Figure 4.3
with x A partition domain B (i.e. cover B but do not intersect each other) then,
for the input PMF Px with Px (x) = 1 ( A ) (a uniform distributionover A ), the
output-vector-signal PDF fY is uniform on B (that is, fY (y) = 1 area of B ).
Consequently, the output-signal entropy h(Y ) = ln area of B is attaining the
maximum over all input-signal PMFs Px with Px (A ) = 1 (and even attaining the
maximum over all input-signal PMFs Px with Px (B ) = 1 where B B is an
arbitrary subset with the property that x B S(x ) lies within B). Finally, the information capacity for the channel under consideration,
Cinf =
1 area of B
ln
nats/(scalar input signal).
2
4b2
1 area of D2
ln
nats/(scalar input signal),
2
4b2
of an additive channel with a uniform noise over (b, b), when the channel is used
two times per scalar input signal and the random vector input x = (X1 , X2 ) is subject
to the regional constraint x D2 . The maximising input-vector PMF assigns equal
probabilities to the centres of squares forming the partition.
A similar conclusion holds in R3 when the channel is used three times for every
input signal, i.e. the input signal is a three-dimensional vector x = (x1 , x2 , x3 ), and
so on. In general, when we use a K-dimensional input signal x = (x1 , . . . , xk ) RK ,
and the regional constraint is x DK RK where DK is a bounded domain that can
397
1 volume of DK
ln
nats/(scalar input signal)
K
(2b)K
(4.2.1)
(4.2.2)
Theorem 4.2.1 (ShannonMcMillanBreiman) For any stationary ergodic process X with finitely many values the information rate R = h, i.e. the limit in (4.2.3)
398
lim
(4.2.3)
The proof of Theorem 4.2.1 requires some auxiliary lemmas and is given at the
end of the section.
Worked Example 4.2.2 (A general asymptotic equipartition property) Given a
sequence of random variables
X1 , X2 , . . ., for all N = 1, 2, . . ., the distribution of
X1
..
N
the random vector x1 = . is determined by a PMF fxN (xN1 ) with respect to
1
XN
(N)
measure = (N factors). Suppose that the statement of the Shannon
McMillanBreiman theorem holds true:
1
log fxN (xN1 ) h in probability,
1
N
where h > 0 is a constant (typically, h = lim h(Xi )). Given > 0, consider the
i
typical set
x1
1
..
N
N
N
S = x1 = . : log fxN (x1 ) + h .
1
xN
0
SN
ties:
(4.2.4)
P(RN )
1=
RN
fxN (xN1 )
1
fxN (xN1 )
1
N(h+ )
1 jN
(dx j )
(dx j )
1 jN
SN
fxN (xN1 )
RN
1 jN
SN
1 jN
(4.2.5)
399
giving the upper bound (4.2.4). On the other hand, given > 0, we can take N
large so that P(SN ) 1 , in which case, for 0 < < h,
1 P(SN )
0
SN
fxN (xN1 )
1
N(h )
(dx j )
1 jN
SN 1 jN
X1
Y1
random vectors XN1 = ... and YN1 = ... which is determined by a (joint)
XN
YN
PMF fxN ,YN with respect to measure (N) (N) where (N) = and
1
(N) = (N factors in both products). Let fXN1 and fYN1 stand for the
(joint) PMFs of vectors XN1 and YN1 , respectively.
As in Worked Example 4.2.2, we suppose that the statements of the Shannon
McMillanBreiman theorem hold true, this time for the pair (XN1 , YN1 ) and each of
XN1 and YN1 : as N ,
1
1
log fXN (XN1 ) h1 , log fYN (YN1 ) h2 ,
1
1
N
N
in probability,
1
N
N
log fXN ,YN (X1 , Y1 ) h,
1
1
N
(4.2.6)
lim I(Xi : Yi ). Given > 0, consider the typical set formed by sample pairs (xN1 , yN1 )
where
x1
xN1 = ...
xN
400
and
y1
yN1 = ... .
yN
Formally,
%
1
= (xN1 , yN1 ) : log fxN (xN1 ) + h1 ,
1
N
1
log fYN (yN1 ) + h2 ,
1
N
K
1
N N
log fxN ,YN (x1 , y1 ) + h ; (4.2.7)
1
1
N
N
by the above assumption we have that lim P T = 1 for all > 0. Next, define
TN
(N)
(N)
N 0
T =
TN
Finally, consider an independent pair XN1 , YN1 where component XN1 has the same
PMF as XN1 and YN1 the same PMF as YN1 . That is, the joint PMF for XN1 and YN1
has the form
fXN ,YN (xN1 , yN1 ) = fXN (xN1 ) fYN (yN1 ).
(4.2.8)
Next, we assess the volume of set TN and then the probability that xN1 , YN1
1
TN .
Worked Example 4.2.3 (A general joint asymptotic equipartition property)
(I) The volume of the typical set has the following properties:
(N) (N) TN 2N(h+ ) , for all and N,
(4.2.9)
and, for all > 0 and 0 < < h, for N large enough, depending on ,
(N) (N) TN (1 )2N(h ) .
(II) For the independent pair XN1 , YN1 ,
P XN1 , YN1 TN 2N(h1 +h2 h3 ) , for all and N,
and, for all > 0, for N large enough, depending on ,
P XN1 , YN1 TN (1 )2N(h1 +h2 h+3 ) ,
for all .
(4.2.10)
(4.2.11)
(4.2.12)
401
Solution (I) Completely follows the proofs of (4.2.4) and (4.2.5) with integration
of fxN ,YN .
1
1
(II) For the probability P XN1 , YN1 TN we obtain (4.2.11) as follows:
P
0
XN1 , YN1 TN =
TN
by definition
0
TN
substituting (4.2.8)
2N(h1 ) 2N(h2 )
0
TN
(dxN1 ) (dyN1 )
according to (4.2.7)
2N(h1 ) 2N(h2 ) 2N(h+ ) = 2N(h1 +h2 h3 )
because of bound (4.2.9).
Finally, by reversing the inequalities in the last two lines, we can cast them as
2N(h1 + ) 2N(h2 + )
0
TN
(dxN1 ) (dyN1 )
according to (4.2.7)
(1 )2N(h1 + ) 2N(h2 + ) 2N(h ) = (1 )2N(h1 +h2 h+3 )
because of bound (4.2.10).
Formally, we assumed here that 0 < < h (since it was assumed in (4.2.10)), but
increasing only makes the factor 2N(h1 +h2 h+3 ) smaller. This proves bound
(4.2.12).
A more convenient (and formally a broader) extension of the asymptotic equipartition property is where we suppose that the statements of
the ShannonMcMillanBreiman
theorem
hold true directly for the ratio
N
N
N
N
fxN ,YN (X1 , Y1 ) fXN (X1 ) fYN (Y1 )) . That is,
1
(4.2.13)
402
where c > 0 is a constant. Recall that fXN ,YN represents the joint PMF while fXN
1
1
1
and fxN individual PMFs for the random input and output vectors xN and YN , with
1
;
(4.2.14)
TN = (XN1 , yN1 ) : log
N
fXN (xN1 ) fYN (yN1 )
1
1
N N
by assumption (4.2.13) we have that lim P X1 , Y1 TN = 1 for all > 0.
N
Again, we will consider an independent pair XN1 , YN1 where component XN1
has the same PMF as XN1 and YN1 the same PMF as YN1 .
Theorem 4.2.4 (Deviation from the joint asymptotic equipartition property)
Assume that property (4.2.13) holds true. For an independent pair XN1 , YN1 , the
probability that XN1 , YN1 TN obeys
P
XN1 , YN1
TN
(4.2.15)
(4.2.16)
TN
N(c )
N(c )
=2
TN
2N(c ) .
N N
X1 , Y1 TN
403
The first equality is by definition, the second step follows by substituting (4.2.8),
the third is by direct calculation, and the fourth because of the bound (4.2.14).
Finally, by reversing the inequalities in the last two lines, we obtain the bound
(4.2.16):
2N(c+ )
x(1)
X(i) with i C). Similarly, given a vector x = ... of values for x, denote by
x(n)
x(C) the argument {x(i) : i C} (the sub-column in x extracted by picking the rows
with i C). By the Gibbs inequality, for all partitions {C1 , . . . ,Cs } of set {1, . . . , n}
into non-empty disjoint subsets C1 , . . . ,Cs (with 1 s n), the integral
0
fx (x) log
fxn1 (x)
(dx( j)) 0.
fx(C1 ) (x(C1 )) . . . fx(Cs ) (x(Cs )) 1
jn
(4.2.17)
What is the partition for which the integral in (4.2.17) attains its maximum?
Solution The partition in question has s = n subsets, each consisting of a single
point. In fact, consider the partition of set {1, . . . , n} into single points; the corresponding integral equals
0
fx (x) log
fxn1 (x)
(dx( j)).
fXi (xi ) 1 jn
(4.2.18)
1in
Let {C1 , . . . ,Cs } be any partition of {1, . . . , n}. Multiply and divide the fraction
under the log by the product of joint PMFs fx(Cl ) (x(Cl )). Then the integral
1is
fx (x) log
fxn1 (x)
(dx( j)) + terms 0.
fx(Ci ) (x(Ci )) 1 jn
1is
404
Here x(C) = {X(i) : i C}, x(C) = {X(i) : i C}, and E I(x(C : Y )|x(C) stands
for the expectation of I(x(C : Y ) conditional on the value of x(C). Prove that this
sum does not depend on the choice of set C.
Solution Check that the expression in question equals I(x : Y ).
In Section 4.3 we need the following facts about parallel (or product) channels.
Worked Example 4.2.7 (Lemma A in [173]; see also [174]) Show that the
capacity of the product of r time-discrete Gaussian channels with parameters
( j , p( j) , 2j ) equals
j
p( j)
ln 1 +
.
(4.2.19)
C=
j 2j
1 jr 2
the case r = 2. For the direct part, assume that R < C1 +C2 and > 0 are given. For
sufficiently large we must find a code for the product channel with M = eR codewords and Pe < . Set = (C1 +C2 R)/2. Let X 1 and X 2 be codes for channels
1 and 2 respectively with M1 e(C1 ) and M2 e(C2 ) and error-probabilities
1
2
PeX , PeX /2. Construct a concatenation code X with codewords x = xk1 xl2
where xi X i , i = 1, 2. Then, for the product-channel under consideration, with
1
2
codes X 1 and X 2 , the error-probability PeX , X is decomposed as follows:
1
2
1
P error in channel 1 or 2| xk1 xl2 sent .
PeX , X =
M1 M2 1kM1 ,1lM2
By independence of the channels, PeX
direct part.
1, X 2
405
The proof of the inverse is more involved and we present only a sketch, referring
the interested reader to [174]. The idea is to apply the so-called list decoding:
suppose we have a code Y of size M and a decoding rule d = d Y . Next, given
that a vector y has been received at the output port of a channel, a list of L possible
Y , and the
code-vectors from Y has to be produced, by using a decoding rule d = dlist
decoding (based on rule d) is successful if the correct word is in the list. Then, for
the average error-probability Pe = PeY (d) over code Y , the following inequality is
satisfied:
Pe Pe ( d ) PeAV (L, d)
(4.2.20)
where the error-probability Pe (d) = PeY (d) refers to list decoding and PeAV (L, d) =
PeAV (Y , L, d) stands for the error-probability under decoding rule d averaged over
all subcodes in Y of size L.
Now, going back to the product-channel with marginal capacities C1 and C2 ,
choose R > C1 +C2 , set = (R C1 C2 )/2 and let the list size be L = eRL , with
RL = C2 + . Suppose we use a code Y of size eR with a decoding rule d and a
list decoder d with the list-size L. By using (4.2.20), write
Pe Pe (d)PeAV (eRL , d)
(4.2.21)
and use the facts that RL > C2 and the value PeAV (eRL , d) is bounded away from
zero. The assertion of the inverse part follows from the following observation discussed in Worked Example 4.2.8. Take R2 < R RL and consider subcodes L Y
of size L = eR2 . Suppose we choose subcode L at random, with equal probabilities. Let M2 = eR2 and PeY ,M2 (d) stand for the mean error-probability averaged
over all subcodes L Y of size L = eR2 . Then
Pe (d) PeY ,M2 (d) + ( )
(4.2.22)
where ( ) 0 as .
Worked Example 4.2.8 Let L = eRL and M = eR . We aim to show that if
R2 < R RL and M2 = eR2 then the following holds. Given a code X of size M ,
a decoding rule d and a list decoder d with list-size L, consider the mean errorprobability PeX ,M2 (d) averaged over the equidistributed subcodes S X of size
S = M2 . Then PeX ,M2 (d) and the list-error-probability PeX (d) satisfy
PeX (d) PeX ,M2 (d) + ( )
(4.2.23)
where ( ) 0 as .
Solution Let X , S and d be as above and suppose we use a list decoder d with
list-length L.
406
Given a subcode S X with M2 codewords, we will use the following decoding. Let L be the output of decoder d. If exactly one element x j S belongs
to L , the decoder for S will declare x j . Otherwise, it will pronounce an error.
Denote the decoder for S by d S . Thus, given that xk S was transmitted, the
resulting error-probability, under the above decoding rule, takes the form
Pek = p(L |xk )ES (L |xk )
L
C
DX ,M2
1 M2
p(L
|x
)E
(L
,
x
|x
)
j k
k
M2 k=1
L j=k
(4.2.25)
I JX ,M2
means the average over all selections of subcodes. As x j and xk
where
are chosen independently,
DX ,M2 C
DX ,M2 C
DX ,M2
C
= p(L |xk )
.
E2 (L , x j )
p(L |xk )E 2 (L , x j )
Next,
C
p(L |xk )
DX ,M2
xX
C
DX ,M2
1
L
p(L |x), E2 (L , x j |xk )
= ,
M
M
and we obtain
PeX ,M2 PeX (d) +
1
L
1 M2
p(L
|x)
M
M
M2 k=1
L xX
j=k
407
which implies
PeX ,M2 PeX (d) +
M2 L
.
M
(4.2.26)
(4.2.27)
1
1
)
= h(X0 |Xk
H (k) = E log p X0 |Xk
(4.2.28)
1
1
H = E log p X0 |X
).
= h(X0 |X
(4.2.29)
i=k
Set also
and
The proof is based on the following three results: Lemma 4.2.9 (the sandwich
lemma), Lemma 4.2.10 (a Markov approximation lemma) and Lemma 4.2.11 (a
no-gap lemma).
Lemma 4.2.9
Proof
(4.2.30)
p(X0n1 )
1
lim sup log
0 a.s.
1
p(X0n1 |X
)
n n
(4.2.31)
(k) n1
p(k) (X0n1 )
n1 p (x0 )
p(x
)
=
0
p(X0n1 )
p(x0n1 )
xn1 A
n
p(k) (x0n1 )
x0n1 An
= p(k) (A) 1.
408
1
Similarly, if Bn = Bn (X
) is a support event for pX n1 |X 1 (i.e. P(X0n1
0
1
) = 1), write
Bn |X
p(X0n1 )
p(x0n1 )
n1 1
p(x
|X
)
=
E
1
0
X
1
1
p(X0n1 |X
)
p(x0n1 |X
)
xn1 B
n
= EX 1
p(x0n1 )
x0n1 Bn
= EX 1 P(Bn ) 1.
(4.2.32)
a.s.
1
1
) H.
log p(X0n1 |X
n
(4.2.33)
1
1
) and f = log p(X0 |X
) into
Proof Substituting f = log p(X0 |Xk
Birkhoffs ergodic theorem (see for example Theorem 9.1 from [36]) yields
a.s.
1
1 n1
1
i1
) 0 + H (k)
log p(k) (X0n1 ) = log p(X0k1 ) log p(k) (Xi |Xik
n
n
n i=k
(4.2.34)
and
a.s.
1 n1
1
1
i1
) = log p(Xi |X
) H,
log p(X0n1 |X
n
n i=0
(4.2.35)
respectively.
So, by Lemmas 4.2.9 and 4.2.10,
1
1
1
1
lim log (k) n1 = H (k) ,
lim sup log
n1
p(X0 ) n n
p (X0 )
n n
(4.2.36)
409
and
1
1
1
1
lim inf log
lim log
= H, )
n1
n1 1
n n
n
n
p(X0 )
p(X0 |X )
which we rewrite as
1
1
H lim inf log p(X0n1 ) lim sup log p(X0n1 ) H (k) .
n
n
n
n
Lemma 4.2.11
(4.2.37)
(4.2.38)
As the set of values I is supposed to be finite, and the function p [0, 1] p log p
is bounded, the bounded convergence theorem gives that as k ,
H (k) = E
1
1
) log p(X0 = x0 |Xk
)
p(X0 = x0 |Xk
x0 I
x0 I
1
1
p(X0 = x0 |X
) log p(X0 = x0 |X
) = H.
410
The setting is as follows. Fix numbers , , p > 0 and assume that every seconds
a coder produces a real code-vector
x1
..
x= .
xn
where n = . All vectors x generated by the coder lie in a finite set X = Xn
Rn of cardinality M 2Rb = eRn (a codebook); sometimes we write, as before,
XM,n to stress the role of M and n. It is also convenient to list the code-vectors
from X as x(1), . . . , x(M) (in an arbitrary order) where
x1 (i)
x(i) = ... , 1 i M.
xn (i)
Code-vector x is then converted into a continuous-time signal
n
(4.3.1)
i=1
xi =
x(t)i (t)dt.
(4.3.2)
The instantaneous signal power at time t is associated with |x(t)|2 ; then the square1
norm ||x||2 = 0 |x(t)|2 dt = |xi |2 will represent the full energy of the signal in
1in
the interval [0, ]. The upper bound on the total energy spent on transmission takes
the form
||x||2 p , or x Bn ( p ).
(4.3.3)
(In the theory of waveguides, the dimension n is called the Nyquist number and the
value W = n/(2 ) /2 the bandwidth of the channel.)
The code-vector x(i) is sent through an additive channel, where the receiver gets
the (random) vector
Y1
..
(4.3.4)
Y = . where Yk = xk (i) + Zk , 1 k n.
Yn
411
From the start we declare that if x(i) X \ Bn ( p ), i.e. ||x(i)||2 > p , the
output signal vector Y is rendered non-decodable. In other words, the probability
of correctly decoding the output vector Y = x(i) + Z with ||x(i)||2 > p is taken to
be zero (regardless of the fact that the noise vector Z can be small and the output
vector Y close to x(i), with a positive probability).
Otherwise, i.e. when ||x(i)||2 p , the receiver applies, to the output vector Y,
a decoding rule d(= dn,X ), i.e. a map y K d(y) X where K Rn is a
decodable domain (where map d had been defined). In other words, if Y K
then vector Y is decoded as d(Y) X . Here, an error arises either if Y K or if
d(Y) = x(i) given that x(i) was sent. This leads to the following formula for the
probability of erroneously decoding the input code-vector x(i):
1,
Pe (i, d) =
Pch Y K or d(Y) = x(i)|x(i) sent ,
||x(i)||2 > p ,
||x(i)||2 p .
(4.3.5)
The average error-probability Pe = PeX ,av (d) for the code X is then defined by
Pe =
1
Pe (i, d).
M 1iM
(4.3.6)
Furthermore, we say that Rbit (or Rnat ) is a reliable transmission rate (for given
and p) if for all > 0 we can specify 0 ( ) > 0 such that for all > 0 ( )
there exists a codebook X of size X eRnat and a decoding rule d such that
Pe = PeX ,av (d) < . The channel capacity C is then defined as the supremum of all
reliable transmission rates, and the argument from Section 4.1 yields
C=
p
ln 1 +
(in nats);
2
2
cf. (4.1.17). Note that when , the RHS in (4.3.7) tends to p/(2 2 ).
(4.3.7)
412
.
..
...
..
0
.......
... ..
....... ... .. . . .. .. . .
T
. .. .. . .. . . .
..
..
. . . .. .... .
.
.
.
.
..
. .. . . . . .
... ... .
.. .
.. ..
.....
....
.
.
Figure 4.4
fi (t) =
413
(4.3.10)
(4.3.11)
1kn
Here
sinc (2Wt k) =
sin( s) , s = 0,
sinc(s) =
s R,
s
1,
s = 0,
(4.3.12)
F ( ) = (x)ei x dx, R.
(4.3.13)
1
( )ei x d .
F (x) =
2
(4.3.14)
(x) =
1
2
( )eix d .
(4.3.15)
1 (x)2 (x)dx =
1 ( )2 ( )d .
(4.3.16)
414
0.2
0.4
0.6
W=0.5
W=1
W=2
4
2
0
ts
2
4
Figure 4.5
415
Furthermore, the Fourier transform can be defined for generalised functions too;
see again [127]. In particular, the equations similar to (4.3.13)(4.3.14) for the
delta-function look like this:
0
0
1
i t
(t) =
e
d , 1 = (t)eit dt,
(4.3.17)
2
implying that the Fourier transform of the Dirac delta is ( ) 1. For the shifted
delta-function we obtain
0
1
k
=
t
eik /(2W ) ei t d .
2W
2
(4.3.18)
i t
d , 1[ , ] ( ) =
sinc(t)eit dt,
t, R1 (4.3.19)
(4.3.20)
416
Solution The shortest way to see this is to write the Fourier-decomposition (in
2 (R)) implied by (4.3.19):
1
2 W sinc (2Wt k) =
2 W
0 2 W
2 W
(4.3.21)
1 (| | 2 W )eik /(2W ) , k = 1, . . . , n,
2 W
are orthonormal. That is,
1
4 W
0 2 W
2 W
ei(kk ) /(2W ) d = kk
where
'
kk =
1,
k = k ,
0,
k = k ,
(4.3.22)
| fi (t)|2 dt,
(4.3.23)
and functions fi have been introduced in (4.3.10). Thus, the power constraint can
be written as
|| fi ||2 p /4 W = p0 .
(4.3.24)
In fact, the coefficients xk (i) coincide with the values fi (k/(2W )) of function fi
calculated at time points k/(2W ), k = 1, . . . , n; these points can be referred to as
sampling instances.
Thus, the input signal fi (t) develops in continuous time although it is completely
specified by its values fi (k/(2W )) = xk (i). Thus, if we think that different signals
are generated in disjoint time intervals (0, ), ( , 2 ), . . ., then, despite interference
caused by infinite tails of the function sinc(t), these signals are clearly identifiable
through their values at sampling instances.
The NyquistShannon assumption is that signal fi (t) is transformed in the channel into
g(t) = fi (t) + Z (t).
(4.3.25)
417
Here Z (t) is a stationary continuous-time Gaussian process with the zero mean
(EZ (t) 0) and the (auto-)correlation function
(4.3.26)
E Z (s)Z (t + s) = 202W sinc(2Wt), t, s R.
In particular, when t is a multiple of /W (i.e. point t coincides with a sampling
instance), the random variables Z (s) and Z (t + s) are independent. An equivalent
form of this condition is that the spectral density
( ) :=
eit E Z (0)Z (t) dt = 02 1 | | < 2 W .
(4.3.27)
(4.3.28)
1kn
418
1
Pe (i, d).
M 1iM
(4.3.30)
Value R(= Rnat ) is called a reliable transmission rate if, for all > 0, there exists
and a code X of size M eR such that Pe < .
Now fix a value (0, 1). The class A ( ,W, p0 ) = A ( ,W, p0 , ) is defined as
the set of functions f (t) such that
(i) f = D f where
D f (t) = f (t)1(|t| < /2), t R,
1
and f (t) has the Fourier transform eit f (t)dt vanishing for | | > 2 W ;
(ii) the ratio
|| f ||2
1 ;
|| f ||2
and
(iii) the norm || f ||2 p0 .
In other words, the transmittable signals f A ( ,W, p0 , ) are sharply localised in time and nearly band-limited in frequency.
419
As 0,
p0
202W
C( ) W ln 1 +
p0
.
1 02
p0
202W
(4.3.31)
(4.3.32)
Before going to (quite involved) technical detail, we will discuss some facts relevant to the product, or parallel combination, of r time-discrete Gaussian channels.
(In essence, this model was discussed at the end of Section 4.2.) Here, every time
units, the input signal is generated, which is an ordered collection of vectors
( j)
x1
+
* (1)
..
nj
(4.3.33)
x , . . . , x(r) where x( j) =
. R , 1 j r,
( j)
xn j
and n j = j with j being a given value (the speed of the digital production
from coder j). For each vector x( j) we consider a specific power constraint:
O O2
O ( j) O
Ox O p( j) , 1 j r.
The output signal is a collection of (random) vectors
( j)
Y
)
(
1.
( j)
( j)
( j)
.
Y(1) , . . . , Y(r) where Y( j) =
. and Yk = xk + Zk ,
( j)
Yn j
(4.3.34)
(4.3.35)
2
( j)
( j)
with Zk being IID random variables, Zk N 0, ( j) , 1 k n j , 1 j r.
420
A codebook X with information rate R, for the product-channel under consideration, is an array of M input signals,
(
x(1) (1), . . . , x(r) (1)
(1)
x (2), . . . , x(r) (2)
(4.3.36)
...
...
...
)
(1)
x (M), . . . , x(r) (M) ,
each of which has the same structure as in (4.3.33).*As before, a +
decoder d is a map
acting on a given set K of sample output signals y(1) , . . . , y(r) and taking these
signals to X .
As above, for i = 1, . . . , M,we define the error-probability
Pe (i, d) for code X
(1)
(r)
when sending an input signal x (i), . . . , x (i) :
2
Pe (i, d) = 1,
if x( j) (i) p( j) for some j = 1, . . . , r,
and
+
Y(1) , . . . , Y(r) K or
+
+ *
*
d Y(1) , . . . , Y(r) = x(1) (i), . . .
, x(r) (i)
*
+
| x(1) (i), . . . , x(r) (i) sent ,
2
if x( j) (i) < p( j) for all j = 1, . . . , r.
Pe (i, d) = Pch
The average error-probability Pe = PeX ,av (d) for code X (while using decoder d)
is then again given by
1
Pe =
Pe (i, d).
M 1iM
As usual, R is said to be a reliable transmission rate if for all > 0 there exists a
0 > 0 such that for all > 0 there exists a code X of cardinality M eR and a
decoding rule d such that Pe < . The capacity of the combined channel is again defined as the supremum of all reliable transmission rates. In Worked Example 4.2.7
the following fact has been established (cf. Lemma A in [173]; see also [174]).
Lemma 4.3.4
421
(4.3.38a)
(4.3.38b)
(4.3.38c)
x( j) 2 < p0
(4.3.39a)
1 j3
and
x(3) 2
x( j) 2 .
(4.3.39b)
1 j3
and
x(1) 2 < p0
(4.3.40a)
x(2) 2 < x(1) 2 + x(2) 2 .
(4.3.40b)
Worked Example 4.3.5 (cf. Theorem 1 in [173]). We want to prove that the
capacities of the above combined parallel channels of types IIII are as follows.
Case I, 1 2 :
(1 )p0
1
2
p0
+
C=
ln 1 +
ln 1 +
2
2
1 02
2 02
where
= min ,
2
.
1 + 2
(4.3.41a)
(4.3.41b)
422
(4.3.41c)
This means that the best transmission rate is attained when one puts as much energy into channel 2 as is allowed by (4.3.38b).
Case II:
Case III:
(1 )p0
1
ln 1 +
C=
2
(1 + 2 )12
2
p
(1 )p0
ln 1 +
+
+ 2.
2
2
( 1 + 2 ) 1
23
p0
1
p0
ln 1 +
C=
.
+
2
2
1 0
2(1 )02
(4.3.42)
(4.3.43)
Solution We present the proof for Case I only. For definiteness, assume that 1 <
2 . First, the direct part. With p1 = (1 )p0 , p2 = p0 , consider the parallel
combination of two channels, with individual power constraints on the input signals
x(1) and x(2) :
O O2
O O2
O (1) O
O O
(4.3.44a)
Ox O p1 , Ox(2) O p2 .
Of course, (4.3.44a) implies (4.3.38a). Next, with , condition (4.3.38b) also
holds true. Then, according to the direct part of Lemma 4.3.4, any rate R with
R < C1 (p1 ) +C2 (p2 ) is reliable. Here and below,
q
ln 1 +
(4.3.44b)
C (q) =
, = 1, 2.
2
02
This implies the direct part.
A longer argument is needed to prove the inverse. Set C = C1 (p1 ) + C2 (p2 ).
The aim is to show that any rate R > C is not reliable. Assume the opposite: there
exists such a reliable R = C + ; let us recall what it formally means. There exists
a sequence of values (l) and (a) a sequence of codes
(
(
)
)
X (l) = x(i) = x(1) (i), x(2) (i) , 1 i M (l)
423
+
of size M (l) e
composed of combined code-vectors x(i) = x(1) (i), x(2) (i)
with square-norms x(i)2 = x(1) (i)2 + x(2) (i)2 , and (b) a sequence of decoding maps d (l) : y K(l) d (l) (y) X (l) such that Pe 0. Here, as before,
R (l)
Pe = PeX
(l) ,av
1
Pe (i, d (l))
M (l) 1iM(l)
2
(l)
(2)
2
2
x1 (i)
..
R1 (l)
x(1) (i) =
.
(1)
x (l) (i)
1
(1)
x1 (i)
..
R2 (l)
and x(2) (i) =
.
(2)
x (l) (i)
2
are sent through their respective parts of the parallel-channel combination, which
results in output vectors
(1)
(2)
Y1 (i)
Y1 (i)
..
..
R1 (l) , Y(2) =
R2 (l)
Y(1) =
.
.
(1)
(2)
Y (l) (i)
Y (l) (i)
1
*
+
forming the combined output signal Y = Y(1) , Y(2) . The entries of vectors Y(1)
and Y(2) are sums
(1)
Yj
(1)
where Z j
(2)
and Zk
(1)
(1)
(2)
= x j (i) + Z j , Yk
(2)
(2)
= xk (i) + Zk ,
(2)
424
p0
p0
(2)
<
(xk )2 j
.
J0
J0
1k (l)
(4.3.45a)
1
ln J0 .
(l)
(4.3.45b)
(l)
On the other hand, the maximum error-probability for subcode X is not larger
than for the whole code X (l) (when using the same decoder d (l) ); consequently,
(l)
(l)
the error-probability PeX ,av d (l) Pe 0.
Having a fixed number J0 of classes in the partition of X (l) , we can find at
least one j0 {1, . . . , J0 } such that, for infinitely many l, the most numerous class
(l)
(l)
X coincides with X j . Reducing our argument to those l, we may assume that
(l)
(l)
(l)
X = X j0 for all l. Then, for all x(1) , x(2) X , with
(i)
x1
..
x(i) =
. , i = 1, 2,
(i)
xni
using (4.3.38a) and (4.3.45a)
O O2
O O2
( j0 1)
j0
O (1) O
(l) O (2) O
,
p0 (l) .
p
x
Ox O
O O
0
J0
J0
(
)
(l)
That is,
X , d (l)
is a coder/decoder sequence for the standard parallelchannel combination (cf. (4.3.34)), with
( j0 1)
j0
p0 .
p1 = 1
p0 and p2 =
J0
J0
(l)
As the error-probability PeX ,av d (l) 0, rate R is reliable for this combination
of channels. Hence, this rate does not surpass the capacity:
( j0 1)
j0
1
p0 +C2
p0 .
R C1
J0
J0
425
Here and below we refer to the definition of Ci (u) given in (4.3.44b), i.e.
R C1 ((1 )p0 ) +C2 ( p0 ) +
(4.3.46)
where = j0 /J0 .
Now note that, for 2 1 , the function
(4.3.47)
0 1 (t), 2 (t), . . ., of a variable t R, belonging to the Hilbert space 2 (R) (i.e. with
n (t)2 dt < ), called prolate spheroidal wave functions (PSWFs), such that
n ( ) =
(a) The Fourier transforms
over, the functions n (t) form an orthonormal basis in the Hilbert subspace
formed by functions from 2 (R) with this property.
(b) The functions n (t) := n (t)1(|t| < /2) (the restrictions of n (t) to
( /2, /2)) are pairwise orthogonal:
0
n (t)n (t)dt
0 /2
/2
(4.3.48a)
0 /2
/2
0 /2
/2
n (s) sinc 2W (t s) ds.
(4.3.48b)
426
That is, functions n (t)0 are the eigenfunctions, with the eigenvalues n , of the
integral operator
K(t, s) = 1(|s| < /2)(2W ) sinc 2W (t s)
sin(2 W (t s))
, /2 s;t /2.
= 1(|s| < /2)
(t s)
(d) The eigenvalues n satisfy the condition
n =
0 /2
/2
An equivalent formulation
can be given in terms involving the Fourier trans0
forms [Fn ] ( ) =
1
2
n (t)eit dt:
0 2 W
2 W
| [Fn ] ( ) |2 d
0
/2
/2
|n (t)|2 dt = n ,
which means that n gives a frequency concentration for the truncated function n .
(e) It can be checked that functions n (t) (and hence numbers n ) depend on W
and through the product W only. Moreover, for all (0, 1), as W ,
(4.3.48c)
That is, for large, nearly 2W of values n are close to 1 and the rest are close
to 0.
An n (t),
(4.3.49)
n1
where 1 (t), 2 (t), . . . are the PSWFs discussed in Worked Example 4.3.9 below
and A1 , A2 , . . . are IID random variables with An N(0, n ) where
n are the corresponding eigenvalues. Equivalently, one writes Z(t) = n1 n n n (t) where
n N(0, 1) IID random variables.
The proof of this fact goes beyond the scope of this book, and the interested
reader is referred to [38] or [103], p. 144.
427
The idea of the proof of Theorem 4.3.3 is as follows. Given W and , an input
signal s (t) from A ( ,W, p0 , ) is written as a Fourier series in the PSWFs n .
In this series, the first 2W summands represent the part of the signal confined
between the frequency band-limits 2 W and the time-limits /2. Similarly, the
noise realisation Z(t) is decomposed in a series in functions n . The action of the
continuous-time channel is then represented in terms of a parallel combination of
two jointly constrained discrete-time Gaussian channels. Channel 1 deals with the
first 2W PSWFs in the signal decomposition and has 1 = 2W . Channel 2 receives
the rest of the expansion and has 2 = +. The power constraint s2 p0 leads
to a joint constraint, as in (4.3.38a). In addition, a requirement emerges that the
energy allocated outside the frequency band-limits 2 W or time-limits /2 is
small: this results in another power constraint, as in (4.3.38b). Applying Worked
Example 4.3.5 for Case I results in the assertion of Theorem 4.3.3.
To make these ideas precise, we first derive Theorem 4.3.7 which gives an alternative approach to the NyquistShannon formula (more complex in formulation
but somewhat simpler in the (still quite lengthy) proof).
Theorem 4.3.7 Consider the following modification of the model from Theorem
4.3.3. The set of allowable signals A2 ( ,W, p0 , ) consists of functions t R
s(t) such that
(1) s2 =
|s(t)|2 dt p0 ,
(2) the Fourier transform [Fs]( ) = s(t)eit dt vanishes when | | > 2 W , and
0 /2
(3) the ratio
|s(t)|2 dt s2 > 1 . That is, the functions s
/2
The noise process is Gaussian, with the spectral density vanishing when | | > 2W
and equal to 02 for | | 2W .
Then the capacity of such a channel is given by
p0
p0
+ 2.
C = C = W ln 1 + (1 ) 2
(4.3.50)
20 W
20
As 0,
C W ln 1 +
p0
202W
428
(4.3.51)
and take (0, 1) and (0, min [ , 1 ]) such that R is still less than
( )p0
(1 + )p0
+
.
(4.3.52)
C = W (1 ) ln 1 +
2
20 W (1 )
202
According to Worked Example 4.3.5, C is the capacity of a jointly constrained
discrete-time pair of parallel channels as in Case I, with
1 = 2W (1 ), 2 = +, = , p = p0 , 2 = 02 ;
(4.3.53)
cf. (4.3.41a). We want to construct codes and decoding rules for the timecontinuous version of the channel,
asymptotically vanishing probability
(1) (2)yielding
is an allowable input signal for the parallel
of error as . Assume x , x
pair of discrete-time channels with parameters given in (4.3.53). The input for the
time-continuous channel is the following series of (W, ) PSWFs:
s(t) =
1k1
(1)
xk k (t) +
(2)
xk k+1 (t).
(4.3.54)
1k<
The first fact to verify is that the signal in (4.3.54) belongs to A2 ( ,W, p0 , ), i.e.
satisfies conditions (1)(3) of Theorem 4.3.7.
To check property (1), write
2
2 O O2 O O2
O O O O
(1)
(2)
+
x
s2 =
xk = Ox(1) O + Ox(2) O p0 .
k
1k1
1k<
Next, the signal s(t) is band-limited, inheriting this property from the PSWFs
k (t). Thus, (2) holds true.
A more involved argument is needed to establish property (3). Because the
PSWFs k (t) are orthogonal in 2 [ /2, /2] (cf. (4.3.48a)), and using the monotonicity of the values n (cf. (4.3.48b)), we have that
0 /2
(1 D )s||2
2
|s(t)| dt s2 =
1
||s||2
/2
2
2
(1)
(2)
(1 k ) xk
(1 k+1 ) xk
=
OO (1) OO2 OO (2) OO2 + OO (1) OO2 OO (2) OO2
+ x
+ x
x
1k<
1k1 x
O (1) O2
O (2) O2
Ox O
Ox O
1 1 O O2 O O2 + O O2 O O2 .
Ox(1) O + Ox(2) O
Ox(1) O + Ox(2) O
429
(1)
1k1
Zk k (t) +
(2)
Zk k+1 (t).
(4.3.55)
1k<
( j)
Here again, k (t) are the PSWFs and IID random variables Zk N(0, k ). Correspondingly, the output signal is written as
Y (t) =
(1)
1k1
Yk k (t) +
(2)
Yk k+1 (t)
(4.3.56)
1k<
where
( j)
Yk
( j)
( j)
= xk + Zk , j = 1, 2, k 1.
(4.3.57)
So, the continuous-time channel is equivalent to a jointly constrained parallel combination. As we checked, the capacity equals C specified in (4.3.52). Thus, for
R < C we can construct codes of rate R and decoding rules such that the errorprobability tends to 0.
For the converse, assume that there exists a sequence (l) , a sequence of
(l)
transmissible domains A2 ( (l) ,W, p0 , (l) ) described in (1)(3) and a sequence
(l)
of codes X (l) of size M = eR where
p0
(1 )p0
+ 2 .
R > W ln 1 +
2
2W 0
0
As usual, we want to show that the error-probability PeX ,av (d (l) ) does not tend to
0.
As before, we take > 0 and (0, 1 ) to ensure that R > C where
p0
(1 )
p0
+
.
C = W (1 + ) ln 1 +
2
(1 ) 2W 0 (1 + )
(1 )02
(l)
430
Then, as in the argument on the direct half, C is the capacity of the type I jointly
constrained parallel combination of channels with
=
, 2 = 02 , p = p0 , 1 = 2W (1 + ), 2 = +.
(4.3.58)
1
(l)
1k1 (l)
(1)
xk k (t) +
1k<
(2)
(4.3.59)
We want to show that the discrete-time signal x = x(1) , x(2) represents an allowable input to the type I jointly constrained parallel combination specified in
(4.3.38ac). By orthogonality of PSWFs k (t) in 2 (R) we can write
x2 = ||s||2 p0 (l)
ensuring that condition (4.3.38a) is satisfied. Further, using orthogonality of PSW
functions k (t) in 2 ( /2, /2) and the fact that the eigenvalues k decrease
monotonically, we obtain that
0 (l) /2
(1 D (l) ) s2
|s(t)|2 dt s2 =
1
||s||2
(l) /2
2
2
(1)
(2)
(1 k ) xk
1 k+1 (l) xk
=
+
x2
x2
1k1 (l)
1k<
O
O
2
Ox(2) O
1 1 (l)
.
x2
By virtue of (4.3.48c), 1 (l) for l large enough. Moreover, since 1
0 (l)
/2
(l) /2
|s(t)|2 dt
x
431
1 = 2W (1 ), 2 = +, p = p0 , 2 = 02 , = ,
(4.3.60)
where (0, 1) (cf. property (e) of PSWFs in Example 4.3.6) and (0, ) are
auxiliary values.
For the converse half we use the decomposition into two parallel channels, again
as in Case III, with
1 = 2W (1 + ), 2 = +, p = p0 , 2 = 02 , =
.
1
(4.3.61)
Here, as before, value (0, 1) emerges from property (e) of PSWFs, whereas
value (0, 1).
eit f (t)dt
vanishes for | | > 2W . Then, for all x R, function f can be uniquely reconstructed from its values f (x + n/(2W )) calculated at points x + n/(2W ), where
n = 0, 1, 2. More precisely, for all t R,
n
sin [2 (Wt n)]
.
(4.3.62)
f (t) = f
2W
2 (Wt n)
nZ1
By the famous uncertainty principle of quantum
Worked Example 4.3.9
physics, a function and its Fourier transform cannot be localised simultaneously
in finite intervals [ , ] and [2W, 2W ]. What could be said about the case
when both function and its Fourier transform are nearly localised? How can we
quantify the uncertainty in this case?
Solution Assume the function f 2 (R) and let f = F f L2 (R) be the Fourier
transform of f . (Recall that space 2 (R) consists of functions f on R with || f ||2 =
432
1
| f (t)|2 dt < + and that for all f , g 2 (R), the inner product
finite.) We shall see that if
0
0 t0 + /2
2
| f (t)| dt
| f (t)|2 dt = 2
t0 /2
and
0 2 W
2 W
f (t)g(t)dt is
|F f ( )| d
0
|F f ( )|2 d = 2
(4.3.63)
(4.3.64)
1
2
0 2 W
2 W
F f ( )ei t d =
f (s)
sin 2 W (t s)
ds.
t s
(4.3.65)
(4.3.66)
0 /2
/2
f (s)
sin 2 W (t s)
ds;
t s
(4.3.67)
see Example 4.3.6. The eigenvalues n of A obey 1 > 0 > 1 > and tend to zero
as n ; see [91]. We are interested in the eigenvalue 0 : it can be shown that 0
is a function of the product W . In fact, the eigenfunctions ( j ) of (4.3.67) yield an
orthonormal basis in 2 (R); at the same time these functions form an orthogonal
basis in 2 [ /2, /2]:
0 /2
/2
j (t)i (t)dt = i i j .
433
||D f ||
.
|| f ||
(4.3.70)
1
Indeed,
expand
f
=
f
D
f
+
D
f
and
observe
that
the
integral
f (t)
D f (t) g(t)dt = 0 (since the supports of g and f D f are disjoint). This implies
that
0
0
0
Re f (t)g(t)dt f (t)g(t)dt = D f (t)g(t)dt .
Hence,
1
Re
|| f ||||g||
f (t)g(t)dt
||D f ||
|| f ||
the formula
cos1
||D f ||
= cos1
|| f ||
n |an |2 n
n |an |2
1/2
.
(4.3.71)
= 0 and0 < 1;
0 < < 0 < 1 and 0 1;
1 + cos1 cos1 ;
0 < 1 and cos
0
= 1 and 0 < 0 .
Proof Given [0, 1], let G ( ) be the family of functions f L2 with norms
|| f || = 1 and ||D f || = . Next, determine ( ) := sup f G ( ) ||B f ||.
(a) If = 0, the family G (0) can contain no function with = B f = 1. Furthermore, if D f = 0 and B f = 1 for f B then f is analytic and f (t) = 0 for
|t| < /2, implying f 0. To show that G (0) contains functions with all values of
n Dn
[0, 1), we set fn =
. Then the norm ||B fn || = 1 n . Since there
1 n
434
1 n since
||Be
ipt
f || =
0
p+ W
p W
1/2
|Fn ( | d
2
,
f=
0 n
for n large when the eigenvalue n is close to 0. We have that f B, || f || = ||B f || =
1, while a simple computation shows that ||D f || = . This includes the case = 1
as, by choosing eipt f (t) appropriately, we can obtain any 0 < < 1.
(4.3.72)
with g orthogonal to both D f and B f . Taking the inner product of the sum in the
RHS of (4.3.72), subsequently, with f , D f , B f and g we obtain four equations:
1 = a1 2 + a2 2 +
2 = a1 2 + a2
0
2 = a1
g(t) f (t)dt,
B f (t)Dg(t)dt,
D f (t)B f (t)dt + a2 2 ,
f (t)g(t)dt = g2 .
2 + 2 1 + ||g||2 = a1
D f (t)B f (t)dt + a2
B f (t)D f (t)dt.
435
2 =
1 2 ||g||2
( 2
B f (t)D f (t)dt)
+ 1
which is equivalent to
2Re
2
2
0
1 2 ||g||2
2 ( 2
B f (t)D f (t)dt)
B f (t)D f (t)dt
D f (t)B f (t)dt
D f (t)B f (t)dt
2
0
1
2
+ (1 2 2 D f (t)B f (t)dt
2
0
1
2
||g|| 1 2 2 D f (t)B f (t)dt .
(4.3.73)
(4.3.74)
cos = Re
0
D f (t)B f (t)dt D f (t)B f (t)dt .
(4.3.75)
(4.3.76)
The locus of points ( , ) satisfying (4.3.76) is up and to the right of the curve
where
$
(4.3.77)
cos1 + cos1 = cos1 0 .
See Figure 4.6.
Equation (4.3.77) holds for the function f = b1 0 + b2 D0 with
R
R
1 2
1 2
and b2 =
.
b1 =
1 0
1 0
0
All intermediate values of are again attained by employing eipt f .
0.2
0.4
beta^2
0.6
0.8
1.0
436
0.0
W=0.5
W=1
W=2
0.0
0.2
0.4
0.6
0.8
1.0
alpha^2
Figure 4.6
437
We will state several facts, without proof, about the existence and properties of
the Poisson random measure introduced in Definition 4.4.1.
Theorem 4.4.2 For any non-atomic and -finite measure on R+ there exists
a unique PRM satisfying Definition 4.4.1. If measure has the form (dt) =
dt where > 0 is a constant (called the intensity of ), this PRM is a Poisson
process PP( ). If the measure has the form (dt) = (t)dt where (t) is a given
function, this PRM gives an inhomogeneous Poisson process PP( (t)).
Theorem 4.4.3 (The mapping theorem) Let be a non-atomic and -finite
measure on R such that for all t 0 and h > 0, the measure (t,t + h) of the
interval (t,t + h) is positive and finite (i.e. the value (t,t + h) (0, )), with
lim (0, h) = 0 and (R+ ) = lim (0, u) = +. Consider the function
h0
u+
f : u R+ (0, u),
and let f 1 be the inverse function of f . (It exists because f (u) = (0, u) is strictly
monotone in u.) Let M be the PRM( ). Define a random measure f M by
( f M)(I) = M( ( f 1 I)) = M( ( f 1 (a), f 1 (b))),
(4.4.1)
n+
1
.
2
(4.4.2)
438
Solution Since
0 1
1
(x)dx = ,
there are with probability 1 infinitely many points of in (1, 1). On the other
hand,
0 1
1+
(x)dx <
for every > 0, so that (1+ , 1 ) is finite with probability 1. This is enough
to label uniquely in ascending order the points of . Let
0 x
f (x) =
0
(y)dy.
(a, b) =
0 f 1 (b)
f 1 (a)
(x)dx = b a.
With this choice of f , the points ( f (Xn )) form a Poisson process of unit rate on R.
The strong law of large numbers shows that, with probability 1, as n ,
n1 f (Xn ) 1, and n1 f (Xn ) 1.
Now, observe that
1
1
(x) (1 x)3 and f (x) (1 x)2 , as x 1.
4
8
Thus, as n , with probability 1,
1
n1 (1 Xn )2 1,
8
which is equivalent to (4.4.2). Similarly,
1
1
(x) (1 + x)2 and f (x) (1 + x)1 , as x 1,
8
8
implying that with probability 1, as n ,
1
n1 (1 + Xn )1 1.
8
Hence, with probability 1
1
lim n(1 + Xn ) = .
n
8
439
Worked Example 4.4.5 Show that, if Y1 < Y2 < Y3 < are points of a Poisson
process on (0, ) with constant rate function , then
lim Yn /n =
with probability 1. Let the rate function of a Poisson process = PP( (x)) on
(0, 1) be
(x) = x2 (1 x)1 .
Show that the points of can be labelled as
< X2 < X1 <
1
< X0 < X1 <
2
and that
lim Xn = 0 , lim Xn = 1 .
Prove that
lim nXn = 1
f (x) =
1/2
( )d ,
and use the fact that f maps into a PP of constant rate on ( f (0), f (1)): f () =
PP(1). In our case, f (0) = and f (1) = , and so f () is a PP on R. Its points
may be labelled
< Y2 < Y1 < 0 < Y0 < Y1 <
with
lim Yn = , lim Yn = +.
n+
Now, as x 0,
f (x) =
0 1/2
x
f (Xn )
Yn
= lim
= 1, a.s.
n n
n
(1 ) d
0 1/2
x
2 d x1 ,
440
implying that
1
Xn
= 1, i.e. lim nXn = 1, a.s.
n n
n
lim
Similarly,
lim
n+
and as x 1,
f (x)
0 x
1/2
f (Xn )
= 1, a.s.,
n
(1 )1 d ln(1 x).
ln(1 Xn )
= 1, a.s.
n
(n An ) = (An ).
n
The value (E) can be finite or infinite. Our aim is to define a random counting
measure M = (M(A), A E ), with the following properties:
(a) The random variable M(A) takes non-negative integer values (including, possibly, +). Furthermore,
'
Po( (A)), if (A) < ,
(4.4.3)
M(A)
= + with probability 1, if (A) = .
(b) If A1 , A2 , . . . E are disjoint sets then
M (i Ai ) = M(Ai ).
(4.4.4)
1in
P (M(Ai ) = ki ) .
(4.4.5)
441
First assume that (E) < (if not, split E into subsets of finite measure). Fix a
random variable M(E) Po( (E)). Consider a sequence X1 , X2 , . . . of IID random points in E, with Xi (E), independently of M(E). It means that for all
n 1 and sets A1 , . . . , An E (not necessarily disjoint)
n n
(Ai )
(E) (E)
,
P M(E) = n, X1 A1 , . . . , Xn An = e
n!
i=1 (E)
(4.4.6)
and conditionally,
n
(Ai )
P X1 A1 , . . . , Xn An |M(E) = n =
.
i=1 (E)
(4.4.7)
Then set
M(E)
M(A) =
1(Xi A), A E .
(4.4.8)
i=1
and
P(R1 > r) = P(C(r) contains at most one point of M)
2
= (1 + r2 )e r , r > 0.
Similarly,
2
1
P(R2 > r) = 1 + r2 + ( r2 )2 e r , r > 0.
2
442
Then
ER0 =
ER1 =
0
0
0
0
1
P(R0 > r)dr =
2
e r d
2
2 r =
1
,
2
0
2
1
e r r2 dr
= +
0
2
0
2
1
1
+
2 r2 e r d 2 r
=
2 2 2 0
3
= ,
4
2
0
r2 r2
3
e
ER2 = +
dr
2
0
4
0
2
2
1
3
= +
2 r2 e r d 2 r
4 8 2 0
3
3
= +
4 16
15
= .
16
We shall use for the PRM M on the phase space E with intensity measure constructed in Theorem 4.4.6 the notation PRM(E, ). Next, we extend the definition
of the PRM to integral sums: for all functions g : E R+ define
0
M(E)
M(g) =
g(Xi ) :=
g(y)dM(y);
(4.4.9)
i=1
summation is taken over all points Xi E, and M(E) is the total number of such
points. Next, for a general g : E R we set
M(g) = M(g+ ) M(g ),
with the standard agreement that + a = + and a = for all a (0, ).
[When both M(g+ ) and M(g ) equal , the value M(g) is declared not defined.]
Then
Theorem 4.4.8 (Campbell theorem) For all R and for all functions g : E R
such that e g(y) 1 is -integrable
0
Ee M(g) = exp
(4.4.10)
e g(y) 1 d (y) .
E
Proof
443
Write
Ee M(g) = e (E)
( (E))k
k!
= e (E) exp
(E)
k
k
e g(x) d (x)
e g(x) d (x)
= exp
e g(x) 1 d (x) .
Corollary 4.4.9
g(y)d (y);
Example 4.4.10 Suppose that the wireless transmitters are located at the points
of Poisson process on R2 of rate . Let ri be the distance from transmitter i to
the central receiver at 0, and the minimal distance to a transmitter is r0 . Suppose
that the power of the received signal is Y = Xi rP for some > 2. Then
i
0
Ee Y = exp 2
(4.4.11)
e g(r) 1 rdr ,
r0
where g(r) =
P
r
444
A popular model in application is the so-called marked point process with the
space of marks D. This is simply a random measure on Rd D or on its subset. We
will need the following product property proved below in the simplest set-up.
Theorem 4.4.11 (The product theorem) Suppose that a Poisson process with the
constant rate is given on R, and marks Yi are IID with distribution . Define a
random measure M on R+ D by
M(A) =
(Tn ,Yn ) A , A R+ D.
(4.4.12)
n=1
M(A) =
n=1
We know that Nt Po( t). Further, given that Nt = k, the jump points T1 , . . . , Tk
have the conditional joint PDF fT1 ,...,Tk ( |Nt = k) given by (4.4.7). Then, by using
further conditioning, by T1 , . . . , Tk , in view of the independence of the Yn , we have
E e M(A) |Nt = k
= E E e M(A) |Nt = k; T1 , . . . , Tk
0t
0t
...
=
0
E exp
1
tk
I (xi ,Yi ) A |Nt = k; T1 = x1 , . . . , Tk = xk
k
i=1
0t 0
0 D
e IA (x,y) d (y)dx .
Then
Ee M(A) = e t
= exp
0t 0
( t)k
k=0
0t
1
k! t k
445
e IA (x,y) d (y)dx
0 D
e IA (x,y) 1 d (y)dx .
0 D
The expression e IA (x,y) 1 takes value e 1 for (x, y) A and 0 for (x, y) A.
Hence,
0
(4.4.13)
Ee M(A) = exp e 1 d (y)dx , R.
A
n 0
n
Ee M(A) = exp e 1 d (y)dx = Ee M(Ai ) , R,
i=1
Ai
i=1
446
0
e a(y) 1 (dy) ,
Ee = exp
E
where
=
a(X).
a(y)M(dy) =
0 0
(dx dm) e Gm/|x| 1 .
Ee F = exp
R3 0
dEe F
|
and equals
d =0
(x)dx
(x, dm)
Gm
= GM
|x|
dx
R3
1
1(|x| R).
|x|2
1
dx 2 1(|x| R) =
|x|
0R
dr
1 2
r
r2
d cos
d = 4 R
which yields
EF = 4 GMR.
Finally, let D be the distance to the nearest star contributing to F at least C. Then,
by the product theorem,
P(D d) = P(no points in A) = exp (A) .
Here
K
%
Gm
C ,
A = (x, m) R3 R+ : |x| d,
|x|
and (A) =
447
0d
1
dr r2 d cos
r
d M
dmem/M
Cr/G
0d
= 4 drreCr/(GM)
0
= 4
GM
C
2
Cd Cd/(GM)
Cd/(GM)
e
1e
.
GM
P(|xi y|)
(4.4.14)
xi
where P is the emitted signal power and the function describes the fading of the
signal. In the case of so-called Rayleigh fading (|x|) = e |x| , and in the case of
the power fading (|x|) = |x| , > 2. By the Campbell theorem
0
( ) = E e Y = exp 2
r e P(r) 1 dr .
(4.4.15)
0
448
Assuming that the signal Sk from the point xk is amplified by the coefficient
we write the signal
Yj =
h jk Sk + Z j , j = 1, . . . , J.
(4.4.16)
xk
e2 ir jk /
P /2 ,
r jk
(4.4.17)
where is the transmission wavelength and r jk = |y j xk |. The noise random variables Z j are assumed to be IID N(0, 02 ). A similar formula could be written for
Rayleigh fading. We know that in the case of J = 1 and a single transmitter K = 1,
by the NyquistShannon theorem of Section 4.3, the capacity of the continuous
time, additive white Gaussian noise channel Y (t) = X(t)(x, y) + Z(t) with attenu1 /2
ation factor (x, y), subject to the power constraint /2 X 2 (t)dt < P , bandwidth
W , and noise power spectral density 02 , is
P2 (x, y)
C = W log 1 +
.
(4.4.18)
2W 02
Next, consider the case of finite numbers K of transmitters and J of receivers
K
(4.4.19)
i=1
the box Bn of area n (i.e. size n); partition them into two sets S1 and S2 , so
449
Cn =
Rk j
k=1 j=1
Pk s2k
log
1
+
,
Pk 0, Pk nP k=1
202
n
max
where sk is the kth largest singular value of the matrix L = ((xk , y j )), 02 is the
noise power spectral density, and the bandwidth W = 1.
This result allows us to find the asymptotic of capacity as n . In the most
2
n) ); in the case of power
interesting case of Rayleigh fading R(n) = Cn /n O( (log
n
1/ (log n)2
P2 (xi , x j )
02 + k=i, j P2 (xk , x j )
(4.4.22)
where P, 02 , k > 0 and 0 < 1k . We say that a transmitter located at xi can send a
message to receiver located at x j if SNR(xi x j ) k. For any k > 0 and 0 < < 1,
let An (k, ) be an event that there exists a set Sn of at least n points of such that
for any two points s, d Sn , SNR(s, d) > k. It can be proved (see [48]) that for all
(0, 1) there exists k = k( ) such that
lim P An (k( ), ) = 1.
(4.4.23)
450
First, we note that any given transmitter may be directly connected to at most
1 + ( k)1 receivers. Indeed, suppose that nx nodes are connected to the node x.
Denote by x1 the node connected to x and such that
(|x1 x|) (|xi x|), i = 2, . . . , nx .
(4.4.24)
02 + P(|xi x|)
i=2
which implies
P(|x1 x|) k02 + k P(|xi x|)
i2
2
k0 + k (nx 1)P(|x1 x|) + k
P(|xi x|)
inx +1
(4.4.25)
( ) = O( 1 ).
(4.4.26)
Gaussian noise in a channel. Enlist in the codebook XM,Nthe random points X(i)
from process (N) lying inside the Euclidean ball B(N) ( N ) and surviving the
following purge. Fix r > 0 (the minimal distance of the random code) and for any
point X j of a Poisson process (N) generate an IID random variable T j U([0, 1])
(a random mark). Next, for every point X j of the original Poisson process examine
451
the ball B(N) (X j , r) of radius r centred at X j . The point X j will survive only if its
mark T j is strictly smaller than the marks of all other points from (N) lying in
B(N) (X j , r). The resulting point process (N) is known as the Matern process; it is
an example of a more general construction discussed in the recent paper [1].
The main parameter of a random codebook with codewords x(N) of length N is
the induced distribution of the distance between codewords. In the case of codebooks generated by stationary point processes it is convenient to introduce a function K(t) such that 2 K(t) gives the expected number of ordered pairs of distinct
points in a unit volume less than distance t apart. In other words, K(t) is the expected number of further points within t of an arbitrary point of a process. Say,
for Poisson process on R2 of rate , K(t) = t 2 . In random codebooks we are interested in models where K(t) is much smaller for small and moderate t. Hence,
random codewords appear on a small distance from one another much more rarely
than in a Poisson process. It is convenient to introduce the so-called product density
(t) =
2 dK(t)
,
c(t) dt
(4.4.27)
where c(t) depends on the state space of the point process. Say, c(t) = 2 t on R1 ,
c(t) = 2 t 2 on R2 , c(t) = 2 sint on the unit sphere, etc.
Some convenient models of this type have been introduced by B. Matern. Here
we discuss two rather intuitive models of point processes on RN . The first is obtained by sampling a Poisson process of rate and deleting any point which is
within 2R of any other whether or not this point has already been deleted. The rate
of this process for N = 2 is
M,1 = e4 R .
2
(4.4.28)
(4.4.29)
Here B((0, 0), 2R) is the ball with centre (0, 0) of radius 2R, and B((t, 0), 2R) is
the ball with centre (t, 0) of radius 2R. For varying this model has the maximum
rate of (4 eR2 )1 and
so cannot model densely packed codes. This is 10% of the
theoretical bound ( 12R2 )1 which is attained by the triangular lattice packing,
cf. [1].
The second Matern model is an example of the so-called marked point process.
The points of a Poisson process of rate are independently marked by IID random
variables with distribution U([0, 1]). A point is deleted if there is another point of
452
the process within distance 2R which has a bigger mark whether or not this point
has already been deleted. The rate of this process for N = 2 is
(4.4.30)
(4.4.31)
k(t) =
(4.4.32)
Jr0 ,a i
where Jr0 ,a denotes the set of interfering transmitters such that r0 ri < a. Let P
be the rate of Poisson process producing a Matern process after thinning. The rate
of thinned process is
1 exp P r02
=
.
r02
Using the Campbell theorem we compute the MGF of Xr0 :
( ) = E e Xr0
= exp P (a
r02 )
0 1
q(t)dt
0
r0
2r
g(r)
dr 1 . (4.4.33)
e
(a2 r02 )
453
P
and q(t) = exp P r02t is the retaining probability of a point
r 0
1
q(t)dt = , we obtain
of mark t. Since
P
0
0 a
2r
2
2
g(r)
e
( ) = exp (a r0 )
dr 1 .
(4.4.34)
2
r0 (a2 r0 )
Here g(r) =
k =
0 a
r0
2r(g(r))k dr =
2 Pk
Pk
.
k 2 r0k 2 ak 2
(4.4.35)
Engineers say that outage happens at the central receiver, i.e. the interference prevents one from reading a signal obtained from a sender at distance rs , if
P/rs
k.
02 + Jr0 ,a P/ri
Here, 02 is the noise power, rs is the distance to sender and k is the minimal SIR
(signal/noise ratio) required for successful reception. Different approximations of
outage probability based on the moments computed in (4.4.35) are developed. Typically, the distribution of Xr0 is close to log-normal; see, e.g., [113].
454
for some function f : {0, 1}d {0, 1} (a feedback function). The initial string
(x0 , . . . , xd1 ) is called an initial fill; it produces an output stream (xn )n0 satisfying
the recurrence equation
xn+d = f (xn , . . . , xn+d1 ),
for all n 0.
(4.5.1)
A feedback shift register is said to be linear (an LFSR, for short) if function f is
linear and c0 = 1:
d1
f (x0 , . . . , xd1 ) =
ci xi ,
where ci = 0, 1, c0 = 1;
(4.5.2)
i=0
xn+d =
ci xn+i
for all n 0.
(4.5.3)
i=0
where
0
0
V = ...
1
0
..
.
0
1
..
.
...
...
..
.
0
0
..
.
0 0 ...
0
c0 c1 c2 . . . cd2
0
0
..
.
(4.5.4)
xn
xn+1
..
n+d1
=
,
x
. .
xn+d2
1
cd1
xn+d1
(4.5.5)
By the expansion of the determinant along the first column one can see that det V =
1 mod 2: the cofactor for the (n, 1) entry c0 is the matrix Id1 . Hence,
det V = c0 det Id1 = c0 = 1, and the matrix V is invertible.
(4.5.6)
(4.5.7)
Observe that general feedback shift registers, after an initial run, become periodic:
Theorem 4.5.2 The output stream (xn ) of a general feedback shift register of
length d has the property that there exists integer r, 0 r < 2d , and integer D,
1 D < 2d r, such that xk+D = xk for all k r.
455
Proof A segment xM . . . xM+d1 determines uniquely the rest of the output stream
in (4.5.1), i.e. (xn , n M + d 1). We see that if such a segment is reproduced in
the stream, it will be repeated. There are 2d different possibilities for a string of d
subsequent digits. Hence, by the pigeonhole principle, there exists 0 r < R < 2d
such that the two segments of length d of the output stream, from positions r and
R onwards, will be the same: xr+ j = xR+ j , 0 r < R < d. Then, as was noted,
xr+ j = xR+ j for all j 0, and the assertion holds true with D = R r.
In the linear case (LFSR), we can repeat the above argument, with the zero string
discarded. This allows us to reduce 2d to 2d 1. However, an LFSR is periodic in
a proper sense:
Theorem 4.5.3 An LFSR (xn ) is periodic, i.e. there exists D 2d 1 such that
xn+D = xn for all n. The smallest D with this property is called the period of the
LFSR.
Proof Indeed, the column vectors xn+d1
, n 0, are related by the equation
n
xn+1 = Vxn = Vn+1 x0 , n 0, where matrix V was defined in (4.5.5). We noted that
det V = c0 = 0 and hence V is invertible. As was said before, we may discard the
zero initial fill. For each vector xn {0, 1}d there are only 2d 1 non-zero possibilities. Therefore, as in the proof of Theorem 4.5.2, among the initial 2d 1 vectors
xn , 0 n 2d 2, either there will be repeats, or there will be a zero vector. The
second possibility can be again discarded, as it leads to the zero initial fill. Thus,
suppose that the first repeat was for j and D + j: x j = x j+D , i.e. V j+D x0 = V j x0 .
If j = 0, we multiply by V1 and arrive at an earlier repeat. So: j = 0, D 2d 1
and VD x0 = x0 . Then, obviously, xn+D = Vn+D x0 = Vn x0 = xn .
Worked Example 4.5.4 Give an example of a general feedback register with
output k j , and initial fill (k0 , k1 , . . . , kN ), such that
(kn , kn+1 , . . . , kn+N ) = (k0 , k1 , . . . , kN ) for all n 1.
Solution Take f : {0, 1}2 {0, 1}2 with f (x1 , x2 ) = x2 1. The initial fill 00 yields
00111111111 . . .. Here, kn+1 = 0 = k1 for all n 1.
Worked Example 4.5.5 Let matrix V be defined by (4.5.5), for the linear recursion (4.5.3). Define and compute the characteristic and minimal polynomials
for V.
456
X
0
0
c0
1
X
..
.
0
1
..
.
...
...
..
.
0
0
..
.
0
0
..
.
(4.5.8)
0 0 ... X
1
c1 c2 . . . cd2 (cd1 + X)
(recall, entries 1 and ci are considered in F2 ). Expanding along the bottom row,
the polynomial hV (t) is written as a linear combination of determinants of size
(d 1) (d 1) (co-factors):
1 0 ... 0 0
X 0 ... 0 0
X 1 . . . 0 0
X 1 . . . 0 0
c0 det . . .
+ c1 det . . .
.
.. ..
.
.
.
.
.
.
.
.
.
. .
. .
. . .
. . .
0
0 ... X
0 0 ... X 1
1 ... 0 0
X . . . 0 0
.. . . .. ..
. . .
.
0 0 ... 0 1
X 1 ... 0
0 X ... 0
+(cd1 + X) det . . .
. . ...
.. ..
1
X
0
+ + cd2 det .
..
0
0
..
.
... 0 X
457
( j)
( j)
a0 e j + a1 Ve j + + ad j Vd j e j + Vd j +1 e j = 0.
(iv) Further, we form the corresponding polynomial
( j)
mV (X) =
( j)
ai X i + X d j +1 .
0id j
(v) Then,
(1)
(d)
mV (X) = lcm mV (X), . . . , mV (X) .
0
..
.
ej =
1
..
.
0
..
.
j.
..
.
mV (X) =
ci X i + X d = hV (X).
0id1
We see that the feedback polynomial C(X) of the recursion coincides with the
characteristic and the minimal polynomial for V. Observe that at X = 0 we obtain
hV (0) = C(0) = c0 = 1 = det V.
(4.5.9)
Any polynomial can be identified through its roots; we saw that such a description may be extremely useful. In the case of an LFSR, the following example is
instructive.
Theorem 4.5.6 Consider the binary linear recurrence in (4.5.3) and the corresponding auxiliary polynomial C(X) from (4.5.7).
(a) Suppose K is a field containing F2 such that polynomial C(X) has a root of
multiplicity m in K. Then, for all k = 0, 1, . . . , m 1,
xn = A(n, k) n , n = 0, 1, . . . ,
(4.5.10)
458
k = 0,
1,
A(n, k) =
(n l)+ mod 2, k 1.
(4.5.11)
0lk1
Here, and below, (a)+ stands for max[a, 0]. In other words, sequence x(k) =
(xn ), where xn is given by (4.5.10), is an output of the LFSR with auxiliary
polynomial C(X).
(b) Suppose K is a field containing F2 such that C(X) factorises in K into linear factors. Let 1 , . . . , r K be distinct roots of C(X) of multiplicities
m1 , . . . , mr , with mi = d . Then the general solution of (4.5.3) in K is
1ir
xn =
1ir 0kmi 1
(4.5.12)
for some bu,v K. In other words, sequences x(i,k) = (xn ), where xn = A(n, k)in
and A(n, k) is given by (4.5.11), span the set of all output streams of the LFSR
with auxiliary polynomial C(X).
Proof (a) If C(X) has a root K of multiplicity m then C(X) = (X )mC(X)
where C(X) is a polynomial of degree d m (with coefficients from a field K K).
Then, for all k = 0, . . . , m 1, and for all n d, the polynomial
dk
Dk,n (X) := X k k X nd C(X)
dX
(with coefficients taken mod 2) vanishes at X = (in field K):
Dk,n ( ) =
0id1
This yields
A(n, k) n =
ci A(n d + i, k) nd+i .
0id1
of root .
(b) First, observe that the set of output streams (xn )n0 forms a linear space W over
K (in the set of all sequences with entries from K). The dimension of W equals d,
as every stream is uniquely defined by a seed (initial fill) x0 x1 . . . xd1 Kd . On the
(i,k)
other hand, d = mi , the total number of sequences x(i,k) = xn
with entries
1ir
(i,k)
xn
= A(n, k)in , n = 0, 1, . . . .
459
1ir 0kmi 1
x(i,k)
1ir 0kmi 1
br,mr 1 x(r,mr 1) . This gives
1i<r
460
Upon seeing a stream of digits (xn )n0 , an observer may wish to determine
whether it was produced by an LFSR. This can be done by using the so-called
BerlekampMassey (BM) algorithm, solving a system of linear equations. If a sed1
d1
x0
x1
Ad = .
..
x1
x2
..
.
x2
x3
..
.
xd xd+1 xd+2
. . . xd
. . . xd+1
.. ,
..
.
.
. . . x2d
c0
c1
..
.
cd =
.
cd1
1
(4.5.13)
x0
x1
Ar = .
..
xr
x1
x2
..
.
x2
x3
..
.
xr+1 xr+2
. . . xr
. . . xr+1
.. .
..
.
.
. . . x2r
x0 x1 . . . xd
x1 x2 . . . xd+1
Ad
= 0, where Ad = ..
..
..
..
.
.
.
.
ad1
xd xd+1 . . . x2d
1
a0
a1
..
.
(e.g. by Gaussian elimination) and test sequence (xn ) for the recursion xn+d =
ai xn+i . If we discover a discrepancy, we choose a different vector cr
0id1
(xn ), consider a formal power series in X: x j X j . The fact that (xn ) is produced
j=0
461
by the LFSR with a feedback polynomial C(X) is equivalent to the fact that the
d
A(X)
x j X j = C(X) .
(4.5.14)
j=0
an = ci xni , n = 1, . . . ,
(4.5.15)
i=1
or
n1
an ci xni ,
xn =
i=1
n1
ci xni ,
i=0
n = 0, 1, . . . , d,
(4.5.16)
n > d.
In other words, A(X) takes part in specifying the initial fill, and C(X) acts as the
feedback polynomial.
Worked Example 4.5.7 What is a linear feedback shift register? Explain the
BerlekampMassey method for recovering the feedback polynomial of a linear
feedback shift register from its output. Illustrate in the case when we observe outputs
1 0 1 0 1 1 0 0 1 0 0 0 ...,
0 1 0 1 1 1 1 0 0 0 1 0 ...
and
1 1 0 0 1 0 1 1.
Solution An initial fill x0 . . . xd1 produces an output stream (xn )n0 satisfying the
recurrence equation
d1
xn+d =
ci xn+i
for all n 0.
i=0
462
1 0 1
A2 = 0 1 0 , with det A2 = 0,
1 0 1
c0
1 0 1 0
0 1 0 1
A3 =
1 0 1 1 , with det A3 = 0,
0 1 1 0
and then to A4 :
1
0
A4 =
1
0
1
0
1
0
1
1
1
0
1
1
0
0
1
1
0
0
1
1
0
, with det A4 = 0.
0
1
1
0
0 1 0
0 1 0
0 1
1 0 1
det
= 0, det 1 0 1 = 0, det
0 1 1
1 0
0 1 1
1 1 1
1
1
= 0
1
1
and
0
1
1
1
1
0
1
1
1
0
1
1
1
1
1
1
1
1
0
463
1
1
1
1
1
0 = 0.
0 0
0
1
This yields the solution: d = 4, xn+4 = xn + xn+1 . The linear recurrence relation is
satisfied by every term of the output sequence given. The feedback polynomial is
then X 4 + X + 1.
In the third example the recursion is xn+3 = xn + xn+1 .
LFSRs are used for producing additive stream ciphers. Additive stream ciphers
were invented in 1917 by Gilbert Vernam, at the time an engineer with the AT&T
Bell Labs. Here, the sending party uses an output stream from an LFSR (kn ) to
encrypt a plain text (pn ) by (zn ) where
zn = pn + kn mod 2, n 0.
(4.5.17)
(4.5.18)
but of course he must know the initial fill k0 . . . kd1 and the string c0 . . . cd1 . The
main deficiency of the stream cipher is its periodicity. Indeed, if the generating
LFSR has period D then it is enough for an attacker to have in his possession a
cipher text z0 z1 . . . z2D1 and the corresponding plain text p0 p1 . . . p2D1 , of length
2D. (Not an unachievable task for a modern-day Sherlock Holmes.) If by some luck
the attacker knows the value of the period D then he only needs z0 z1 . . . zD1 and
p0 p1 . . . pD1 . This will allow the attacker to break the cipher, i.e. to decrypt the
whole plain text, however long.
Clearly, short-period LFSRs are easier to break when they are used repeatedly. The history of World War II and the subsequent Cold War has a number of
spectacular examples (German code-breakers succeeding in part in reading British
Navy codes, British and American code-breakers succeeding in breaking German
codes, the American project Venona deciphering Soviet codes) achieved because
of intensive message traffic. However, even ultra-long periods cannot guarantee
safety.
As far as this section of the book is concerned, the period of an LFSR can be
increased by combining several LFSRs.
Theorem 4.5.8 Suppose a stream (xn ) is produced by an LFSR of length d1 ,
period D1 and with an auxiliary polynomial C1 (X), and a stream (yn ) by an LFSR
464
1ir2
1ir1 0kmi 1
1 jr2 0lm j 1
b j,l A(n, l) jn ,
(4.5.19)
The
can
be
written
as
a
sum
product
ai,k b j,l A(n, k)A(n, l)
A(n,t)ut (ai,k , b j,l ) where coefficients ut (ai,k , b j,l ) K. This gives
kltk+l1
A(n,t)
k,l:kltk+l1
A(n,t)vi, j;t (i j )n
corresponding to the generic form of the output stream for the LFSR with the auxiliary polynomial C(X) in statement (b).
Despite serious drawbacks, LFSRs remain in use in a variety of situations: they
allow simple enciphering and deciphering without lookahead and display a local effect of an error, be it encoding, transmission or decoding. More generally,
465
non-linear LFSRs often offer only marginal advantages while bringing serious disadvantages, in particular with deciphering.
(in F2 ).
466
All 16 initial fills have appeared in the list, so the analysis is complete.
Worked Example 4.5.11 Describe how an additive stream cipher operates. What
is a one-time pad? Explain briefly why a one-time pad is safe if used only once but
becomes unsafe if used many times. A one-time pad is used to send the message
x1 x2 x3 x4 x5 x6 y7 which is encoded as 0101011. By mistake, it is reused to send the
message y0 x1 x2 x3 x4 x5 x6 which is encoded as 0100010. Show that x1 x2 x3 x4 x5 x6 is
one of two possible messages, and find the two possibilities.
Solution A one-time pad is an example of a cipher based on a random key and
proposed by Gilbert Vernam and Joseph Mauborgne (the Chief of the USA Signal
Corps during World War II). The cipher uses a random number generator producing
a sequence k1 k2 k3 . . . from the alphabet J of size q. More precisely, each letter
is uniformly distributed over J and different letters are independent. A message
m = a1 a2 . . . an is encrypted as c = c1 c2 . . . cn where
ci = ai + ki mod q .
To show that the one-time pad achieves perfect secrecy, write
P(M = m,C = c) = P(M = m, K = c m)
= P(M = m)P(K = c m) = P(M = m)
1
;
qn
here the subtraction c m is digit-wise and mod q. Hence, the conditional probability
P(C = c|M = m) =
1
P(M = m,C = c)
= n
P(M = m)
q
467
the cipher key stream is known only to the sender and the recipient.) In the example,
we have
x1 x2 x3 x4 x5 x6 y7 0101011,
y0 x1 x2 x3 x4 x5 x6 0100010.
Suppose x1 = 0. Then
k0 = 0, k1 = 1, x2 = 0, k2 = 0, x3 = 0, k3 = 0, x4 = 1, k1 = 0,
x5 = 1, k5 = 0, x6 = 1, k6 = 1.
Thus,
k = 0100101, x = 000111.
If x1 = 1, every digit changes, so
k = 1011010, x = 111000.
Alternatively, set x0 = y0 and x7 = y7 . If the first cipher is q1 q2 . . ., the second is
p1 p2 . . . and the one-time pad is k1 , k2 , . . ., then
q j = x j+1 + k j , p j = x j + k j .
So,
x j + x j+1 = q j + p j ,
and
x1 + x2 = 0, x2 + x3 = 0,
x3 + x4 = 1, x4 + x5 = 0, x5 + x6 = 0.
This yields
x1 = x2 = x3 , x4 = x5 = x6 , x4 = x3 + 1.
The message is 000111 or 111000.
Worked Example 4.5.12
(a) Let : Z+ {0, 1} be given by (n) = 1 if n is
odd, (n) = 0 if n is even. Consider the following recurrence relation over F2 :
un+3 + un+2 + un+1 + un = 0.
(4.5.20)
468
469
such that
(f) for all key e K there is a key d K , with the property that Dd (Ee (P)) = P
for all plaintext P P.
Example 4.5.14 Suppose that two parties, Bob and Alice, intend to have a twoside private communication. They want to exchange their keys, EA and EB , by using an insecure binary channel. An obvious protocol is as follows. Alice encrypts
a plain-text m as EA (m) and sends it to Bob. He encrypts it as EB (EA (m)) and
returns it to Alice. Now we make a crucial assumption that EA and EB commute
for any plaintext m : EA EB (m ) = EB EA (m ). In this case Alice can decrypt
this message as DA (EA (EB (m))) = EB (m) and send this to Bob, who then calculates DB (EB (m)) = m. Under this protocol, at no time during the transaction is an
unencrypted message transmitted.
However, a further thought shows that this is no solution at all. Indeed, suppose
that Alice uses a one-time pad kA and Bob uses a one-time pad kB . Then any single interception provides no information about plaintext m. However, if all three
transmissions are intercepted, it is enough to take the sum
(m + kA ) + (m + kA + kB ) + (m + kB ) = m
to obtain the plaintext m. So, more sophisticated protocols should be developed:
this is where public key cryptosystems are helpful.
Another popular example is a network of investors and brokers dealing in a
market and using an open access cryptosystem such as RSA. An investors concern
is that a broker will buy shares without her authorisation and, in the case of a loss,
claim that he had a written request from the client. Indeed, it is easy for a broker
to generate a coded order requesting to buy the stocks as the encoding key is in the
public domain. On the other hand, a broker may be concerned that if he buys the
shares by the investors request and the market goes down, the investor may claim
that she never ordered this transaction and that her coded request is a fake.
However, it is easy to develop a protocol which addresses these concerns. An
investor Alice sends to a broker Bob, together with her request p to buy shares, her
electronic signature fB fA1 (p). After receiving this message Bob sends a receipt
r encoded as fA fB1 (r). If a conflict emerges, both sides can provide a third party
(say, a court) with these coded messages and the keys. Since no-one but Alice could
generate the message coded by fB fA1 and no-one but Bob could generate the message coded by fA fB1 , no doubts would remain. This is the gist of bit commitment.
The above-mentioned RSA (RivestShamirAdelman) scheme is a prime example
470
(4.5.21)
Number N is often called the RSA modulus (and made public). The value of the
Euler totient function is
(4.5.22)
(4.5.23)
[The value of d can be computed via the extended Euclids algorithm.] The public
key eB used for encryption is the pair (N, l) (listed in the public directory). The
sender (Alice), when communicating to Bob, understands that Bobs plaintext and
ciphertext sets are P = C = {1, . . . , N 1}. She then encrypts her chosen plaintext
m = 1, . . . , N 1 as the ciphertext
EN,l (m) = c where c = ml mod N.
(4.5.24)
Bobs private key dB is the pair (N, d) (or simply number d): it is kept secret
from public but made known to Bob. The recipient decrypts ciphertext c as
Dd (c) = cd mod N.
(4.5.25)
In the literature, l is often called the encryption and d the decryption exponent.
Theorem 4.5.15 below guarantees that
Dd (c) = mdl = m mod N,
(4.5.26)
471
(4.5.27)
Otherwise, i.e. when p|m, (4.5.27) still holds as m and (ml )d are both equal to
0 mod p. By a similar argument,
(ml )d = m mod q.
(4.5.28)
By the Chinese remainder theorem (CRT) [28], [114] (4.5.27) and (4.5.28)
imply (4.5.26).
Example 4.5.16 Suppose Bob has chosen p = 29, q = 31, with N = 899 and
(N) = 840. The smallest possible value of e with gcd(l, (N)) = 1 is l = 11, after
that 13 followed by 17, and so on. The (extended) Euclid algorithm yields d = 611
for l = 11, d = 517 for l = 13, and so on. In the first case, the encrypting key E899,11
is
m m11 mod 899, that is, E899,11 (2) = 250.
The ciphertext 250 is decoded by
D611 (250) = 250611 mod 899 = 2,
with the help of the computer. [The computer is needed even after the simplification
rendered by the use of the CRT. For instance, the command in Mathematica is
PowerMod[250,611,899].]
Worked Example 4.5.17 (a) Referring to the RSA cryptosystem with public key
(N, l) and private key ( (N), d), discuss possible advantages or disadvantages of
taking (i) l = 232 + 1 or (ii) d = 232 + 1.
(b) Let a (large) number N be given, and we know that N is a product of two distinct
prime numbers, N = pq, but we do not know the numbers p and q. Assume that
another positive integer, m, is given, which is a multiple of (N). Explain how to
find p and q.
(c) Describe how to solve the bit commitment problem by means of the RSA.
Solution Using l = 232 + 1 provides fast encryption (you need just 33 multiplications using repeated squaring). With d = 232 + 1 one can decrypt messages quickly
(but an attacker can easily guess it).
472
(b) Next, we show that if we know a multiple m of (N) then it is easy to factor N.
Given positive integers y > 1 and M > 1, denote by ordM (y) the order of y relative
to M:
(4.5.29)
y2 1 mod p, y2 1 mod q.
t
So, gcd y2 1, N = p, as required.
473
has size (p 1)/2. We will do this by exhibiting such a subset of size (p 1)/2.
Note that
(4.5.30)
Furthermore:
Alices public key is N = pq; her secret key is the pair (p, q);
Alices plaintext and ciphertext are numbers m = 0, 1 . . . , N 1,
and her encryption rule is EN (m) = c where c = m2 mod N.
(4.5.31)
(4.5.32)
Then
m p = c1/2 mod p and mq = c1/2 mod q,
i.e. m p and mq are the square roots of c mod p and mod q, respectively. In fact,
2
p1
c = c mod p;
m p = c(p+1)/2 = c(p1)/2 c = m p
at the last step the EulerFermat theorem has been used. The argument for mq is
similar. Then Alice computes, via Euclids algorithm, integers u(p) and v(q) such
that
u(p)p + v(q)q = 1.
474
s = u(p)pmq v(q)qm p ] mod N.
These are four square roots of c mod N. The plaintext m is one of them. To
secure that she can identify the original plaintext, Alice may reduce the plaintext
space P, allowing only plaintexts with some special features (like the property
that their first 32 and last 32 digits are repetitions of each other), so that it becomes
unlikely that more than one square root has this feature. However, such a measure
may result in a reduced difficulty of breaking the cipher as it will be not always
true that the reduced problem is equivalent to factoring.
(4.5.33)
475
Then is called the discrete logarithm, mod p, of b to base ; some authors write
= dlog b mod p. Computing discrete logarithms is considered a difficult problem: no efficient (polynomial) algorithm is known, although there is no proof that
it is indeed a non-polynomial problem. [In an additive cyclic group Z/(nZ), the
DLP becomes b = mod n and is solved by the Euclid algorithm.]
The DiffieHellman protocol allows Alice and Bob to establish a common secret
key using field tables for F p , for a sufficient quantity of prime numbers p. That is,
they know a primitive element in each of these fields. They agree to fix a large
prime number p and a primitive element F p . The pair (p, ) may be publicly
known: Alice and Bob can fix p and through the insecure channel.
Next, Alice chooses a {0, 1, . . . , p 2} at random, computes
A = a mod p
and sends A to Bob, keeping a secret. Symmetrically, Bob chooses b {0, 1, . . . ,
p 2} at random, computes
B = b mod p
and sends B to Alice keeping b secret. Then
Alice computes Ba mod p and Bob computes Ab mod p,
and their secret key is the common value
K = ab = Ba = Ab mod p.
The attacker may intercept p, , A and B but knows
neither a = dlog A mod p nor b = dlog B mod p.
If the attacker can find discrete logarithms mod p then he can break the secret
key: this is the only known way to do so. The opposite question solving the
discrete logarithm problem if he is able to break the protocol remains open (it is
considered an important problem in public key cryptography).
However, like previously discussed schemes, the DiffieHellman protocol has a
particular weak point: it is vulnerable to the man in the middle attack. Here, the
attacker uses the fact that neither Alice nor Bob can verify that a given message really comes from the opposite party and not from a third party. Suppose the attacker
can intercept all messages between Alice and Bob. Suppose he can impersonate
Bob and exchange keys with Alice pretending to be Bob and at the same time impersonate Alice and exchange keys with Bob pretending to be Alice. It is necessary
to use electronic signatures to distinguish this forgery.
476
477
and Alices public key is (p = 37, = 2, A = 26), her plaintexts are 0, 1, . . . , 36 and
private key a = 12. Assume Bob has chosen b = 32; then
B = 232 mod 37 = 4.
Suppose Bob wants to send m = 31. He encrypts m by
c = Ab m mod p = (26)32 m mod 37 = 10 31 mod 37 = 14.
Alice decodes this message as 232 = 7 and 724 = 26 mod 37,
14 232(37121) mod 37 = 14 724 = 14 26 mod 37 = 31.
Worked Example 4.5.21 Suppose that Alice wants to send the message today
to Bob using the ElGamal encryption. Describe how she does this using the prime
p = 15485863, = 6 a primitive root mod p, and her choice of b = 69. Assume
that Bob has private key a = 5. How does Bob recover the message using the
Mathematica program?
Solution Bob has public key (15485863, 6, 7776), which Alice obtains. She converts the English plaintext using the alphabet order to the numerical equivalent:
19, 14, 3, 0, 24. Since 265 < p < 266 , she can represent the plaintext message as a
single 5-digit base 26 integer:
m = 19 264 + 14 263 + 3 262 + 0 26 + 24 = 8930660.
Now she computes b = 669 = 13733130 mod 15485863, then
m ab = 8930660 777669 = 4578170 mod 15485863.
Alice sends c = (13733130, 4578170) to Bob. He uses his private key to compute
( b ) p1a = 137331301548586315 = 2620662 mod 15485863
and
( )a m ab = 2620662 4578170 = 8930660 mod 15485863,
and converts the message back to the English plaintext.
Worked Example 4.5.22 (a) Describe the RabinWilliams scheme for coding
a message x as x2 modulo a certain N . Show that, if N is chosen appropriately,
breaking this code is equivalent to factorising the product of two primes.
(b) Describe the RSA system associated with a public key e, a private key d and
the product N of two large primes.
Give a simple example of how the system is vulnerable to a homomorphism
attack. Explain how a signature system prevents such an attack. Explain how to
factorise N when e, d and N are known.
478
Solution (a) Fix two large primes p, q 1 mod 4 which forms a private key; the
broadcasted public key is the product N = pq. The properties used are:
(i) If p is a prime, the congruence a2 d mod p has at most two solutions.
(ii) For a prime p = 1 mod 4, i.e. p = 4k 1, if the congruence a2 c mod p has
a solution then a c(p+1)/4 mod p is one solution and a c(p+1)/4 mod p is
another solution. [Indeed, if c a2 mod p then, by the EulerFermat theorem,
c2k = a4k = a(p1)+2 = a2 mod p, implying ck = a.]
The message is a number m from M = {0, 1, . . . , N 1}. The encrypter (Bob) sends
(broadcasts) m = m2 mod N. The decrypter (Alice) uses property (ii) to recover the
two possible values of m mod p and two possible values of m mod q. The CRT then
yields four possible values for m: three of them would be incorrect and one correct.
So, if one can factorise N then the code would be broken. Conversely, suppose
that we can break the code. Then we can find all four distinct square roots u1 , u2 ,
u3 , u4 mod N for a general u. (The CRT plus property (i) shows that u has zero
or four square roots unless it is a multiple of p and q.) Then u j u1 (calculable via
Euclids algorithm) gives rise to the four square roots, 1, 1, 1 and 2 , of 1 mod N,
with
1 1 mod p, 1 1 mod q
and
2 1 mod p, 2 1 mod q.
By interchanging p and q, if necessary, we may suppose we know 1 . As 1 1
is divisible by p and not by q, the gcd(1 1, N) = p; that is, p can be found by
Euclids algorithm. Then q can also be identified.
In practice, it can be done as follows. Assuming that we can find square roots
mod N, we pick x at random and solve the congruence x2 y2 mod N. With probability 1/2, we have x y mod N. Then gcd(x y, N) is a non-trivial factor of N.
We repeat the procedure until we identify a factor; after k trials the probability of
success is 1 2k .
(b) To define the RSA cryptosystem let us randomly choose large primes p and q.
By Fermats little theorem,
x p1 1 mod p, xq1 1 mod q.
Thus, by writing N = pq and (N) = lcm(p 1, q 1), we have
x (N) 1 mod N,
for all integers x coprime to N.
479
Next, we choose e randomly. Either Euclids algorithm will reveal that e is not
co-prime to (N) or we can use Euclids algorithm to find d such that
de 1 mod (N).
With a very high probability a few trials will give appropriate d and e.
We now give out the value e of the public key and the value of N but keep secret
the private key d. Given a message m with 1 m N 1, it is encoded as the
integer c with
1 c N 1 and c me mod N.
Unless m is not co-prime to N (an event of negligible probability), we can decode
by observing that
m mde cd mod N.
As an example of a homomorphism attack, suppose the system is used to transmit a number m (dollars to be paid) and someone knowing this replaces the coded
message c by c2 . Then
(c2 )d m2de m2
and the recipient of the (falsified) message believes that m2 dollars are to be paid.
Suppose that a signature B(m) is also encoded and transmitted, where B is a
many-to-one function with no simple algebraic properties. Then the attack above
will produce a message and signature which do not correspond, and the recipient
will know that the message was tampered with.
Suppose e, d and N are known. Since
de 1 0 mod (N)
and (N) is even, de 1 is even. Thus de 1 = 2a b with b odd and a 1.
Choose x at random. Set z xb mod N. By the CRT, z is a square root of
1 mod N = pq if and only if it is a square root of 1 mod p and q. As F2 is a field,
x2 1 mod p (x 1)(x + 1) 0 mod p
(x 1) 0 mod p or (x + 1) 0 mod p.
Thus 1 has four square roots w mod N satisfying w 1 mod p and w
1 mod q. In other words,
w 1 mod N, w 1 mod N,
w w1 mod N with w1 1 mod p and w1 1 mod q
or
w w2 (mod N) with w2 1 mod p and w1 1 mod q.
480
P(T < t) = (1 ep j t ) .
(4.6.1)
j=1
Let L = NT be the total number of coupons collected by the time the complete
set of coupon types is obtained. Show that ET = EL. Hence, or otherwise, deduce
that EL does not depend on .
Solution Part (a) directly follows from the definition of a Poisson process.
(b) Let T j be the time of the first collection of a type j coupon. Then T j Exp(p j ),
independently for different j. We have
T = max T1 , . . . , Tm ,
and hence
m
m
P(T < t) = P max T1 , . . . , Tm < t = P(T j < t) = 1 ep j t .
j=1
j=1
481
Next, observe that the random variable L counts the jumps in the original Poisson
process (Nt ) until the time of collecting a complete set of coupon types. That is:
L
T = Si ,
i=1
where S1 , S2 , . . . are the holding times in (Nt ), with S j Exp( ), independently for
different j. Then
E(T |L = n) = nES1 = n 1 .
Moreover, L is independent of the random variables S1 , S2 , . . .. Thus,
ET = P(L = n)E T |L = n = ES1 nP(L = n) = 1 EL.
nm
nm
But
ET =
=
0
j=1
1 1 ep j t dt,
m
j=1
P(Nt i),
t 0,
i = 1, . . . , k,
(4.6.2)
482
Suppose now that arrivals to the first queue stop at time T . Determine the mean
number of customers at the ith queue at each time t T .
Solution We apply the product theorem to the Poisson process of arrivals with
random vectors Yn = (Sn1 , . . . , Snk ) where Sni is the service time of the nth customer
at the ith queue. Then
Vi (t) = the number of customers in the ith queue at time t
= 1 the nth customer arrived in the first queue at
n=1
n=1
n=1
Here (Jn : n N) denote the jump times of a Poisson process of rate , and the
measures M and on (0, ) Rk+ are defined by
M(A) =
(Jn ,Yn ) A , A (0, ) Rk+
n=1
and
(0,t] B = t (B).
The product theorem states that M is a Poisson random measure on (0, ) Rk+
with intensity measure . Next, the set Ai (t) (0, ) Rk+ is defined by
*
Ai (t) = ( , s1 , . . . , sk ) : 0 < < t, s1 , . . . , sk 0
+
and + s1 + + si1 t < + s1 + + si
'
=
( , s1 , . . . , sk ) : 0 < < t, s1 , . . . , sk 0
and
i1
l=1
l=1
sl t < sl
@
.
Sets Ai (t) are pairwise disjoint for i = 1, . . . , k (as t can fall between subsei1
l=1
l=1
quent partial sums sl and sl only once). So, the random variables Vi (t) are
independent Poisson.
483
A direct verification is through the joint MGF. Namely, let Nt Po( t) be the
number of arrivals at the first queue by time t. Then write
MV1 (t),...,Vk (t) (1 , . . . , k ) =
E exp(1V1 (t) + + kVk (t))
k
.
= E E exp iVi (t) Nt ; J1 , . . . , JNt
i=1
In turn, given n = 1, 2, . . . and points 0 < 1 < < n < t, the conditional expectations is
k
E exp iVi (t) Nt = n; J1 = 1 , . . . , Jn = n
i=1
k
n
1
k
= E exp i 1 j , (S j , . . . , S j ) Ai (t)
= E exp
i=1
j=1
1
k
i 1 j , (S j , . . . , S j ) Ai (t)
n
j=1 i=1
= E exp i 1 j , (S1j , . . . , Skj ) Ai (t) .
j=1
i=1
n
t
= e
i=1
l=1
l=1
By the uniqueness of a random variable with a given MGF, this implies that
0
Vi (t) Po
i1
l=1
l=1
Sl < t < Sl
d , independently.
484
l=1
l=1
= E
0t
1(Ns = i 1)ds =
P(Nt i).
Finally, write Vi (t, T ) for the number of customers in queue i at time t after closing
the entrance at time T . Then
EVi (t, T ) =
0T
P(Nt = i 1)d = E
0t
1(Ns = i 1)ds
tT
P(Nt i) P(NtT i) .
t1
dt1 e
dt2 e
t1
dt1 e
dt1 e t1
0t2
t2
dt2 e t2
s1 +g(s1 )
485
(ii) Let NTch be the number of checkouts used at time T . By the product theorem,
4.4.11, NTch Po((T )) where
(T ) =
0T
du
0
0T
du
0
NTarr
Ji + Si < T < Ji + Si + g(Si ) ,
i=1
E
exp
1 ti + Si < T
= e T k
k=0
=e
0 i=1
< ti + Si + g(Si ) dt1 dtk
k
k!
k=0
0 T
0
0 T k
E exp
0 i=1
1 ti + Si < T
< ti + Si + g(Si ) dt1 dtk
0 T
k
k
= e T
E exp 1 t + S < T < t + S + g(S)
0
k=0 k!
0T
= exp
E exp 1 t + S < T < t + S + g(S) 1 dt
0
0 T
= exp (e 1)
f (s, u)1 u + s < T < u + s + g(s) dsdu ,
0
486
1y
measure with intensity ([0,t] [0, y]) = tF(y), where F(y) = h(x)dx (the time
0
t = 0 corresponds to 9am).
(a) Now, the number of students leaving the library between 3pm and 4pm (i.e.
6 t 7) has a Poisson distribution Po( (A)) where A = {(r, s) : s [0, 7], r
[6 s, 7 s] if s 6; r [0, 7 s] if s > 6}. Here
(A) =
08
0
dF(r)
(7r)
0 +
(6r)+
ds =
08
So, the distribution of students leaving the library between 3pm and 4pm is Poisson
17
(b)
0,
(7 y)+ (6 y)+ = 7 y,
1,
if y 7,
if 6 y 7,
if y 6.
487
The mean number of students leaving the library between 3pm and 4pm is
(A) =
08
as required.
(c) For students still to be there at closing time we require J + H 8, as H ranges
over [0, 8], and J ranges over [8 H, 8]. Let
B = {(t, x) : t [0, 8], x [8 t, 8]}.
So,
(B) =
08
08
dt
0
08
dF(x) =
8t
18
8x
0
dF(x)
0
(8 x)dF(x) = 8
18
08
08
dF(x)
dt
0
08
xdF(x),
0
but dF(x) = 1 and xdF(x) = E[H] = 1 imply E[H] = . Hence, the expected
number of students in the library at closing time is 7 .
Problem 4.5 (i) Prove Campbells theorem, i.e. show that if M is a Poisson
random measure on the state space E with intensity measure and a : E R is a
bounded measurable function, then
E[e
(4.6.3)
(ii) Shots are heard at jump times J1 , J2 , . . . of a Poisson process with rate . The
initial amplitudes of the gunshots A1 , A2 , . . . Exp(2) are IID exponentially distributed with parameter 2, and the amplitutes decay linearly at rate . Compute the
MGF of the total amplitude Xt at time t :
Xt = An (1 (t Jn )+ )1(Jn t) ;
n
x+ = x if x 0 and 0 otherwise.
488
E[e
a(Yk )
| M(E) = n] = E e k=1
e a(y) (dy)/ .
Hence,
E[e X ] = E[e X | M(E) = n]P(M(E) = n)
n
=
n
e a(y) (dy)/
e n
n!
(ii) Fix t and let E = [0,t] R+ and and M be such that (ds, dx) =
2 e2x dsdx, M(B) = 1{(Jn ,An )B} . By the product theorem M is a Poisson rann
dom measure with intensity measure . Set at (s, x) = x(1 (t s))+ , then
1
Xt = at (s, x)M(ds, dx). So, by Campbells theorem, for < 2,
E
= e t exp 2
0t 0
= e t exp 2
0t
0
by splitting integral
1t
0
1 t 1
0
0 0
= e min[t,1/ ]
ds
2 (1 (t s))+
2 + min[t, 1/ ] 2
1t
t 1
2
in the case t > 1 .
Problem 4.6 Seeds are planted in a field S R2 . The random way they are sown
means that they form a Poisson process on S with density (x, y). The seeds grow
into plants that are later harvested as a crop, and the weight of the plant at (x, y) has
489
mean m(x, y) and variance v(x, y). The weights of different plants are independent
random variables. Show that the total weight W of all the plants is a random variable
with finite mean
00
I1 =
and variance
I2 =
00 *
S
+
m(x, y)2 + v(x, y) (x, y) dxdy ,
0
S
(x, y)dxdy
is finite. Then the number N of plants is finite and has the distribution Po( ).
Conditional on N, their positions may be taken as independent random variables
(Xn ,Yn ), n = 1, . . . , N, with density / on S. The weights of the plants are then
independent, with
0
EW =
S
and
EW 2 =
0
S
1 I1 = N 1 I1
n=1
and
Var W |N =
1 I2 2 I12 = N 1 I2 2 I12 .
n=1
Then
EW = EN 1 I1 = I1
and
490
EW = EW(k) =
k
=
S
0
Sk
(B) =
00
dp d
(4.6.4)
Is a Poisson process?
Solution Suppose that is a measure on the space L of lines in R2 not passing
through 0. A Poisson process with mean measure is a random countable subset
of L such that
(1) the number N(A) of points of in a measurable subset A of L has distribution
Po( (A)), and
(2) for disjoint A1 , . . . , An , the N(A j ) are independent.
In the problem, the number N of lines which meet the disc D of centre 0 and radius
r equals the number of lines with p < r. It is Poisson with mean
0 r 0 2
0
dpd = 2 r.
If there is at least one point of in D then there must be at least two lines of
meeting D, and this has probability
(2 r)n 2 r
e
= 1 (1 + 2 r)e2 r .
n!
n2
491
The probability of a point of lying in D is strictly less than this, because there
may be two lines meeting D whose intersection lies outside D.
Finally, is not a Poisson process, since it has with positive probability collinear
points.
Problem 4.8 Particular cases of the PoissonDirichlet distribution for the random sequence (p1 , p2 , p3 , . . .) with parameter appeared in PSE II the definition
is given below. Show that, for any polynomial with (0) = 0,
'
@
0
(pn )
n=1
(x)x1 (1 x) 1 dx .
(4.6.5)
(where > 0 can be arbitrary) and is independent from the vector p = (p1 , p2 , . . .)
with
p1 p2 ,
pn = 1,
with probability 1.
n1
Here Gam stands for the Gamma distribution; see PSE I, Appendix.
To prove (4.6.5), we can take pn = n / and use the fact that and p are independent. For k 1,
0
nk
=
0
n1
xk x1 ex dx = (k).
pkn
= ( + k)( ) E
n1
Thus,
E
n1
pkn
1
pkn
n1
(k)( )
=
(k + )
0 1
0
xk1 (1 x) 1 dx.
We see that the identity (4.6.5) holds for (x) = xk (with k 1) and hence by
linearity for all polynomials with (0) = 0.
492
0 b
a
x1 (1 x) 1 dx.
If a > 1/2, there can be at most one such pn , so that p1 has the PDF
x1 (1 x) 1 on (1/2, 1).
But this fails on (0, 1/2), and the identity (4.6.5) does not determine the distribution
of p1 on this interval.
Problem 4.9 The positions of trees in a large forest can be modelled as a Poisson process of constant rate on R2 . Each tree produces a random number of
seeds having a Poisson distribution with mean . Each seed falls to earth at a point
uniformly distributed over the circle of radius r whose centre is the tree. The positions of the different seeds relative to their parent tree, and the numbers of seeds
produced by a given tree, are independent of each other and of . Prove that, conditional on , the seeds form a Poisson process whose mean measure depends
on . Is the unconditional distribution of that of a Poisson process?
Solution By a direct calculation, the seeds from a tree at X form a Poisson process
with rate
'
1 r2 , |x X| < r,
X (x) =
0,
otherwise.
Superposing these independent Poisson processes gives a Poisson process with rate
(x) =
X (x);
it clearly depends on . The unrealistic assumption of a circular uniform distribution is chosen to create no doubt about this dependence in this case can be
reconstructed from the contours of .
Here we meet for the first time the doubly stochastic (Cox) processes, i.e. Poisson process with random intensity. The number of seeds in a bounded set has
mean
0
493
and variance
Var N(A) = E Var N(A)| + Var E N(A)|
0
= EN(A) + Var
(x)dx
> EN(A).
Hence, is not a Poisson process.
Problem 4.10 A uniform Poisson process in the unit ball of R3 is one whose
mean measure is Lebesgue measure (volume) on
B = {(x, y, z) R3 : r2 = x2 + y2 + z2 1}.
Show that
1 = {r : (x, y, z) }
is a Poisson process on [0, 1] and find its mean measure. Show that
2 = {(x/r, y/r, z/r) : (x, y, z) }
b a .
3
3
Thus, the mean measure of 1 has the PDF
4 r2 (0 < r < 1).
Similarly, the expected number of points of 2 in A B equals
1
(the conic volume from 0 to A) = (the surface area of A).
3
Finally, 1 and 2 are not independent since they have the same number of points.
Problem 4.11 The points of are coloured randomly either red or green, the
probability of any point being red being r, 0 < r < 1, and the colours of different
points being independent. Show that the red and the green points form independent
Poisson processes.
494
Hence, N1 (A) and N2 (A) are independent Poisson random variables with means
r (A) and (1 r) (A), respectively.
If A1 , A2 , . . . are disjoint sets then the pairs
(N1 (A1 ), N2 (A1 )), (N1 (A2 ), N2 (A2 )), . . .
are independent, and hence
N1 (A1 ), N1 (A2 ), . . . and N2 (A1 ), N2 (A2 ), . . .
are two independent sequences of independent random variables. If (A) = then
N(A) = a.s., and since r > 0 and 1 r > 0, there are a.s. infinitely many red and
green points in A.
Problem 4.12 A model of a rainstorm falling on a level surface (taken to be the
plane R2 ) describes each raindrop by a triple (X, T,V ), where X R2 is the horizontal position of the centre of the drop, T is the instant at which the drop hits the
plane, and V is the volume of water in the drop. The points (X, T,V ) are assumed
to form a Poisson process on R4 with a given rate (x,t, v). The drop forms a wet
circular patch on the surface, with centre X and a radius that increases with time,
the radius at time (T + t) being a given function r(t,V ). Find the probability that
a point R2 is dry at time , and show that the total rainfall in the storm has
expectation
0
R4
v (x,t, v)dxdtdv
495
(x,t, v)1 t < , ||x || < r( t, v) dxdtdv.
Hence, the probability that is dry is e (or 0 if = +). Finally, the formula
for the expected total rainfall,
V,
(X,T,V )
496
where
d jk =
(4.6.6)
>
(a jt akt )2 vt .
1tn
Suppose that M = 2 and that the transmitted waveforms are subject to the power
constraint a2jt K , j = 1, 2. Which of the two waveforms minimises the prob1tn
ability of error?
[Hint: You may assume validity of the bound P(Z a) exp(a2 /2), where Z is
a standard N(0, 1) random variable.]
Solution Let f j = fch (y|X = a j ) be the PDF of receiving a vector y given that a
waveform A j = (a jt ) was transmitted. Then
P(error)
1
P({y : fk (y) f j (y)|X = A j }).
M
j k:k= j
Let V be the diagonal matrix with the diagonal elements v j . In the present case,
1 n
2
f j = C exp (yt a jt ) /vt
2 t=1
1
T 1
= C exp (Y A j ) V (Y A j ) .
2
Then if X = A j and Y = A j + we have
1
1
log fk log f j = (A j Ak + )TV 1 (A j Ak + ) + TV 1
2
2
1
T 1
= d jk (A j (Ak ) V )
2
$
1
= d jk + d jk Z
2
497
1tn
subject to
a2jt K
or (a)Tj (a) j K, j = 1, 2.
By CauchySchwarz,
(A1 A2 ) V
T
S
2
S
T
T
1
1
(A1 A2 )
A1 V A1 + A2 V A2
(4.6.7)
with equality holding when A1 = const A2 . Further, in our case V is diagonal, and
(4.6.7) is maximised when ATj A j = K, j = 1, 2, . . . We conclude that
a1t = a2t = bt
with bt non-zero only for t such that vt is minimal, and t bt2 = K.
Problem 4.15 A random variable Y is distributed on the non-negative integers.
Show that the maximum entropy of Y , subject to EY M , is
M log M + (M + 1) log(M + 1)
2r+1
q
and determine the capacity of the channel.
Describe, very briefly, the problem of determining the channel capacity if p >
1/3.
498
py 0,
maximise h(Y ) = py log py subject to y py = 1,
y0
y ypy = M.
M
,
, or =
1
M+1
EzY
1 >
1
= =
(pz + q) =
Ez
1z
(2 z)(q + pz)
(2 z)1 + p(q + pz)1
=
1+ p
r
1 1+r r
1/2
z + p/q p/q zr .
=
1+ p
If p > 1/3 then p/q > 1/2 and the alternate probabilities become negative,
which means that there is no distribution for X giving an optimum for Y . Then
we would have to maximise
py log py , subject to py = py1 + qy ,
y
where y 0, y y = 1 and y yy q.
Problem 4.16 Assuming the bounds on channel capacity asserted by the second
coding theorem, deduce the capacity of a memoryless Gaussian channel.
A channel consists of r independent memoryless Gaussian channels, the noise in
the ith channel having variance vi , i = 1, 2, . . . , n. The compound channel is subject
499
2
to an overall power constraint E xit p, for each t , where xit is the input of
i
L = (vi + pi )1 , i = 1, . . . , r
pi
2
and the maximum at
1
1
pi = max 0,
vi =
vi
.
2
2
+
v
2 i = p.
+
i
The existence and uniqueness of follows since the LHS monotonically decreases from + to 0. Thus,
1
1
.
C = log
2 i
2 vi
Problem 4.17
Here we consider random variables taking values in a given
set A (finite, countable or uncountable) whose distributions are determined by
PMFs with respect to a given reference measure . Let be a real function
and a real number. Prove that the maximum hmax (X) of the entropy h(X) =
1
fX (X) log fX (x) (dx) subject to the constraint E (X) = is achieved at the
random variable X with the PMF
1
fX (x) = exp (x)
(4.6.8a)
500
where = ( ) = exp (x) (dx) is the normalising constant and is chosen so that
0
(x)
1 (x)
exp (x) (dx) = exists.
Assume that the value with the property
Show that if, in addition, function is non-negative, then, for any given > 0,
the PMF fX from (4.6.8a), (4.6.8b) maximises the entropy h(X) under a wider
constraint E (X) .
Consequently, calculate the maximal value of h(X) subject to E (X) , in
the following cases: (i) when A is a finite set, is a positive measure on A (with
i = ({i}) = 1/ (A) where (A) = j ) and (x) 1, x A; (ii) when A is
1
jA
Bibliography
501
502
Bibliography
[18] R.E. Blahut. Principles and Practice of Information Theory. Reading, MA:
Addison-Wesley, 1987.
[19] R.E. Blahut. Theory and Practice of Error Control Codes. Reading, MA: AddisonWesley, 1983. See also Algebraic Codes for Data Transmission. Cambridge:
Cambridge University Press, 2003.
[20] R.E. Blahut. Algebraic Codes on Lines, Planes, and Curves. Cambridge:
Cambridge University Press, 2008.
[21] I.F. Blake, R.C. Mullin. The Mathematical Theory of Coding. New York: Academic
Press, 1975.
[22] I.F. Blake, R.C. Mullin. An Introduction to Algebraic and Combinatorial Coding
Theory. New York: Academic Press, 1976.
[23] I.F. Blake (ed). Algebraic Coding Theory: History and Development. Stroudsburg,
PA: Dowden, Hutchinson & Ross, 1973.
[24] N. Blachman. Noise and its Effect on Communication. New York: McGraw-Hill,
1966.
[25] R.C. Bose, D.K. Ray-Chaudhuri. On a class of errors, correcting binary group
codes. Information and Control, 3(1), 6879, 1960.
[26] W. Bradley, Y.M. Suhov. The entropy of famous reals: some empirical results.
Random and Computational Dynamics, 5, 349359, 1997.
[27] A.A. Bruen, M.A. Forcinito. Cryptography, Information Theory, and ErrorCorrection: A Handbook for the 21st Century. Hoboken, NJ: Wiley-Interscience,
2005.
[28] J.A. Buchmann. Introduction to Cryptography. New York: Springer-Verlag, 2002.
[29] P.J. Cameron, J.H. van Lint. Designs, Graphs, Codes and their Links. Cambridge:
Cambridge University Press, 1991.
[30] J. Castineira Moreira, P.G. Farrell. Essentials of Error-Control Coding. Chichester:
Wiley, 2006.
[31] W.G. Chambers. Basics of Communications and Coding. Oxford: Clarendon, 1985.
[32] G.J. Chaitin. The Limits of Mathematics: A Course on Information Theory and the
Limits of Formal Reasoning. Singapore: Springer, 1998.
[33] G. Chaitin. Information-Theoretic Incompleteness. Singapore: World Scientific,
1992.
[34] G. Chaitin. Algorithmic Information Theory. Cambridge: Cambridge University
Press, 1987.
[35] F. Conway, J. Siegelman. Dark Hero of the Information Age: In Search of Norbert
Wiener, the Father of Cybernetics. New York: Basic Books, 2005.
[36] T.M. Cover, J.M. Thomas. Elements of Information Theory. New York: Wiley,
2006.
[37] I. Csiszar, J. Korner. Information Theory: Coding Theorems for Discrete Memoryless Systems. New York: Academic Press, 1981; Budapest: Akademiai Kiado,
1981.
[38] W.B. Davenport, W.L. Root. Random Signals and Noise. New York: McGraw Hill,
1958.
[39] A. Dembo, T. M. Cover, J. A. Thomas. Information theoretic inequalities. IEEE
Transactions on Information Theory, 37, (6), 15011518, 1991.
Bibliography
503
[40] R.L. Dobrushin. Taking the limit of the argument of entropy and information functions. Teoriya Veroyatn. Primen., 5, (1), 2937, 1960; English translation: Theory
of Probability and its Applications, 5, 2532, 1960.
[41] F. Dyson. The Tragic Tale of a Genius. New York Review of Books, July 14, 2005.
[42] W. Ebeling. Lattices and Codes: A Course Partially Based on Lectures by F. Hirzebruch. Braunschweig/Wiesbaden: Vieweg, 1994.
[43] N. Elkies. Excellent codes from modular curves. STOC01: Proceedings of the
33rd Annual Symposium on Theory of Computing (Hersonissos, Crete, Greece),
pp. 200208, NY: ACM, 2001.
[44] S. Engelberg. Random Signals and Noise: A Mathematical Introduction. Boca Raton, FL: CRC/Taylor & Francis, 2007.
[45] R.M. Fano. Transmission of Information: A Statistical Theory of Communication.
New York: Wiley, 1961.
[46] A. Feinstein. Foundations of Information Theory. New York: McGraw-Hill, 1958.
[47] G.D. Forney. Concatenated Codes. Cambridge, MA: MIT Press, 1966.
[48] M. Franceschetti, R. Meester. Random Networks for Communication. From Statistical Physics to Information Science. Cambridge: Cambridge University Press,
2007.
[49] R. Gallager. Information Theory and Reliable Communications. New York: Wiley,
1968.
[50] A. Gofman, M. Kelbert, Un upper bound for KullbackLeibler divergence with a
small number of outliers. Mathematical Communications, 18, (1), 7578, 2013.
[51] S. Goldman. Information Theory. Englewood Cliffs, NJ: Prentice-Hall, 1953.
[52] C.M. Goldie, R.G.E. Pinch. Communication Theory. Cambridge: Cambridge
University Press, 1991.
[53] O. Goldreich. Foundations of Cryptography, Vols 1, 2. Cambridge: Cambridge
University Press, 2001, 2004.
[54] V.D. Goppa. Geometry and Codes. Dordrecht: Kluwer, 1988.
[55] S. Gravano. Introduction to Error Control Codes. Oxford: Oxford University Press,
2001.
[56] R.M. Gray. Source Coding Theory. Boston: Kluwer, 1990.
[57] R.M. Gray. Entropy and Information Theory. New York: Springer-Verlag, 1990.
[58] R.M. Gray, L.D. Davisson (eds). Ergodic and Information Theory. Stroudsburg,
CA: Dowden, Hutchinson & Ross, 1977 .
[59] V. Guruswami, M. Sudan. Improved decoding of ReedSolomon codes and algebraic geometry codes. IEEE Trans. Inform. Theory, 45, (6), 17571767, 1999.
[60] R.W. Hamming. Coding and Information Theory. 2nd ed. Englewood Cliffs, NJ:
Prentice-Hall, 1986.
[61] T.S. Han. Information-Spectrum Methods in Information Theory. New York:
Springer-Verlag, 2002.
[62] D.R. Hankerson, G.A. Harris, P.D. Johnson, Jr. Introduction to Information Theory
and Data Compression. 2nd ed. Boca Raton, FL: Chapman & Hall/CRC, 2003.
[63] D.R. Hankerson et al. Coding Theory and Cryptography: The Essentials. 2nd ed.
New York: M. Dekker, 2000. (Earlier version: D. G. Hoffman et al. Coding Theory:
The Essentials. New York: M. Dekker, 1991.)
[64] W.E. Hartnett. Foundations of Coding Theory. Dordrecht: Reidel, 1974.
504
Bibliography
[65] S.J. Heims. John von Neumann and Norbert Wiener: From Mathematics to the
Technologies of Life and Death. Cambridge, MA: MIT Press, 1980.
[66] C. Helstrom. Statistical Theory of Signal Detection. 2nd ed. Oxford: Pergamon
Press, 1968.
[67] C.W. Helstrom. Elements of Signal Detection and Estimation. Englewood Cliffs,
NJ: Prentice-Hall, 1995.
[68] R. Hill. A First Course in Coding Theory. Oxford: Oxford University Press, 1986.
[69] T. Ho, D.S. Lun. Network Coding: An Introduction. Cambridge: Cambridge University Press, 2008.
[70] A. Hocquenghem. Codes correcteurs derreurs. Chiffres, 2, 147156, 1959.
[71] W.C. Huffman, V. Pless. Fundamentals of Error-Correcting Codes. Cambridge:
Cambridge University Press, 2003.
[72] J.F. Humphreys, M.Y. Prest. Numbers, Groups, and Codes. 2nd ed. Cambridge:
Cambridge University Press, 2004.
[73] S. Ihara. Information Theory for Continuous Systems. Singapore: World Scientific,
1993 .
[74] F.M. Ingels. Information and Coding Theory. Scranton: Intext Educational Publishers, 1971.
[75] I.M. James. Remarkable Mathematicians. From Euler to von Neumann.
Cambridge: Cambridge University Press, 2009 .
[76] E.T. Jaynes. Papers on Probability, Statistics and Statistical Physics. Dordrecht:
Reidel, 1982.
[77] F. Jelinek. Probabilistic Information Theory. New York: McGraw-Hill, 1968.
[78] G.A. Jones, J.M. Jones. Information and Coding Theory. London: Springer, 2000.
[79] D.S. Jones. Elementary Information Theory. Oxford: Clarendon Press, 1979.
[80] O. Johnson. Information Theory and the Central Limit Theorem. London: Imperial
College Press, 2004.
[81] J. Justensen. A class of constructive asymptotically good algebraic codes. IEEE
Transactions Information Theory, 18(5), 652656, 1972.
[82] M. Kelbert, Y. Suhov. Continuity of mutual entropy in the large signal-to-noise
ratio limit. In Stochastic Analysis 2010, pp. 281299, 2010. Berlin: Springer.
[83] N. Khalatnikov. Dau, Centaurus and Others. Moscow: Fizmatlit, 2007.
[84] A.Y. Khintchin. Mathematical Foundations of Information Theory. New York:
Dover, 1957.
[85] T. Klove. Codes for Error Detection. Singapore: World Scientific, 2007.
[86] N. Koblitz. A Course in Number Theory and Cryptography. New York: Springer,
1993 .
[87] H. Krishna. Computational Complexity of Bilinear Forms: Algebraic Coding Theory and Applications of Digital Communication Systems. Lecture notes in control
and information sciences, Vol. 94. Berlin: Springer-Verlag, 1987.
[88] S. Kullback. Information Theory and Statistics. New York: Wiley, 1959.
[89] S. Kullback, J.C. Keegel, J.H. Kullback. Topics in Statistical Information Theory.
Berlin: Springer, 1987.
[90] H.J. Landau, H.O. Pollak. Prolate spheroidal wave functions, Fourier analysis and
uncertainty, II. Bell System Technical Journal, 6484, 1961.
Bibliography
505
[91] H.J. Landau, H.O. Pollak. Prolate spheroidal wave functions, Fourier analysis and
uncertainty, III. The dimension of the space of essentially time- and band-limited
signals. Bell System Technical Journal, 12951336, 1962.
[92] R. Lidl, H. Niederreiter. Finite Fields. Cambridge: Cambridge University Press,
1997.
[93] R. Lidl, G. Pilz. Applied Abstract Algebra. 2nd ed. New York: Wiley, 1999.
[94] E.H. Lieb. Proof of entropy conjecture of Wehrl. Commun. Math. Phys., 62, (1),
3541, 1978.
[95] S. Lin. An Introduction to Error-Correcting Codes. Englewood Cliffs, NJ; London:
Prentice-Hall, 1970.
[96] S. Lin, D.J. Costello. Error Control Coding: Fundamentals and Applications.
Englewood Cliffs, NJ: Prentice-Hall, 1983.
[97] S. Ling, C. Xing. Coding Theory. Cambridge: Cambridge University Press, 2004.
[98] J.H. van Lint. Introduction to Coding Theory. 3rd ed. Berlin: Springer, 1999.
[99] J.H. van Lint, G. van der Geer. Introduction to Coding Theory and Algebraic
Geometry. Basel: Birkhauser, 1988.
[100] J.C.A. van der Lubbe. Information Theory. Cambridge: Cambridge University
Press, 1997.
[101] R.E. Lewand. Cryptological Mathematics. Washington, DC: Mathematical Association of America, 2000.
[102] J.A. Llewellyn. Information and Coding. Bromley: Chartwell-Bratt; Lund:
Studentlitteratur, 1987.
[103] M. Lo`eve. Probability Theory. Princeton, NJ: van Nostrand, 1955.
[104] D.G. Luenberger. Information Science. Princeton, NJ: Princeton University Press,
2006.
[105] D.J.C. Mackay. Information Theory, Inference and Learning Algorithms.
Cambridge: Cambridge University Press, 2003.
[106] H.B. Mann (ed). Error-Correcting Codes. New York: Wiley, 1969 .
[107] M. Marcus. Dark Hero of the Information Age: In Search of Norbert Wiener, the
Father of Cybernetics. Notices of the AMS 53, (5), 574579, 2005.
[108] A. Marshall, I. Olkin. Inequalities: Theory of Majorization and its Applications.
New York: Academic Press, 1979 .
[109] V.P. Maslov, A.S. Chernyi. On the minimization and maximization of entropy in
various disciplines. Theory Probab. Appl. 48, (3), 447464, 2004.
[110] F.J. MacWilliams, N.J.A. Sloane. The Theory of Error-Correcting Codes, Vols I,
II. Amsterdam: North-Holland, 1977.
[111] R.J. McEliece. The Theory of Information and Coding. Reading, MA: AddisonWesley, 1977. 2nd ed. Cambridge: Cambridge University Press, 2002.
[112] R. McEliece. The Theory of Information and Coding. Student ed. Cambridge:
Cambridge University Press, 2004.
[113] A. Menon, R.M. Buecher, J.H. Read. Impact of exclusion region and spreading in
spectrum-sharing ad hoc networks. ACM 1-59593-510-X/06/08, 2006 .
[114] R.A. Mollin. RSA and Public-Key Cryptography. New York: Chapman & Hall,
2003.
[115] R.H. Morelos-Zaragoza. The Art of Error-Correcting Coding. 2nd ed. Chichester:
Wiley, 2006.
506
Bibliography
[116] G.L. Mullen, C. Mummert. Finite Fields and Applications. Providence, RI:
American Mathematical Society, 2007.
[117] A. Myasnikov, V. Shpilrain, A. Ushakov. Group-Based Cryptography. Basel:
Birkhauser, 2008.
[118] G. Nebe, E.M. Rains, N.J.A. Sloane. Self-Dual Codes and Invariant Theory. New
York: Springer, 2006.
[119] H. Niederreiter, C. Xing. Rational Points on Curves over Finite Fields: Theory and
Applications. Cambridge: Cambridge University Press, 2001.
[120] W.W. Peterson, E.J. Weldon. Error-Correcting Codes. 2nd ed. Cambridge,
MA: MIT Press, 1972. (Previous ed. W.W. Peterson. Error-Correcting Codes.
Cambridge, MA: MIT Press, 1961.)
[121] M.S. Pinsker. Information and Information Stability of Random Variables and Processes. San Francisco: Holden-Day, 1964.
[122] V. Pless. Introduction to the Theory of Error-Correcting Codes. 2nd ed. New York:
Wiley, 1989.
[123] V.S. Pless, W.C. Huffman (eds). Handbook of Coding Theory, Vols 1, 2. Amsterdam: Elsevier, 1998.
[124] P. Piret. Convolutional Codes: An Algebraic Approach. Cambridge, MA: MIT
Press, 1988.
[125] O. Pretzel. Error-Correcting Codes and Finite Fields. Oxford: Clarendon Press,
1992; Student ed. 1996.
[126] T.R.N. Rao. Error Coding for Arithmetic Processors. New York: Academic Press,
1974.
[127] M. Reed, B. Simon. Methods of Modern Mathematical Physics, Vol. II. Fourier
analysis, self-adjointness. New York: Academic Press, 1975.
[128] A. Renyi. A Diary on Information Theory. Chichester: Wiley, 1987; initially published Budapest: Akademiai Kiado, 1984.
[129] F.M. Reza. An Introduction to Information Theory. New York: Constable, 1994.
[130] S. Roman. Coding and Information Theory. New York: Springer, 1992.
[131] S. Roman. Field Theory. 2nd ed. New York: Springer, 2006.
[132] T. Richardson, R. Urbanke. Modern Coding Theory. Cambridge: Cambridge University Press, 2008.
[133] R.M. Roth. Introduction to Coding Theory. Cambridge: Cambridge University
Press, 2006.
[134] B. Ryabko, A. Fionov. Basics of Contemporary Cryptography for IT Practitioners.
Singapore: World Scientific, 2005.
[135] W.E. Ryan, S. Lin. Channel Codes: Classical and Modern. Cambridge: Cambridge
University Press, 2009.
[136] T. Schurmann, P. Grassberger. Entropy estimation of symbol sequences. Chaos, 6,
(3), 414427, 1996.
[137] P. Seibt. Algorithmic Information Theory: Mathematics of Digital Information Processing. Berlin: Springer, 2006.
[138] C.E. Shannon. A mathematical theory of cryptography. Bell Lab. Tech. Memo.,
1945.
[139] C.E. Shannon. A mathematical theory of communication. Bell System Technical
Journal, 27, July, October, 379423, 623658, 1948.
Bibliography
507
[140] C.E. Shannon: Collected Papers. N.J.A. Sloane, A.D. Wyner (eds). New York:
IEEE Press, 1993.
[141] C.E. Shannon, W. Weaver. The Mathematical Theory of Communication. Urbana,
IL: University of Illinois Press, 1949.
[142] P.C. Shields. The Ergodic Theory of Discrete Sample Paths. Providence, RI:
American Mathematical Society, 1996.
[143] M.S. Shrikhande, S.S. Sane. Quasi-Symmetric Designs. Cambridge: Cambridge
University Press, 1991.
[144] S. Simic. Best possible global bounds for Jensen functionals. Proc. AMS, 138, (7),
24572462, 2010.
[145] A. Sinkov. Elementary Cryptanalysis: A Mathematical Approach. 2nd ed. revised
and updated by T. Feil. Washington, DC: Mathematical Association of America,
2009.
[146] D. Slepian, H.O. Pollak. Prolate spheroidal wave functions, Fourier analysis and
uncertainty, Vol. I. Bell System Technical Journal, 4364, 1961 .
[147] W. Stallings. Cryptography and Network Security: Principles and Practice. 5th ed.
Boston, MA: Prentice Hall; London: Pearson Education, 2011.
[148] H. Stichtenoth. Algebraic Function Fields and Codes. Berlin: Springer, 1993.
[149] D.R. Stinson. Cryptography: Theory and Practice. 2nd ed. Boca Raton, FL;
London: Chapman & Hall/CRC, 2002.
[150] D. Stoyan, W.S. Kendall. J. Mecke. Stochastic Geometry and its Applications.
Berlin: Academie-Verlag, 1987 .
[151] C. Schlegel, L. Perez. Trellis and Turbo Coding. New York: Wiley, 2004.
Sujan.
[152] S.
Ergodic Theory, Entropy and Coding Problems of Information Theory.
Praha: Academia, 1983.
[153] P. Sweeney. Error Control Coding: An Introduction. New York: Prentice Hall,
1991.
[154] Te Sun Han, K. Kobayashi. Mathematics of Information and Coding. Providence,
RI: American Mathematical Society, 2002.
[155] T.M. Thompson. From Error-Correcting Codes through Sphere Packings to Simple
Groups. Washington, DC: Mathematical Association of America, 1983.
[156] R. Togneri, C.J.S. deSilva. Fundamentals of Information Theory and Coding
Design. Boca Raton, FL: Chapman & Hall/CRC, 2002.
[157] W. Trappe, L.C. Washington. Introduction to Cryptography: With Coding Theory.
2nd ed. Upper Saddle River, NJ: Pearson Prentice Hall, 2006.
[158] M.A. Tsfasman, S.G. Vladut. Algebraic-Geometric Codes. Dordrecht: Kluwer
Academic, 1991.
[159] M. Tsfasman, S. Vladut, T. Zink. Modular curves, Shimura curves and Goppa
codes, better than VarshamovGilbert bound. Mathematics Nachrichten, 109,
2128, 1982.
[160] M. Tsfasman, S. Vladut, D. Nogin. Algebraic Geometric Codes: Basic Notions.
Providence, RI: American Mathematical Society, 2007.
[161] M.J. Usher. Information Theory for Information Technologists. London: Macmillan, 1984.
[162] M.J. Usher, C.G. Guy. Information and Communication for Engineers. Basingstoke: Macmillan, 1997
508
Bibliography
[163] I. Vajda. Theory of Statistical Inference and Information. Dordrecht: Kluwer, 1989.
[164] S. Verdu. Multiuser Detection. New York: Cambridge University Press, 1998.
[165] S. Verdu, D. Guo. A simple proof of the entropypower inequality. IEEE Trans.
Inform. Theory, 52, (5), 21652166, 2006.
[166] L.R. Vermani. Elements of Algebraic Coding Theory. London: Chapman & Hall,
1996.
[167] B. Vucetic, J. Yuan. Turbo Codes: Principles and Applications. Norwell, MA:
Kluwer, 2000.
[168] G. Wade. Coding Techniques: An Introduction to Compression and Error Control.
Basingstoke: Palgrave, 2000.
[169] J.L. Walker. Codes and Curves. Providence, RI: American Mathematical Society,
2000.
[170] D. Welsh. Codes and Cryptography. Oxford, Oxford University Press, 1988.
[171] N. Wiener. Cybernetics or Control and Communication in Animal and Machine.
Cambridge, MA: MIT Press, 1948; 2nd ed: 1961, 1962.
[172] J. Wolfowitz. Coding Theorems of Information Theory. Berlin: Springer, 1961; 3rd
ed: 1978.
[173] A.D. Wyner. The capacity of the band-limited Gaussian channel. Bell System Technical Journal, 359395, 1996 .
[174] A.D. Wyner. The capacity of the product of channels. Information and Control,
423433, 1966.
[175] C. Xing. Nonlinear codes from algebraic curves beating the TsfasmanVladut
Zink bound. IEEE Transactions Information Theory, 49, 16531657, 2003.
[176] A.M. Yaglom, I.M. Yaglom. Probability and Information. Dordrecht, Holland:
Reidel, 1983.
[177] R. Yeung. A First Course in Information Theory. Boston: Kluwer Academic, 1992;
2nd ed. New York: Kluwer, 2002.
Index
bound
BCH bound, 237, 295
Elias bound, 177
Gilbert bound, 198
GilbertVarshamov bound, 154
Griesmer bound, 197
Hamming bound, 150
Johnson bound, 177
linear programming bound, 322
Plotkin bound, 155
Singleton bound, 154
bar-product, 152
capacity, 61
capacity of a discrete channel, 61
capacity of a memoryless Gaussian channel with
white noise, 374
capacity of a memoryless Gaussian channel with
coloured noise, 375
operational channel capacity, 102
character (as a digit or a letter or a symbol), 53
character (of a homomorphism), 313
modular character, 314
trivial, or principal, character, 313
character transform, 319
characteristic of a field, 269
channel, 60
additive Gaussian channel (AGC), 368
memoryless Gaussian channel (MGC), 368
memoryless additive Gaussian channel (MAGC),
366
memoryless binary channel (MBC), 60
memoryless binary symmetric channel (MBSC), 60
noiseless channel, 103
channel capacity, 61
operational channel capacity, 102
check matrix: see parity-check matrix
cipher (or a cryptosystem), 463
additive stream cipher, 463
509
510
cipher (or a cryptosystem) (cont.)
one-time pad cipher, 466
public-key cipher, 467
ciphertext, 468
code, or encoding, viii, 4
alternant code, 332
BCH code, 213
binary code, 10, 95
cardinality of a code, 253
cyclic code, 216
decipherable code, 14
dimension of a linear code, 149
D error detecting code, 147
dual code, 153
equivalent codes, 190
E error correcting code, 147
Golay code, 151
Goppa code, 160, 334
Hamming code, 199
Huffman code, 9
information rate of a code, 147
Justesen code, 240, 332
lossless code, 4
linear code, 148
maximal distance separating (MDS), 155
parity-check code, 149
perfect code, 151
prefix-free code, 4
random code, 68, 372
rank of a linear code, 184
ReedMuller (RM) code, 203
ReedSolomon code, 256, 291
repetition code, 149
reversible cyclic code, 230
self-dual code, 201, 227
self-orthogonal, 227
symplex code, 194
codebook, 67
random codebook, 68
coder, or encoder, 3
codeword, 4
random codeword, 6
coding: see encoding
coloured noise, 374
concave, 19, 32
strictly concave, 32
concavity, 20
conditionally independent, 26
conjugacy, 281
conjugate, 229
convergence almost surely (a.s.), 131
convergence in probability, 43
convex, 32
strictly convex, 104
convexity, 142
Index
core polynomial of a field, 231
coset, 192
cyclotomic coset, 285
leader of a coset, 192
cryptosystem (or a cipher), 468
bit commitment cryptosystem, 468
ElGamal cryptosystem, 475
public key cryptosystem, 468
RSA (RivestShamirAdelman) cryptosystem,
468
Rabin, or RabinWilliams cryptosystem, 473
cyclic group, 231
generator of a cyclic group
cyclic shift, 216
data-processing inequality, 80
detailed balance equations (DBEs), 56
decoder, or a decoding rule, 65
geometric (or minimal distance) decoding rule, 163
ideal observer (IO) decoding rule, 66
maximum likelihood (ML) decoding rule, 66
joint typicality (JT) decoder, 372
decoding, 167
decoding alternant codes, 337
decoding BCH codes, 239, 310
decoding cyclic codes, 214
decoding Hamming codes, 200
list decoding, 192, 405
decoding ReedMuller codes, 209
decoding ReedSolomon codes, 292
decoding ReedSolomon codes by the
GuruswamiSudan algorithm, 299
syndrome decoding, 193
decrypt function, 469
degree of a polynomial, 206, 214
density of a probability distribution (PDF), 86
differential entropy, 86
digit, 3
dimension, 149
dimension of a code, 149
dimension of a linear representation, 314
discrete Fourier transform (FFT), 296
discrete-time Markov chain (DTMC), 1, 3
discrete logarithm, 474
distributed system, or a network (of transmitters), 436
Dirac -function, 318
distance, 20
KullbackLeibler distance, 20
Hamming distance, 144
minimal distance of a code, 147
distance enumerator polynomial, 322
divisor, 217
greatest common divisor (gcd), 223
dot-product, 153
doubly stochastic (Cox) random process, 492
Index
electronic signature, 469, 476
encoding, or coding, vii, 4
Huffman encoding, 9
ShannonFano encoding, 9
random coding, 67
entropy, vii, 7
axiomatic definition of entropy, 36
binary entropy, 7
conditional entropy, 20
differential entropy, 86
entropy of a random variable, 18
entropy of a probability distribution, 18
joint entropy, 20
mutual entropy, 28
entropypower inequality, 92
q-ary entropy, 7
entropy rate, vii, 41
relative entropy, 20
encrypt function, 468
ergodic random process (stationary), 397
ergodic transformation of a probability space, 397
error locator, 311
error locator polynomial, 239, 311
error-probability, 58
extension of a code, 151
parity-check extension, 151
extension field, 261
factor (as a divisor), 39
irreducible factor, 219
prime factor, 39
factorization, 230
fading of a signal, 447
power fading, 447
Rayleigh fading, 447
feedback shift register, 453
linear feedback shift register (LFSR), 454
feedback polynomial, 454
field (a commutative ring with inverses), 146, 230
extension field, 261
Galois field, 272
finite field, 194
polynomial field, 231
primitive element of a field, 230, 232
splitting field, 236, 271
Frobenius map, 283
Gaussian channel, 366
additive Gaussian channel (AGC), 368
memoryless Gaussian channel (MGC), 368
memoryless additive Gaussian channel (MAGC),
366
Gaussian coloured noise, 374
Gaussian white noise, 368
Gaussian random process, 369
generating matrix, 185
511
512
Index
Index
equiprobable, or uniform, distribution, 3, 22
exponential distribution (with exponential density),
89
geometric distribution, 21
joint probability, 1
multivariate normal distribution, 88
normal distribution (with univariate normal
density), 89
Poisson distribution, 101
probability mass function (PMF), 366
probability space, 397
prolate spheroidal wave function (PSWF), 425
protocol of a private communication, 469
DiffieHellman protocol, 474
prefix, 4
prefix-free code, 4
product-channel, 404
public-key cipher, 467
quantum mechanics, 431
random code, 68, 372
random codebook, 68
random codeword, 6
random measure, 436
Poisson random measure (PRM), 436
random process, vii
Gaussian random process, 369
Poisson random process, 436
stationary random process, 397
stationary ergodic random process, 397
random variable, 18
conditionally independent random variables, 26
equiprobable, or uniform, random variable,
3, 22
exponential random variable (with exponential
density), 89
geometric random variable, 21
independent identically distributed (IID) random
variables, 1, 3
joint probability distribution of random variables, 1
normal random variable (with univariate normal
density), 89
Poisson random variable, 101
random vector, 20
multivariate normal random vector, 88
rank of a code, 184
rank-nullity property, 186
rate, 15
entropy rate, vii, 41
information rate of a source, 15
reliable encoding (or encodable) rate, 15
reliable transmission rate, 62
reliable transmission rate with regional constraint,
373
regional constraint for channel capacity, 367
513
register, 453
feedback shift register, 453
linear feedback shift register (LFSR), 454
feedback, or auxiliary, polynomial of an LFSR, 454
initial fill of register, 454
output stream of a register, 454
repetition code, 149
repetition of a code, 152
ring, 217
ideal of a ring, 217
quotient ring, 274
root of a cyclic code, 230
defining root of a cyclic code, 233
root of a polynomial, 228
root of unity, 228
primitive root of unity, 236
sample, viii, 2
signal/noise ratio (SNR), 449
sinc function, 413
size of a code, 147
space, 35
Hamming space, 144
space 2 (R1 ), 415
linear space, 146
linear subspace, 148
space of a linear representation, 314
state space of a Markov chain, 35
vector space over a field, 269
stream, 463
strictly concave, 32
strictly convex, 104
string, or a word (of characters, digits, letters or
symbols), 3
source of information (random), 2, 44
Bernoulli source, 3
equiprobable Bernoulli source, 3
Markov source, 3
stationary Markov source, 3
spectral density, 417
stationary, 3
stationary Markov source, 3
stationary random process, 397
stationary ergodic random process, 397
supercritical network, 449
symbol, 2
syndrome, 192
theorem
BrunnMinkovski theorem, 93
Campbell theorem, 442
CayleyHamilton theorem, 456
central limit theorem (CLT), 94
DoobLevy theorem, 409
local De MoivreLaplace theorem, 53
mapping theorem, 437
514
Index
theorem (cont.)
product theorem, 444
Shannon theorem, 8
Shannons noiseless coding theorem (NLCT), 8
Shannons first coding theorem (FCT), 42
Shannons second coding theorem (SCT), or noisy
coding theorem (NCT), 59, 162
Shannons SCT: converse part, 69
Shannons SCT: strong converse part, 175
Shannons SCT: direct part, 71, 163
ShannonMcMillanBreiman theorem, 397
totient function, 270
transform
character transform, 319
Fourier transform, 296